250+ TOP MCQs on Analysis and Experimental Design and Answers

Data Science Multiple Choice Questions on “Analysis and Experimental Design”.

1. If X predicts Y, it does mean X causes Y.
a) True
b) False

Answer: b
Explanation: If X predicts Y, it does not mean X causes Y.

2. Point out the correct statement.
a) If equations are known but the parameters are not, they may be inferred with data analysis
b) If equations are not known but the parameters are, they may be inferred with data analysis
c) If equations and parameter are not, they may be inferred with data analysis
d) None of the mentioned

Answer: a
Explanation: Usually the random component of data is measurement error.

3. Which of the following is the top most important thing in data science?
a) answer
b) question
c) data
d) none of the mentioned

Answer: b
Explanation: The second most important is the data.

4. Which of the following approach should be used if you can’t fix the variable?
a) randomize it
b) non stratify it
c) generalize it
d) none of the mentioned

Answer: a
Explanation: If you can’t fix the variable, stratify it.

5. Point out the wrong statement.
a) Randomized studies are not used to identify causation
b) Complication approached exist for inferring causation
c) Causal relationships may not apply to every individual
d) All of the mentioned

Answer: a
Explanation: Randomized studies are usually used to identify causation.

6. Which of the following is a good way of performing experiments in data science?
a) Measure variability
b) Generalize to the problem
c) Have Replication
d) All of the mentioned

Answer: d
Explanation: Experiments on causal relationships investigate the effect of one or more variables on one or more outcome variables.

7. Which of the following is commonly referred to as ‘data fishing’?
a) Data bagging
b) Data booting
c) Data merging
d) None of the mentioned

Answer: d
Explanation: Data dredging is sometimes referred to as “data fishing”.

8. Which of the following data mining technique is used to uncover patterns in data?
a) Data bagging
b) Data booting
c) Data merging
d) Data Dredging

Answer: d
Explanation: Data dredging, also called as data snooping, refers to the practice of misusing data mining techniques to show misleading scientific ‘research’.

250+ TOP MCQs on Plotting Systems and Answers

Data Science Multiple Choice Questions on “Plotting Systems”.

1. How many stages commonly occurs in creation of plot?
a) 2
b) 5
c) 8
d) All of the mentioned

Answer: a
Explanation: The base plotting system is highly flexible.

2. Base graphics are used most commonly for creating 2D graphics.
a) True
b) False

Answer: a
Explanation: Base graphics is a very powerful system for creating 2D graphics.

3. Which of the following annotation function is used to add or modify text?
a) word
b) graph
c) lines
d) all of the mentioned

Answer: d
Explanation: points and axis are other well known annotation function.

4. Which of the following package is implemented by lattice plotting system?
a) grDevices
b) grid
c) graphics
d) all of the mentioned

Answer: b
Explanation: Use grid on to display the major grid lines.

5. Point out the wrong statement.
a) Plot are created with multiple functions only
b) Plots are created with both single and multiple function calls
c) Annotation in plot is not especially intuitive
d) None of the mentioned

Answer: a
Explanation: Plots are created with single function also.

6. Which of the following parameter defines line type such as dashed and dotted?
a) lty
b) pch
c) lwd
d) all of the mentioned

Answer: a
Explanation: lwd is used for line width.

7. The core plotting engine is encapsulated in graphics package.
a) True
b) False

Answer: a
Explanation: graphics package contain plotting functions.

8. Which of the following argument specifies margin size with regards to par function?
a) las
b) bg
c) mar
d) all of the mentioned

Answer: c
Explanation: par function is used to specify global parameters.

250+ TOP MCQs on caret and Answers

Data Science MCQs focuses on “Caret”.

1. Which of the following function is a wrapper for different lattice plots to visualize the data?
a) levelplot
b) featurePlot
c) plotsample
d) none of the mentioned

Answer: b
Explanation: featurePlot is used for data visualization in caret.

2. Point out the wrong statement.
a) In every situation, the data generating mechanism can create predictors that only have a single unique value
b) Predictors might have only a handful of unique values that occur with very low frequencies
c) The function findLinearCombos uses the QR decomposition of a matrix to enumerate sets of linear combinations
d) All of the mentioned

Answer: a
Explanation: In some situations, the data generating mechanism can create predictors that only have a single unique value.

3. Which of the following function can be used to identify near zero-variance variables?
a) zeroVar
b) nearVar
c) nearZeroVar
d) all of the mentioned

Answer: c
Explanation: The saveMetrics argument can be used to show the details and usually defaults to FALSE.

4. Which of the following function can be used to flag predictors for removal?
a) searchCorrelation
b) findCausation
c) findCorrelation
d) none of the mentioned

Answer: c
Explanation: Some models thrive on correlated predictors.

5. Point out the correct statement.
a) findLinearColumns will also return a vector of column positions can be removed to eliminate the linear dependencies
b) findLinearCombos will return a list that enumerates dependencies
c) the function findLinearRows can be used to generate a complete set of row variables from one factor
d) none of the mentioned

Answer: b
Explanation: For each linear combination, it will incrementally remove columns from the matrix and test to see if the dependencies have been resolved.

6. Which of the following can be used to impute data sets based only on information in the training set?
a) postProcess
b) preProcess
c) process
d) all of the mentioned

Answer: b
Explanation: This can be done with K-nearest neighbors.

7. The function preProcess estimates the required parameters for each operation.
a) True
b) False

Answer: a
Explanation: predict.preProcess is used to apply them to specific data sets.

8. Which of the following can also be used to find new variables that are linear combinations of the original set with independent components?
a) ICA
b) SCA
c) PCA
d) None of the mentioned

Answer: a
Explanation: ICA stands for independent component analysis.

9. Which of the following function is used to generate the class distances?
a) preprocess.classDist
b) predict.classDist
c) predict.classDistance
d) all of the mentioned

Answer: b
Explanation: By default, the distances are logged.

10. The preProcess class can be used for many operations on predictors.
a) True
b) False

Answer: a
Explanation: Operations include centering and scaling.

250+ TOP MCQs on Time Deltas and Answers

Data Science Multiple Choice Questions on “Time Deltas”.

1. Which of the following operations are supported on Time Frames?
a) idxmax
b) ixmax
c) ixmin
d) none of the mentioned

Answer: a
Explanation: Operands can also appear in a reversed order.

2. Point out the correct statement.
a) Timedeltas are differences in times, expressed in difference units
b) You can construct a Timedelta scalar through various argument
c) DateOffsets cannot be used in construction
d) All of the mentioned

Answer: a
Explanation: Timedeltas can be both positive and negative.

3. Numeric reduction operation for timedelta64[ns] will return _________ objects.
a) Timeseries
b) Timeplus
c) Timedelta
d) None of the mentioned

Answer: c
Explanation: NaT are skipped during evaluation.

4. Which of the following scalars can be converted to other ‘frequencies’ by as typing to a specific timedelta type?
a) Timedelta Series
b) TimedeltaIndex
c) Timedelta
d) All of the mentioned

Answer: d
Explanation: These operations yield Series and propagate NaT -> nan.

5. Point out the wrong statement.
a) min, max, idxmin, idxmax operations are supported on Series
b) You cannot pass a timedelta to get a particular value
c) Division by the numpy scalar is true division
d) None of the mentioned

Answer: b
Explanation: Dividing or multiplying a timedelta64[ns] Series by an integer or integer Series yields another timedelta64[ns] dtypes Series.

6. Which of the following is used to generate an index with time delta?
a) TimeIndex
b) TimedeltaIndex
c) LeadIndex
d) None of the mentioned

Answer: b
Explanation: Using TimedeltaIndex you can pass string-like, Timedelta, timedelta, or np.timedelta64 objects.

7. Combination of TimedeltaIndex with DatetimeIndex allow certain combination operations that are NaT preserving.
a) True
b) False

Answer: a
Explanation: You can also convert indices to yield another index.

8. Using _________ on categorical data will produce similar output to a Series or DataFrame of type string.
a) .desc()
b) .describe()
c) .rank()
d) none of the mentioned

Answer: b
Explanation: Categorical data has a categories and a ordered property.

9. Which of the following method can be used to rename categorical data?
a) Categorical.rename_categories()
b) Categorical.rename()
c) Categorical.mv_categories()
d) None of the mentioned

Answer: a
Explanation: Renaming categories is done by assigning new values to the Series.cat.categories property.

10. All values of categorical data are either in categories or np.nan.
a) True
b) False

Answer: a
Explanation: Categoricals are pandas data type.

250+ TOP MCQs on Clustering and Answers

Data Science Multiple Choice Questions on “Clustering”.

1. K-means is not deterministic and it also consists of number of iterations.
a) True
b) False

Answer: a
Explanation: K-means clustering produces the final estimate of cluster centroids.

2. Point out the correct statement.
a) The choice of an appropriate metric will influence the shape of the clusters
b) Hierarchical clustering is also called HCA
c) In general, the merges and splits are determined in a greedy manner
d) All of the mentioned

Answer: d
Explanation: Some elements may be close to one another according to one distance and farther away according to another.

3. Which of the following is finally produced by Hierarchical Clustering?
a) final estimate of cluster centroids
b) tree showing how close things are to each other
c) assignment of each point to clusters
d) all of the mentioned

Answer: b
Explanation: Hierarchical clustering is an agglomerative approach.

4. Which of the following is required by K-means clustering?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned

Answer: d
Explanation: K-means clustering follows partitioning approach.

5. Point out the wrong statement.
a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbor is same as k-means
d) none of the mentioned

Answer: c
Explanation: k-nearest neighbor has nothing to do with k-means.

6. Which of the following combination is incorrect?
a) Continuous – euclidean distance
b) Continuous – correlation similarity
c) Binary – manhattan distance
d) None of the mentioned

Answer: d
Explanation: You should choose a distance/similarity that makes sense for your problem.

7. Hierarchical clustering should be primarily used for exploration.
a) True
b) False

Answer: a
Explanation: Hierarchical clustering is deterministic.

8. Which of the following function is used for k-means clustering?
a) k-means
b) k-mean
c) heatmap
d) none of the mentioned

Answer: a
Explanation: K-means requires a number of clusters.

9. Which of the following clustering requires merging approach?
a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned

Answer: b
Explanation: Hierarchical clustering requires a defined distance as well.

250+ TOP MCQs on caret and Answers

Data Science Multiple Choice Questions & Answers focuses on “Caret”.

1. varImp is a wrapper around the evimp function in the _______ package.
a) numpy
b) earth
c) plot
d) none of the mentioned

Answer: b
Explanation: The earth package is an implementation of Jerome Friedman’s Multivariate Adaptive Regression Splines.

2. Point out the wrong statement.
a) The trapezoidal rule is used to compute the area under the ROC curve
b) For regression, the relationship between each predictor and the outcome is evaluated
c) An argument, para, is used to pick the model fitting technique
d) All of the mentioned

Answer: c
Explanation: An argument, nonpara, is used to pick the model fitting technique.

3. Which of the following curve analysis is conducted on each predictor for classification?
a) NOC
b) ROC
c) COC
d) All of the mentioned

Answer: b
Explanation: For two class problems, a series of cutoffs is applied to the predictor data to predict the class.

4. Which of the following function tracks the changes in model statistics?
a) varImp
b) varImpTrack
c) findTrack
d) none of the mentioned

Answer: a
Explanation: GCV change value can also be tracked.

5. Point out the correct statement.
a) The difference between the class centroids and the overall centroid is used to measure the variable influence
b) The Bagged Trees output contains variable usage statistics
c) Boosted Trees uses different approach as a single tree
d) None of the mentioned

Answer: a
Explanation: The larger the difference between the class centroid and the overall center of the data, the larger the separation between the classes.

6. Which of the following model model include a backwards elimination feature selection routine?
a) MCV
b) MARS
c) MCRS
d) All of the mentioned

Answer: b
Explanation: MARS stands for Multivariate Adaptive Regression Splines.

7. The advantage of using a model-based approach is that is more closely tied to the model performance.
a) True
b) False

Answer: a
Explanation: Model-based approach is able to incorporate the correlation structure between the predictors into the importance calculation.

8. Which of the following model sums the importance over each boosting iteration?
a) Boosted trees
b) Bagged trees
c) Partial least squares
d) None of the mentioned

Answer: a
Explanation: gbm package can be used here.

9. Which of the following argument is used to set importance values?
a) scale
b) set
c) value
d) all of the mentioned

Answer: a
Explanation: All measures of importance are scaled to have a maximum value of 100.

10. For most classification models, each predictor will have a separate variable importance for each class.
a) True
b) False

Answer: a
Explanation: The exceptions are classification trees, bagged trees and boosted trees.