ML Interview Questions and Answers
ML Interview Questions and Answers
Ans. K-Means is an unsupervised learning algorithm. K-Means clustering is the task of
grouping a set of data points in such a way that data points in the same group (called a
cluster) are closer to each other than to those in other groups (clusters).
Steps for K-Means clustering: -
1. In the dataset provided, consider the variables required for clustering
2. Randomly Initialize the cluster centroid
3. Calculate Euclidean distance between each observation and initial cluster centroids
4. Based on Euclidean distance each observation is assigned to one of the clusters -
based on minimum distance.
5. Update cluster centroid by taking the mean of the variables in a cluster
6. Repeat the process till convergence is achieved i.e. there is no further change in the
cluster centroids.
Example: - Segmenting individuals into different clusters based on their height and weight
Additional Information:-
Link for various distance measures:-
http://www.sthda.com/english/articles/26-clustering-basics/86-clustering-distance-measure
s-essentials/
Steps and pseudo code for K-means clustering:-
http://mnemstudio.org/clustering-k-means-example-1.htm
2. What are the important considerations in K-means clustering?
Ans. a) Scale of measurements influences Euclidean Distance, so variable
standardisation becomes necessary
b) Outlier treatment is necessary depending on the problem statement
c) K- Means clustering may be biased on initial centroids - called cluster seeds
d) The number of clusters to be created is an input to the algorithm and it
impacts the clusters getting created
3. How is the number of clusters identified in K-means clustering.
Ans Elbow method:-
a) Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
b) For each k, calculate the total within-cluster sum of square (wss).
c) Plot the curve of wss according to the number of clusters k.
d) The location of a bend (knee) in the plot is generally considered as an indicator of the
appropriate number of clusters.
Average silhouette Method
A. Compute clustering algorithm (e.g., k-means clustering) for different values of k. For
instance, by varying k from 1 to 10 clusters.
B. For each k, calculate the average silhouette of observations (avg.sil).
C. Plot the curve of a
vg.s il according to the number of clusters k.
D. The location of the maximum is considered as the appropriate number of clusters.
Ans. An ensemble is a collection of multiple models (usually supervised) which is used to
obtain better predictive performance than any of the individual models.
The two most common types of ensembles are bagging and boosting.
Bagging stands for bootstrapped aggregation, and RF is the most common implementation
of bagging.
-- If asked to explain what is random forest:
An RF is basically a collection of multiple decision trees, say 500. To make a decision, a
majority vote of the 500 trees is taken for each data point.
To build an RF, we take bootstrapped samples, i,e. do sampling with replacement (e.g. take
random 40% data points from training data n times, and use them to build n trees). This
ensures that each decision tree is trained using different training sets, and is evaluated
using data points which were not in the 40% points - this is called out of bag error or OOB
since the evaluation is done on points not used for training. If each tree is still performing
well (as measured by OOB error), the entire forest is likely performing well, i.e if the
average OOB error is low, we can be confident that the model (RF) will not overfit.
Also, each node in the tree is built using only a subset of the features. This is because if all
the features are available for all the nodes, the top nodes in each tree (the important ones)
will almost always contain the most imp variables, and all the tree will look similar. This is
not desirable because we want to ensure diversity in the ensemble model (the entire
forest) - if all trees are similar, there is no point taking a majority vote. But if the trees are
different, then the majority vote is unlikely to be a result of overfitting, since even if some
trees are overfitting, others will likely not overfit.
-- If asked about boosting: Read below
5. Difference between Bagging and Boosting?
Ans. Bagging and Boosting are called "meta-algorithms" approaches which combine
several machine learning techniques into one predictive model. Their purpose is either to
decrease the variance (bagging), bias (boosting) or improving the prediction outcomes. Every
algorithm consists of two steps:
1. Producing a distribution of simple ML models on subsets of the original data.
2. Combining the distribution into one "aggregated" model.
Here is a short description of the two methods:
It shows that, If an actual label of a particular data point (y(xi)) is zero and the predicted
probability of xi is one, then the cost of the logistic function will be very high. Similarly, If an
actual label of a particular data point (y(xi)) and the predicted probability of xi are same,
then the cost of a logistic function will be zero. So, we need to find an estimate (β^) in such
a way that the cost function will have to be minimum. The above graph shows that the
logistic cost function is convex cost function, so we don’t need to worry about local
minimum. But, it is not possible to find a global minimum point using closed form solution
as linear regression because the sigmoid function is nonlinear
Ans Filter Methods They are generally used as a preprocessing step. The
selection of features is independent of any machine learning algorithms. Instead, features
are selected on the basis of their scores in various statistical tests for their correlation with
the outcome variable
Wrapper methods In this method we try to use a subset of features and train a model
using them. The problem is essentially reduced to a search problem. These methods are
usually computationally very expensive.Some common examples of wrapper methods are
forward feature selection, backward feature elimination, recursive feature elimination, etc.
Embedded methods Embedded methods combine the qualities’ of filter and wrapper
methods. It’s implemented by algorithms that have their own built-in feature selection
methods.Some of the most popular examples of these methods are LASSO and RIDGE
regression which have inbuilt penalization functions to reduce overfitting.
The main differences between the filter and wrapper methods for feature selection
are:
● Filter methods measure the relevance of features by their correlation with
dependent variable while wrapper methods measure the usefulness of a subset of
feature by actually training a model on it.
● Filter methods are much faster compared to wrapper methods as they do not
involve training the models. On the other hand, wrapper methods are
computationally very expensive as well.
● Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
● Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
● Using the subset of features from the wrapper methods make the model more
prone to overfitting as compared to using subset of features from the filter
methods.
Additional Information:-
https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-
with-an-example-or-how-to-select-the-right-variables/
21. Difference between linear and logistic regression
Ans The difference between linear and logistic regression is as follows:-
Linear Regression Logistic Regression
Equation - Equation -
Curve - aims at finding the best-fitting Changing the coefficient leads to change
straight line which is also called a in both the direction and the steepness
regression line. of the logistic function. It means positive
slopes result in an S-shaped curve and
negative slopes result in a Z-shaped
curve.
Linear regression requires error term Logistic regression does not require that
should be normally distributed. the error term should be normally
distributed
Linear regression uses Identity link logistic regression uses Logit function of
function of gaussian family. Binomial family.
24. In which cases you would use Generalized linear models?
Ans. The cases in which Generalized linear models can be used are as follows:-
(a) If the response variable is categorical
(b) When the distribution of residuals in non-normal or non-gaussian
(c) If the distribution of the residuals is from the exponential of gamma distributions
25. What is a naïve bayes model called a naïve?
Ans Naïve Bayes machine learning algorithm is considered Naïve because the
assumptions the algorithm makes are virtually impossible to find in real-life data.
Conditional probability is calculated as a pure product of individual probabilities of
components. This means that the algorithm assumes the presence or absence of a specific
feature of a class is not related to the presence or absence of any other feature (absolute
independence of features), given the class variable. For instance, a fruit may be considered
to be a banana if it is yellow, long and about 5 inches in length. However, if these features
depend on each other or are based on the existence of other features, a naïve Bayes
classifier will assume all these properties to contribute independently to the probability
that this fruit is a banana. Assuming that all features in a given dataset are equally
important and independent rarely exists in the real-world scenario.
26. How do you handle class imbalance in a dataset. Explain what could you do at the
data level and at the model level?
Ans a) Collecting more data
b) Changing the performance metric (It was observed in many cases that accuracy
does not work well with imbalanced datasets.
c) Resampling the dataset. Add samples of data that is under represented. This is
known as oversampling. Delete samples of data that is over represented. This is
known as undersampling.
d) Generating synthetic samples - to randomly sample the attributes from instances
in the minority class.
e) Spot checking of different algorithms
f) Penalizing the models - Penalized classification imposes an additional cost on the
model for making classification mistakes on the minority class during training. These
penalties can bias the model to pay more attention to the minority class.
Additional information:-
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machin
e-learning-dataset/
27. What is the difference between R2 and adj-R2?
Ans One major difference between R-squared and the adjusted R-squared is that
R-squared supposes that every independent variable in the model explains the variation in
the dependent variable R-squared cannot verify whether the coefficient ballpark figure and
its predictions are prejudiced. It also does not show if a regression model is satisfactory; it
can show an R-squared figure for a good model, or a high R-squared figure for a model that
doesn’t fit.
The adjusted R-squared compares the descriptive power of regression models that include
diverse numbers of predictors. Every predictor added to a model increases R-squared and
never decreases it. Thus, a model with more terms may seem to have a better fit just for
the fact that it has more terms, while the adjusted R-squared compensates for the addition
of variables and only increases if the new term enhances the model above what would be
obtained by probability and decreases when a predictor enhances the model less than
what is predicted by chance. In an overfitting condition, an incorrectly high value of
R-squared, which leads to a decreased ability to predict, is obtained. This is not the case
with the adjusted R-squared.
28. Why is "having" used when "where" is already used ?
Both “Having” and “Where” clauses are similar in nature and have a functionality of
“FILTERING” the data with a slight difference The WHERE clause filters on individual rows
and the HAVING clause filters on the aggregated values, it applies only to groups as a
whole. Now, why "having" clause is being used when "where" is already used, let’s consider
an example, you are booking a movie ticket from specific theatre e.g. IMAX (as you are
having specific taste) but due to tight budget you want to list ONLY IMAX theatres which
has a “average” movie ticket price below Rs. 300 or you want to list only IMAXs whose
number of shows are greater than 5 per day.
Here “where” clause will eliminate the theatres that are not from “IMAX” before calculating
the average price of the movie ticket. To implement the filter on average ticket price we’ll
need the HAVING clause as here we need both “GROUPING” and “SUMMARIZING”, like
count(*) or avg(). Also when using having, you must also have a “Group By” clause in the
query.
https://docs.microsoft.com/en-us/sql/ssms/visual-db-tools/use-having-and-where-clauses-i
n-the-same-query-visual-database-tools
http://www.java2s.com/Code/Oracle/Select-Query/UsingtheWHEREGROUPBYandHAVINGCl
ausesTogether.htm
29. What are window functions?
The window function performs calculation on “set of rows” and return a SINGLE
aggregated value for each row. This is comparably similar to the form of calculation that
can be done with an aggregate function but with little bit of difference.
When we use aggregate functions with the GROUP BY clause, we “lose” the individual rows.
We can’t mix attributes from an individual row with the results of an aggregate function;
the function is performed on the rows as an entire group. But unlike regular aggregate
functions, use of a window function does not cause rows to become grouped into a single
output row — the rows retain their separate identities. We can generate a result set with
some attributes of an individual row together with the results of the window function. This
makes windowing one of the coolest feature of SQL.
e.g if you want to compare each player’s auction value in IPL to average auction amount
spent on each player of his team in same table.
SELECT team_name, player_name, auction_amt, avg(auction_amt) OVER (PARTITION BY
team_name) FROM auction_table;
Refer PostgreSQL documentation:
https://www.postgresql.org/docs/9.1/static/tutorial-window.html
https://community.modeanalytics.com/sql/tutorial/sql-window-functions/
30. Difference between logistic and linear regression.Is logistic regression a linear
model? Why or why not?
Linear regression : This algorithm’s principle is to find a linear relation within your data.
Once the linear relation is found, predicting a new value is done with respect to this
relation.Linear regression is used when the desired output is required to take a continuous
value based on whatever input/dataset is given to the algorithm.
Let us say, you ask a child in fifth grade to arrange people in his class by increasing order of
weight, without asking them their weights! What do you think the child will do? He / she
would likely look (visually analyze) at the height and build of people and arrange them
using a combination of these visible parameters. This is linear regression in real life! The
child has actually figured out that height and build would be correlated to the weight by a
relationship, which looks like the equation above.
Logistic Regression: Don’t go by its name! It is a c
lassification algorithm not a regression
algorithm.Let’s say your friend gives you a puzzle to solve. There are only 2 outcome
scenarios – either you solve it or you don’t.
Linear regression is used when the output is continuous in nature based on corresponding
input, it can have any one of an infinite number of possible values. Consider the weather
forecasting problem where you want to predict the tomorrow’s temperature, % humidity
etc.
Now suppose your problem was to not predict the average temperature or % humidity, but
what type of day it will be (eg., sunny, cloudy, stormy, rainy etc). This problem will give an
output belonging to a certain set of values predefined, hence it is basically classifying your
output into categories. Classifying problems can be either binary (yes/no, o/1 like you either
solve the problem or not) or multiclass (like the problem described above). Logistic
regression is used in classifying problems of machine learning.
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/
For mathematical POV:
https://medium.com/deep-math-machine-learning-ai/chapter-1-complete-linear-regression
-with-math-25b2639dde23
https://medium.com/deep-math-machine-learning-ai/chapter-2-0-logistic-regression-with-
math-e9cbb3ec6077
Also logistic regression It is called a generalized linear model not because the estimated
probability of the response event is linear, but because the logit of the estimated
probability response is a linear function of the parameters.
https://www.slideshare.net/SatishGupta4/ihcc-logistic-regression
Linear means linear (degree =1 ) in betas (the coefficients) but not in x's (the independent
variables).
https://stats.stackexchange.com/questions/88603/why-is-logistic-regression-a-linear-model
31. What is p value, how to read it, how to calculate it.
Imagine that , India in its full strength (all top player in best of the form) got into a 1 on 1
match with ZIMBABWE, But it turned out that India lost the game.
Fans were stunned. And frustrated. And angry.
The reasoning goes like this: if India had played as usual, he would have been highly
unlikely to be defeated. But he lost the game! So fans had every reason to cast doubt on
the team’s fair play. (Some might pull the allegation of Match Fixing)
To put it another way, the reasoning goes like this:
We have a hypothesis,that India rocks as usual. If the hypothesis had been true, the
probability of India losing would have been very small, say, less than 5%. But India lost the
game.
So the unlikelihood was considered as evidence against his fair play.
You may say p-value is a measurement of weirdness of your observations, according to
your current believes, smaller the weirder.So you believe in lots of things, for some things,
the reality reconfirms your believes by giving you expected outcomes; but for the others,
the reality challenges you by throwing at you weird unexpected outcomes, and at some
point, you can't deny it anymore, so you start to realize what you once believed may be
wrong.
The P value, or calculated probability, is the probability of finding the observed, or more
extreme, results when the null hypothesis (H0) of a study question is true – the definition of
‘extreme’ depends on how the hypothesis is being tested.The p-value is defined as the
probability, under the null hypothesis H, of obtaining a result equal to or more extreme
than what was actually observed.
Calculation of P-value:
Look up your test statistic on the appropriate distribution — in this case, on the standard
normal (Z-) distribution (see the following Z-table).
Pr(X >= x|H) for right tail event
Pr(X <= x|H) for left tail event
2min(Pr(X >= x|H),Pr(X <= x|H)) for double tail event
Relationship between significance level and p-value. The relationship is: the p-value is the
smallest significance level at which the null hypothesis would be rejected
Reject the null hypothesis if P i s "small".
https://en.wikipedia.org/wiki/P-value
http://www.perfendo.org/docs/BayesProbability/twelvePvaluemisconceptions.pdf
https://www.students4bestevidence.net/p-value-in-plain-english-2/
https://www.quora.com/What-is-a-p-value-explained-in-layman%E2%80%99s-terms
To know how to do Hypothesis testing:
http://people.cas.uab.edu/~mpogwizd/ma180-fall-2014/HypothesisTesting.pdf
32. What is the difference between median and mean, when to chose what?
The average (mean) is the sum of a set of numbers divided by the count of numbers in the
data set.
Whereas Median is the middle number in the data set, which can be determined by sorting
the numbers in order and finding the middle number in the data set
Now which one to choose it depends on the data distribution and purpose.
Mean is the average count, if you want to to find the per capita income i.e. “What is the
average income of the country?” you will use mean here and coming to accuracy For a
bell-shaped population distribution, the mean will be more accurate.
Median is more of a central measure where you draw some middle line What is the income
of an average person? On the basis of that you‘ll know people below poverty line and
coming to accuracy For a heavy-tailed distribution if the data sets are skewed (which is the
usual case) in one direction or the other the median will be more accurate
The forte of the median comes when you want to handle outliers.
https://learnandteachstatistics.wordpress.com/2013/04/29/median/
https://math.stackexchange.com/questions/2304710/mean-vs-median-when-to-use
33. What is VIF?
A variance inflation factor(VIF) detects multicollinearity (a predictor/independent variable
can be linearly predicted from the others with a significant accuracy) in regression analysis.
Multicollinearity is when there’s correlation between predictors (i.e. independent variables)
in a model;
it’s presence can adversely affect your regression results. The VIF estimates how much the
variance of a regression coefficient is inflated due to existing multicollinearity in the model.
VIFs are calculated by taking a predictor, and regressing it against every other predictor in
the model. This gives you the R-squared values, which can then be plugged into the VIF
formula. “i” is the predictor you’re looking at (e.g. x1 or x2):
R^2 is the coefficient of determination ( proportion of the variance in the dependent
variable that is predictable from the independent variable)
https://en.wikipedia.org/wiki/Coefficient_of_determination
https://onlinecourses.science.psu.edu/stat501/node/347
https://en.wikipedia.org/wiki/Variance_inflation_factor
Let’s consider hypothesized mean wait time of an ola cab is 5 minutes.If in your random
sample, had a mean wait time for cab of 15.1 minutes, the signal is 5.1-15 = 0.1 minutes.
The difference is relatively small, so the signal in the numerator is weak.
However, if patients in your random sample had a mean wait time of 18 minutes, the
difference is much larger: 18 - 5 = 3 minutes. So the signal is stronger
The denominator is the noise. The equation in the denominator is a measure of variability
known as the standard error of the mean.
http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-t-tests-t-values-and-
t-distributions
http://blog.minitab.com/blog/adventures-in-statistics-2/understanding-t-tests%3A-1-sample
%2C-2-sample%2C-and-paired-t-tests
36. How to know if a cluster generated to be good?
To measure the quality of clustering results, there are two kinds of validity indices: external
indices and internal indices.
An external index is a measure of agreement between two partitions where the first
partition is the a priori known clustering structure, and the second results from the
clustering procedure (Dudoit et al., 2002).
Internal indices are used to measure the goodness of a clustering structure without
external information (Tseng et al., 2005).
For external indices, we evaluate the results of a clustering algorithm based on a known
cluster structure of a data set (or cluster labels).
For internal indices, we evaluate the results using quantities and features inherent in the
data set. The optimal number of clusters is usually determined based on an internal validity
index.
http://www.sthda.com/english/articles/29-cluster-validation-essentials/96-determining-the-
optimal-number-of-clusters-3-must-know-methods/
https://link.springer.com/article/10.1007/s40595-016-0086-9
As your unsupervised learning method is probabilistic, another option is to evaluate some
probability measure (log-likelihood, perplexity, etc) on held out data. The motivation here is
that if your unsupervised learning method assigns high probability to similar data that
wasn't used to fit parameters, then it has probably done a good job of capturing the
distribution of interest
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation
https://stats.stackexchange.com/questions/21807/evaluation-measure-of-clustering-withou
t-having-truth-labels
https://www.analyticsvidhya.com/blog/2013/11/getting-clustering-right-part-ii/
37. What is ANOVA?
If you want to check the variation among and between the considered groups , basically
you are testing whether there is difference between the groups or not
. As example: Students of different coaching institute are giving IIT JEE, we want to see
which coaching institute is giving better result. Or A researcher conducted a study to
investigate the effect of 3 different teaching method on the reading ability of school
children.
T-test vs ANOVA
It’s very similar to t-test just for more than two groups.When comparing only two groups (A
and B), you test the difference between the two groups with a Student t test. So when
comparing three groups (A, B, and C) it’s natural to think of testing each of the three
possible two-group comparisons (A – B, A – C, and B – C) with a t test.
But running an exhaustive set of two-group t tests can be risky, because as the number of
groups goes up, the number of two-group comparisons goes up even more.
So here ANOVA comes for rescue.
Where n-1 is the degrees of freedom (DF), the summation is called the sum of squares (SS),
the result is called the mean square (MS) and the squared terms are deviations from the
sample mean.
https://www.edanzediting.com/blogs/statistics-anova-explained
http://www.dummies.com/education/science/biology/the-basic-idea-of-an-analysis-of-varia
nce-anova/
It’s always to practice for ANOVA. Few problems to understand those:
http://rstudio-pubs-static.s3.amazonaws.com/228015_d8d0ddab79664707890681a9a75cf1
6d.html
38. Basic of Neural Networks ?
https://towardsdatascience.com/a-gentle-introduction-to-neural-networks-series-part-1-2b
90b87795bc
Basic of Convolutional neural networks (CNNs) and recurrent neural networks
(RNNs).
https://medium.com/machine-learning-for-humans/neural-networks-deep-learning-cdad8a
eae49b
39. What is KS statistics?
Kolmogorov Smirnov Test is basically a test of goodness of fit. It compares the cumulative
distribution function for a variable with a “specified distribution”. Let’s say we
Suppose that we have an i.i.d. sample X1, . . . , Xn with some unknown distribution “D” and
we would like to test the hypothesis that “D” is equal to a particular distribution “D0”, i.e.
(KS-test) tries to determine whether the two datasets differ significantly or not. The
advantage of KS-test is its agnostic to the distribution of the sample considered.
Programming POV: http://daithiocrualaoich.github.io/kolmogorov_smirnov/
Mathematics POV:
https://ocw.mit.edu/courses/mathematics/18-443-statistics-for-applications-fall-2006/lectur
e-notes/lecture14.pdf
40. What is cross-Validation? Why is it important ? What are the different methods
of cross validation?
Ans There is always a need to validate the stability of the machine learning model.There
is no assurance that the model created will work well on unseen data. we need some kind
of assurance that the model has got most of the patterns from the data correct, and its not
picking up too much on the noise, or in other words its low on bias and variance.
Validation
This process of deciding whether the numerical results quantifying hypothesized
relationships between variables, are acceptable as descriptions of the data, is known as
validation. Generally, an error estimation for the model is made after training, better
known as evaluation of residuals. In this process, a numerical estimate of the difference in
predicted and original responses is done, also called the training error. However, this only
gives us an idea about how well our model does on data used to train it. Now its possible
that the model is underfitting or overfitting the data. So, the problem with this evaluation
technique is that it does not give an indication of how well the learner will generalize to an
independent/ unseen data set. Getting this idea about our model is known as Cross
Validation.
Holdout Method
Now a basic remedy for this involves removing a part of the training data and using it to get
predictions from the model trained on rest of the data. The error estimation then tells how
our model is doing on unseen data or the validation set. This is a simple kind of cross
validation technique, also known as the holdout method. Although this method doesn’t
take any overhead to compute and is better than traditional validation, it still suffers from
issues of high variance. This is because it is not certain which data points will end up in the
validation set and the result might be entirely different for different sets.
K-Fold Cross Validation
As there is never enough data to train your model, removing a part of it for validation
poses a problem of underfitting. By reducing the training data, we risk losing important
patterns/ trends in data set, which in turn increases error induced by bias. So, what we
require is a method that provides ample data for training the model and also leaves ample
data for validation. K Fold cross validation does exactly that.
In K
Fold cross validation, the data is divided into k subsets. Now the holdout method is
repeated k times, such that each time, one of the k subsets is used as the test set/
validation set and the other k-1 subsets are put together to form a training set. The error
estimation is averaged over all k trials to get total effectiveness of our model. As can be
seen, every data point gets to be in a validation set exactly once, and gets to be in a training
set k-1times. This significantly reduces bias as we are using most of the data for fitting, and
also significantly reduces variance as most of the data is also being used in validation set.
Interchanging the training and test sets also adds to the effectiveness of this method. As a
general rule and empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed
and it can take any value.
Stratified K-Fold Cross Validation
In some cases, there may be a large imbalance in the response variables. For example, in
dataset concerning price of houses, there might be large number of houses having high
price. Or in case of classification, there might be several times more negative samples than
positive samples. For such problems, a slight variation in the K Fold cross validation
technique is made, such that each fold contains approximately the same percentage of
samples of each target class as the complete set, or in case of prediction problems, the
mean response value is approximately equal in all the folds. This variation is also known as
Stratified K Fold.
Above explained validation techniques are also referred to as Non-exhaustive cross
validation methods. These do not compute all ways of splitting the original sample, i.e. you
just have to decide how many subsets need to be made.