How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
Search...
The test options you use when evaluating machine learning algorithms can mean the difference between over-
learning, a mediocre result and a usable state-of-the-art result that you can confidently shout from the roof tops
(you really do feel like doing that sometimes).
In this post you will discover the standard test options you can use in your algorithm evaluation test harness and
how to choose the right options next time.
Randomness
The root of the difficulty in choosing the right test options is randomness. Most (almost all) machine learning
algorithms use randomness in some way. The randomness may be explicit in the algorithm or may be in the sample
of the data selected to train the algorithm.
Randomness
Photo by afoncubierta, some rights reserved
This does not mean that the algorithms produce random results, it means that they produce results with some
noise or variance. We call this type of limited variance, stochastic and the algorithms that exploit it, stochastic
algorithms.
This matters, only if you want to use the model to make predictions on unseen data.
Split Test
A simple way to use one dataset to both train and estimate the performance of the algorithm on unseen data is to
split the dataset. You take the dataset, and split it into a training dataset and a test dataset. For example, you
randomly select 66% of the instances for training and use the remaining 34% as a test dataset.
The algorithm is run on the training dataset and a model is created and assessed on the test dataset and you get a
performance accuracy, lets say 87% classification accuracy.
Spit tests are fast and great when you have a lot of data or when training a model is expensive (it resources or
time). A split test on a very very large dataset can produce an accurate estimate of the actual performance of the
algorithm.
How good is the algorithm on the data? Can we confidently say it can achieve an accuracy of 87%?
A problem is that if we spit the training dataset again into a different 66%/34% split, we would get a different result
from our algorithm. This is called model variance.
For example, let’s say we split our dataset 66%/34%, ran our algorithm and got an accuracy and we did this 10
times with 10 different splits. We might have 10 accuracy scores as follows: 87, 87, 88, 89, 88, 86, 88, 87, 88, 87.
The average performance of our model is 87.5, with a standard deviation of about 0.85.
Coin Toss
Photo by ICMA Photos, some rights reserved
A problem with multiple split tests is that it is possible that some data instance are never included for training or
testing, where as others may be selected multiple times. The effect is that this may skew results and may not give
an meaningful idea of the accuracy of the algorithm.
Cross Validation
A solution to the problem of ensuring each instance is used for training and testing an equal number of times while
reducing the variance of an accuracy score is to use cross validation. Specifically k-fold cross validation, where k is
the number of splits to make in the dataset.
For example, let’s choose a value of k=10 (very common). This will split the dataset into 10 parts (10 folds) and the
algorithm will be run 10 times. Each time the algorithm is run, it will be trained on 90% of the data and tested on
10%, and each run of the algorithm will change which 10% of the data the algorithm is tested on.
In this example, each data instance will be used as a training instance exactly 9 times and as a test instance 1
time. The accuracy will not be a mean and a standard deviation, but instead will be an exact accuracy score of how
many correct predictions were made.
The k-fold cross validation method is the go-to method for evaluating the performance of an algorithm on a dataset.
You want to choose k-values that give you a good sized training and test dataset for your algorithm. Not too
disproportionate (too large or small for training or test). If you have a lot of data, you may may have to resort to
either sampling the data or reverting to a split test.
Cross validation does give an unbiased estimation of the algorithms performance on unseen data, but what if the
algorithm itself uses randomness. The algorithm would produce different results for the same training data each
time it was trained with a different random number seed (start of the sequence of pseudo-randomness). Cross
validation does not account for variance in the algorithm’s predictions.
Another point of concern is that cross validation itself uses randomness to decide how to split the dataset into k
folds. Cross validation does not estimate how the algorithm perform with different sets of folds.
This only matters if you want to understand how robust the algorithm is on the dataset.
This will will give you an an estimate of the performance of the algorithm on the dataset and an estimation of how
robust (the size of the standard deviation) the performance is.
If you have one mean and standard deviation for algorithm A and another mean and standard deviation for
algorithm B and they differ (for example, algorithm A has a higher accuracy), how do you know if the difference is
meaningful?
This only matters if you want to compare the results between algorithms.
Statistical Significance
A solution to comparing algorithm performance measures when using multiple runs of k-fold cross validation is to
use statistical significance tests (like the Student’s t-test).
The results from multiple runs of k-fold cross validation is a list of numbers. We like to summarize these numbers
using the mean and standard deviation. You can think of these numbers as a sample from an underlying
population. A statistical significance test answers the question: are two samples drawn from the same population?
(no difference). If the answer is “yes”, then, even if the mean and standard deviations differ, the difference can be
said to be not statistically significant.
We can use statistical significance tests to give meaning to the differences (or lack there of) between algorithm
results when using multiple runs (like multiple runs of k-fold cross validation with different random number seeds).
This can when we want to make accurate claims about results (algorithm A was better than algorithm B and the
difference was statistically significant)
This is not the end of the story, because there are different statistical significance tests (parametric and
nonparametric) and parameters to those tests (p-value). I’m going to draw the line here because if you have
followed me this far, you now know enough about selecting test options to produce rigorous (publishable!) results.
Summary
In this post you have discovered the difference between the main test options available to you when designing a
test harness to evaluate machine learning algorithms.
When in doubt, use k-fold cross validation (k=10) and use multiple runs of k-fold cross validation with statistical
significance tests when you want to meaningfully compare algorithms on your dataset.
Train-Test Split for Evaluating 16 Options To Get Started and Training-validation-test split and
Machine Learning Algorithms Make Progress in… cross-validation…
About Jason Brownlee
Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern
machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →
How to Run Your First Classifier in Weka Design and Run your First Experiment in Weka
56 Responses to How To Choose The Right Test Options When Evaluating Machine
Learning Algorithms
REPLY
Mickael March 4, 2014 at 4:41 am #
REPLY
jasonb March 4, 2014 at 5:17 am #
Thanks Mickael.
REPLY
Lawrence February 28, 2015 at 2:14 am #
Great post.
About the last part, are there situations where performance comparison for different algorithms needs to involve non
parametric test?
Could you please tell us some examples for that if they exists?
Thanks for your help.
REPLY
Jason Brownlee March 1, 2015 at 8:24 pm #
REPLY
Lawrence March 3, 2015 at 7:13 pm #
Thanks;)
REPLY
Melody July 23, 2015 at 5:05 am #
You suggest running multiple k-fold cross validation w/ statistical significance tests would help draw conclusions of
algorithm comparison.
I wonder how to set aside a fold for validating and tuning a model if I use k-fold cross-validation.
Is it ok to run multiple splits of training/validation/test datasets w/ statistical significance tests, even though some data
may occur multiple times in the same split of datasets?
REPLY
Surajit August 25, 2015 at 9:59 pm #
Hi Dr Jason,
This is another great post. I have learned a lot from this post.
Keep up.
Regards,
Surajit
REPLY
Leena February 8, 2016 at 10:53 pm #
Hi Jason – I’m new to ML and python.Just learnt your algorithm for implementing naive Bayes from scratch
in Python.First thanks for explaining it beautifully.
Also learnt implementation of the same algorithm with scikit learn.
REPLY
DR Venugopala Rao Manneni April 6, 2016 at 6:06 pm #
REPLY
Jason Brownlee April 8, 2016 at 1:37 pm #
Thanks.
REPLY
anbu May 2, 2016 at 1:51 am #
thanks
REPLY
Jason Brownlee May 2, 2016 at 6:25 pm #
REPLY
Mohammed June 11, 2016 at 6:47 am #
Thank you very much for this magnificent post, Jason. Please keep up the good work.
REPLY
Jason Brownlee June 14, 2016 at 8:17 am #
REPLY
hew August 2, 2016 at 3:06 am #
I’m trying to understand how K-fold CV could help evaluate a neural network model trained using
backpropagation. As I understand it, in K-fold CV, every batch is use to train a single neural network model. Would
that means, at the end of every batch the model is more or less overfitted to that particular batch? So in a 10 fold CV,
at the end of the last batch, we have a model that potentially overfitted to the last batch where it could perform well
on the last batch 90% training data, but sucked in the test 10%?
Anyway, thank you for sharing, your posts have been very useful to me.
REPLY
Jason Brownlee August 2, 2016 at 5:27 am #
Hi hew.
Yes, 10 different models are trained and evaluated. We report the average score and throw the models away.
Cross validation is only used to estimate the performance of the model on unseen data, not to train the model.
If we are happy with the performance, we can then train the model on the entire training dataset and begin to use
it.
Time series is difficult to use with cross validation. Normally, I a train/test split and a sliding window to evaluate
models on time series data.
REPLY
Hew August 3, 2016 at 3:59 am #
Hi Jason,
Thank you for your reply. Im not sure if i understand you completely. To estimate the performance of our model on
unseen data? Is the definition of model here, the ML method used? What does it mean we score well on k fold
exercise? All i would assume is that, the dataset we use contain evenly spread common features that can be use for
our prediction of our dataset. It doesnt proof its effectiveness against oos. Perhaps the dataset that we selected for
the exercise is biased in away that the features that we found never works in oos datset.
REPLY
Jason Brownlee August 3, 2016 at 8:26 am #
Great comment.
The goal of predictive modeling is to develop a model from a sample of data of the domain to perform well on
data it has not seen before. To make predictions on unseen data.
If this is not the goal, then you are doing stats and developing a descriptive rather than a predictive model and
trying to understand the domain.
Knowing how well a model does on unseen data is a hard problem. We can hold back a sample and use that to
estimate the skill of the model. A more advanced technique is to do this many times – cross validation.
Cross validation does not prove a model or modeling methodology (data prep + model) will do well, but it gives
us confidence that it will do well.
Indeed, we must be very concerned with the quality of our data sample, otherwise the ability to generalize will be
compromised.
REPLY
Poonam November 15, 2016 at 6:46 pm #
Hi Jason,
May I ask you what is the alternative to fit the data when we have very low data points (say 15 to 20 only) but
no. of predictors are large (e.g.9 to 10)
REPLY
Jason Brownlee November 16, 2016 at 9:28 am #
You may be better served with small n statistics (e.g. statistical methods). You just don’t have enough
observations for machine learning methods to learn from and generalize.
REPLY
Rita Zhang March 6, 2017 at 2:49 pm #
Hi Jason,
Could you please explain the differences between test set and validation set? If I split the data into test set and
training set. Do I need to setup a validation set?
Thank you so much!
REPLY
Jason Brownlee March 7, 2017 at 9:31 am #
Hi Rita,
A training set is used to fit the model, a test set is used to evaluate a model. A single large dataset can be split
into multiple train and test sets – e.g. k-fold cross validation.
A validation dataset is held back and used as a final check. It is optional but recommended.
REPLY
Seo young jae April 9, 2017 at 2:39 pm #
2. In data split, You said that you recommended validation data set in above question. Then, when i split data set as
60%(train), 20%(test), 20%(validation), what is the difference between test and validation set? I think that algorithm is
evaluated as using test data. So.. what is the role of validation?? is it same role ??? I’m confused…
REPLY
Omogbehin Azeez June 10, 2017 at 6:23 am #
Hello Dr,
Thank you for the good work. I want to ask how can someone plot the accuracy of each K-folds model deveolped in
K-Flold cross validation.
REPLY
Purvi Chokshi July 15, 2017 at 7:52 am #
Thank you for very detailed explanation Dr Jason. I see that for any example of k-fold cross validation,
sample split size is always 10. How do we determine the right splits for cross validation. Some of my dataset have
700 rows and some have 7000 rows of data. Always splitting dataset into 10 folds is not the right decision I think
REPLY
Jason Brownlee July 15, 2017 at 9:47 am #
10 is reported for most dataset because further groups result in diminishing returns on the bias/variance
of the mean skill score.
You can try a sensitivity analysis and see how k values affect the distribution of skill scores. Not a bad idea.
REPLY
Balaji July 22, 2017 at 4:00 am #
Which is the best technique to minimize Mean absolute error?
REPLY
Jason Brownlee July 22, 2017 at 8:36 am #
REPLY
Joseph August 8, 2017 at 6:27 am #
hello jadon. i want to learn how to estimate the accuracy of a regression model. I have gone through the
post but I don’t seem to cope. please I need to learn machine learning from the scratch… I mean from the very
scrach and foundation. please help with any material. Thanks
REPLY
Jason Brownlee August 8, 2017 at 7:54 am #
You can estimate the skill of a regression model by looking at prediction error.
It is common to estimate prediction error using Root Mean Squared Error (RMSE) or Mean Absolute Error (MAE).
These measures are provided in top platforms like scikit-learn, Weka and caret in R.
REPLY
surya prasad August 17, 2017 at 9:40 pm #
Sir, for a two-class defect detection problem, apart from accuracy which other parameters have to be
evaluated?
REPLY
Jason Brownlee August 18, 2017 at 6:18 am #
REPLY
mohammed temraz September 22, 2017 at 8:36 am #
Hello Jason
Secondly, if i decided to use cross validation, when to use it? on the training or the test data?
In my scenario, i ll build some classification algorithms using the training data set, then i ill use the test set to
evaluate the performance of my models.
Should i use Cross validation when i build my models or when i test them?
REPLY
Jason Brownlee September 23, 2017 at 5:33 am #
Yes, CV is perhaps the best method we have for developing unbiased (less biased) estimators.
REPLY
Jesús Martínez March 2, 2018 at 2:44 am #
Nice article, Jason. I really like the tools you provided here. I’ve got a question: In case you’re on a tight
deadline (like in a competition or dealing with a really impatient boss), what method would you use? I’m asking
because although CV and multiple CV runs are a great way to gain confidence in your results, they consume lots of
time.
REPLY
Jason Brownlee March 2, 2018 at 5:36 am #
I seem to always fall back to repeated CV and significance tests. It’s my background in stochastic
optimization that makes me not settle for anything less.
REPLY
András Novoszáth May 12, 2018 at 2:07 am #
Hi,
Have you written, or do you know of any introduction/tutorial you would advice to the application of significance tests
on cross-validations results? It would be really useful.
REPLY
Jason Brownlee May 12, 2018 at 6:47 am #
REPLY
Habib October 24, 2018 at 8:40 am #
Hi,
This is the overall accuracy is computed. But in the paper, “Steven L Salzberg. 1997. On comparing classifiers:
Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery 1, 3 (1997), 317-328” under “A
recommended approach”, Overall accuracy is averaged across all k partitions. These k values also give an estimate
of the variance of the algorithms. Then to compare algorithms, the binomial test or the McNemar test is suggested.
Thanks
REPLY
Jason Brownlee October 24, 2018 at 2:42 pm #
There are many statistical hypothesis test methods that can be used, I explain more here:
https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/
REPLY
Ayobami July 12, 2019 at 3:24 am #
Please, i will like to know the level of efficiency for all the Machine Learning Algorithm. I’m actually on a
project and it’s very important for me to know this. Please as soon as possible. Thanks
REPLY
Jason Brownlee July 12, 2019 at 8:46 am #
REPLY
suv July 23, 2019 at 4:37 pm #
Hi Jason,
what is the acceptable standard deviation (limits?) between cross-validated model errors?
Kind Regards,
REPLY
Jason Brownlee July 24, 2019 at 7:50 am #
It is problem specific. Compare your results to a naive baseline model. More details here:
https://machinelearningmastery.com/faq/single-faq/how-to-know-if-a-model-has-good-performance
REPLY
halima October 27, 2019 at 10:25 am #
hello Jason, I might want to clarify with u about run algorithm in training dataset and the model is access and
created on test data set? How to build model in test dataset with the result run from train dataset? am not clear about
this statement..Could you please advice?
REPLY
Jason Brownlee October 28, 2019 at 6:01 am #
REPLY
Skylar May 11, 2020 at 3:19 pm #
Hi Jason,
I want to make sure with you: the “Multiple Cross Validation” that you mentioned in your post, does it mean “repeated
cross-validation” in caret method=”repeatedcv” in the “trainControl” function? Thank you!
REPLY
Jason Brownlee May 12, 2020 at 6:34 am #
Yes.
REPLY
Rom August 15, 2020 at 2:26 am #
I trained a learner on the training dataset and got an accuracy of 99%. When i trained the same learner on a
10 fold CV my accuracy deacreases to 40%. Cleary my model is overfitting. After performing Random search CV I
obtain the best estimator wich gives an accuracy of 55%. How do I know that my tunned hyperparameter model has
combat the problem of overfitting. In other words on which dataset should I asses my tuned learner model? obviously
i cannot use the test set as I’m still not confident about my learner.
REPLY
Jason Brownlee August 15, 2020 at 6:34 am #
Maybe not overfitting, maybe just the first case was a poor estimate of expected performance and the cv
was more reliable.
You can grid search on a hold out dataset or grid search within each cross-validation fold called nested cross-
validation.
REPLY
David November 9, 2020 at 10:17 pm #
Hello Mr.Brownlee,
Great post. I would like to ask you a few doubts if you don’t mind:
1. In k-fold cross-validation technique the dataset is then used to both establish and validate the model, which is not
ideal right?
2. Even the model may be accurately classified, will it be limited by the data-specificity?
3. How can you calculate the accuracy to classify a novel and unseen sample?
Thanks in advance
REPLY
Jason Brownlee November 10, 2020 at 6:42 am #
k-fold cv estimates the performance of the model when used to make predictions on new data not seen during
training.
REPLY
Saeed January 20, 2021 at 2:56 pm #
Do we need test data for the evaluation of an unsupervised model where the model used to generate synthetic data?
REPLY
Jason Brownlee January 21, 2021 at 6:43 am #
It depends on the type of model, e.g. it is common to evaluate clustering methods like a classification
model:
https://machinelearningmastery.com/faq/single-faq/how-do-i-evaluate-a-clustering-algorithm
REPLY
Anjali Budhiraja December 24, 2021 at 4:35 am #
Dear sir…your articles are always very nice…when I do any search and your article is in option I always
prefer to read yours…my query is… I was trying RandomizedSearchCV using cross validation to select the hyper
parameters for F1 score…it’s giving me the best model for F1 score for class 1 but F1 score for class2 which is my
minority class is coming very poor…I want to select the best hyper parameters for the model for class2 ? What
should I do in that case ?
REPLY
James Carmichael February 18, 2022 at 1:03 pm #
https://stackoverflow.com/questions/62672842/how-to-improve-f1-score-for-classification
Leave a Reply
Name (required)
SUBMIT COMMENT
Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more
Classification Accuracy is Not Enough: More Performance Measures You Can Use