0% found this document useful (0 votes)

12 views

Why You Should Be Spot-Checking Algorithms On Your Machine Learning Problems

The document discusses the benefits of "spot-checking algorithms" when working on machine learning problems. Spot-checking involves quickly testing a variety of algorithms on a new problem to determine which perform best and which to discard. The key benefits are speed, objectivity in discovering effective algorithms, and fast usable results. Tips for effective spot-checking include using a diverse set of algorithm types, giving each a fair chance, running a formal experiment, treating results as a starting point, and building a "shortlist" of algorithms to regularly test. Popular algorithms to consider spot-checking include decision trees, clustering, SVMs, rules extraction, boosting, and naive Bayes.

Uploaded by

prediatech

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Why You Should Be Spot-Checking Algorithms On Your Machine Learning Problems

Uploaded by

prediatech

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

 Navigation

Click to Take the FREE Crash-Course

Search... 

Why you should be Spot-Checking Algorithms on your

Machine Learning Problems
by Jason Brownlee on August 16, 2020 in Machine Learning Process  22

Share Tweet Share

Spot-checking algorithms is about getting a quick assessment of a bunch of different algorithms on your machine
learning problem so that you know what algorithms to focus on and what to discard.

Photo by withassociates, some rights reserved

In this post you will discover the 3 benefits of spot-checking algorithms, 5 tips for spot-checking on your next
problem and the top 10 most popular data mining algorithms that you could use in your suite of algorithms to spot-
check.

Spot-Checking Algorithms
Spot-checking algorithms is a part of the process of applied machine learning. On a new problem, you need to
quickly determine which type or class of algorithms is good at picking out the structure in your problem and which
are not.

The alternative to spot checking is that you feel overwhelmed by the vast number of algorithms and algorithm types
that you could try that you end up trying very few or going with what has worked for you in the past. This results in
wasted time and sub-par results.

Benefits of Spot-Checking Algorithms

There are 3 key benefits of spot-checking algorithms on your machine learning problems:

Speed: You could spend a lot of time playing around with different algorithms, tuning parameters and thinking
about what algorithms will do well on your problem. I have been there and end up testing the same algorithms
over and over because I have not been systematic. A single spot-check experiment can save hours, days and
even weeks of noodling around.
Objective: There is a tendency to go with what has worked for you before. We pick our favorite algorithm (or
algorithms) and apply them to every problem we see. The power of machine learning is that there are so many
different ways to approach a given problem. A spot-check experiment allows you to automatically and
objectively discover those algorithms that are the best at picking out the structure in the problem so you can
focus your attention.
Results: Spot-checking algorithms gets you usable results, fast. You may discover a good enough solution in
the first spot experiment. Alternatively, you may quickly learn that your dataset does not expose enough
structure for any mainstream algorithm to do well. Spot-checking gives you the results you need to decide
whether to move forward and optimize a given model or backward and revisit the presentation of the problem.

I think spot checking mainstream algorithms on your problem is a no-brainer first step.

Tips for Spot-Checking Algorithms

There are some things you can do when you are spot-checking algorithms to ensure you are getting useful and
actionable results.
Tips for Spot-Checking Algorithms
Photo by vintagedept, some rights reserved.

Below are 5 tips to ensure you are getting the most from spot-checking machine learning algorithms on your
problem.

Algorithm Diversity: You want a good mix of algorithm types. I like to include instance based methods (live
LVQ and knn), functions and kernels (like neural nets, regression and SVM), rule systems (like Decision Table
and RIPPER) and decision trees (like CART, ID3 and C4.5).
Best Foot Forward: Each algorithm needs to be given a chance to put it’s best foot forward. This does not
mean performing a sensitivity analysis on the parameters of each algorithm, but using experiments and
heuristics to give each algorithm a fair chance. For example if kNN is in the mix, give it 3 chances with k values
of 1, 5 and 7.
Formal Experiment: Don’t play. There is a huge temptation to try lots of different things in an informal manner,
to play around with algorithms on your problem. The idea of spot-checking is to get to the methods that do well
on the problem, fast. Design the experiment, run it, then analyze the results. Be methodical. I like to rank
algorithms by their statistical significant wins (in pairwise comparisons) and take the top 3-5 as a basis for
tuning.
Jumping-off Point: The best performing algorithms are a starting point not the solution to the problem. The
algorithms that are shown to be effective may not be the best algorithms for the job. They are most likely to be
useful pointers to types of algorithms that perform well on the problem. For example, if kNN does well,
consider follow-up experiments on all the instance based methods and variations of kNN you can think of.
Build Your Short-list: As you learn and try many different algorithms you can add new algorithms to the suite
of algorithms that you use in a spot-check experiment. When I discover a particularly powerful configuration of
an algorithm, I like to generalize it and include it in my suite, making my suite more robust for the next problem.

Start building up your suite of algorithms for spot check experiments.

Top 10 Algorithms
There was a paper published in 2008 titled “Top 10 algorithms in data mining“. Who could go past a title like that? It
was also turned into a book “The Top Ten Algorithms in Data Mining” and inspired the structure of another “Machine
Learning in Action“.

This might be a good paper for you to jump start your short-list of algorithms to spot-check on your next machine
learning problem. The top 10 algorithms for data mining listed in the paper were.

C4.5 This is a decision tree algorithm and includes descendent methods like the famous C5.0 and ID3
algorithms.
k-means. The go-to clustering algorithm.
Support Vector Machines. This is really a huge field of study.
Apriori. This is the go-to algorithm for rule extraction.
EM. Along with k-means, go-to clustering algorithm.
PageRank. I rarely touch graph-based problems.
AdaBoost. This is really the family of boosting ensemble methods.
knn (k-nearest neighbor). Simple and effective instance-based method.
Naive Bayes. Simple and robust use of Bayes theorem on data.
CART (classification and regression trees) another tree-based method.

There is also a great Quora question on this topic that you could mine for ideas of algorithms to try on your
problem.

Resources
Top 10 algorithms in data mining (2008)
Quora: What are some Machine Learning algorithms that you should always have a strong understanding of,
and why?

Which algorithms do you like to spot-check on problems? Do you have a favorite?

Share Tweet Share

About Jason Brownlee

Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern
machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →

 What is the Weka Machine Learning Workbench Applied Machine Learning Process 

22 Responses to Why you should be Spot-Checking Algorithms on your Machine

Learning Problems

REPLY 
Jeremy June 6, 2016 at 11:28 pm #

Hi Jason. When you say each algorithm needs to put its “best foot forward” by using a range of parameter
values, how do you know what values to use? Do you have any resources regarding what parameter values to test
for some of the most common algorithms? Thanks, Jeremy

REPLY 
Jason Brownlee June 14, 2016 at 8:24 am #

Great question Jeremy.

Generally, you can gather this information from papers, posts and competition outcomes, as well as experience.
It is hard earned knowledge and sadly not written down anywhere.
For the mean time, you may be best to grid search the parameters of a given algorithm on a suite of standard
algorithms to start to build up an intuition for “classes of configuration”.

REPLY 
Nick December 22, 2016 at 5:54 am #

I see the benefit of spot checking, but how do you know that a model that underperforms in spot checking
wouldn’t be better to use once fully tuned? For example, suppose model A has a 65% classification accuracy with no
tuning, and model B is 70% accurate with no tuning. Is it possible for model A to overtake model B once they have
been tuned? Or is it common for models to maintain the same performance relative to one another, even after
tuning? For the sake of argument, I’m ignoring the effect of any overfitting, but perhaps that is part of the answer.

REPLY 
Jason Brownlee December 22, 2016 at 6:38 am #

Really great point.

It’s hard.

You need to give each algorithm its best chance but pull back from full algorithm tuning.

This applies both to algorithm config and to transformed input data.

Generally, I would advise designing a suite of transformed inputs (views of the data) and a suite of
algorithm/configs and run all combinations to see what floats to the top, and then double down.

A lot of this can be automated with good tools.

REPLY 
Glen February 6, 2018 at 10:38 pm #

Great article Jason. I really like your advice on not trying to hard to make an algorithm work.

Just a question around tools. Could you mention the different tools that you use to automate and run through
different algorithms?

REPLY 
Jason Brownlee February 7, 2018 at 9:24 am #

I recommend sklearn in Python or caret in R.

I have many tutorials on both on the blog.

REPLY 
Skylar May 12, 2020 at 8:39 am #

Hi Jason,

After reading your this post, I indeed realize it is so important to first do spot-check on the algorithms! I feel
very curious about the above question raised by Nick on Dec 22, 2016 (though it was long time ago),
because I have the exactly the same question with his. To follow up his question, I was wondering when we
are doing the spot-check, after we select a bunch of different types of algorithms for test, should we compare
them after config tuning? What do you mean “give each algorithm a fair chance” in the part of “Best foot
forward”? Should we achieve this by just automatically and randomly select several values for each
parameter to tune OR we should use a more strict “grid” tuning strategy to have each algorithm fully tuned
and them compare them in the spot-check step?

REPLY 
Jason Brownlee May 12, 2020 at 1:30 pm #

One approach is to spot check a suite of algorithms with “standard” config then tune. The risk is
overfitting the dataset.

Alternately, you can tune each algorithm as part of spot check, using so-called nested cross-validation.

Skylar May 12, 2020 at 4:25 pm #

Thank you Jason for your reply! Sorry that your statements with these two approaches
make me even more curious, and want to ask:

1. In the first approach, you mentioned “spot check a suite of algorithms with “standard” config then
tune”. I was understanding model configs = model hyperparameters, is it correct? If so, do you mean
that we first evaluate a suite of algorithms with some “standard” hyperparameters, and if we find it is
the best model, then tune the hyperparameters systematically with “grid” function?

2. The “nested cross-validation” sounds amazing and interesting! Do you mean in the inner cv loop,
we tune the hyperparameters by using grid search for each model that we want to compare; while in
the outer cv loop, we measure the performance of the model with the hyperparameter combinations
that win in the inner cv loop? If I don’t understand correctly, do you have clarified it somewhere or
could you please suggest the related materials that I can learn?

Many thanks!

Jason Brownlee May 13, 2020 at 6:25 am #

1. Yes and yes.

2. Yes. I have a tutorial on this written and scheduled. Generally you can loop yourself and perform
the search within the loop, or apply the cross val to a grid search object directly. The latter is less
code.

REPLY 
Jesús Martínez February 27, 2018 at 12:05 am #

Great resource, Jason. Thanks for publishing it! I think it would be very useful for my next machine learning
endeavor 🙂
Does this process work with deep learning? Given that deep learning tasks tend to take a lot more time than their
machine learning counterpart, is it feasible to spot-check different architectures?

Thanks in advance for your time and attention!

REPLY 
Jason Brownlee February 27, 2018 at 6:33 am #
Absolutely.

REPLY 
Jonathan Moregård April 16, 2018 at 2:21 am #

Is it possible to do this in WEKA? I Tried to download packages for all the algorithms you mentioned in the
introduction, but can’t seem to find them inside the experimenter.

REPLY 
Jason Brownlee April 16, 2018 at 6:11 am #

Yes, you can spot check in Weka.

REPLY 
Shay Geller April 16, 2019 at 12:57 am #

Thanks Jason, great post.

How would you split your dataset for this spot checking experiment?
Let’s assume you have 20K of samples in your data.
Would you use all of it? part of it?
How many train\test splits will you consider? would you get results over k-fold CV?

Another problem is with the hyperparameters. If you say you have multiple parameters options for each model, would
you get the results on a nester cross-validation splits or a regular CV?

My opinion is splitting the 20K to 80% train and 20% test, and for each model perform k-fold cross-validation *only on
the train data* (Not inner CV, choose k to be 5 or 10).
Then use these results to perform pair-wise significance tests as you proposed.
Pick the top 3-4 models (regardless of their hyper-parameters).
That way you will not get any assessments from the test set which is good.

Does it sound reasonable? Is there a more safe\reliable way to do it?

Thanks

REPLY 
Jason Brownlee April 16, 2019 at 6:51 am #

Perhaps start with 10-fold cross validation.

Perhaps this framework will help:

https://machinelearningmastery.com/spot-check-machine-learning-algorithms-in-python/

REPLY 
Ida June 20, 2019 at 8:53 pm #

Thank you Jason. Your posts are great. I really learn a lot from every posts.

REPLY 
Jason Brownlee June 21, 2019 at 6:36 am #
THanks.

REPLY 
Skylar May 12, 2020 at 8:06 am #

Hi Jason,

You mentioned “go-to algorithm” in Top 10 algorithm in this post, I wonder what does it mean? Sorry English is not
my mother tongue, and I don’t want to misunderstand your meaning:-) Thanks!

REPLY 
Jason Brownlee May 12, 2020 at 1:29 pm #

Sorry, “go-to” means most common or most widely used / recommended.

REPLY 
Skylar May 12, 2020 at 4:26 pm #

Thank you for clarifying this Jason!

REPLY 
Jason Brownlee May 13, 2020 at 6:25 am #

You’re welcome.

Email (will not be published) (required)

SUBMIT COMMENT

Welcome!
I'm Jason Brownlee PhD
and I help developers get results with machine learning.
Read more

Never miss a tutorial:

Picked for you:

What is the Difference Between Test and Validation Datasets?

How to Train a Final Machine Learning Model

What is the Difference Between a Parameter and a Hyperparameter?

So, You are Working on a Machine Learning Problem…

Classification Accuracy is Not Enough: More Performance Measures You Can Use

Loving the Tutorials?

The EBook Catalog is where

you'll find the Really Good stuff.

>> SEE WHAT'S INSIDE

Privacy | Disclaimer | Terms | Contact | Sitemap | Search

Algorithms For The Girlies (FREE SAMPLE)
50% (2)
Algorithms For The Girlies (FREE SAMPLE)
21 pages
Machine Learning Interviews
100% (3)
Machine Learning Interviews
22 pages
Test Automation Strategies
No ratings yet
Test Automation Strategies
8 pages
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
No ratings yet
Data2vec: A General Framework For Self-Supervised Learning in Speech, Vision & Language
20 pages
M3 - v13x Developing H5 Personalized Script1
No ratings yet
M3 - v13x Developing H5 Personalized Script1
2 pages
ML Performance Improvement Cheatsheet
No ratings yet
ML Performance Improvement Cheatsheet
11 pages
MAchine Learning
No ratings yet
MAchine Learning
120 pages
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
No ratings yet
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
16 pages
Applied Machine Learning Process
No ratings yet
Applied Machine Learning Process
23 pages
How To Choose A Machine Learning Algorithm
No ratings yet
How To Choose A Machine Learning Algorithm
12 pages
Which Machine Learning Algorithm Should I Use - The SAS Data Science Blog
No ratings yet
Which Machine Learning Algorithm Should I Use - The SAS Data Science Blog
15 pages
41 Essential Machine Learning Interview Questions - Springboard Blog
No ratings yet
41 Essential Machine Learning Interview Questions - Springboard Blog
13 pages
Essential Machine Learning Interview Questions and Answers
No ratings yet
Essential Machine Learning Interview Questions and Answers
15 pages
Math For ML
No ratings yet
Math For ML
6 pages
Test Driven Machine Learning - Sample Chapter
100% (1)
Test Driven Machine Learning - Sample Chapter
25 pages
lecture 9 machine_learning new
No ratings yet
lecture 9 machine_learning new
11 pages
5 Techniques To Understand Machine Learning Algorithms Without The Background in Mathematics - Machine Learning Mastery
No ratings yet
5 Techniques To Understand Machine Learning Algorithms Without The Background in Mathematics - Machine Learning Mastery
16 pages
DS Notes
No ratings yet
DS Notes
170 pages
So, You Want To Learn Artificial Intelligence. Here's How You Do It
No ratings yet
So, You Want To Learn Artificial Intelligence. Here's How You Do It
23 pages
Machine Learning Interview Questions and Answers PDF
No ratings yet
Machine Learning Interview Questions and Answers PDF
15 pages
MA - Take Home Exam
No ratings yet
MA - Take Home Exam
21 pages
A 6 Step Field Guide For Building Machine Learning Projects
No ratings yet
A 6 Step Field Guide For Building Machine Learning Projects
24 pages
Lecture 4_simulation Modelling
No ratings yet
Lecture 4_simulation Modelling
17 pages
2. 02 PyTorch, Datasets, and Models
No ratings yet
2. 02 PyTorch, Datasets, and Models
39 pages
Top 45 Machine Learning Interview Questions in 2025
No ratings yet
Top 45 Machine Learning Interview Questions in 2025
37 pages
(Ebook) Ensemble Machine Learning With Python: 7-Day Mini-Course by Jason Brownlee - The ebook in PDF and DOCX formats is ready for download
100% (2)
(Ebook) Ensemble Machine Learning With Python: 7-Day Mini-Course by Jason Brownlee - The ebook in PDF and DOCX formats is ready for download
57 pages
Analysis and Design of Algorithms: A Beginner’s Hope
From Everand
Analysis and Design of Algorithms: A Beginner’s Hope
Shefali Singhal
No ratings yet
AI Session 3 Machine Learning Slides
No ratings yet
AI Session 3 Machine Learning Slides
35 pages
Machine Learning With Python
No ratings yet
Machine Learning With Python
89 pages
Brownlee J. Genetic Algorithm Afternoon. A Practical Guide... 2024
No ratings yet
Brownlee J. Genetic Algorithm Afternoon. A Practical Guide... 2024
130 pages
Practicing Consciousness Slides
No ratings yet
Practicing Consciousness Slides
22 pages
How To Think About Machine Learning
No ratings yet
How To Think About Machine Learning
15 pages
A Tour of Machine Learning Algorithms
No ratings yet
A Tour of Machine Learning Algorithms
9 pages
Types of ML
No ratings yet
Types of ML
4 pages
Machine Learning Strategy
No ratings yet
Machine Learning Strategy
102 pages
Data Analyst
No ratings yet
Data Analyst
5 pages
7641 Assignment 1
No ratings yet
7641 Assignment 1
4 pages
AI Unit 2
No ratings yet
AI Unit 2
31 pages
41 Essential Machine Learning Interview Questions (1/4)
No ratings yet
41 Essential Machine Learning Interview Questions (1/4)
4 pages
Start Here With Machine Learning
No ratings yet
Start Here With Machine Learning
25 pages
Andrew ML
No ratings yet
Andrew ML
218 pages
66 Job Interview Questions For Data Scientists
No ratings yet
66 Job Interview Questions For Data Scientists
10 pages
07 - Model Selection & Building
No ratings yet
07 - Model Selection & Building
17 pages
51 Machine Learning Interview Questions With Answers - Springboard
100% (1)
51 Machine Learning Interview Questions With Answers - Springboard
20 pages
Coursera Algorithm Toolbox
0% (1)
Coursera Algorithm Toolbox
456 pages
Lecture 4
No ratings yet
Lecture 4
17 pages
4f13 Machine Learning Coursework
100% (2)
4f13 Machine Learning Coursework
8 pages
40 Interview Questions asked at Startups in Machine Learning _ Data Science
No ratings yet
40 Interview Questions asked at Startups in Machine Learning _ Data Science
13 pages
7641 Assignment 2 Fall 2024
No ratings yet
7641 Assignment 2 Fall 2024
5 pages
Algorithms For The Girlies (FREE Sample)
No ratings yet
Algorithms For The Girlies (FREE Sample)
22 pages
ML Microsoft Course Overview: Machine Learning in Context
100% (1)
ML Microsoft Course Overview: Machine Learning in Context
53 pages
When Should A Test Be Automated?: Scenarios
No ratings yet
When Should A Test Be Automated?: Scenarios
20 pages
Subjects You Need To Know:: Programming Languages of AI
0% (1)
Subjects You Need To Know:: Programming Languages of AI
7 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
9 pages
Success With Test Automation
No ratings yet
Success With Test Automation
9 pages
Algorithms: Discover The Computer Science and Artificial Intelligence Used to Solve Everyday Human Problems, Optimize Habits, Learn Anything and Organize Your Life
From Everand
Algorithms: Discover The Computer Science and Artificial Intelligence Used to Solve Everyday Human Problems, Optimize Habits, Learn Anything and Organize Your Life
Trust Genics
No ratings yet
RST Automation Traps (1)
No ratings yet
RST Automation Traps (1)
5 pages
Weatherwax Geron Solutions
No ratings yet
Weatherwax Geron Solutions
52 pages
Machine Learning, Part I: Types of Learning Problems
No ratings yet
Machine Learning, Part I: Types of Learning Problems
8 pages
Introduction To Machine Learning Top-Down Approach - Towards Data Science
No ratings yet
Introduction To Machine Learning Top-Down Approach - Towards Data Science
6 pages
p78 Domingos
No ratings yet
p78 Domingos
10 pages
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
From Everand
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
Abiprod Pty Ltd
5/5 (10)
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
No ratings yet
8 Tactics To Combat Imbalanced Classes in Your Machine Learning Dataset
62 pages
How To Prepare Data For Machine Learning
No ratings yet
How To Prepare Data For Machine Learning
34 pages
Build A Machine Learning Portfolio
No ratings yet
Build A Machine Learning Portfolio
18 pages
Pilot
No ratings yet
Pilot
78 pages
AST20105 Data Structures & Algorithms: Chapter 4 - Array and Linked List
No ratings yet
AST20105 Data Structures & Algorithms: Chapter 4 - Array and Linked List
39 pages
Apex - Testing PDF
No ratings yet
Apex - Testing PDF
4 pages
Labview
No ratings yet
Labview
49 pages
SQL Server Architecture Explained
100% (1)
SQL Server Architecture Explained
22 pages
1.relational Model
No ratings yet
1.relational Model
12 pages
Comparison Between Python and JavaScript
No ratings yet
Comparison Between Python and JavaScript
3 pages
Unit 1
No ratings yet
Unit 1
10 pages
A New Algorithm For Parallel Connected-Component Labelling On Gpus
No ratings yet
A New Algorithm For Parallel Connected-Component Labelling On Gpus
14 pages
MO18 5304 Design Optimization Techniques
No ratings yet
MO18 5304 Design Optimization Techniques
2 pages
ACN Error Detection and Control
No ratings yet
ACN Error Detection and Control
15 pages
Python Coding
No ratings yet
Python Coding
12 pages
c13 Quiz Attempt Review PDF
No ratings yet
c13 Quiz Attempt Review PDF
12 pages
Time-Table For B - Tech Nov - 2014
No ratings yet
Time-Table For B - Tech Nov - 2014
27 pages
MySQL Commands Cheat Sheet by PhoenixNAP
No ratings yet
MySQL Commands Cheat Sheet by PhoenixNAP
1 page
3.6.2 Generating Admissible Heuristics From Relaxed Problems
No ratings yet
3.6.2 Generating Admissible Heuristics From Relaxed Problems
2 pages
Lab 2
No ratings yet
Lab 2
2 pages
Mistral 1
No ratings yet
Mistral 1
8 pages
Mastering C 2nd Edition Venugopal pdf download
100% (2)
Mastering C 2nd Edition Venugopal pdf download
47 pages
AEL Abstracting Execution Logs To Execution Events For Enterprise Applications
No ratings yet
AEL Abstracting Execution Logs To Execution Events For Enterprise Applications
6 pages
Producer Consumer Problem
No ratings yet
Producer Consumer Problem
4 pages
OS
No ratings yet
OS
20 pages
Internship Report
No ratings yet
Internship Report
6 pages
Double Click Event in OOPS ALV
No ratings yet
Double Click Event in OOPS ALV
14 pages
02 Linux - Fundamentals
No ratings yet
02 Linux - Fundamentals
28 pages
Operator Overloading Programs.
No ratings yet
Operator Overloading Programs.
14 pages
Makefiles: Department of Computer Science COS121 Lecture Notes
No ratings yet
Makefiles: Department of Computer Science COS121 Lecture Notes
15 pages
Irjet V7i61149
No ratings yet
Irjet V7i61149
6 pages
Quiz 6 PL - SQL
No ratings yet
Quiz 6 PL - SQL
6 pages