From Excel To Machine Learning
From Excel To Machine Learning
MACHINE LEARNING
BOOTCAMP
Introduction
IS EASY
I am a hands-on analytics professional and experienced leader. I mention this only to provide
credibility for what I am about to say next.
Over the years I have performed analytics on behalf of teams in Marketing, Product
Management, Finance, Customer Service, and Supply Chain.
In the majority of cases, these analyses could be performed by the business subject matter
experts (SMES). The techniques I use are not that hard to learn. This includes business SMEs
acquiring machine learning skills to analyze their data.
LINEAR
My philosophy is to let my content do the talking - which is why this document exists.
If your team wants to have more business impact using data, I invite you to keep reading.
-Dave
2
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Document Goals
IS EASY
This document is structured to accomplish the following goals:
1. To demonstrate that any business team with Excel skills can learn R programming.
2. To demonstrate that any business team can acquire machine learning skills.
3. To provide an outline of the 3-day bootcamp curriculum
LINEAR
A secondary benefit of this document is that you will acquire some actual machine learning
knowledge as the result of your reading.
REGRESSION
3
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Table of Contents
4
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Table of Contents
Overfitting Intuition................................31
The Bugbear of Machine Learning
The Model Is Good - Or Is It?
What Happened?
Model Tuning Intuition
Gini Impurity.............................................36
Impurity Intuition
Gini Impurity
Gini Impurity Example
5
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Table of Contents
Wrap Up
6
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
While most Excel users don't think of it this way, they spend a lot of time writing and
debugging code in Excel. In fact, Microsoft Excel is by far and away the world's most
popular programming environment.
Take the image below as an example. The user is using Excel's AVERAGE function to
calculate the average of a column (i.e., Petal.Width) of a table (i.e., iris_data) in a
worksheet.
LINEAR
Once the user hits the <enter> key, Excel attempts to interpret the instructions in the
cell and perform the desired operation. If Excel doesn't understand what the user
REGRESSION
typed, it reports an error.
That's coding!
7
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
R Code
While it isn't the only way to code in Excel, calling Excel functions as depicted on the
previous page is by far the most common. When using Excel in this way, Excel is
operating as a code interpreter.
Using R as an interpreter is very common. The R user types some code and hits the
<enter> key. R then tries to interpret the code, throwing an error if doesn't
understand what was typed by the user.
In this way Excel and R are very similar, but it doesn't stop there. Even the code is
very similar!
LINEAR
The image below depicts the same scenario as on the previous page, but using R
instead. REGRESSION
As depicted, the user is calculating the average (mean is just another name
for the average) of the Petal.Width column of the iris_data table.
8
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
While this example might seem simple, it demonstrates why R is the fastest, easiest
way for ANY team to unlock advanced analytics.
As you will see through the rest of this document, Excel is a powerful analytical tool
with many concepts and skills that need to be mastered to use Excel effectively.
This knowledge makes the learning process an exercise in mapping Excel skills to R.
LINEAR
REGRESSION
9
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
As depicted below, Excel features above the water line (e.g., Pivot Tables) only
scratch the surface of Excel's capabilities. However, these feature represent the bulk
of Excel's use in practice.
Another similarity between Excel and R is the "choose your own adventure" aspect
of the technologies. Just as many Excel users never learn Power Query, not every R
user needs to learn statistical analysis to be effective in their work.
LINEAR
REGRESSION
Tables Data Frames
Common Functions Common Functions
Pivot Tables dplyr
Charts ggplot2
10
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Excel tables can also be thought of as container objects. Tables contain rows, columns,
cells, data formats, etc.
You probably can see where I'm going with this already.
LINEAR
When analyzing data with R, it's all about the tables - just like Excel.
REGRESSION
Once again, your Excel knowledge directly translates to R.
11
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
The image below demonstrates how Excel tables are objects. For example, every
table in Excel has a name - whether you explicitly name a table or not. Table names
allow you to directly access/manipulate tables using Excel code.
12
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Things work in R exactly the same way. Tables of data in R (known as "data frames")
have names just like Excel tables so that you can write R code to access/manipulate
tables of data.
13
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Working with cells of data is very common in Excel. It is useful to think of cells as
objects contained within tables - as depicted below. Once again, you use Excel code
to access cells.
14
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
15
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Excel code supports different ways of accessing columns of data within tables. Two
examples:
16
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Notice how similar the actual R code is to Excel when using object names.
17
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
The bulk of code Excel users write call functions. Often, these function calls are nested
and can be difficult to debug (again, that's coding!).
18
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
The contrived example from the previous page is repeated using R code.
A IsSetosa column is being added to the iris.data table (or data frame) and populated
with new data derived from the existing Species column.
First, notice how the workflow is exactly the same as in Excel - only everything is
done in code with R.
Second, notice how similar the R ifelse function call is to Excel code.
18
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Common Functions
Like Excel, R comes out of the box with many, many functions to work with columns
of data.
Many of the R functions share the same name with the corresponding Excel
function. In other cases, mapping your Excel knowledge to R is straightforward, as
depicted below.
20
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Excel is a great data visualization tool, supporting many ways to analyze data
visually.
21
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
R Data Visualizations
R also easily produces data visualizations that are difficult, or not possible, to do with
out of the box Excel features.
22
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Turns out that one of the most useful machine learning algorithms for business data
is also one of the simplest to learn - decision trees. Almost everyone is familiar with a
decision tree. You start at the top of the tree and move down each decision in the
tree until you reach the bottom.
23
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Machine learning algorithms learn from data. In the case of decision trees, the
algorithm learns rules from the data.
Some sample data that was used to create the tree on the previous page is depicted
below. When we apply the decision tree algorithm to the data, the algorithm tries to
learn the rules to accurately predict the label using the features in the data set.
24
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Most of us are familiar with decision trees being depicted in a graphical form as
shown on page 23. However, you can also represent a decision tree using text.
The tree learned from the data listed on the previous page is represented below as
text.
What is clear from this representation is that trees are collections of rules used to
make predictions.
The textual rules below clearly show what the algorithm learned from the data in
terms of predicting income levels (i.e., the label).
For example, using the rules below, a person that had an occupation of "Adm-
clerical" is predicted to have an annual income of less than or equal to $50,000 USD.
25
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Machine learning algorithms work by pursuing some sort of objective. Think of the
objective as how the algorithm determines if it has learned all that it can from the
data.
In the case of classification trees, the objective is to minimize impurity. Later you will
learn about the math of impurity, for now let's focus on the intuition.
A collection of data is pure when all the labels are the same. Reflexively, a collection
of data is impure when the labels are not all the same.
26
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
In this contrived example we have 10 observations (or rows) and 2 features of data.
The classification tree algorithm loves a single feature with lots of observations and
only a single label value.
27
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
This is the first conceptual step of the classification tree algorithm learning from the
contrived data set.
Graphically, we can represent what the algorithm has learned so far as depicted
below.
However, not all the data has been used in the algorithm's learning process, so we
are not done yet!
28
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
As depicted below, the tree so far has only used 40% of the data for learning.
The observations used so far have been hidden by bars to clearly show what data is
left for the algorithm to use.
Of the data left for use by the algorithm, a pure split using the relationship feature is
highlighted.
29
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
The math of impurity tells the algorithm that after using the relationship feature
value of Wife, there are no valid ways of splitting the data further.
Viola! You now have a conceptual understanding of how classification trees learn
from data.
30
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Overfitting Intuition
The Bugbear of Machine Learning
Simply put, overfitting is where your model's predictions are much less "accurate"
when presented with new data.
The full bootcamp covers overfitting in depth, this document will focus on the
intuition.
31
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
32
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Or Is It?
33
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
What Happened?
34
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
The bootcamp has robust coverage of model tuning. For the purposes of this
document, the focus will be on the intuition behind tuning machine learning
models.
Conceptually, think of model tuning like tuning your automobile for optimal
performance.
The decision tree "engine" has a number of settings that you can control as the
machine learning practitioner
These settings are called hyperparameters and are typically changed as needed for
optimal decision tree performance - just like a mechanic can change the engine
settings in your automobile for optimal performance.
In the case of decision trees, tuning manifests as controlling how specialized (i.e.,
how complex) a tree can become from the data used to train it.
In the case of the tree on the previous page, you would tune the hyperparameters
to make a less complex tree to pursue better predictions on new data.
35
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Gini Impurity
Impurity Intuition
The bootcamp has robust coverage of the mathematics of decision trees. Here's the
good news:
This document will focus on the intuition of impurity math, starting with a 2-label
(i.e., binary) case.
Using our example, we can think about "buckets" of labels and impurity as a
spectrum...
36
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
Gini Impurity
Turns out there are several calculations that manifest the intuition of the previous
page.
Gini impurity is widely used and is the default calculation used by the R packages
taught in the bootcamp.
Don't panic!
While the math may appear intimidating, when you think about it in English, it is
really quite easy...
37
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
As we know, the classification tree algorithm loves to split up the data into groups
where the labels are all the same (i.e., pure).
Using the math from the previous page we can now do some calculations and see
how pure buckets of labels have the smallest Gini impurity scores.
Notice that the worst case scenario is a 50/50 split of labels. This makes intuitive
sense when you have 2 labels - it is essentially a coin flip.
38
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
40
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
41
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
42
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
43
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
44
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
45
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
46
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
My speciality is customizing analytics training for your team to ensure your investment
generates business impact.
Many of my clients augment bootcamp training with 1-on-1 coaching services for their
team. By coaching your team on how to apply the skills learned to your specific business
problems, your return on investment is ensured.
BTW - For my coaching clients I offer a guarantee. I will continue to coach your team free
of charge until they are successful with the learned skills.
47
FROM EXCEL TO MACHINE LEARNING BOOTCAMP
48