Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Tuning Decision Trees Python

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Tuning Decision Trees Python

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

TUNING

DECISION TREES
WITH PYTHON
WHICH TREE IS BETTER?

Let me know in the chat:


Name & where you are attending from
OK to connect with folks on LinkedIn?
About Me
I’ve been in tech for 26 years and doing Hands-on analytics consultant and instructor.
hands-on analytics for 12+ years.

I’ve supported all manner of business


functions and advised leaders.

I have successfully trained 1000+


professionals in a live classroom setting.

Trained 1000s more via my online courses


and tutorials.
Housekeeping
Questions Polls Offers
Chat Handouts

Please hold “industry-specific” questions until the end.


The Code is the Easy Part!
Supervised Learning
Data Analyst, Teacher
Machine learning encompasses many areas of study.
Data
The focus of this course will be supervised learning…
Model

You
(supervisor) Algorithm Training
Student
(machine)

DataFrame
education sex hours_per_week age income
Bachelors Male 52 39 >50K
Doctorate Female 23 53 <=50K
HS-grad Male 40 31 <=50K
Masters Female 43 26 >50K
Did the Machine Learn?
As the “teacher” supervising the student’s learning, you want to evaluate how much
the machine has learned.

Just as with humans, this involves testing. Your data


You
(supervisor)

Student
(machine)

Test data

Training data
Splitting Your Data
How Much Training/Test Data?
The real answer is “it depends.”
Forms of Supervised Learning

You can think of supervised learning as coming in two forms – classification and regression.

Classification models predict labels, whereas regression models predict numeric values (i.e., targets).

Some ML algorithms are classification or regression only. Tree-based algorithms can do both!

CLASSIFICATION REGRESSION
Types of predictions: Business scenarios: Types of predictions: Business scenarios:

Code Value • Fraud detection • Numeric values • Marketing mix


0 FALSE
1 TRUE • Churn prevention • Anything with a • Price/cost modeling
decimal point!
• Conversion modeling • Customer lifetime
value
• Underwriting
The Data
The Adult Census Income Dataset
Here is a summarized description of the Adult Census dataset:

• “Extraction was done by Barry Becker from the 1994 Census database.

Prediction task is to determine whether a person makes over 50K a year.”

A cleaned version of this dataset will be used as a running example throughout the course lectures.

This data set represents a classification scenario – the data to be predicted is a categorical label.

Most of the course will focus on classification as this knowledge is directly transferrable to regression.

More information on the Adult Census dataset can be found at the UCI Machine Learning Repository:

• https://archive.ics.uci.edu/ml/datasets/adult
The Adult Census Income Dataset
Variable Description Values
age Age of observation in years.
work_class Categorical feature denoting type of employment. Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt Numeric feature. Statistical calculation of demographics.
education Categorical feature denoting level of education. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-
4th, 10th, Doctorate, 5th-6th, Preschool
education_num Numeric feature. Years of education completed.
marital_status Categorical feature. Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-
spouse
occupation Categorical feature denoting occupation type. Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-
op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship Categorical feature denoting familial relationship. Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race Categorical feature denoting racial assignment. White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex Categorical feature denoting gender. Female, Male
capital_gain Numeric feature. Any reported capital gains.
capital_loss Numeric feature. Any reported capital losses.
hours_per_week Numeric feature. Employment hours per week worked.
native_county Categorical feature denoting country of citizenry before United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan,
immigration to the US. Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, etc.
income Categorical feature (the label) denoting income level <=50K, >50K
Under/Overfitting
It’s All About the Fit!
In crafting valuable machine learning models, a critical idea is underfitting vs. overfitting.

The concept of a spectrum is useful…

Underfitting Goldilocks Zone Overfitting

Models as complex
as needed, but no
more complex!

Decision stump

Less Complex Goldilocks Zone More Complex


Controlling Complexity
The DecisionTreeClassifier class offers many options for controlling complexity.

These options are what machine learning practitioners call hyperparameters.

While the DecisionTreeClassifier supports many hyperparameters, the following are the most useful:

Hyperparameter Default Description


max_depth None The maximum depth of the tree. If value is None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split observations.
min_samples_split 2 The minimum number of observations required to perform a split. If value is an integer, then
min_samples_split is the min count. If value is a float, then min_samples_split is a fraction of
the observations.
min_samples_leaf 1 The minimum number of observations required to make a leaf node. A split point at any depth
will only be considered if it leaves at least min_samples_leaf training samples in each of the
left and right branches. If value is an integer, then min_samples_leaf is the min count. If value
is a float, then min_samples_leaf is a fraction of the observations.
min_impurity_decrease 0.0 A node will be split if this split produces a decrease in the weighted impurity greater than or
equal to this value.
Max Depth
Controlling complexity using the max_depth hyperparameter…

max_depth = 1 max_depth = 3
Min Samples Per Split
Controlling complexity using the min_samples_split hyperparameter…

min_samples_split = 1000 min_samples_split = 10000


Min Samples Per Leaf
Controlling complexity using the min_samples_leaf hyperparameter…

min_samples_leaf = 1000 min_samples_leaf = 10000


Min Impurity Decrease
Controlling complexity using the min_impurity_decrease hyperparameter…

min_impurity_decrease = 0.001 min_impurity_decrease = 0.01


The Bias-Variance Tradeoff
Dave Heads Down to the Pub
The bias-variance tradeoff is arguably one of the most important concepts in machine learning.

To gain intuitive understanding, let’s use the example of throwing darts at the pub…

Underfitting XX X X
X XX X
XX X X
Dave is good at darts, X Dave is good at darts, his
High bias, High bias, X
but his board at home is board at home is too
low variance high variance
too high. high, and he’s had a few.

Dave is good at darts X X Dave is good at darts, his


XX Low bias, Low bias, X
X XX board at home is regulation,
and his board is XX low variance high variance X X
regulation. X and he’s had a few.
X

The goal Overfitting


High Bias, Low Variance Model

We saw an example of a high bias, low variance model in the last section:

This model exhibits This model exhibits


very high bias – it very low variance – it
always predicts <=50K! always predicts <=50K!
Low Bias, High Variance Model
We saw an example of a low bias, high variance model in the last section:

This model has likely overfit the training data!


The Tradeoff

Bias Variance

Underfitting Goldilocks Zone Overfitting

Models as complex
as needed, but no
more complex!

Decision stump

Less Complex Goldilocks Zone More Complex


Cross-Validation
Supervising the Data
Remember our classroom analogy?

It turns out that your most important duty is


supervising the data. Data
Model

You
(supervisor)
Algorithm Training
Student
(machine)

The key to optimizing the bias-variance tradeoff is


the intersection of data and training regimen.

To gain intuition of how to supervise data, we’ll


continue with the classroom analogy.
Back to the Teacher
Imagine you are the teacher. Your goal is to teach most effectively.

How do you achieve this? How do you know when you are successful?

In a word – testing. However, good teachers just don’t jump into testing.
Classic machine learning
Good teachers also provide practice to students… practice is to segment your
data into Training, Validation,
and Testing datasets.

50 Questions Training
75 Questions With Answers
100 Questions With Answers
With Answers
25 Quiz Questions Validation

25 Test Questions 25 Test Questions Testing


Data Trumps Algorithm

A popular saying in machine learning is, “data trumps algorithm.” The core idea being that you
can craft more useful models with more data.

NOTE – Garbage in, garbage out (GIGO) applies here!

Applying this to our example…

So much practice… ½ the practice!

Training data is a precious


50 Questions
Training resource, we want as much of it
With Answers
as we can get.
100 Questions
With Answers
25 Quiz Questions Validation However, we need to still
validate progress and perform
25 Test Questions Testing final testing.
Cross-Validation
We can’t escape the need of pulling out some data for final testing (i.e., a holdout set).

However, it would be optimal if we didn’t also need a validation hold-out set.

Enter the cross-validation…

Step 1 – Repurpose Step 2 – Test Hold Out Step 3 – Cross-Validate

25% of Data 25% of Data 25% of Data


50% of Data Training
25% of Data 25% of Data 25% of Data

25% of Data 25% of Data 25% of Data 25% of Data Validation

25% of Data 25% of Data With Cross-validation, we get to use


more of the data for training.
3-Fold Cross-Validation
Cross-validation is a technique to make maximum use of your training data.

Using cross-validation, your training regimen gets “multiple looks at the training data.”

Here’s how it works…

Step 1 – Step 2 – Step 3 – Step 4 –


Split Data Train & Validate Train & Validate Train & Validate

25% of Data 25% of Data Training 25% of Data Training 25% of Data Validation

25% of Data 25% of Data Training 25% of Data Validation 25% of Data Training

25% of Data 25% of Data Validation 25% of Data Training 25% of Data Training

Each split is The number of folds is referred to as Using cross-validation, you train and
known as a fold. k (i.e., k-fold cross-validation) evaluate k models.
Model Tuning
Back to the Darts
We can now combine everything we’ve learned so far.

Let’s assume you’ve got data and some DecisionTreeClassifier hyperparameter values.

You then perform 10-fold cross-validation with the above, where each CV fold is conceptually a dart…

High bias, High bias, Low bias, Low bias,


high variance low variance high variance low variance

X X XXX
X
X X X XX XX XX
X X XX X X
X X X
X X X
X XXX XX
X X
X XXXX
X X
X X
CV results with CV results with CV results with CV results with
hyperparameter set 1 hyperparameter set 2 hyperparameter set 3 hyperparameter set 4

This process is the essence of model tuning.


Making the Darts Real
Let’s assume you are optimizing your DecisionTreeClassifier for accuracy.

Also, let’s assume you’re using 10-fold cross-validation to evaluate the tradeoff…

CV results with CV results with CV results with CV results with


hyperparameter set 1 hyperparameter set 2 hyperparameter set 3 hyperparameter set 4

Mean = 84.5 Bias Mean = 85.35 Mean = 90.06 Mean = 95.47


Range = 9 Variance Range = 2.2 Range = 10.8 Range = 2
Estimating Generalization Error
Useful machine learning models generalize well – models that produce “accurate” predictions on new,
unseen data. How do you know that any given model will generalize well?

You leverage cross-validation to tune your models and estimate the generalization error.

Mean = 84.5 Mean = 85.35 Mean = 90.06 Mean = 95.47


Winner!
Range = 9 Range = 2.2 Range = 10.8 Range = 2
What About the Test Holdout?
The test holdout set is your final estimate of generalization error.

WARNING – The test holdout set cannot be used/influence training, or it is useless!

Here’s the process (assuming you can’t get more data):

1. Acquire your data.


2. Split data into training and test datasets.
3. Explore the training data.
4. Clean the training data.
5. Select your algorithm (e.g., DecisionTreeClassifier).
6. Engineer features with the training data.
7. Train and tune your model with cross-validation.
8. If your model doesn’t have low enough bias and variance, go to 3.
9. Use the test set once to estimate the generalization error.
10. If generalization error meets business requirements, you may have a useful model!
11. Train a new model using all the data.

There is no guarantee at the beginning of your work that you will craft a useful model!
Model Tuning with Python
Loading the Training Data
Prepping the Training Data
Prepping the Training Data
Prepping the Training Data
Model Tuning
Bias and Variance
Prepping the Test Data
Prepping the Test Data
Prepping the Test Data
Model Testing
Wrap-Up
Continue Your Learning
Q&A

You might also like