Tuning Decision Trees Python
Tuning Decision Trees Python
DECISION TREES
WITH PYTHON
WHICH TREE IS BETTER?
You
(supervisor) Algorithm Training
Student
(machine)
DataFrame
education sex hours_per_week age income
Bachelors Male 52 39 >50K
Doctorate Female 23 53 <=50K
HS-grad Male 40 31 <=50K
Masters Female 43 26 >50K
Did the Machine Learn?
As the “teacher” supervising the student’s learning, you want to evaluate how much
the machine has learned.
Student
(machine)
Test data
Training data
Splitting Your Data
How Much Training/Test Data?
The real answer is “it depends.”
Forms of Supervised Learning
You can think of supervised learning as coming in two forms – classification and regression.
Classification models predict labels, whereas regression models predict numeric values (i.e., targets).
Some ML algorithms are classification or regression only. Tree-based algorithms can do both!
CLASSIFICATION REGRESSION
Types of predictions: Business scenarios: Types of predictions: Business scenarios:
• “Extraction was done by Barry Becker from the 1994 Census database.
A cleaned version of this dataset will be used as a running example throughout the course lectures.
This data set represents a classification scenario – the data to be predicted is a categorical label.
Most of the course will focus on classification as this knowledge is directly transferrable to regression.
More information on the Adult Census dataset can be found at the UCI Machine Learning Repository:
• https://archive.ics.uci.edu/ml/datasets/adult
The Adult Census Income Dataset
Variable Description Values
age Age of observation in years.
work_class Categorical feature denoting type of employment. Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt Numeric feature. Statistical calculation of demographics.
education Categorical feature denoting level of education. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-
4th, 10th, Doctorate, 5th-6th, Preschool
education_num Numeric feature. Years of education completed.
marital_status Categorical feature. Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-
spouse
occupation Categorical feature denoting occupation type. Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-
op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship Categorical feature denoting familial relationship. Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race Categorical feature denoting racial assignment. White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex Categorical feature denoting gender. Female, Male
capital_gain Numeric feature. Any reported capital gains.
capital_loss Numeric feature. Any reported capital losses.
hours_per_week Numeric feature. Employment hours per week worked.
native_county Categorical feature denoting country of citizenry before United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan,
immigration to the US. Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, etc.
income Categorical feature (the label) denoting income level <=50K, >50K
Under/Overfitting
It’s All About the Fit!
In crafting valuable machine learning models, a critical idea is underfitting vs. overfitting.
Models as complex
as needed, but no
more complex!
Decision stump
While the DecisionTreeClassifier supports many hyperparameters, the following are the most useful:
max_depth = 1 max_depth = 3
Min Samples Per Split
Controlling complexity using the min_samples_split hyperparameter…
To gain intuitive understanding, let’s use the example of throwing darts at the pub…
Underfitting XX X X
X XX X
XX X X
Dave is good at darts, X Dave is good at darts, his
High bias, High bias, X
but his board at home is board at home is too
low variance high variance
too high. high, and he’s had a few.
We saw an example of a high bias, low variance model in the last section:
Bias Variance
Models as complex
as needed, but no
more complex!
Decision stump
You
(supervisor)
Algorithm Training
Student
(machine)
How do you achieve this? How do you know when you are successful?
In a word – testing. However, good teachers just don’t jump into testing.
Classic machine learning
Good teachers also provide practice to students… practice is to segment your
data into Training, Validation,
and Testing datasets.
50 Questions Training
75 Questions With Answers
100 Questions With Answers
With Answers
25 Quiz Questions Validation
A popular saying in machine learning is, “data trumps algorithm.” The core idea being that you
can craft more useful models with more data.
Using cross-validation, your training regimen gets “multiple looks at the training data.”
25% of Data 25% of Data Training 25% of Data Training 25% of Data Validation
25% of Data 25% of Data Training 25% of Data Validation 25% of Data Training
25% of Data 25% of Data Validation 25% of Data Training 25% of Data Training
Each split is The number of folds is referred to as Using cross-validation, you train and
known as a fold. k (i.e., k-fold cross-validation) evaluate k models.
Model Tuning
Back to the Darts
We can now combine everything we’ve learned so far.
Let’s assume you’ve got data and some DecisionTreeClassifier hyperparameter values.
You then perform 10-fold cross-validation with the above, where each CV fold is conceptually a dart…
X X XXX
X
X X X XX XX XX
X X XX X X
X X X
X X X
X XXX XX
X X
X XXXX
X X
X X
CV results with CV results with CV results with CV results with
hyperparameter set 1 hyperparameter set 2 hyperparameter set 3 hyperparameter set 4
Also, let’s assume you’re using 10-fold cross-validation to evaluate the tradeoff…
You leverage cross-validation to tune your models and estimate the generalization error.
There is no guarantee at the beginning of your work that you will craft a useful model!
Model Tuning with Python
Loading the Training Data
Prepping the Training Data
Prepping the Training Data
Prepping the Training Data
Model Tuning
Bias and Variance
Prepping the Test Data
Prepping the Test Data
Prepping the Test Data
Model Testing
Wrap-Up
Continue Your Learning
Q&A