0% found this document useful (0 votes)

7 views

Tuning Decision Trees Python

Uploaded by

Walter Martin Izaga Valderrama

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Tuning Decision Trees Python

Uploaded by

Walter Martin Izaga Valderrama

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

TUNING

DECISION TREES
WITH PYTHON
WHICH TREE IS BETTER?

Let me know in the chat:

Name & where you are attending from
OK to connect with folks on LinkedIn?
About Me
I’ve been in tech for 26 years and doing Hands-on analytics consultant and instructor.
hands-on analytics for 12+ years.

I’ve supported all manner of business

functions and advised leaders.

I have successfully trained 1000+

professionals in a live classroom setting.

Trained 1000s more via my online courses

and tutorials.
Housekeeping
Questions Polls Offers
Chat Handouts

Please hold “industry-specific” questions until the end.

The Code is the Easy Part!
Supervised Learning
Data Analyst, Teacher
Machine learning encompasses many areas of study.
Data
The focus of this course will be supervised learning…
Model

You
(supervisor) Algorithm Training
Student
(machine)

DataFrame
education sex hours_per_week age income
Bachelors Male 52 39 >50K
Doctorate Female 23 53 <=50K
HS-grad Male 40 31 <=50K
Masters Female 43 26 >50K
Did the Machine Learn?
As the “teacher” supervising the student’s learning, you want to evaluate how much
the machine has learned.

Just as with humans, this involves testing. Your data

You
(supervisor)

Student
(machine)

Test data

Training data
Splitting Your Data
How Much Training/Test Data?
The real answer is “it depends.”
Forms of Supervised Learning

You can think of supervised learning as coming in two forms – classification and regression.

Classification models predict labels, whereas regression models predict numeric values (i.e., targets).

Some ML algorithms are classification or regression only. Tree-based algorithms can do both!

CLASSIFICATION REGRESSION
Types of predictions: Business scenarios: Types of predictions: Business scenarios:

Code Value • Fraud detection • Numeric values • Marketing mix

0 FALSE
1 TRUE • Churn prevention • Anything with a • Price/cost modeling
decimal point!
• Conversion modeling • Customer lifetime
value
• Underwriting
The Data
The Adult Census Income Dataset
Here is a summarized description of the Adult Census dataset:

• “Extraction was done by Barry Becker from the 1994 Census database.

Prediction task is to determine whether a person makes over 50K a year.”

A cleaned version of this dataset will be used as a running example throughout the course lectures.

This data set represents a classification scenario – the data to be predicted is a categorical label.

Most of the course will focus on classification as this knowledge is directly transferrable to regression.

More information on the Adult Census dataset can be found at the UCI Machine Learning Repository:

• https://archive.ics.uci.edu/ml/datasets/adult
The Adult Census Income Dataset
Variable Description Values
age Age of observation in years.
work_class Categorical feature denoting type of employment. Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt Numeric feature. Statistical calculation of demographics.
education Categorical feature denoting level of education. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-
4th, 10th, Doctorate, 5th-6th, Preschool
education_num Numeric feature. Years of education completed.
marital_status Categorical feature. Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-
spouse
occupation Categorical feature denoting occupation type. Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-
op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship Categorical feature denoting familial relationship. Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race Categorical feature denoting racial assignment. White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex Categorical feature denoting gender. Female, Male
capital_gain Numeric feature. Any reported capital gains.
capital_loss Numeric feature. Any reported capital losses.
hours_per_week Numeric feature. Employment hours per week worked.
native_county Categorical feature denoting country of citizenry before United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan,
immigration to the US. Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, etc.
income Categorical feature (the label) denoting income level <=50K, >50K
Under/Overfitting
It’s All About the Fit!
In crafting valuable machine learning models, a critical idea is underfitting vs. overfitting.

The concept of a spectrum is useful…

Underfitting Goldilocks Zone Overfitting

Models as complex
as needed, but no
more complex!

Decision stump

Less Complex Goldilocks Zone More Complex

Controlling Complexity
The DecisionTreeClassifier class offers many options for controlling complexity.

These options are what machine learning practitioners call hyperparameters.

While the DecisionTreeClassifier supports many hyperparameters, the following are the most useful:

Hyperparameter Default Description

max_depth None The maximum depth of the tree. If value is None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split observations.
min_samples_split 2 The minimum number of observations required to perform a split. If value is an integer, then
min_samples_split is the min count. If value is a float, then min_samples_split is a fraction of
the observations.
min_samples_leaf 1 The minimum number of observations required to make a leaf node. A split point at any depth
will only be considered if it leaves at least min_samples_leaf training samples in each of the
left and right branches. If value is an integer, then min_samples_leaf is the min count. If value
is a float, then min_samples_leaf is a fraction of the observations.
min_impurity_decrease 0.0 A node will be split if this split produces a decrease in the weighted impurity greater than or
equal to this value.
Max Depth
Controlling complexity using the max_depth hyperparameter…

max_depth = 1 max_depth = 3
Min Samples Per Split
Controlling complexity using the min_samples_split hyperparameter…

min_samples_split = 1000 min_samples_split = 10000

Min Samples Per Leaf
Controlling complexity using the min_samples_leaf hyperparameter…

min_samples_leaf = 1000 min_samples_leaf = 10000

Min Impurity Decrease
Controlling complexity using the min_impurity_decrease hyperparameter…

min_impurity_decrease = 0.001 min_impurity_decrease = 0.01

The Bias-Variance Tradeoff
Dave Heads Down to the Pub
The bias-variance tradeoff is arguably one of the most important concepts in machine learning.

To gain intuitive understanding, let’s use the example of throwing darts at the pub…

Underfitting XX X X
X XX X
XX X X
Dave is good at darts, X Dave is good at darts, his
High bias, High bias, X
but his board at home is board at home is too
low variance high variance
too high. high, and he’s had a few.

Dave is good at darts X X Dave is good at darts, his

XX Low bias, Low bias, X
X XX board at home is regulation,
and his board is XX low variance high variance X X
regulation. X and he’s had a few.
X

The goal Overfitting

High Bias, Low Variance Model

We saw an example of a high bias, low variance model in the last section:

This model exhibits This model exhibits

very high bias – it very low variance – it
always predicts <=50K! always predicts <=50K!
Low Bias, High Variance Model
We saw an example of a low bias, high variance model in the last section:

This model has likely overfit the training data!

The Tradeoff

Bias Variance

Underfitting Goldilocks Zone Overfitting

Models as complex
as needed, but no
more complex!

Decision stump

Less Complex Goldilocks Zone More Complex

Cross-Validation
Supervising the Data
Remember our classroom analogy?

It turns out that your most important duty is

supervising the data. Data
Model

You
(supervisor)
Algorithm Training
Student
(machine)

The key to optimizing the bias-variance tradeoff is

the intersection of data and training regimen.

To gain intuition of how to supervise data, we’ll

continue with the classroom analogy.
Back to the Teacher
Imagine you are the teacher. Your goal is to teach most effectively.

How do you achieve this? How do you know when you are successful?

In a word – testing. However, good teachers just don’t jump into testing.
Classic machine learning
Good teachers also provide practice to students… practice is to segment your
data into Training, Validation,
and Testing datasets.

50 Questions Training
75 Questions With Answers
100 Questions With Answers
With Answers
25 Quiz Questions Validation

25 Test Questions 25 Test Questions Testing

Data Trumps Algorithm

A popular saying in machine learning is, “data trumps algorithm.” The core idea being that you
can craft more useful models with more data.

NOTE – Garbage in, garbage out (GIGO) applies here!

Applying this to our example…

So much practice… ½ the practice!

Training data is a precious

50 Questions
Training resource, we want as much of it
With Answers
as we can get.
100 Questions
With Answers
25 Quiz Questions Validation However, we need to still
validate progress and perform
25 Test Questions Testing final testing.
Cross-Validation
We can’t escape the need of pulling out some data for final testing (i.e., a holdout set).

However, it would be optimal if we didn’t also need a validation hold-out set.

Enter the cross-validation…

Step 1 – Repurpose Step 2 – Test Hold Out Step 3 – Cross-Validate

25% of Data 25% of Data 25% of Data

50% of Data Training
25% of Data 25% of Data 25% of Data

25% of Data 25% of Data 25% of Data 25% of Data Validation

25% of Data 25% of Data With Cross-validation, we get to use

more of the data for training.
3-Fold Cross-Validation
Cross-validation is a technique to make maximum use of your training data.

Using cross-validation, your training regimen gets “multiple looks at the training data.”

Here’s how it works…

Step 1 – Step 2 – Step 3 – Step 4 –

Split Data Train & Validate Train & Validate Train & Validate

25% of Data 25% of Data Training 25% of Data Training 25% of Data Validation

25% of Data 25% of Data Training 25% of Data Validation 25% of Data Training

25% of Data 25% of Data Validation 25% of Data Training 25% of Data Training

Each split is The number of folds is referred to as Using cross-validation, you train and
known as a fold. k (i.e., k-fold cross-validation) evaluate k models.
Model Tuning
Back to the Darts
We can now combine everything we’ve learned so far.

Let’s assume you’ve got data and some DecisionTreeClassifier hyperparameter values.

You then perform 10-fold cross-validation with the above, where each CV fold is conceptually a dart…

High bias, High bias, Low bias, Low bias,

high variance low variance high variance low variance

X X XXX
X
X X X XX XX XX
X X XX X X
X X X
X X X
X XXX XX
X X
X XXXX
X X
X X
CV results with CV results with CV results with CV results with
hyperparameter set 1 hyperparameter set 2 hyperparameter set 3 hyperparameter set 4

This process is the essence of model tuning.

Making the Darts Real
Let’s assume you are optimizing your DecisionTreeClassifier for accuracy.

Also, let’s assume you’re using 10-fold cross-validation to evaluate the tradeoff…

CV results with CV results with CV results with CV results with

hyperparameter set 1 hyperparameter set 2 hyperparameter set 3 hyperparameter set 4

Mean = 84.5 Bias Mean = 85.35 Mean = 90.06 Mean = 95.47

Range = 9 Variance Range = 2.2 Range = 10.8 Range = 2
Estimating Generalization Error
Useful machine learning models generalize well – models that produce “accurate” predictions on new,
unseen data. How do you know that any given model will generalize well?

You leverage cross-validation to tune your models and estimate the generalization error.

Mean = 84.5 Mean = 85.35 Mean = 90.06 Mean = 95.47

Winner!
Range = 9 Range = 2.2 Range = 10.8 Range = 2
What About the Test Holdout?
The test holdout set is your final estimate of generalization error.

WARNING – The test holdout set cannot be used/influence training, or it is useless!

Here’s the process (assuming you can’t get more data):

1. Acquire your data.

2. Split data into training and test datasets.
3. Explore the training data.
4. Clean the training data.
5. Select your algorithm (e.g., DecisionTreeClassifier).
6. Engineer features with the training data.
7. Train and tune your model with cross-validation.
8. If your model doesn’t have low enough bias and variance, go to 3.
9. Use the test set once to estimate the generalization error.
10. If generalization error meets business requirements, you may have a useful model!
11. Train a new model using all the data.

There is no guarantee at the beginning of your work that you will craft a useful model!
Model Tuning with Python
Loading the Training Data
Prepping the Training Data
Prepping the Training Data
Prepping the Training Data
Model Tuning
Bias and Variance
Prepping the Test Data
Prepping the Test Data
Prepping the Test Data
Model Testing
Wrap-Up
Continue Your Learning
Q&A

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
87% (46)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
57% (82)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (79)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (108)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Penis Enlargement Secret
60% (124)
Penis Enlargement Secret
12 pages
Workbook For The Body Keeps The Score
89% (53)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
79% (28)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
78% (36)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (8)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
70% (73)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Chapter I Answers
100% (9)
Chapter I Answers
2 pages
ML Unit 2
No ratings yet
ML Unit 2
37 pages
ppt5dl
No ratings yet
ppt5dl
33 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
19-Introduction classification algorithm-18-09-2024
No ratings yet
19-Introduction classification algorithm-18-09-2024
102 pages
IntroClassificationDA-2024
No ratings yet
IntroClassificationDA-2024
129 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Introduction To ML
No ratings yet
Introduction To ML
31 pages
Chapter 7 Learning
No ratings yet
Chapter 7 Learning
34 pages
module3_DS_ppt
No ratings yet
module3_DS_ppt
68 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Lecture 2
No ratings yet
Lecture 2
98 pages
Introduction to ML Unit-1 PPT
No ratings yet
Introduction to ML Unit-1 PPT
90 pages
INTRODUCTION
No ratings yet
INTRODUCTION
51 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
ML 1-6
No ratings yet
ML 1-6
248 pages
Machine Learning: Mona Leeza Email: Monaleeza - Bukc@bahria - Edu.pk
No ratings yet
Machine Learning: Mona Leeza Email: Monaleeza - Bukc@bahria - Edu.pk
60 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
No ratings yet
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
15 pages
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Chapter - 1 PPT
No ratings yet
Chapter - 1 PPT
56 pages
Machine Learning General: Definiton
No ratings yet
Machine Learning General: Definiton
14 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Classification
No ratings yet
Classification
33 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Lec-1 Introduction
No ratings yet
Lec-1 Introduction
65 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
224 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
Introductiontomachinelearning 230723174746 1a0e5edc
No ratings yet
Introductiontomachinelearning 230723174746 1a0e5edc
27 pages
MachineLearning Jan2nd
100% (2)
MachineLearning Jan2nd
171 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
08 Class Basic
No ratings yet
08 Class Basic
141 pages
Neural Networks Cheat Sheet - 2020 PDF
No ratings yet
Neural Networks Cheat Sheet - 2020 PDF
14 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
CS7641 Machine Learning Midterm Notes PDF
No ratings yet
CS7641 Machine Learning Midterm Notes PDF
239 pages
Classification
100% (2)
Classification
105 pages
Machine Learning INTRO
No ratings yet
Machine Learning INTRO
12 pages
ML -1_Sovan_Introduction to ML
No ratings yet
ML -1_Sovan_Introduction to ML
83 pages
19_ML_intro
No ratings yet
19_ML_intro
33 pages
ML 1 2 3
No ratings yet
ML 1 2 3
54 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Learning AI
No ratings yet
Learning AI
34 pages
"Classifiers": R & D Project by Under The Guidance of
No ratings yet
"Classifiers": R & D Project by Under The Guidance of
59 pages
Lecture Notes
No ratings yet
Lecture Notes
86 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Big Data Lesson 5 Lucrezia Noli
No ratings yet
Big Data Lesson 5 Lucrezia Noli
30 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
08 - Classification - Decision Trees
No ratings yet
08 - Classification - Decision Trees
116 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Module 1 ML
No ratings yet
Module 1 ML
51 pages
Intro To ML
No ratings yet
Intro To ML
26 pages
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
Advanced Engineering Mathematics PDF
No ratings yet
Advanced Engineering Mathematics PDF
6 pages
G6 Q3 QUALITIES IN REAL LIFE SITUATION USING ALGEBRAIC EXPRESSION
No ratings yet
G6 Q3 QUALITIES IN REAL LIFE SITUATION USING ALGEBRAIC EXPRESSION
28 pages
Math 108 Course Syllabus
No ratings yet
Math 108 Course Syllabus
6 pages
Namma Kalvi 6th Standard Science Guide Term 1 EM 220954
No ratings yet
Namma Kalvi 6th Standard Science Guide Term 1 EM 220954
75 pages
Session 11
No ratings yet
Session 11
18 pages
Limit Continuity Derivability of Function (DERIVABILITY) (Sol)
No ratings yet
Limit Continuity Derivability of Function (DERIVABILITY) (Sol)
7 pages
Overall Effectiveness Factor
No ratings yet
Overall Effectiveness Factor
17 pages
07 Real Gas Xi (J) (E) - Wa
No ratings yet
07 Real Gas Xi (J) (E) - Wa
12 pages
Dynamics of A Particle Moving in A Straight Line: Isam Al Hassan 0796988794
No ratings yet
Dynamics of A Particle Moving in A Straight Line: Isam Al Hassan 0796988794
84 pages
Tutorial2 Answer
No ratings yet
Tutorial2 Answer
6 pages
12 Mathematics 2023 All Region Question Papers
No ratings yet
12 Mathematics 2023 All Region Question Papers
88 pages
7th Local Demo Equation Oof Circle
No ratings yet
7th Local Demo Equation Oof Circle
4 pages
Important Terms (Linear Motion)
100% (3)
Important Terms (Linear Motion)
4 pages
Pure Math - 3D Coordinate Geometry: Points in The Space
No ratings yet
Pure Math - 3D Coordinate Geometry: Points in The Space
5 pages
Section 5 Quiz (DadanD)
No ratings yet
Section 5 Quiz (DadanD)
7 pages
Rural Marketing Research
No ratings yet
Rural Marketing Research
25 pages
Case 10 Taco Bell Prodman
No ratings yet
Case 10 Taco Bell Prodman
7 pages
Javascript
No ratings yet
Javascript
8 pages
LESSON 3 Mensuration and Calculation 2
100% (1)
LESSON 3 Mensuration and Calculation 2
4 pages
TP1 Final Report IP
No ratings yet
TP1 Final Report IP
42 pages
Data Structures & Algorithms
89% (19)
Data Structures & Algorithms
52 pages
DLP Remainder and Factors Theorem
100% (1)
DLP Remainder and Factors Theorem
35 pages
Arithmetic Sequences
No ratings yet
Arithmetic Sequences
3 pages
ACJ Section Notes 10.21
No ratings yet
ACJ Section Notes 10.21
6 pages
Estimation of Rock Mass Deformation Modulus Using Variations in Transmissivity and RQD With Depth PDF
No ratings yet
Estimation of Rock Mass Deformation Modulus Using Variations in Transmissivity and RQD With Depth PDF
8 pages
Function
No ratings yet
Function
19 pages
Tabel Baja WF LRFD
No ratings yet
Tabel Baja WF LRFD
8 pages
A Short Guide To Occult Symbols
No ratings yet
A Short Guide To Occult Symbols
25 pages
Mathmatical Physics
No ratings yet
Mathmatical Physics
7 pages

Tuning Decision Trees Python

Uploaded by

Tuning Decision Trees Python

Uploaded by

TUNING

Let me know in the chat:

I’ve supported all manner of business

I have successfully trained 1000+

Trained 1000s more via my online courses

Please hold “industry-specific” questions until the end.

Just as with humans, this involves testing. Your data

Code Value • Fraud detection • Numeric values • Marketing mix

Prediction task is to determine whether a person makes over 50K a year.”

The concept of a spectrum is useful…

Underfitting Goldilocks Zone Overfitting

Less Complex Goldilocks Zone More Complex

These options are what machine learning practitioners call hyperparameters.

Hyperparameter Default Description

min_samples_split = 1000 min_samples_split = 10000

min_samples_leaf = 1000 min_samples_leaf = 10000

min_impurity_decrease = 0.001 min_impurity_decrease = 0.01

Dave is good at darts X X Dave is good at darts, his

The goal Overfitting

This model exhibits This model exhibits

This model has likely overfit the training data!

Underfitting Goldilocks Zone Overfitting

Less Complex Goldilocks Zone More Complex

It turns out that your most important duty is

The key to optimizing the bias-variance tradeoff is

To gain intuition of how to supervise data, we’ll

25 Test Questions 25 Test Questions Testing

NOTE – Garbage in, garbage out (GIGO) applies here!

Applying this to our example…

So much practice… ½ the practice!

Training data is a precious

However, it would be optimal if we didn’t also need a validation hold-out set.

Enter the cross-validation…

Step 1 – Repurpose Step 2 – Test Hold Out Step 3 – Cross-Validate

25% of Data 25% of Data 25% of Data

25% of Data 25% of Data 25% of Data 25% of Data Validation

25% of Data 25% of Data With Cross-validation, we get to use

Here’s how it works…

Step 1 – Step 2 – Step 3 – Step 4 –

High bias, High bias, Low bias, Low bias,

This process is the essence of model tuning.

CV results with CV results with CV results with CV results with

Mean = 84.5 Bias Mean = 85.35 Mean = 90.06 Mean = 95.47

Mean = 84.5 Mean = 85.35 Mean = 90.06 Mean = 95.47

WARNING – The test holdout set cannot be used/influence training, or it is useless!

Here’s the process (assuming you can’t get more data):

1. Acquire your data.

You might also like