4.0 Classification Methodologies R
4.0 Classification Methodologies R
4.0 Classification Methodologies R
0 Classification
Methodologies
Response Variable/
Dependent Variable/
Predictor Variables/Independent Class Variable/ Label
Variables/Control Variables Variable/ Target Variable
E.R. L. Jalao, Copyright UP NEC,
4 4
eljalao@up.edu.ph
What is Classification?
Train
Model
Classification
Evaluation
Models
Testing
New Dataset
Dataset
Temperature Humidity
Outlook Windy Play (Class)
Nominal Nominal
overcast hot high false yes
overcast cool normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
rainy mild normal false yes
rainy mild high true no
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes
Temperature Humidity
Outlook Windy Play (Class)
Nominal Nominal
overcast hot high false yes
overcast cool normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
rainy mild normal false yes
rainy mild high true no
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes
𝑃 𝐶
𝑃 𝐶 𝐴 = 𝑃 𝐴|𝐶
𝑃 𝐴
• Interpreted as:
– The probability of C happening given A is true is equal to the
Probability of A happening given C is true times the ratio of the
probability of C happening and probability of A happening
• Prediction Scenario:
– Given 𝐴𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑅𝑎𝑖𝑛𝑦, and 𝐴 𝑇𝑒𝑚𝑝 = 𝐻𝑜𝑡
– Will I play Golf? Outlook Temp
Play
(Class)
overcast hot yes
• Historical Data was gathered as follows: overcast cool yes
overcast mild yes
overcast hot yes
rainy mild yes
rainy cool yes
rainy cool no
rainy mild yes
rainy mild no
sunny hot no
sunny hot no
sunny mild no
22
E.R. L. Jalao, Copyright UP NEC, sunny cool yes 22
eljalao@up.edu.ph
sunny mild yes
Naïve Bayes Classifier
𝑁𝑖𝑐
𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙: 𝑃 𝐴𝑖 𝐶 =
𝑁𝑐
𝑐: number of classes
• Laplace Probability Estimation: 𝑝: prior probability
𝑁𝑖𝑐 + 1
𝐿𝑎𝑝𝑙𝑎𝑐𝑒: 𝑃 𝐴𝑖 𝐶 =
𝑁𝑐 + 𝑐
• New Data:
• Results
=Overcast !=Overcast
Wind
Play=Yes
=True !=True
Play=No
=Rain !=Rain
Play=Yes
E.R. L. Jalao, Copyright UP NEC, Play=No
35 35
eljalao@up.edu.ph
Decision Tree Generation: The ID3 Algorithm
𝐺𝑖𝑛𝑖 𝑡 = 1 − 𝑝 𝑗|𝑡 2
𝑗
– Where 𝑝( 𝑗 | 𝑡) is the relative frequency of class j at node t
Yes Refund?
No
3 7
• Gini Child: 𝐺𝑖𝑛𝑖 𝑆𝑝𝑙𝑖𝑡 𝑐 = 0 ∗
10
+ 0.48 ∗
10
= 0.34
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
𝐺𝑖𝑛𝑖 𝑡 = 0.42
Marital
Married Status Single, Divorced
eljalao@up.edu.ph
Information Gain
Marital
Married Status Single, Divorced
Cheat = No Refund
Yes No
Cheat = No
Marital
Married Status Single, Divorced
Cheat = No Refund
Yes No
Cheat = No Income
>=77.5K <77.5K
Classification Rules
Refund
Yes No (Refund=Yes) ==> No
NO YES
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
• Exhaustive rules
– Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values
– Each record is covered by at least one rule
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
If (Status=Single) Cheat= No 10
10 No Single 90K Yes
• Basic idea:
– If it walks like a duck, quacks like a duck, then it’s probably a duck
Compute
Distance Test
Record
X X X
𝑑 𝑝, 𝑞 = 𝑝𝑖 − 𝑞𝑖 2
𝑖
• Determine the class from nearest neighbor list
– take the majority vote of class labels among the k-nearest
neighbors
– Weigh the vote according to distance
1
• weight factor, 𝑤 =
𝑑2
Input
nodes Black box
Output
X1 b1 node
X2 b2 y
X3 b3 X0
𝑦ො = 𝑠𝑖𝑔𝑛(𝛽𝑇 𝑥)
• Graphically: • Algebraically:
1
𝑥= ,𝑦 = 1
2
−1
𝑥= , 𝑦 = −1
2
0
𝑥1 (𝐵𝑈𝑆𝐴𝐺𝐸) 𝑥= , 𝑦 = −1
−1
Default
Did Not
Default
𝑥2 (𝐷𝐴𝑌𝑆𝐷𝐸𝐿𝑄)
E.R. L. Jalao, Copyright UP NEC,
77 77
eljalao@up.edu.ph
Simple Perceptron Induction Example
• Graphically:
𝑥1 (𝐵𝑈𝑆𝐴𝐺𝐸) 𝑥1 − 0.8𝑥2 = 0
Default
Did Not
Default
𝑥2 (𝐷𝐴𝑌𝑆𝐷𝐸𝐿𝑄)
E.R. L. Jalao, Copyright UP NEC,
78 78
eljalao@up.edu.ph
Simple Perceptron Induction Example
• Graphically:
Default
Did Not
Default
𝑥2 (𝐷𝐴𝑌𝑆𝐷𝐸𝐿𝑄)
E.R. L. Jalao, Copyright UP NEC,
79 79
eljalao@up.edu.ph
Simple Perceptron Induction Example
• Graphically:
Default
Did Not
Default
𝑥2 (𝐷𝐴𝑌𝑆𝐷𝐸𝐿𝑄)
E.R. L. Jalao, Copyright UP NEC,
80 80
eljalao@up.edu.ph
Simple Perceptron Induction Example
• Graphically:
Default
Did Not
Default
𝑥2 (𝐷𝐴𝑌𝑆𝐷𝐸𝐿𝑄)
E.R. L. Jalao, Copyright UP NEC,
81 81
eljalao@up.edu.ph
Perceptron Induction Usage
x1
x2
x3
y
Training ANN means learning
x4 the weights of the neurons
x5
Input Neuron i Outpu
I1 wi1
Input Hidden Output Activation
Layer Layer wi2
Layer I2
wi3
Si function Oi Oi
g(Si )
I3
threshold, t
E.R. L. Jalao, Copyright UP NEC,
85 85
eljalao@up.edu.ph
Choice of Hidden Layers
http://www.r-bloggers.com/using-neural-
networks-for-credit-scoring-a-simple-
example/
E.R. L. Jalao, Copyright UP NEC,
90 90
eljalao@up.edu.ph
Business Scenario: Credit Scoring
• Find a linear hyperplane (decision boundary) that will separate the data
E.R. L. Jalao, Copyright UP NEC,
97
eljalao@up.edu.ph
Support Vector Machines
B1
B2
B2
B2
B2
b21
b22
margin
b11
b12
𝛽𝑇 𝑥 = 0
𝑇
𝛽 𝑥 + 𝑏 = −1 𝛽𝑇 𝑥 + 𝑏 = 1
2
𝑀𝑎𝑟𝑔𝑖𝑛 = 2
𝛽
b11
𝑇 b12
1 𝑖𝑓 𝛽 𝑥 ≥ 1
𝑓 𝑥 =൝
−1 𝑖𝑓 𝛽 𝑇 𝑥 ≤ −1
E.R. L. Jalao, Copyright UP NEC,
103 103
eljalao@up.edu.ph
Support Vector Machines
2
• We want to maximize: 𝑀𝑎𝑟𝑔𝑖𝑛 = 𝛽 2
𝛽 2
– Which is equivalent to minimizing: 𝑍 𝛽 =
2
– But subjected to the following constraint:
𝑦𝑖 𝛽 𝑇 𝑥𝑖 + 𝑏 ≥ 1, ∀ 𝑦𝑖 , 𝑥𝑖 , 𝑖 = 1,2. . . 𝑛
• If 𝑦𝑖 = 1, prediction 𝛽 𝑇 𝑥𝑖 + 𝑏 = 1 and if 𝑦𝑖 = −1 then 𝛽 𝑇 𝑥𝑖 + 𝑏 = −1
ξi
ξi
Φ: 𝑥 → 𝜑(𝑥)
– Kernel: 𝑘 𝑥1 , 𝑥2 = 1 + 𝑥1 𝑥2𝑡 2
http://www.r-bloggers.com/using-neural-
networks-for-credit-scoring-a-simple-
example/
E.R. L. Jalao, Copyright UP NEC,
116 116
eljalao@up.edu.ph
Business Scenario: Credit Scoring
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
E.R. L. Jalao, Copyright UP NEC,
122 122
eljalao@up.edu.ph
Generation of Datasets: Bagging
• Ensemble Model
• If 𝑥 ≤ 0.35 → 𝑦 = 1, 𝑥 > 0.35 → 𝑦 = −1
• If 𝑥 ≤ 0.75 → 𝑦 = −1, 𝑥 > 0.75 → 𝑦 = 1
• 𝑦= 1
Round 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
1 1 1 1 -1 -1 -1 -1 -1 -1 -1
2 -1 -1 -1 -1 -1 -1 -1 1 1 1
3 1 1 1 1 1 1 1 1 1 1
Majority 1 1 1 -1 -1 -1 -1 1 1 1
True 1 1 1 -1 -1 -1 -1 1 1 1
Original Data 1 2 3 4 5 6 7 8 9 10
Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
131
Business Scenario: Bank Data
Married
Bachelo -civ- Adm- Other-
32 Private 270335 rs 13 spouse clerical relative White Male 0 0 40 Philippines ?
E.R. L. Jalao, Copyright UP NEC,
139 139
eljalao@up.edu.ph
Business Scenario: Income Data
Handling Missing Data Needs Needs Needs Can run with Depends on
Complete Complete Complete incomplete the algorithm
Data Data Data data
Interpretability High Blackbox Blackbox Blackbox Blackbox
Generate
Training Classification
Data Model
• Error:
𝑒𝑟𝑟𝑜𝑟 = 1 𝑖𝑓 𝑦ො ≠ 𝑦
– If predicted value of the classifier model is not equal to the actual
value, then we define that as an error.
• We would like to minimize error
• Given a binary classification model, how do we count the
error predictions?
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
CLASS Class=No c d
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
Generate
All Data Classification
(600 rows) Model
All Data
(600 rows)
Use Model to
All Data
Measure
Predict New
(600 rows) Performance
Rows
Generate
Train Data Classification
(400 Rows) Model
All Data
(600 rows)
Use Model to
Test Data
Measure
Predict New
(200 Rows) Performance
Rows
• Holdout
– Reserve x% for training and 100-x% for testing
• Cross validation
– Estimates the performance of the Model generated using all data.
– Partition data into k disjoint subsets
– k-fold: train on 𝑘 − 1 partitions, test on the remaining one
– If 𝑘 = 10, 11 Decision Tree Models will be created. One for each
fold, and 1 using all data.
...
Where:
Training Data Testing Data
• Underfit
– Simple Model
– Lots of Errors
– Stable Prediction
• Overfit
– Complex Model
– No Errors
– Unstable Prediction
– Predicts Even Noise
• Just Right
– Small Errors
– Stable Predictions
Overfitting
Underfitting: when model is too simple, both training and test errors are large
E.R. L. Jalao, Copyright UP NEC,
166
eljalao@up.edu.ph
Notes on Overfitting
• Post-pruning
– Grow model to its entirety
– Trim the complexity of the model
– If generalization error improves after trimming, replace the
complex model with the less complex model
• Complexity:
• Training Error:
• Testing Error
• Complexity:
• Training Error:
• Testing Error
• Complexity:
• Training Error:
• Testing Error:
PREDICTED CLASS
• With Cost
• Result:
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
2𝑇𝑃
𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 =
2𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃
• Predicts that all women will not have breast cancer again
201
• 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = 70.3%
286
0 85 women will incorrectly
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = =0 think their breast cancer
0+0
was not going to reoccur but
0
• 𝑅𝑒𝑐𝑎𝑙𝑙 = =0 will reoccur
0+85
2∗0
• 𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = = 0
2∗0+0+85
E.R. L. Jalao, Copyright UP NEC,
183 183
eljalao@up.edu.ph
Model 2
85
• 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = 29.7%
286
85 201 women will incorrectly
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 29.7% think their breast cancer is
85+201
going to reoccur but will not
85
• 𝑅𝑒𝑐𝑎𝑙𝑙 = = 100% reoccur
0+85
2∗85
• 𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = = 45.8
2∗85+201+0
E.R. L. Jalao, Copyright UP NEC,
184 184
eljalao@up.edu.ph
Model 3
198
• 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = 69.2%
286
10
• 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = = 43%
10+13
10
• 𝑅𝑒𝑐𝑎𝑙𝑙 = = 12%
10+75
2∗10
• 𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = = 18.15%
2∗10+13+75
E.R. L. Jalao, Copyright UP NEC,
185 185
eljalao@up.edu.ph
Model Summaries
• (TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal
• Diagonal line:
– Random guessing
– Below diagonal line:
• prediction is opposite of the true class
• Predicting Churn
– Must have high true positive rate = very low false negative rate
• Must be able to capture all Churn Customers
– Ok to have high false positive rate
• Predicting that you will Churn but actually will Not Churn
– Choose Model 2
• Predicting Flu
– Must have low false positive rate
• Predicting You Have Flu but actually have no Flu
– Must have an acceptable true positive rate
• Must be able to detect Flu at least 60% of the time
– Choose Model 1
http://scikit-
learn.org/stable/tutorial/machine_learning_map/index.html
E.R. L. Jalao, Copyright UP NEC,
197 197
eljalao@up.edu.ph
Algorithm Cheat Sheet
http://www.datasciencecentral.com/profiles/blogs/key-tools-of-big-
data-for-transformation-review-case-study
E.R. L. Jalao, Copyright UP NEC,
198 198
eljalao@up.edu.ph
This Session’s Outline