ML3 - Evaluation

Evaluation
REFERENCES
Book: Machine Learning with Python for Everyone (Chapter 5)
Training – Validating – Testing Steps
Book: Machine Learning Bookcamp (Chapter 1)
Bias Variance Tradeoff
Understanding the Bias-Variance Tradeoff | by Seema Singh | Towards Data Science
Estimating Future Performance
Book: Practical Machine Learning in R – Nwanganga & Chapple (Chapter 1 & 9)
Performance Metrics – Classification
https://medium.com/@hemaanushatangellamudi/performance-metrics-in-machine-learni
ng-1ec2e48771b5
Performance Metrics – Regression
Background
• Two pitfalls when we learn
• Limited Capacity
• Distraction by Noise
• Equivalent pitfalls when machines learn
• Bias
• Variance
• To avoid “learning data by heart”
• Learn using Training Dataset
• Evaluate using Test Dataset
• Wouldn’t Learning using All Data lead to a better Model?
Three Step Process
• Two Step Process
• Train (setting parameters of the model – like m and b in y=mx+b)
• Test (check how well the model generalises)
• What about which model? What about hyper-parameters?
• k-NN or Naïve Bayes?
• k = 3 or 10 or 20?
• Three Step Process
• Train Training set
• Select (1st evaluation) Validation (test) set [ValS]
• Test / Assess (2nd evaluation) Hold-out Test set [HOT]
• Train – Validate – Test Split?
• 50% - 25% - 25%
• What if learning curve takes longer to plateau?
Underfitting and Overfitting
• Why may humans do ‘bad’ in an exam (test)?
• Not bringing enough capacity while studying
• Focussing too much on irrelevant details (noise)
• Why may machine learning have errors?
• Underfitting
• Overfitting
• Example using Synthetic data and Linear / Polynomial Regression
• Error reduces going from line (degree 1) to parabola (degree 2) but worsens
going to nonic (degree 9)
• Model 1 (Complexity 1) – Straight Line
• Not enough capacity to capture the complexity of target
• Too biased towards flatness
• Underfitting
• Model 3 (Complexity 9) – wiggly curve
• Has good capacity (captures the complexity of training data perfectly)
• Has memorised the noise (randomness in data) – does bad on test data
• Overfitting
• Model 2 (Complexity 2) – parabola
• Just enough capacity to capture the complexity (but not exactly due to noise)
• Lowest test error
• Just right!
• Underfitting
• A very simple model may not be able to learn the pattern in the training data.
• It also does poorly on the testing data.
• Overfitting
• A very complex model may learn the training data perfectly.
• However, it does poorly on the testing data because it also learned irrelevant
relationships in the training data.
• Just-right
• A medium-complexity model performs well on the training and testing data.
Understanding the Bias-Variance Tradeoff
• When we make a mistake (error) in prediction
• Predict an incorrect class (in classification problems)
• Predict a value with high MSE (in regression problems)
• We have no control over (VarianceData)
• Actual randomness in relationship between input features & output target
• Example: Wide range of possible incomes earned by
• {college grad, economics, 5 years experience}
• degree to which our data is affected by randomness—either in measurement or in
real-world differences—is called the variance of the data
• Irreducible Error – measure of amount of noise in our data
• No matter how good we make our model, our data will have a certain amount of
noise or irreducible error that cannot be removed
• We have some control over (VarianceLearner/Model(Training))
• Way models vary due to the random selection of the data we train on - is
called the variance of the model
• Example: Linear Regression parameters m & b
• Will differ depending on the randomly selected training data set
• We have most control over (BiasLearner/Model)
• When we choose between two models, one may have a fundamentally better
resonance with the relationship between the inputs and outputs
• Example: line has great difficulty following the path of a parabola
• A model that cannot match the actual relationship between the inputs and
outputs—after we ignore the inherent noisiness in the data—has higher bias
• Model with high bias has difficulty capturing complicated patterns
• Model with low bias can follow more complicated patterns
• Three components of error
• Inherent variability in our data
• Variability in creating our predicting model from the training data
• Bias of our model
• Bias-Variance Decomposition
• Error = BiasLearner + VarianceLearner(Training) + VarianceData
Bias-Variance using bulls eye diagram
Bias-Variance tradeoff
• If model is too simple and has very few parameters

• it may have high bias and low variance
• If model is complex and has large number of parameters
• It may have high variance and low bias
• We need to find the right/good balance
• Without overfitting and underfitting the data
• Tradeoff in complexity is why there is a tradeoff between bias and variance
• An algorithm can’t be more complex and less complex at the same time.
Bias-Variance for k-NN
• 1-NN
• “If I’m a new example, find who is most like me and label me with its target”
• Potential to have a very jagged or wiggly border (highly variable!)
• Once we find the closest example, we ignore what everyone else says
• If there were 10 training examples, once we find our closest neighbor, nothing about the
other nine matters!
• 10-NN
• “I’m a new example. Find my ten closest neighbors and average their target.
That’s my predicted target.” ≡ “Make my predicted target the overall training
mean.”
• Our predictions have no border: they are all exactly the same (very biased!)
• Every new example is going to have the same 10 neighbors
• We predict the same value regardless of the input predictor values!
Bias-Variance for k-NN
• Increasing the number of neighbors increases our bias and decreases
our variance
• Decreasing the number of neighbors increases our variance and
decreases our bias
Bias-Variance for Linear Regression
• Variations of Linear Regression Model based on Features
• Constant linear: include no features, wi = 0 for all i = 0
• Predicting a flat horizontal line or surface
• Few: include a few features, most wi = 0
• Many: include many features, a few wi = 0
• Plain linear: include all features, no wi = 0
• Predicting a standard line or plane-like model that can incline and tilt
• Adding features decreases bias but increases variance
• Conversely, forcing the weights of features to zero increases bias and
decreases variances
Bias-Variance for Linear Regression
• Complexity decreases as we choose lesser features
• Losing features (dimensions) restricts our view of the world and increases our
bias
• Complexity increases as we choose more features as well as when we

extend our features (add polynomial terms)
• Adding complex terms lets our model wiggle more and increases our variance
SCENARIO EXAMPLE GOOD BAD RISK
High bias & More neighbors Resists noise Misses pattern Underfit
low variance Zero or smaller linear Forced to generalize
regression coefficients
Low degree polynomial
Low bias & Fewer neighbors Follows complex Follows noise Overfit
high variance Bigger linear regression patterns
Memorises
coefficients training data
High degree polynomial
Problem with simple Train-Validate-Test Split
• When we don’t have a large amount of data to work with
• all or some of our data partitions may not be adequately representative of the
original dataset
• Example: model that predicts whether a bank customer will or will not default on their
loan
• Class distribution of observed data
• 95 percent Not Default
• 5 percent Default
• Taking small enough dataset, it is possible that the random sampling approach used to
generate the training, validation, and test partitions result in samples that do not evenly
represent the class distribution of the original dataset
Problem even in Stratified Sampling
• Even if stratified sampling approach were used
• some of the partitions may also have too many or too few examples of the
easy or difficult-to-predict patterns that exist in the original dataset
• Scenario 1
• Hard training data & Easy test data
• Overestimating (how well we will do in real world)
• Scenario 2
• Easy training data & Hard test data
• Underestimating (how well we will do in real world)
Resampling (making more from less)
• Can we generate multiple estimates for evaluation? (use ‘crowd’
estimation instead of ‘single’ estimation)
• Need multiple datasets (or)
• Convert one dataset into many!
• Repeated holdout or Resampling
• Repeatedly using different samples of the original data to train and validate a
model
• Performance of the model across the different iterations is averaged to yield
an overall performance estimate for the model
Cross Validation
• k-Fold Cross Validation
• After test data has been held out
• Remaining data is divided into k completely separate random partitions of
approximately equal size (folds)
• Folds represent data that will be used to validate model during each of k
iterations of the repeated holdout
Or k-1 folds are used for train & validate
Cross Validation
• k-Fold Cross Validation
• higher value of k leads to less biased model (but large variance might lead to
overfit)
• lower value of k is similar to the simple train-test split approach
Cross Validation
• Leave One Out Cross Validation (LOOCV)
• essentially k-fold cross-validation with k set to n (the number of instances in
the dataset)
Cross Validation
• Leave One Out Cross Validation (LOOCV)
• Benefits
• Greatest amount of data is used each time we train the model (helps with the
accuracy of the model)
• Approach is deterministic (no randomness in LOOCV unlike random sampling to
create k-folds in k-Fold CV; we are training model on every possible combination
of observations)
• Drawbacks
• High Computational Cost (model is being trained n times – rather
expensive with complex models and large datasets)
• Validation set is not stratified (with single instance in validation set, it is
impossible to mimic the statistics of overall dataset)
Cross Validation
• Random or Monte Carlo Cross Validation
• Similar to k-Fold CV but instead of creating a set number of folds (validation
sets) at the beginning, the random sample that makes up the validation set is
created during each iteration
Bootstrap Sampling
• Basic idea is to create a training set from the original dataset using random
sampling with replacement approach
• 0.632 Bootstrap
• random sampling a dataset with n instances, n different times with replacement, to
create another dataset also with n instances
• new dataset is used for training, while the instances from the original data, which
were not selected as part of the training data, are used for validation
Bootstrap Sampling
• 0.632 Bootstrap
• Results in rather pessimistic performance estimates against the validation
data
• By using sampling with replacement, the probability that an instance will be
selected in the training set is statistically 63.2 %
• With training data that is only 63.2% of the available data, performance is worse
that a model with 100% or 90% data
• To account for this, bootstrap method calculates final performance as a function
of the performance on both the training (resubstitution error) and validation
(misclassification error) datasets
Bootstrap Sampling
• 0.632 Bootstrap Advantages over Cross Validation
• Faster and Simpler
• By using sampling with replacement, bootstrapping tends to be a better way
to estimate model performance for small datasets
• 0.632 Bootstrap Drawback
• Similar to random cross validation approach, some instances in the original
datasets may be used more than once for validation and training, and some
instances may never be used at all
• Model may never learn or be evaluated against some of the patterns in the
data
Performance Metrics for Classification
• Confusion Matrix
• AUC-ROC Curve
• LogLoss
Confusion Matrix
• Confusion matrix is an N x N matrix
• Used for evaluating the performance of a classification model,
• Where N is the number of target classes
• Compares the actual target values with those predicted by the machine
learning model
• Shows the errors in the model performance in the form of a matrix,
hence also known as an error matrix
Confusion Matrix
• For a Binary Classification Problem
• Confusion Matrix is 2 x 2
• Where N is the number of target classes
Confusion Matrix
• True Positive (TP) :
• The predicted value matches the actual value
• The actual value was positive and the model predicted a positive value
• True Negative (TN) :
• The predicted value matches the actual value
• The actual value was negative and the model predicted a negative value
• False Positive (FP) :
• The predicted value was falsely predicted
• The actual value was negative but the model predicted a positive value
• Also known as the Type 1 error
• False Negative (FN) :
• The predicted value was falsely predicted
• The actual value was positive but the model predicted a negative value
Confusion Matrix
• Rate
• Measure factor in a confusion matrix
• 4 types : TPR, FPR, TNR, FNR.
• True Positive Rate
• Sensitivity / Recall
• True Negative Rate
• Specificity
• For better performance
• TPR, TNR should be high and
• FNR, FPR should be low
Calculations using Confusion Matrix
• Classification Accuracy
• How often the model predicts the correct output
• Ratio of the number of correct predictions made by the classifier to all
number of predictions made by the classifiers
• Misclassification Rate (Error Rate)

• How often the model gives the wrong predictions
• Ratio of number of incorrect predictions made by the classifier to all number of
the predictions made by the classifier
• Precision
• Number of correct outputs provided by the model, or
• Out of all positive classes that have predicted correctly by the model, how
many of them were actually true
• Recall
• Out of total positive classes that are actually positive, how many of them are
predicted correctly by the model
• Recall must be as high as possible
• F1 Score
• Recall and precision are two scores
• Aggregate of these two is the F1-Score
• Harmonic mean of precision and recall
• Tells precise your classifier is, as well as how robust it is
• Maximum when Precision is equal to Recall.
• High precision but lower recall, gives you an extremely accurate
prediction, but misses a large number of instances that are difficult to
classify
• Greater the F1 Score, better is the performance of our model
AUC – ROC Curve
• AUC (Area Under Curve)-ROC
(Receiver Operating Characteristic) is a
performance metric
• Based on varying threshold values
• For classification problems
• ROC is a probability curve
• AUC represents the degree or measure of
separability
• Tells how much the model is capable of
distinguishing between classes
• Higher the AUC, the better the model is at
predicting
• 0 classes as 0 and
• 1 classes as 1
• Higher the AUC, the better the model.
AUC – ROC Curve
• When AUC = 1
• Classifier is able to perfectly distinguish between all the Positive and the
Negative class points correctly
AUC – ROC Curve
• When AUC = 0
• Classifier would be predicting all Negatives as Positives, and all Positives as
Negatives
AUC – ROC Curve
• When 0.5<AUC<1
• High chance that the classifier will be able to distinguish the positive class
values from the negative class values
• Classifier is able to detect more numbers of True positives and True negatives
than False negatives and False positives
AUC – ROC Curve
• When AUC=0.5
• Classifier is not able to distinguish between Positive and Negative class points
• Either the classifier is predicting random class or constant class for all the data
points
Relationship between Sensitivity, Specificity,
FPR and Threshold
• Sensitivity and Specificity are inversely proportional to each other. So
when we increase Sensitivity, Specificity decreases, and vice versa.
• Sensitivity, Specificity and Sensitivity, Specificity
• When we decrease the threshold, we get more positive values thus it
increases the sensitivity and decreasing the specificity.
• Similarly, when we increase the threshold, we get more negative

values thus we get higher specificity and lower sensitivity.
• As we know FPR is 1 — specificity. So when we increase TPR, FPR
also increases and vice versa.
• TPR, FPR and TPR, FPR
AUC-ROC for Multi Class Model
• We can plot the N number of AUC ROC Curves for N number classes
using the One vs ALL methodology
• For example, if you have three classes named X, Y, and Z
• you will have one ROC for X classified against Y and Z
• another ROC for Y classified against X and Z
• third one of Z classified against Y and X
LOG LOSS (Logarithamic Loss)
• Also called Logistic regression loss or cross-entropy loss
• Indicative of how close the prediction probability is to the
corresponding actual/true value (0 or 1 in case of binary classification)
• More the predicted probability diverges from the actual value, the higher is the
log-loss value
• Quantifies the accuracy of a classifier by penalising false
classifications
• Minimising the Log Loss is basically equivalent to maximising the
accuracy of the classifier
• In order to calculate Log Loss the classifier must assign a probability
to each class rather than simply yielding the most likely class
• Mathematically defined as
• For Binary Classification

• Log Loss contribution from a single positive instance where the predicted
probability ranges from 0 (completely wrong prediction) to 1 (correct prediction)
• It’s apparent from the gentle downward slope towards the right that the Log Loss
gradually declines as the predicted probability improves
• Moving in the opposite direction though, the Log Loss ramps up very rapidly as
the predicted probability approaches 0
• Log Loss heavily penalises classifiers that are confident about an incorrect
classification
Performance Metrics for Regression
• R2 (Coefficient of Determination)
• MAE (Mean Absolute Error)
• RMSE (Root Mean Square Error)

Residuals
• Difference between actual and predicted values
• Can be thought of as ‘Distances’
• Closer the residual is to zero, the better our model performs in making its
predictions
R Score (Coefficient of Determination)
2
• Statistical measure that tells us how well our model is making all its
predictions on a scale of zero to one
• Determines the accuracy of our model in terms of distance or residual
R Score (Coefficient of Determination)
2
• Gives the accuracy of your model on a percentage scale, that is 0–100

• Python example to compute R2
MAE (Mean Absolute Error)
• Sum of all the distances/residuals (the differences between the actual
and predicted values) divided by the total number of points in the
dataset
• Absolute average distance of our model prediction
MAE (Mean Absolute Error)
• Gives how close the predictions are to the actual model on average
• Low MAE values indicate that the model is correctly predicting
• Larger MAE values indicate that the model is poor at prediction
• Python example to compute MAE
RMSE (Root Mean Square Error)
• Square root of the average squared distance / residual (difference
between actual and predicted value)
• Square root of all the squares of the distances divided by the total
number of points
• RMSE functions similarly to MAE (that is, you use it to determine how close the
prediction is to the actual value on average), but with a minor difference
• Used to determine whether there are any large errors or distances that could be
caused if the model
• overestimated the prediction (that is the model predicted values that were significantly higher
than the actual value) or
• underestimated the predictions (that is, predicted values less than actual prediction)
RMSE (Root Mean Square Error)
• If you are concerned about large errors, RMSE is a
good metric to use
• Popular evaluation metric for regression problems
because it not only calculates how close the
prediction is to the actual value on average, but it
also indicates the effect of large errors
• Python example to compute RMSE
• Scikit-learn evaluation metric library has no RMSE
metric, but it does include the mean squared error
method
• We use the Numpy square root method to find the
square root of mean squared error

ML3 - Evaluation

Uploaded by

Copyright:

Available Formats

ML3 - Evaluation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML3 - Evaluation

Uploaded by

Copyright:

Available Formats

Evaluation

• If model is too simple and has very few parameters

• Complexity increases as we choose more features as well as when we

• Misclassification Rate (Error Rate)

• Similarly, when we increase the threshold, we get more negative

• For Binary Classification

• MAE (Mean Absolute Error)

• RMSE (Root Mean Square Error)

• Gives the accuracy of your model on a percentage scale, that is 0–100

You might also like