ML3 - Evaluation
ML3 - Evaluation
ML3 - Evaluation
REFERENCES
Book: Machine Learning with Python for Everyone (Chapter 5)
Training – Validating – Testing Steps
Book: Machine Learning Bookcamp (Chapter 1)
Bias Variance Tradeoff
Understanding the Bias-Variance Tradeoff | by Seema Singh | Towards Data Science
Estimating Future Performance
Book: Practical Machine Learning in R – Nwanganga & Chapple (Chapter 1 & 9)
Performance Metrics – Classification
https://medium.com/@hemaanushatangellamudi/performance-metrics-in-machine-learni
ng-1ec2e48771b5
Performance Metrics – Regression
Background
• Two pitfalls when we learn
• Limited Capacity
• Distraction by Noise
• Equivalent pitfalls when machines learn
• Bias
• Variance
• To avoid “learning data by heart”
• Learn using Training Dataset
• Evaluate using Test Dataset
• Wouldn’t Learning using All Data lead to a better Model?
Three Step Process
• Two Step Process
• Train (setting parameters of the model – like m and b in y=mx+b)
• Test (check how well the model generalises)
• What about which model? What about hyper-parameters?
• k-NN or Naïve Bayes?
• k = 3 or 10 or 20?
• Three Step Process
• Train Training set
• Select (1st evaluation) Validation (test) set [ValS]
• Test / Assess (2nd evaluation) Hold-out Test set [HOT]
• Train – Validate – Test Split?
• 50% - 25% - 25%
• What if learning curve takes longer to plateau?
Underfitting and Overfitting
• Why may humans do ‘bad’ in an exam (test)?
• Not bringing enough capacity while studying
• Focussing too much on irrelevant details (noise)
• Why may machine learning have errors?
• Underfitting
• Overfitting
• Example using Synthetic data and Linear / Polynomial Regression
• Error reduces going from line (degree 1) to parabola (degree 2) but worsens
going to nonic (degree 9)
Underfitting and Overfitting
• Model 1 (Complexity 1) – Straight Line
• Not enough capacity to capture the complexity of target
• Too biased towards flatness
• Underfitting
• Model 3 (Complexity 9) – wiggly curve
• Has good capacity (captures the complexity of training data perfectly)
• Has memorised the noise (randomness in data) – does bad on test data
• Overfitting
• Model 2 (Complexity 2) – parabola
• Just enough capacity to capture the complexity (but not exactly due to noise)
• Lowest test error
• Just right!
Underfitting and Overfitting
• Underfitting
• A very simple model may not be able to learn the pattern in the training data.
• It also does poorly on the testing data.
• Overfitting
• A very complex model may learn the training data perfectly.
• However, it does poorly on the testing data because it also learned irrelevant
relationships in the training data.
• Just-right
• A medium-complexity model performs well on the training and testing data.
Understanding the Bias-Variance Tradeoff
• When we make a mistake (error) in prediction
• Predict an incorrect class (in classification problems)
• Predict a value with high MSE (in regression problems)
• We have no control over (VarianceData)
• Actual randomness in relationship between input features & output target
• Example: Wide range of possible incomes earned by
• {college grad, economics, 5 years experience}
• degree to which our data is affected by randomness—either in measurement or in
real-world differences—is called the variance of the data
• Irreducible Error – measure of amount of noise in our data
• No matter how good we make our model, our data will have a certain amount of
noise or irreducible error that cannot be removed
Understanding the Bias-Variance Tradeoff
• We have some control over (VarianceLearner/Model(Training))
• Way models vary due to the random selection of the data we train on - is
called the variance of the model
• Example: Linear Regression parameters m & b
• Will differ depending on the randomly selected training data set
• We have most control over (BiasLearner/Model)
• When we choose between two models, one may have a fundamentally better
resonance with the relationship between the inputs and outputs
• Example: line has great difficulty following the path of a parabola
• A model that cannot match the actual relationship between the inputs and
outputs—after we ignore the inherent noisiness in the data—has higher bias
• Model with high bias has difficulty capturing complicated patterns
• Model with low bias can follow more complicated patterns
Understanding the Bias-Variance Tradeoff
• Three components of error
• Inherent variability in our data
• Variability in creating our predicting model from the training data
• Bias of our model
• Bias-Variance Decomposition
• Error = BiasLearner + VarianceLearner(Training) + VarianceData
Bias-Variance using bulls eye diagram
Bias-Variance tradeoff
• AUC-ROC Curve
• LogLoss
Confusion Matrix
• Confusion matrix is an N x N matrix
• Used for evaluating the performance of a classification model,
• Where N is the number of target classes
• Compares the actual target values with those predicted by the machine
learning model
• Shows the errors in the model performance in the form of a matrix,
hence also known as an error matrix
Confusion Matrix
• For a Binary Classification Problem
• Confusion Matrix is 2 x 2
• Where N is the number of target classes
Confusion Matrix
• True Positive (TP) :
• The predicted value matches the actual value
• The actual value was positive and the model predicted a positive value
• True Negative (TN) :
• The predicted value matches the actual value
• The actual value was negative and the model predicted a negative value
• False Positive (FP) :
• The predicted value was falsely predicted
• The actual value was negative but the model predicted a positive value
• Also known as the Type 1 error
• False Negative (FN) :
• The predicted value was falsely predicted
• The actual value was positive but the model predicted a negative value
Confusion Matrix
• Rate
• Measure factor in a confusion matrix
• 4 types : TPR, FPR, TNR, FNR.
• True Positive Rate
• Sensitivity / Recall
• True Negative Rate
• Specificity
• For better performance
• TPR, TNR should be high and
• FNR, FPR should be low
Calculations using Confusion Matrix
• Classification Accuracy
• How often the model predicts the correct output
• Ratio of the number of correct predictions made by the classifier to all
number of predictions made by the classifiers
• Recall
• Out of total positive classes that are actually positive, how many of them are
predicted correctly by the model
• Recall must be as high as possible
Calculations using Confusion Matrix
• F1 Score
• Recall and precision are two scores
• Aggregate of these two is the F1-Score
• Harmonic mean of precision and recall
• Tells precise your classifier is, as well as how robust it is
• Maximum when Precision is equal to Recall.
• High precision but lower recall, gives you an extremely accurate
prediction, but misses a large number of instances that are difficult to
classify
• Greater the F1 Score, better is the performance of our model
AUC – ROC Curve
• AUC (Area Under Curve)-ROC
(Receiver Operating Characteristic) is a
performance metric
• Based on varying threshold values
• For classification problems
• ROC is a probability curve
• AUC represents the degree or measure of
separability
• Tells how much the model is capable of
distinguishing between classes
• Higher the AUC, the better the model is at
predicting
• 0 classes as 0 and
• 1 classes as 1
• Higher the AUC, the better the model.
AUC – ROC Curve
• When AUC = 1
• Classifier is able to perfectly distinguish between all the Positive and the
Negative class points correctly
AUC – ROC Curve
• When AUC = 0
• Classifier would be predicting all Negatives as Positives, and all Positives as
Negatives
AUC – ROC Curve
• When 0.5<AUC<1
• High chance that the classifier will be able to distinguish the positive class
values from the negative class values
• Classifier is able to detect more numbers of True positives and True negatives
than False negatives and False positives
AUC – ROC Curve
• When AUC=0.5
• Classifier is not able to distinguish between Positive and Negative class points
• Either the classifier is predicting random class or constant class for all the data
points
Relationship between Sensitivity, Specificity,
FPR and Threshold
• Sensitivity and Specificity are inversely proportional to each other. So
when we increase Sensitivity, Specificity decreases, and vice versa.
• Sensitivity, Specificity and Sensitivity, Specificity
• When we decrease the threshold, we get more positive values thus it
increases the sensitivity and decreasing the specificity.
• Statistical measure that tells us how well our model is making all its
predictions on a scale of zero to one
• Determines the accuracy of our model in terms of distance or residual
R Score (Coefficient of Determination)
2
• RMSE functions similarly to MAE (that is, you use it to determine how close the
prediction is to the actual value on average), but with a minor difference
• Used to determine whether there are any large errors or distances that could be
caused if the model
• overestimated the prediction (that is the model predicted values that were significantly higher
than the actual value) or
• underestimated the predictions (that is, predicted values less than actual prediction)
RMSE (Root Mean Square Error)
• If you are concerned about large errors, RMSE is a
good metric to use
• Popular evaluation metric for regression problems
because it not only calculates how close the
prediction is to the actual value on average, but it
also indicates the effect of large errors
• Python example to compute RMSE
• Scikit-learn evaluation metric library has no RMSE
metric, but it does include the mean squared error
method
• We use the Numpy square root method to find the
square root of mean squared error