Performance Measurement for Machine Leaning.pptx

Performance Measurement
Usman Khan

Confusion Matrix (1)
• A confusion matrix is a table that is often used
to describe the performance of a classification
model (or “classifier”) on a set of test data for
which the true values are known
• It allows the visualization of the performance of
an algorithm

• It allows easy identification of confusion
between classes e.g. one class is commonly
mislabeled as the other.
• Most performance measures are computed
from the confusion matrix.

• A confusion matrix is a summary of prediction
results on a classification problem
• The number of correct and incorrect
predictions are summarized with count values
and broken down by each class. This is the key
to the confusion matrix

• The confusion matrix shows the ways in which
your classification model is confused when it
makes predictions
• It gives us insight not only into the errors being
made by a classifier but more importantly the
types of errors that are being made

• Here,
Class 1 : Positive
Class 2 : Negative
Definition of the Terms:
• Positive (P) : Observation is positive
(for example: is an apple).
• Negative (N) : Observation is not positive
• (for example: is not an apple).

• True Positive (TP) :
Observation is positive, and is predicted to be
positive
• False Negative (FN) :
Observation is positive, but is predicted negative.
• True Negative (TN) :
Observation is negative, and is predicted to be
negative.
• False Positive (FP) :
Observation is negative, but is predicted positive.

• Total number of test samples are 165

Classification Rate/Accuracy
• Classification Rate or Accuracy is given by the
relation:

Sensitivity and Specificity
• Sensitivity and specificity values can be used
to quantify the performance of a case
definition or the results of a diagnostic test.
• Even with a highly specific diagnostic test, if a
disease is uncommon among those people
tested, a large proportion of positive test
results will be false positive, and the positive
predictive value will be low.

• If the test is applied more selectively such that
the proportion of people tested who truly have
disease is greater, the test's predictive value will
be improved
• Thus, sensitivity and specificity are characteristics
of the test, whereas predictive values depend
both on test sensitivity and specificity and on the
disease prevalence in the population in which
the test is applied

• Sensitivity/Recall
• Sensitivity (Se) is defined as the proportion of
individuals that have a positive test result.

• Specificity
• Specificity is defined as the proportion of
individuals have negative test result

Precision
• To get the value of precision we divide the total
number of correctly classified positive examples by
the total number of predicted positive examples.
High Precision indicates an example labeled as
positive is indeed positive (small number of FP).

precision
Precision is the fraction of true positive examples
among the examples that the model classified as
positive. In other words, the number of true
positives divided by the number of false positives
plus true positives.
recall
Recall, also known as sensitivity, is the fraction of
examples classified as positive, among the total
number of positive examples. In other words, the
number of true positives divided by the number of
true positives plus false negatives.
TP
The number of true positives classified by the
model.
FN
The number of false negatives classified by the
model.
FP
The number of false positives classified by the
model.

F1 Score
• The F-score, also called the F1-score, is a measure of a model’s accuracy
on a dataset. It is used to evaluate binary classification systems,
which classify examples into ‘positive’ or ‘negative’
• The F-score is a way of combining the precision and recall of the model,
and it is defined as the harmonic mean of the model’s precision and recall

Calculating F-score
• Let us imagine we have a tree with ten apples
on it. Seven are ripe and three are still unripe,
but we do not know which one is which. We
have an AI which is trained to recognize which
apples are ripe for picking, and pick all the
ripe apples and no unripe apples. We would
like to calculate the F-score, and we consider
both precision and recall to be equally
important, so we use the F1-score.

The AI picks five ripe apples but also picks one unripe apple.

Confusion Matrix for Model 1
Ripe Unripe
Picked 5 1
Unpicked 2 2

Precision and Recall for model 1
• Precision = 0.83
• Recall = 0.71
• F1 Score = 0.77

Confusion Matrix for Model 2
Ripe Unripe
Picked 4 1
Unpicked 2 3

Precision and Recall for model 1
• Precision = 0.8
• Recall = 0.666
• F1 Score = 0.72

Conclusion
• High recall, low precision:
This means that most of the positive examples
are correctly recognized (low FN) but there are a
lot of false positives.
• Low recall, high precision:
This shows that we miss a lot of positive
examples (high FN) but those we predict as
positive are indeed positive (low FP)

F-score vs Accuracy
• There are a number of metrics which can be used to evaluate a binary
classification model, and accuracy is one of the simplest to understand.
Accuracy is defined as simply the number of correctly categorized
examples divided by the total number of examples. Accuracy can be useful
but does not take into account the subtleties of class imbalances, or
differing costs of false negatives and false positives.
• The F1-score is useful:
where there are either differing costs of false positives or false negatives,
• or where there is a large class imbalance, such as if 10% of apples on
trees tend to be unripe. In this case the accuracy would be misleading,
since a classifier that classifies all apples as ripe would automatically get
90% accuracy but would be useless for real-life applications.
• The accuracy has the advantage that it is very easily interpretable, but the
disadvantage that it is not robust when the data is unevenly distributed, or
where there is a higher cost associated with a particular type of error.

Mean Absolute Error or MAE
• We know that an error basically is the absolute difference
between the actual or true values and the values that are
predicted. Absolute difference means that if the result has a
negative sign, it is ignored.
• Hence, MAE = True values – Predicted values
• MAE takes the average of this error from every sample in a
dataset and gives the output.

Mean Squared Error or MSE
• MSE is calculated by taking the average of the
square of the difference between the original
and predicted values of the data.
• Hence, MSE =

Root Mean Squared Error or RMSE

Where to use which Metric to determine the Performance of a
Machine Learning Model?
• MAE: It is not very sensitive to outliers in comparison to MSE since it
doesn't punish huge errors. It is usually used when the performance is
measured on continuous variable data. It gives a linear value, which
averages the weighted individual differences equally. The lower the value,
better is the model's performance.
• MSE: It is one of the most commonly used metrics, but least useful when a
single bad prediction would ruin the entire model's predicting abilities, i.e
when the dataset contains a lot of noise. It is most useful when the
dataset contains outliers, or unexpected values (too high or too low
values).
• RMSE: In RMSE, the errors are squared before they are averaged. This
basically implies that RMSE assigns a higher weight to larger errors. This
indicates that RMSE is much more useful when large errors are present
and they drastically affect the model's performance. It avoids taking the
absolute value of the error and this trait is useful in many mathematical
calculations. In this metric also, lower the value, better is the performance
of the model.

Cross Validation (1)
• In machine learning is to not use the entire data
set when training a learner.
• Some of the data is removed before training
begins.
• Then when training is done, the data that was
removed can be used to test the performance of
the learned model on ``new'' data.
• This is the basic idea for a whole class of model
evaluation methods called cross validation

• Method of estimating expected predicting
error
• Helps selecting the best fit model
• Helps ensuring model is not over fit

1) Holdout method
2) K-Fold CV
3) Leave one out CV
4) Bootstraps Methods

Holdout method
• The holdout cross validation method is the
simplest of all.
• In this method, you randomly assign data
points to two sets. The size of the sets does
not matter

K-FOLD
• K-fold cross validation is one way to improve
over the holdout method. The data set is
divided into k subsets and the holdout
method is repeated k times
• Each time, one of the k subsets is used as the
test set and the other k-1 subsets are put
together to form a training set

K - FOLD
• Disadvantages
• ???
• Stratified K-Fold

Leave one out CV (1)
• Leave-one-out cross validation is K-fold cross
validation taken to its logical extreme, with K
equal to N, the number of data points in the set
• That means that N separate times, the function
approximate is trained on all the data except for
one point and a prediction is made for that point
• As before the average error is computed and
used to evaluate the model.

• Specific case of K-fold validation

• Disadvantages
• ???

Bootstrap (1)
• Randomly draw datasets from the training
sample
• Each sample same size as the training sample
• Refit the model with the bootstrap samples
• Examine the model

Performance Measurement for Machine Leaning.pptx

Recommended

Recommended

More Related Content

Similar to Performance Measurement for Machine Leaning.pptx

Similar to Performance Measurement for Machine Leaning.pptx (20)

Recently uploaded

Recently uploaded (20)

Performance Measurement for Machine Leaning.pptx

Editor's Notes