19-Performance Metrics

Module5_Performance_Metrics
References:
1. Ethem Alpaydin, "Introduction to
Machine Learning”, MIT Press, Prentice
Hall of India
Performance evaluation
• How predictive is the model we learned?
– For regression, usually R2 or MSE
– For classification, many options
• Accuracy, Precision, Recall can be used
– Performance on the training data (data used to
build models) is not a good indicator of
performance on future data
• Because new data will probably not be exactly the
same as the training data!
Measuring Classifier Performance
• For classification, a variety of measures have been proposed.
• There are four possible cases, as shown in table 19.1.
• A confusion matrix is a table that is often used to describe the
performance of a classification model (or "classifier") on a set of test data
• For a positive example,
– if the prediction is also positive, this is a true positive;
– if our prediction is negative for a positive example, this is a false negative.
• For a negative example,
– if the prediction is also negative, we have a true negative, and
– if we predict a negative example as positive we have a false positive
Measuring Classifier Performance
• p=tp+fn; p’ =tp+fp
• n=fp+tn; n’ =tn+fn
Confusion Matrix
• A confusion matrix is a table that is often used to describe the performance of a
classification model (or "classifier") on a set of test data
• TRUE POSITIVE (TP):
– A document is actually SPAM (positive) and classified as SPAM (positive).
• TRUE NEGATIVE (TN):
– A document which is actually NONSPAM(negative) and classified as
NONSPAM(negative).
• FALSE POSITIVE (FP):
– A document which is actually NONSPAM(negative) and classified as SPAM(positive).
• FALSE NEGATIVE (FN):
– A document which is actually SPAM(positive) and classified as NONSPAM(negative).
Predicted: Predicted:
Positive Negative
Actual: TP FN
Positive
Actual: FP TN
Negative
Confusion Matrix
A confusion matrix is a table that is often used to describe the performance
of a classification model (or "classifier") on a set of test data
• True Positives (TP): These are cases in which we predicted positive (they
have the disease), and they do have the disease.
• True Negatives (TN): Predicted negative, and they don't have the disease.
• False Positives (FP): Predicted positive, but they don't actually have the
disease. (Also known as a "Type I error.")
• False Negatives (FN): Predicted negative, but they actually do have the
disease. (Also known as a "Type II error.")
Predicted: Predicted:
Positive Negative
Actual: TP FN
Positive
Actual: FP TN
Negative
Precision and Recall
• Accuracy = No. of correct predictions/All predictions
• Precision measures how many of the instances the model predicted as

positive were actually positive.
• Precision
– Precision P = TP/(TP +FP)
• Recall measures how many of the actual positive instances the model
correctly identified.
• Recall
– Recall R = TP/( TP + FN)
Accuracy
• Overall, how often is the classifier correct?
– Number of correct predictions / Total number of predictions
– Accuracy = tp+tn/(tp + fp + fn + tn)
Positive Negative
Predicted Positive 1 1
Predicted Negative 8 90
– Accuracy = 1+90/(1+1+8+90) = 0.91

• 91 correct predictions out of 100 total examples
– Precision = 1/2 and Recall =1/9
– Accuracy alone doesn't tell the full story when you're working
with a class-imbalanced data set
F Measure (F1/Harmonic Mean)
• Consider Recall and Precision together or
combine them into a single metric.
• F-Measure:
– A measure that combines precision and recall is
the harmonic mean of precision and recall.
– F1 is the most commonly used metric.

F Measure (F1)
• One measure of performance that takes into account both
recall and precision.
• Harmonic mean of recall and precision:
–Why harmonic mean?

– harmonic mean emphasizes the importance of small values,
whereas the arithmetic mean is affected more by outliers that are
unusually large
– Data are extremely skewed; over 99% documents are non-
relevant. This is why accuracy is not an appropriate measure
– Compared to arithmetic mean, both need to be high for harmonic
mean to be high.
Example
• Example of classifying 100 documents
(which includes 40 SPAM and the remaining
60 are NONSPAM) as SPAM and NONSPAM.
– Out of 40 SPAM documents 30 documents are
classified correctly and the remaining 10 SPAM
documents are classified as NONSPAM by an
algorithm.
– Out of the 60 NONSPAM documents, 55 are
classified as NONSPAM and the remaining 5 are
classified as SPAM.
– Here, TN = 55, FP = 5, FN = 10, TP = 30.
Example
• Here, TN = 55, FP = 5, FN = 10, TP =
30.
• Accuracy = (55 + 30)/(55 + 5 + 30
+ 10 ) = 0.85 and the accuracy will
be 85%.
• Precision = 30/(30+ 5) = 0.857
• Recall = 30/(30+ 10) = 0.75
• F1 Score = 2* ( 0.857 * 0.75)/(0.857
+ 0.75) = 0.799.
Example
• In a Spam detection model handling 10,000
emails,
– 600 spam emails were correctly flagged
– 9,000 non-spam emails accurately identified
– 100 non-spam emails were incorrectly marked as
spam
– 300 spam emails were missed
Example
• Accuracy: (TP + TN) / (TP + FP + FN + TN) = 96%
• Precision: TP / (TP + FP) = 86%
• Recall: TP / (TP + FN) = 67%
• F1 Score: 2 * (Precision * Recall) / (Precision + Recall) = 75%
Metric Formula Value

(600 + 9000) /
Accuracy 96%
10000
Precision 600 / (600 + 100) 86%
Recall 600 / (600 + 300) 67%
2 * (0.86 * 0.67) /
F1 Score 75%
(0.86 + 0.67)
Example
• Find the Precision and Recall for the results of prediction
given below.
Example
• Confusion Matrix
Cost-Sensitive Learning
• Learning to minimize the expected cost of
misclassifications.
• Most classification learning algorithms
attempt to minimize the expected number of
misclassification errors.
• In many applications, different kinds of
classification errors have different costs, so we
need cost-sensitive methods.
Examples of Applications with Unequal
Misclassification Costs
• Medical Diagnosis:
– Cost of false positive error: Unnecessary treatment;
unnecessary worry
– Cost of false negative error: Postponed treatment or
failure to treat; death or injury
• Fraud Detection:
– False positive: resources wasted investigating non-fraud
– False negative: failure to detect fraud could be very
expensive
Cost Matrix
Model 1: Confusion matrix Model 2: Confusion matrix
P N P N
Actual
Actual
P 150 40 FN P 250 45
N 60 250 N 5 200
FP Predicted Predicted
Cost matrix
P N
Accuracy: 80% Accuracy: 90%
Cost: 150x-1 + 40x100 + 60x1=3910 P -1 100 Cost: 250x-1 + 45x100 +5x1 = 4255
N 1 0
• If we are focusing on accuracy then we will go with the Model 2 (In this
case we need to compromise on cost) , however if we are focusing on
cost then we will go with the Model 1 (In this case we need to
compromise on accuracy).
ROC Curves
• A receiver operating characteristic curve, i.e. ROC curve, is
a graphical plot that illustrates the diagnostic ability of a
binary classifier system as its discrimination threshold is
varied.
• The diagnostic performance of a test, or the accuracy of a
test to discriminate diseased cases from normal cases is
evaluated using Receiver Operating Characteristic (ROC)
curve analysis.
• A ROC Curve is a way to compare diagnostic tests.
• It is a plot of the true positive rate against the false positive
rate.
ROC Curves
This is an ideal situation. This is the worst situation.

Model has an ideal When AUC is approximately
measure of separability. It 0.5, model has no
is perfectly able to discrimination capacity to
distinguish between distinguish between positive
positive class and class and negative class.
negative class. Random predictions.
Multiple ROC Curves
• Comparison of multiple classifiers is usually straight-forward
especially when no curves cross each other.
• Curves close to the perfect ROC curve have a better
performance level than the ones closes to the baseline.
PR Curves Vs ROC Curves
• A ROC curve represents a relation between sensitivity
(Recall) and False Positive Rate (Not Precision).
– ROC curve plots True Positive Rate Vs. False Positive Rate;
whereas, PR curve plots Precision Vs. Recall.
• Conclusion:
– "How well can this classifier be expected to perform in general, go
with a ROC curve
– If true negative is not much valuable to the problem, or negative
examples are abundant. Then, PR-curve is typically more
appropriate.
• For example, if the class is highly imbalanced and positive samples are
very rare, then use PR-curve.
• How meaningful is a positive result from my classifier

19-Performance Metrics

Uploaded by

Copyright:

Available Formats

19-Performance Metrics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

19-Performance Metrics

Uploaded by

Copyright:

Available Formats

Module5_Performance_Metrics

• Precision measures how many of the instances the model predicted as

– Accuracy = 1+90/(1+1+8+90) = 0.91

– F1 is the most commonly used metric.

–Why harmonic mean?

Metric Formula Value

This is an ideal situation. This is the worst situation.

You might also like