Hands-on Tutorials

How to calculate confidence intervals for performance metrics in Machine Learning using an automatic bootstrap method

Are your model performance measurements very precise due to a “large” test set, or very uncertain due to a “small” or imbalanced test set?

David B Rosen (PhD)

Published in

Towards Data Science

11 min readSep 8, 2021

The orange line shows 89.7% as the lower bound of the Balanced Accuracy confidence interval, green for the original observed Balanced Accuracy=92.4% (point estimate), and red for upper bound of 94.7%. (This and all images are by the author unless otherwise noted.)

Introduction

If you report your classifier’s performance as having Accuracy=94.8% and F1=92.3% on a test set, this doesn’t mean much without knowing something about the size and composition of the test set. The margin of error of those performance measurements will vary a lot depending on the size of the test set, or, for an imbalanced dataset, primarily depending on how many independent instances of the minority class it contains (more copies of the same instances from oversampling doesn’t help for this purpose).

If you were able to collect another, independent test set of similar origin, the Accuracy and F1 of your model on this dataset are unlikely to be the same, but how much different might they plausibly be? A question similar to this is answered in statistics as the confidence interval of the measurement.

Hands-on Tutorials

How to calculate confidence intervals for performance metrics in Machine Learning using an automatic bootstrap method

Are your model performance measurements very precise due to a “large” test set, or very uncertain due to a “small” or imbalanced test set?

Introduction

Written by David B Rosen (PhD)