Hands-on Tutorials
How to calculate confidence intervals for performance metrics in Machine Learning using an automatic bootstrap method
Are your model performance measurements very precise due to a “large” test set, or very uncertain due to a “small” or imbalanced test set?
Introduction
If you report your classifier’s performance as having Accuracy=94.8% and F1=92.3% on a test set, this doesn’t mean much without knowing something about the size and composition of the test set. The margin of error of those performance measurements will vary a lot depending on the size of the test set, or, for an imbalanced dataset, primarily depending on how many independent instances of the minority class it contains (more copies of the same instances from oversampling doesn’t help for this purpose).
If you were able to collect another, independent test set of similar origin, the Accuracy and F1 of your model on this dataset are unlikely to be the same, but how much different might they plausibly be? A question similar to this is answered in statistics as the confidence interval of the measurement.