Machine Learning and Pattern Recognition Week 2 Error Bars
Machine Learning and Pattern Recognition Week 2 Error Bars
It’s good practice to give some indication of uncertainty or expected variability in experi-
mental results. You will need to report experimental results in your coursework. Many of
you will also write up experimental results in dissertations this year, and you will also want
to know how seriously to take numbers that you measure in your future work.
We will discuss some different “standard deviations” that you might see reported, or want
to report, including “standard errors on the mean”.
The ( N − 1) in the estimator for the variance, rather than N, (Bessel’s correction) is a small
detail you don’t need to worry about for this course.1
The estimator x̄ is itself a random variable: if we gathered a second dataset and computed
its mean in the same way, we would get a different x̄. For some datasets x̄ will be bigger
than the underlying true mean µ, sometimes it will be smaller. The mean of x̄ is the correct
answer µ. That is, x̄ is an unbiased estimator.
Using the rules of expectations and variances (see the note in the background section), we
can estimate the variance of x̄. We assume here that the observations are independent:
N
1
var[ x̄ ] =
N2 ∑ var[xn ] (3)
n =1
1
= Nσ2 = σ2 /N ≈ σ̂2 /N. (4)
N2
A “typical” deviation from the mean is given by the standard deviation (not the variance).
So we write: √
µ = x̄ ± σ̂/ N, (5)
to give an indication of how precisely we think we have measured the mean of the distribu-
tion with our N samples. Some papers might report ± two standard deviations.
1. The ( N − 1) normalization makes the variance estimator unbiased and is what the Matlab/Octave var function
does by default. NumPy’s np.var requires the option ddof=1 to get the unbiased estimator. However, if N is small
enough that this difference matters, you need to be more careful about the statistics than we are in this note.
3 Reliability of a method
A standard error on the test set loss indicates how much the future performance of a
particular fitted model might deviate from the performance we have estimated. It doesn’t say
whether the machine learning method would work well in future, if training were run again
to create a new model.
Readers of a paper may also want to know how variable the performance of a model can
be across different fits, that is, how robust is the method? The fitted models could vary
for multiple reasons: we might gather new data; some machine learning methods depend
on random choices; somewhat horrifyingly, even machine learning code using no random
numbers and running on the same data, often gives different results!2 To summarize one
of these effects, we could report the standard deviation of the models’ performances (not
a standard error on the mean) to indicate how much a future fit will typically vary from
average performance when something is changed.
Important: Papers are sometimes not clear on what their “error bars” are reporting. Some-
times they show the standard deviation of results under different conditions, other times
a standard error indicating uncertainty of the generalization error due to a finite test set.
Always try to be clear precisely what standard deviation/error you are reporting and why.
If the mean of the δ’s is several standard errors greater than zero, we would report that A is
the better model. (Non-examinable: you could perform a paired t-test if you wanted to turn
this idea into a formal hypothesis test.)