Notes On ANOVA For Comparing Multiple Algorithms
Notes On ANOVA For Comparing Multiple Algorithms
When comparing the performance of multiple algorithms, we often train and test several
algorithms on multiple datasets to evaluate their error rates. Given L algorithms and K training
sets, we induce K classifiers for each algorithm and test them on K validation sets. This results in
L groups of K error rates each. The goal is to determine if there are statistically significant
differences in error rates among these algorithms.
2. ANOVA Framework
Objective:
Test whether there are significant differences in mean error rates across L algorithms.
Hypotheses:
Null Hypothesis (H0): All algorithms have the same mean error rate. μ1 = μ2 = ⋯ =
μL .
Alternative Hypothesis (H1): At least one algorithm has a different mean error rate.
Data Assumptions:
Error rates Xij are normally distributed with mean μj and common variance σ2 .
Each error rate is approximately normal due to the binomial distribution of validation
errors.
3. ANOVA Procedure
a. Estimators of Variance:
̂𝟐 ):
1. Between-Group Variance (Estimator 𝛔𝒃
1
o Group Mean: mj = ∑K X
K i=1 ij
1
o Overall Mean: m = ∑Lj=1 mj
L
2
o Between-Group Sum of Squares (SSB): SSB = K ∑Lj=1(mj − m)
o ̂2 = SSB
Estimator: σb L−1
2. Within-Group Variance (Estimator 𝛔̂𝟐𝒘 ):
1 2
o Group Variance: Sj2 = ∑K (X − mj )
K−1 i=1 ij
2
o Within-Group Sum of Squares (SSW): SSW = ∑Lj=1 ∑Ki=1(Xij − mj )
o 2 = SSW
Estimator: σ̂
w L⋅(K−1)
b. F-Ratio Calculation:
σ̂2b
F-Ratio: F0 =
σ̂
w
2
o The F-Ratio compares the variance between groups to the variance within groups.
c. Decision Rule:
If 𝐅𝟎 is greater than the critical value 𝐅𝛂,𝐋−𝟏,𝐋(𝐊−𝟏) from the F-distribution table,
reject the null hypothesis.
If 𝐅𝟎 is not significant, fail to reject the null hypothesis.
4. ANOVA Table
2
Total Sum of Squares (SST): SST = ∑Lj=1 ∑Ki=1(Xij − m)
Purpose:
To identify which specific groups differ after finding a significant difference with
ANOVA.
Common Tests:
mi −mj
Least Significant Difference (LSD) Test: t =
√2σ̂
2 /K
w
ANOVA is a powerful tool for comparing multiple algorithms' performance. It assesses whether
the observed differences in error rates are statistically significant by analyzing variances within
and between groups. Significant results suggest that at least one algorithm's performance is
significantly different, warranting further investigation through post hoc tests.
Let's go through a detailed example of ANOVA with calculations and post hoc testing. Suppose
we have three algorithms, and we want to compare their error rates using a 5-fold cross-
validation. Here’s the step-by-step process:
Example Dataset
Let’s assume we have the following error rates (in percentage) for three algorithms (A, B, and C)
across 5 folds:
Group Means:
10+12+11+13+12
Algorithm A: mA = = 11.6
5
15+14+16+17+15
Algorithm B: mB = = 15.4
5
20+19+21+22+20
Algorithm C: mC = = 20.4
5
(10+12+11+13+12)+(15+14+16+17+15)+(20+19+21+22+20) 165
Overall Mean: m = = = 11.0
15 15
Algorithm A:
(10 − 11.6)2 + (12 − 11.6)2 + (11 − 11.6)2 + (13 − 11.6)2 + (12 − 11.6)2
SA2 =
5−1
Algorithm B:
(15 − 15.4)2 + (14 − 15.4)2 + (16 − 15.4)2 + (17 − 15.4)2 + (15 − 15.4)2
SB2 =
5−1
Algorithm C:
(20 − 20.4)2 + (19 − 20.4)2 + (21 − 20.4)2 + (22 − 20.4)2 + (20 − 20.4)2
SC2 =
5−1
F-Ratio:
MSB 98.6
F0 = = ≈ 151.0
MSW 0.65
Standard Error:
3.8
t= ≈ 7.45
0.51
o Algorithm A vs. C:
8.8
t= ≈ 17.25
0.51
o Algorithm B vs. C:
5.0
t= ≈ 9.80
0.51
Summary: All pairwise comparisons are significant, indicating that all algorithms have
significantly different error rates.
By following these steps, we have used ANOVA to determine that there are significant
differences in error rates among the algorithms and used post hoc tests to pinpoint where those
differences lie.