Evaluating Machine Learning Model
Evaluating Machine Learning Model
Learning Models
A Beginners Guide
to Key Concepts and Pitfalls
Alice Zheng
Job # 15420
Evaluating Machine
Learning Models
A Beginners Guide to Key
Concepts and Pitfalls
Alice Zheng
September 2015:
First Edition
978-1-491-93246-9
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Orientation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
The Machine Learning Workflow
Evaluation Metrics
Hyperparameter Search
Online Testing Mechanisms
1
3
4
5
Evaluation Metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Classification Metrics
Ranking Metrics
Regression Metrics
Caution: The Difference Between Training Metrics and
Evaluation Metrics
Caution: Skewed DatasetsImbalanced Classes, Outliers,
and Rare Data
Related Reading
Software Packages
7
12
14
16
16
18
18
19
21
22
22
23
iii
24
24
25
25
Hyperparameter Tuning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
27
28
28
30
34
36
36
iv
Table of Contents
38
39
46
47
48
Preface
vi
Preface
Preface
vii
Orientation
Cross-validation, RMSE, and grid search walk into a bar. The bar
tender looks up and says, Who the heck are you?
That was my attempt at a joke. If youve spent any time trying to
decipher machine learning jargon, then maybe that made you
chuckle. Machine learning as a field is full of technical terms, mak
ing it difficult for beginners to get started. One might see things like
deep learning, the kernel trick, regularization, overfitting,
semi-supervised learning, cross-validation, etc. But what in the
world do they mean?
One of the core tasks in building a machine learning model is to
evaluate its performance. Its fundamental, and its also really hard.
My mentors in machine learning research taught me to ask these
questions at the outset of any project: How can I measure success
for this project? and How would I know when Ive succee
ded? These questions allow me to set my goals realistically, so that I
know when to stop. Sometimes they prevent me from working on
ill-formulated projects where good measurement is vague or infeasi
ble. Its important to think about evaluation up front.
So how would one measure the success of a machine learning
model? How would we know when to stop and call it good? To
answer these questions, lets take a tour of the landscape of machine
learning model evaluation.
places where one needs to evaluate the model. Roughly speaking, the
first phase involves prototyping, where we try out different models to
find the best one (model selection). Once we are satisfied with a pro
totype model, we deploy it into production, where it will go through
further testing on live data.1 Figure 1-1 illustrates this workflow.
1 For the sake of simplicity, we focus on batch training and deployment in this report.
Orientation
Evaluation Metrics
Chapter 2 focuses on evaluation metrics. Different machine learning
tasks have different performance metrics. If I build a classifier to
detect spam emails versus normal emails, then I can use classifica
tion performance metrics such as average accuracy, log-loss, and
area under the curve (AUC). If Im trying to predict a numeric
score, such as Apples daily stock price, then I might consider the
root-mean-square error (RMSE). If I am ranking items by relevance
Evaluation Metrics
Hyperparameter Search
You may have heard of terms like hyperparameter search, autotuning (which is just a shorter way of saying hyperparameter
search), or grid search (a possible method for hyperparameter
search). Where do those terms fit in? To understand hyperparame
ter search, we have to talk about the difference between a model
parameter and a hyperparameter. In brief, model parameters are the
knobs that the training algorithm knows how to tweak; they are
learned from data. Hyperparameters, on the other hand, are not
learned by the training method, but they also need to be tuned. To
4
Orientation
Evaluation Metrics
Evaluation metrics are tied to machine learning tasks. There are dif
ferent metrics for the tasks of classification, regression, ranking,
clustering, topic modeling, etc. Some metrics, such as precisionrecall, are useful for multiple tasks. Classification, regression, and
ranking are examples of supervised learning, which constitutes a
majority of machine learning applications. Well focus on metrics for
supervised learning models in this report.
Classification Metrics
Classification is about predicting class labels given input data. In
binary classification, there are two possible output classes. In multi
class classification, there are more than two possible classes. Ill
focus on binary classification here. But all of the metrics can be
extended to the multiclass scenario.
An example of binary classification is spam detection, where the
input data could include the email text and metadata (sender, send
ing time), and the output label is either spam or not spam. (See
Figure 2-1.) Sometimes, people use generic names for the two
classes: positive and negative, or class 1 and class 0.
There are many ways of measuring classification performance.
Accuracy, confusion matrix, log-loss, and AUC are some of the most
popular metrics. Precision-recall is also widely used; Ill explain it in
Ranking Metrics on page 12.
Accuracy
Accuracy simply measures how often the classifier makes the correct
prediction. Its the ratio between the number of correct predictions
and the total number of predictions (the number of data points in
the test set):
accuracy =
# correct predictions
# total data points
Confusion Matrix
Accuracy looks easy enough. However, it makes no distinction
between classes; correct answers for class 0 and class 1 are treated
equallysometimes this is not enough. One might want to look at
how many examples failed for class 0 versus class 1, because the cost
of misclassification might differ for the two classes, or one might
have a lot more test data of one class than the other. For example,
when a doctor makes a medical diagnosis that a patient has cancer
when he doesnt (known as a false positive) has very different conse
quences than making the call that a patient doesnt have cancer
when he does (a false negative). A confusion matrix (or confusion
table) shows a more detailed breakdown of correct and incorrect
classifications for each class. The rows of the matrix correspond to
ground truth labels, and the columns represent the prediction.
Suppose the test dataset contains 100 examples in the positive class
and 200 examples in the negative class; then, the confusion table
might look something like this:
Predicted as positive
Predicted as negative
Labeled as positive
80
20
Labeled as negative
195
Evaluation Metrics
Looking at the matrix, one can clearly tell that the positive class has
lower accuracy (80/(20 + 80) = 80%) than the negative class (195/
(5 + 195) = 97.5%). This information is lost if one only looks at the
overall accuracy, which in this case would be (80 + 195)/(100 + 200)
= 91.7%.
Per-Class Accuracy
A variation of accuracy is the average per-class accuracythe aver
age of the accuracy for each class. Accuracy is an example of whats
known as a micro-average, and average per-class accuracy is a
macro-average. In the above example, the average per-class accuracy
would be (80% + 97.5%)/2 = 88.75%. Note that in this case, the aver
age per-class accuracy is quite different from the accuracy.
In general, when there are different numbers of examples per class,
the average per-class accuracy will be different from the accuracy.
(Exercise for the curious reader: Try proving this mathematically!)
Why is this important? When the classes are imbalanced, i.e., there
are a lot more examples of one class than the other, then the accu
racy will give a very distorted picture, because the class with more
examples will dominate the statistic. In that case, you should look at
the per-class accuracy, both the average and the individual per-class
accuracy numbers.
Per-class accuracy is not without its own caveats. For instance, if
there are very few examples of one class, then test statistics for that
class will have a large variance, which means that its accuracy esti
mate is not as reliable as other classes. Taking the average of all the
classes obscures the confidence measurement of individual classes.
Log-Loss
Log-loss, or logarithmic loss, gets into the finer details of a classifier.
In particular, if the raw output of the classifier is a numeric proba
bility instead of a class label of 0 or 1, then log-loss can be used. The
probability can be understood as a gauge of confidence. If the true
label is 0 but the classifier thinks it belongs to class 1 with probabil
ity 0.51, then even though the classifier would be making a mistake,
its a near miss because the probability is very close to the decision
boundary of 0.5. Log-loss is a soft measurement of accuracy that
incorporates this idea of probabilistic confidence.
Classification Metrics
1
N y
N i=1 i
log pi + 1 yi log 1 pi
AUC
AUC stands for area under the curve. Here, the curve is the receiver
operating characteristic curve, or ROC curve for short. This exotic
sounding name originated in the 1950s from radio signal analysis,
and was made popular by a 1978 paper by Charles Metz called
"Basic Principles of ROC Analysis. The ROC curve shows the sensi
tivity of the classifier by plotting the rate of true positives to the rate
of false positives (see Figure 2-2). In other words, it shows you how
many correct positive classifications can be gained as you allow for
more and more false positives. The perfect classifier that makes no
mistakes would hit a true positive rate of 100% immediately, without
incurring any false positivesthis almost never happens in practice.
10
Evaluation Metrics
11
Ranking Metrics
Weve arrived at ranking metrics. But wait! We are not quite out of
the classification woods yet. One of the primary ranking metrics,
precision-recall, is also popular for classification tasks.
Ranking is related to binary classification. Lets look at Internet
search, for example. The search engine acts as a ranker. When the
user types in a query, the search engine returns a ranked list of web
pages that it considers to be relevant to the query. Conceptually, one
can think of the task of ranking as first a binary classification of rel
evant to the query versus irrelevant to the query, followed by
ordering the results so that the most relevant items appear at the top
of the list. In an underlying implementation, the classifier may
assign a numeric score to each item instead of a categorical class
label, and the ranker may simply order the items by the raw score.
Another example of a ranking problem is personalized recommen
dation. The recommender might act either as a ranker or a score
predictor. In the first case, the output is a ranked list of items for
each user. In the case of score prediction, the recommender needs to
return a predicted score for each user-item pairthis is an example
of a regression model, which we will discuss later.
Precision-Recall
Precision and recall are actually two metrics. But they are often used
together. Precision answers the question, Out of the items that the
ranker/classifier predicted to be relevant, how many are truly rele
vant? Whereas, recall answers the question, Out of all the items
that are truly relevant, how many are found by the ranker/classi
fier? Figure 2-3 contains a simple Venn diagram that illustrates pre
cision versus recall.
12
Evaluation Metrics
Frequently, one might look at only the top k items from the ranker,
k = 5, 10, 20, 100, etc. Then the metrics would be called preci
sion@k and recall@k.
When dealing with a recommender, there are multiple queries of
interest; each user is a query into the pool of items. In this case, we
can average the precision and recall scores for each query and look
at average precision@k and average recall@k. (This is analogous
to the relationship between accuracy and average per-class accuracy
for classification.)
13
F1 = 2 precision
Unlike the arithmetic mean, the harmonic mean tends toward the
smaller of the two elements. Hence the F1 score will be small if
either precision or recall is small.
NDCG
Precision and recall treat all retrieved items equally; a relevant item
in position k counts just as much as a relevant item in position 1.
But this is not usually how people think. When we look at the results
from a search engine, the top few answers matter much more than
answers that are lower down on the list.
NDCG tries to take this behavior into account. NDCG stands for
normalized discounted cumulative gain. There are three closely
related metrics here: cumulative gain (CG), discounted cumulative
gain (DCG), and finally, normalized discounted cumulative gain.
Cumulative gain sums up the relevance of the top k items. Discoun
ted cumulative gain discounts items that are further down the list.
Normalized discounted cumulative gain, true to its name, is a nor
malized version of discounted cumulative gain. It divides the DCG
by the perfect DCG score, so that the normalized score always lies
between 0.0 and 1.0. See the Wikipedia article for detailed mathe
matical formulas.
DCG and NDCG are important metrics in information retrieval and
in any application where the positioning of the returned items is
important.
Regression Metrics
In a regression task, the model learns to predict numeric scores. For
example, when we try to predict the price of a stock on future days
given past price history and other information about the company
and the market, we can treat it as a regression task. Another example
is personalized recommenders that try to explicitly predict a users
rating for an item. (A recommender can alternatively optimize for
ranking.)
14
| Evaluation Metrics
RMSE
The most commonly used metric for regression tasks is RMSE
(root-mean-square error), also known as RMSD (root-mean-square
deviation). This is defined as the square root of the average squared
distance between the actual score and the predicted score:
RMSE =
i yi yi
n
Here, yi denotes the true score for the ith data point, and yi denotes
the predicted value. One intuitive way to understand this formula is
that it is the Euclidean distance between the vector of the true scores
and the vector of the predicted scores, averaged by n, where n is
the number of data points.
Quantiles of Errors
RMSE may be the most common metric, but it has some problems.
Most crucially, because it is an average, it is sensitive to large outli
ers. If the regressor performs really badly on a single data point, the
average error could be very big. In statistical terms, we say that the
mean is not robust (to large outliers).
Quantiles (or percentiles), on the other hand, are much more
robust. To see why this is, lets take a look at the median (the 50th
percentile), which is the element of a set that is larger than half of
the set, and smaller than the other half. If the largest element of a set
changes from 1 to 100, the mean should shift, but the median would
not be affected at all.
One thing that is certain with real data is that there will always be
outliers. The model will probably not perform very well on them.
So its important to look at robust estimators of performance that
arent affected by large outliers. It is useful to look at the median
absolute percentage:
MAPE = median
yi yi /yi
Regression Metrics
15
Evaluation Metrics
17
der, both during training and evaluation. When not enough data is
available in the training data, a recommender model would not be
able to learn the users preferences, or the items that are similar to a
rare item. Rare users and items in the evaluation data would lead to
a very low estimate of the recommenders performance, which com
pounds the problem of having a badly trained recommender.
Outliers are another kind of data skew. Large outliers can cause
problems for a regressor. For instance, in the Million Song Dataset, a
users score for a song is taken to be the number of times the user
has listened to this song. The highest score is greater than 16,000!
This means that any error made by the regressor on this data point
would dwarf all other errors. The effect of large outliers during eval
uation can be mitigated through robust metrics such as quantiles of
errors. But this would not solve the problem for the training phase.
Effective solutions for large outliers would probably involve careful
data cleaning, and perhaps reformulating the task so that its not
sensitive to large outliers.
Related Reading
An Introduction to ROC Analysis.Tom Fawcett. Pattern Recog
nition Letters, 2006.
Chapter 7 of Data Science for Business discusses the use of
expected value as a useful classification metric, especially in
cases of skewed data sets.
Software Packages
Many of the metrics (and more) are implemented in various soft
ware packages for data science.
R: Metrics package.
Python: scikit-learns model evaluation methods and GraphLab
Creates fledgling evaluation module.
18
Evaluation Metrics
Now that weve discussed the metrics, lets re-situate ourselves in the
machine learning model workflow that we unveiled in Figure 1-1.
We are still in the prototyping phase. This stage is where we tweak
everything: features, types of model, training methods, etc. Lets dive
a little deeper into model selection.
19
21
Hold-Out Validation
Hold-out validation is simple. Assuming that all data points are i.i.d.
(independently and identically distributed), we simply randomly
hold out part of the data for validation. We train the model on the
larger portion of the data and evaluate validation metrics on the
smaller hold-out set.
Computationally speaking, hold-out validation is simple to program
and fast to run. The downside is that it is less powerful statistically.
The validation results are derived from a small subset of the data,
hence its estimate of the generalization error is less reliable. It is also
difficult to compute any variance information or confidence inter
vals on a single dataset.
Use hold-out validation when there is enough data such that a sub
set can be held out, and this subset is big enough to ensure reliable
statistical estimates.
Cross-Validation
Cross-validation is another validation technique. It is not the only
validation technique, and it is not the same as hyperparameter tun
ing. So be careful not to get the three (the concept of model valida
tion, cross-validation, and hyperparameter tuning) confused with
each other. Cross-validation is simply a way of generating training
and validation sets for the process of hyperparameter tuning. Holdout validation, another validation technique, is also valid for hyper
parameter tuning, and is in fact computationally much cheaper.
There are many variants of cross-validation. The most commonly
used is k-fold cross-validation. In this procedure, we first divide the
training dataset into k folds (see Figure 3-2). For a given hyperpara
meter setting, each of the k folds takes turns being the hold-out vali
dation set; a model is trained on the rest of the k 1 folds and meas
ured on the held-out fold. The overall performance is taken to be
the average of the performance on all k folds. Repeat this procedure
for all of the hyperparameter settings that need to be evaluated, then
pick the hyperparameters that resulted in the highest k-fold average.
Another variant of cross-validation is leave-one-out crossvalidation. This is essentially the same as k-fold cross-validation,
where k is equal to the total number of data points in the dataset.
23
date results on the rest of the unselected data. The effects are very
similar to what one would get from cross-validation.
Summary
To recap, here are the important points for offline evaluation and
model validation:
1. During the model prototyping phase, one needs to do model
selection. This involves hyperparameter tuning as well as model
24 | Offline Evaluation Mechanisms: Hold-Out Validation, Cross-Validation, and
Bootstrapping
2.
3.
4.
5.
Related Reading
The Bootstrap: Statisticians Can Reuse Their Data to Quantify
the Uncertainty of Complex Models. Cosma Shalizi. American
Scientist, MayJune 2010.
Software Packages
R: cvTools
Python: scikit-learn provides a cross-validation module and
out-of-bag estimators that follow the same idea as bootstrap
ping. GraphLab Create offers hold-out validation and cross vali
dation.
Related Reading
25
Hyperparameter Tuning
27
28
Hyperparameter Tuning
29
This pseudocode is correct for grid search and random search. But
the smart search methods do not require a list of candidate settings
as input. Rather it does something smarter than a for-loop through a
static set of candidates. Well see how later.
30
Hyperparameter Tuning
Grid Search
Grid search, true to its name, picks out a grid of hyperparameter
values, evaluates every one of them, and returns the winner. For
example, if the hyperparameter is the number of leaves in a decision
tree, then the grid could be 10, 20, 30, , 100. For regularization
parameters, its common to use exponential scale: 1e-5, 1e-4, 1e-3,
, 1. Some guesswork is necessary to specify the minimum and
maximum values. So sometimes people run a small grid, see if the
optimum lies at either endpoint, and then expand the grid in that
direction. This is called manual grid search.
Grid search is dead simple to set up and trivial to parallelize. It is the
most expensive method in terms of total computation time. How
ever, if run in parallel, it is fast in terms of wall clock time.
Random Search
I love movies where the underdog wins, and I love machine learning
papers where simple solutions are shown to be surprisingly effective.
This is the storyline of Random Search for Hyper Parameter Opti
mization by Bergstra and Bengio. Random search is a slight varia
tion on grid search. Instead of searching over the entire grid, ran
dom search only evaluates a random sample of points on the grid.
This makes random search a lot cheaper than grid search. Random
search wasnt taken very seriously before. This is because it doesnt
search over all the grid points, so it cannot possibly beat the opti
mum found by grid search. But then along came Bergstra and Ben
gio. They showed that, in surprisingly many instances, random
search performs about as well as grid search. All in all, trying 60
random points sampled from the grid seems to be good enough.
In hindsight, there is a simple probabilistic explanation for the
result: for any distribution over a sample space with a finite maxi
mum, the maximum of 60 random observations lies within the top
5% of the true maximum, with 95% probability. That may sound
complicated, but its not. Imagine the 5% interval around the true
maximum. Now imagine that we sample points from this space and
see if any of them land within that maximum. Each random draw
has a 5% chance of landing in that interval; if we draw n points inde
Hyperparameter Tuning Algorithms
31
pendently, then the probability that all of them miss the desired
interval is (1 0.05)n. So the probability that at least one of them
succeeds in hitting the interval is 1 minus that quantity. We want at
least a 0.95 probability of success. To figure out the number of draws
we need, just solve for n in the following equation:
1 (1 0.05)n > 0.95
We get n >= 60. Ta-da!
The moral of the story is: if at least 5% of the points on the grid yield
a close-to-optimal solution, then random search with 60 trials will find
that region with high probability. The condition of the if-statement is
very important. It can be satisfied if either the close-to-optimal
region is large, or if somehow there is a high concentration of grid
points in that region. The former is more likely, because a good
machine learning model should not be overly sensitive to the hyper
parameters, i.e., the close-to-optimal region is large.
With its utter simplicity and surprisingly reasonable performance,
random search is my go-to method for hyperparameter tuning. Its
trivially parallelizable, just like grid search, but it takes much fewer
tries and performs almost as well most of the time.
32
Hyperparameter Tuning
33
34
| Hyperparameter Tuning
35
Related Reading
Random Search for Hyper-Parameter Optimization. James
Bergstra and Yoshua Bengio. Journal of Machine Learning
Research, 2012.
Algorithms for Hyper-Parameter Optimization. James Berg
stra, Rmi Bardenet, Yoshua Bengio, and Balzs Kgl. Neural
Information Processing Systems, 2011. See also a SciPy 2013 talk
by the authors.
Practical Bayesian Optimization of Machine Learning Algo
rithms. Jasper Snoek, Hugo Larochelle, and Ryan P. Adams.
Neural Information Processing Systems, 2012.
Sequential Model-Based Optimization for General Algorithm
Configuration. Frank Hutter, Holger H. Hoos, and Kevin
Leyton-Brown. Learning and Intelligent Optimization, 2011.
Lazy Paired Hyper-Parameter Tuning. Alice Zheng and
Mikhail Bilenko. International Joint Conference on Artificial
Intelligence, 2013.
Introduction to Derivative-Free Optimization (MPS-SIAM Series
on Optimization). Andrew R. Conn, Katya Scheinberg, and Luis
N. Vincente, 2009.
Gradient-Based Hyperparameter Optimization Through
Reversible Learning. Dougal Maclaurin, David Duvenaud, and
Ryan P. Adams. ArXiv, 2015.
Software Packages
Grid search and random search: GraphLab Create, scikit-learn.
Bayesian optimization using Gaussian processes: Spearmint
(from Jasper et al.)
Bayesian optimization using Tree-based Parzen Estimators:
Hyperopt (from Bergstra et al.)
Random forest tuning: SMAC (from Hutter et al.)
Hyper gradient: hypergrad (from Maclaurin et al.)
36
| Hyperparameter Tuning
37
Recall that there are roughly two regimes for machine learning eval
uation: offline and online. Offline evaluation happens during the
prototyping phase where one tries out different features, models,
and hyperparameters. Its an iterative process of many rounds of
evaluation against a chosen baseline on a set of chosen evaluation
metrics. Once you have a model that performs reasonably well, the
next step is to deploy the model to production and evaluate its per
formance online, i.e., on live data. This chapter discusses
online testing.
38
39
2. Which Metric?
The next important question is, on which metric should you evalu
ate the model? Ultimately, the right metric is probably a business
metric. But this may not be easily measurable in the system. For
instance, search engines care about the number of users, how long
they spend on the site, and their overall market share. Comparison
statistics are not readily available to the live system. So they will
need to approximate the ultimate business metric of market share
with measurable ones like number of unique visitors per day and
average session length. In practice, short-term, measurable live met
rics may not always align with long-term business metrics, and it
can be tricky to design the right metric.
Backing up for a second, there are four classes of metrics to think
about: business metrics, measurable live metrics, offline evaluation
metrics, and training metrics. We just discussed the difference
between business metrics and live metrics that can be measured.
Offline evaluation metrics are things like the classification, regres
sion, and ranking metrics we discussed previously. The training
metric is the loss function that is optimized during the training pro
cess. (For example, a support vector machine optimizes a combina
tion of the norm of the weight vector and misclassification penal
ties.)
The optimal scenario is where all four of those metrics are either
exactly the same or are linearly aligned with each other. The former
is impossible. The latter is unlikely. So the next thing to shoot for is
that these metrics always increase or decrease with each other. How
ever, you may still encounter situations where a linear decrease in
RMSE (a regression metric) does not translate to a linear increase in
click-through rates. (Kohavi et al. described some interesting exam
ples in their KDD 2012 paper.) Keep this in mind and save your
efforts to optimize where it counts the most. You should always be
tracking all of these metrics, so that you know when things go out of
whackusually a sign of distribution drift or software and instru
mentation bugs.
#2, this is probably not solely a data science question but a business
question. Pick a reasonable value up front and stick to it. Avoid the
temptation to shift it later, as you start to see the results.
41
not Gaussian and whose standard error does not decrease with
longer tests. For example, metrics involving counts are better mod
eled as negative binomials.
When these assumptions are violated, the distribution may take
longer than usual to converge to a Gaussian, or not at all. Usually,
the average of more than 30 observations starts to look like a Gaus
sian. When there is a mixture of populations, however, it will take
much longer. Here are a few rules of thumb that can mitigate the
violation of t-test assumptions:
1. If the metric is nonnegative and has a long tail, i.e., its a count
of some sort, take the log transform.
2. Alternatively, the family of power transforms tends to stabilize
the variance (decrease the variance or at least make it not
dependent on the mean) and make the distribution more
Gaussian-like.
3. The negative binomial is a better distribution for counts.
4. If the distribution looks nowhere near a Gaussian, dont use the
t-test. Pick a nonparametric test that doesnt make the Gaussian
assumption, such as the Mann-Whitney U test.
43
In practice, this may not make too big of a difference, because the tdistribution is well approximated by the Gaussian when the sample
sizes are larger than 20. However, Welchs t-test is a safe choice that
works regardless of sample size or whether the variance is equal. So
why not?
ability. If one test has a false positive rate of 0.05, then the probabil
ity that none of the 20 tests makes a false positive drops precipi
tously to (1 0.05)20 = 0.36. Whats more, this calculation assumes
that the tests are independent. If the tests are not independent (i.e.,
maybe your 32 models all came from the same training dataset?),
then the probability of a false positive may be even higher.
Benjamini and Hochberg proposed a useful method for dealing with
false positives in multiple tests. In their 1995 paper, Controlling the
False Discovery Rate: A Practical and Powerful Approach to Multi
ple Testing, they proposed a modified procedure that orders the pvalues from each test and rejects the null hypothesis for the smallest
i
normalized p-values (p i m q, where q is the desired significance
level, m is the total number of tests, and i is the ranking of the pvalue). This test does not assume that the tests are independent or
are normally distributed, and has more statistical power than the
classic Bonferroni correction.
Even without running multiple tests simultaneously, you may still
run into the multiple hypothesis testing scenario. For instance, if
you are changing your model based on live test data, submitting
new models until something achieves the acceptance threshold, then
you are essentially running multiple tests sequentially. Its a good
idea to apply the Benjamini-Hochberg procedure (or one of its
derivatives) to control the false discovery rate in this situation as
well.
45
distribution drift, where the behavior of the user changes faster than
one can collect enough observations. (See question #12.)
When determining the length of a trial, its important to go beyond
whats known as the Novelty effect. When users are switched to a
new experience, their initial reactions may not be their long-term
reactions. In other words, if you are testing a new color for a button,
the user may initially love the button and click it more often, just
because its novel, or she may hate the new color and never touch it,
but eventually she would get used to the new color and behave as
she did before. Its important to run the trial long enough to get past
the period of the shock of the new.
The metric may also display seasonality. For instance, the website
traffic may behave one way during the day and another way at night,
or perhaps people buy different types of clothes in the summer ver
sus fall. Its important to take this into account and discount foresee
able changes when collecting data for the trial.
Related Reading
Deploying Machine Learning in Production, slides from my
Strata London 2015 talk.
So, You Need a Statistically Significant Sample? Kim Larsen,
StitchFix blog post, May 2015.
Related Reading
47
48