W EB T ECHNOLOGIE S
Online Experiments:
Practical Lessons
Ron Kohavi, Roger Longbotham,
and Toby Walker, Microsoft
When running online experiments, getting numbers is easy;
getting numbers you can trust is hard.
F
rom ancient times through have spread into the online world of to two or more groups for some
the 19th century, physi- websites and services. In an earlier period of time and exposed to differ-
cians used bloodletting Web Technologies article (R. Kohavi ent variants of the website. The most
to treat acne, ca ncer, and R. Longobotham, “Online Experi- common online experiment, the A/B
diabetes, jaundice, plague, and hun- ments: Lessons Learned,” Computer, test, has two variants: the A version
dreds of other diseases and ailments Sept. 2007, pp. 85-87) and a related of the site is the control and the B ver-
(D. Wooton, Doctors Doing Harm survey (R. Kohavi et al., “Controlled sion is the treatment.
since Hippocrates, Oxford Univ. Press, Experiments on the Web: Survey and The experimenters define an
2006). It was judged most effective Practical Guide,” Data Mining and overall evaluation criterion (OEC)
to bleed patients while they were sit- Knowledge Discovery, Feb. 2009, pp. and compute a statistic—for exam-
ting upright or standing erect, and 140-181), Microsoft’s Experimentation ple, the mean of the OEC—for each
blood was often removed until the Platform team introduced basic prac- variant. The OEC statistic is also
patient fainted. On 12 December tices of good online experimentation. referred to as a key performance
1799, 67-year-old President George Three years later and having run indicator (KPI); in statistics, the
Washington rode his horse in heavy hundreds of experiments on more OEC is often called the response or
snowfall to inspect his plantation at than 20 websites, including some dependent variable.
Mount Vernon. A day later, he was in of the world’s largest, like msn.com The difference between the OEC
respiratory distress and his doctors and bing.com, we have learned some statistic for the treatment and con-
extracted nearly half of his blood important practical lessons about trol groups is the treatment effect.
over 10 hours, causing anemia and the limitations of standard statisti- If the experiment was designed and
hypotension; he died that night. cal formulas and about data traps. executed properly, the only thing
Today, we know that bloodletting These lessons, even for seemingly consistently different between the
is unhelpful because in 1828 a Pari- simple univariate experiments, aren’t two variants is the planned change
sian doctor named Pierre Louis did taught in Statistics 101. After reading between the control and treatment.
a controlled experiment. He treated this article, we hope you’ll have better Consequently, any statistically signifi-
78 people suffering from pneumonia negative introspection: to know what cant effect is the result of the planned
with early and frequent bloodlet- you don’t know. change, establishing causality with
ting or less aggressive measures and high probability.
found that bloodletting didn’t help ONLINE EXPERIMENTS: Common extensions to simple
survival rates or recovery times. A QUICK REVIEW A/B tests include multiple variants
Having roots in agriculture and In an online controlled experi- along a single axis—for example,
medicine, controlled experiments ment, users are randomly assigned A/B/C/D—and multivariable tests
82 COMPUTER Published by the IEEE Computer Society 0018-9162/10/$26.00 © 2010 IEEE
that expose users to changes along AVOIDABLE DATA TRAPS the Microsoft support site cached
several axes, such as font choice, In our work we’ve encountered page outputs and didn’t update the
size, and color. many common data traps that are cookies properly only for one vari-
avoidable. Some are seemingly obvi- ant, causing a significant bias. It’s
LIMITATIONS OF ous; others are obvious in hindsight, therefore important to check for con-
STATISTICAL FORMULAS but the hindsight was painfully won. sistent allocation of the users against
Most online experiments ran- the target percentages.
domly assign users to variants based Variant assignment
on a user ID stored in a cookie. Thus A basic requirement of variant Data loss
“users” are statistically indepen- assignment is that randomization Online experiments are vulner-
dent of one another, and any metric work correctly. We’ve observed able to data loss at all stages of the
calculated by user can employ stan- subtle examples of variant assign- pipeline: logging, extract, transform,
dard statistical methods requiring ment bias that significantly impacted and load. We recommend monitor-
independence—t-tests, analysis of the results. For example, in one Bing ing for data loss at both the aggregate
variance (ANOVA), and so on. Stan- experiment, a small misconfigura- and machine level. At best, the data
dard metrics like clicks per user tion caused internal Microsoft traffic loss is unbiased and you’ll have lost
work well for this analysis, but many some sample power, but in some
important metrics aren’t defined by cases, the loss can bias the results.
user but by, for example, page view Using browser redirects In one experiment, users were sent
or session. to send users to a to one website in the control and a
One metric used extensively in variant if they aren’t in different one in the treatment. The
online experiments is click-through the control introduces website in the control redirected
rate (CTR), which is the total number subtle biases,including non-US users to their international
of clicks divided by the total number performance differences. site, which wasn’t logging to the
of impressions. This formula can be same system, and they were “lost” to
used to estimate the average CTR, but the experiment. Using browser redi-
estimating the standard deviation is to always be assigned to the control. rects to send users to a variant if they
difficult. It’s tempting to treat this as In another experiment on the MSN aren’t in the control also introduces
a sequence of independent Bernoulli homepage, cobranded users, whose subtle biases, including performance
trials, with each page view either page is slightly different, were always differences.
getting a click or not, but this will shown the control. Even a small 1 per-
underestimate the standard devia- cent imbalance between the control Event tracking
tion, and too many experiments will and treatment can easily cause a 5 Most online experiments log clicks
incorrectly appear as statistically percent delta to the treatment effect. to the servers using some form of call-
significant. The assumption of inde- Such users, if not assigned randomly, back, such as JavaScript handlers or
pendence of page views doesn’t hold must be excluded from the analysis. URL redirects. Such tracking mecha-
because clicks and page views from nisms can cause problems. In one
the same user are correlated. For User identification experiment, an unexpectedly strong
the same reason, metrics calculated User IDs are typically stored in negative impact on CTR surfaced.
by user-day or by session are also browser cookies. In Microsoft’s eco- An investigation revealed a problem
correlated. system, which has multiple domains with the callback mechanism used
To address this problem, we use (msn.com, microsoft.com, bing. for tracking clicks that affected only
both bootstrapping and the delta com, live.com), synchronization of Internet Explorer 6 (IE6). Another
method. Bootstrapping “resamples” identity across domains is complex. experiment used redirects for track-
the dataset to estimate the standard Because user assignment to variants ing ads in the control and JavaScript
deviation. It’s computationally inten- is based on user ID, the user’s expe- callbacks for the treatment. Because
sive, especially for large datasets. The rience may “flip” to another variant the loss rate for these is different and
delta method, on the other hand, is if a user’s cookie changes due to the because some robots don’t respect
computationally simpler because it identity synchronization. In addi- JavaScript, there was significant bias
uses a simple, low-order Taylor series tion, if one of the variants updates in click-through metrics.
approximation to compute the vari- the cookie more frequently, there
ance. For large sample sizes it works will be a consistent bias. Server differences
well because higher-order terms con- Cookie-related bugs are extremely A tempting experimental design is
verge to zero. hard to debug. In one experiment, to run one variant—say, the control—
SEPTEMBER 2010 83
W EB T ECHNOLOGIE S
on an existing server fleet and the Robots ple, a metric that computes average
treatment on another fleet. If there Robot re q u e st s a r e websit e number of queries per user can be
are systematic differences in the serv- requests that don’t come from users’ heavily influenced by a single robot
ers—different hardware capabilities, direct interaction with the website. that issues thousands of queries.
a different network, patches, more Robot traffic can have nonmalicious Metrics such as conversion rate
crashes, different data centers, and so sources—ranging from browser add- that take only 0 or 1 for each user
on—any or all of these can severely ins that prefetch pages linked from are least affected by robots or out-
bias the experiment. the current one, to search-engine liers. When business requirements
crawlers, to a graduate student scrap- dictate nonrobust metrics, we rec-
System-wide interference ing webpages—as well as malicious ommend calculating related robust
Because online experiments for sources like ad-click spam. For a metrics and investigating directional
the same site typically run in the large search engine or online portal differences—for example, from
same physical systems, experi- site, robots can commonly represent robots.
ment variants can interfere with one 15 to 30 percent of page views. Detecting and removing robots is
another or with other experiments both critical and challenging. Most
when running at the same time. For robots, especially nonmalicious
example, if variants use a shared Robots can easily bias ones, can be removed using simple
resource—say, a least recently used experimental results techniques. For example, you could
(LRU) cache—differential use of that because some robots remove users with an unlikely large
resource can have unexpected con- have many more events— number of queries, who issue a sig-
sequences on the entire system and page views, clicks, nificant number of queries but never
thus the experiments. searches, and so on— click, or who issue queries too rap-
For example, in one experiment, than members of the true idly; you could also blacklist known
many key performance metrics went user population. user agents. Some robots will still
down when a small change was made slip through, so it’s a good practice
to how results were generated for a to use manual or automated meth-
page. The cause turned out to be an It’s naïve to hope that robots won’t ods to look for outliers. Finally, you
unexpected cache-hit-rate difference affect experimental results. Almost should use robust statistical meth-
between the control and treatment. all experiments are conducted to ods—nonparametric or rank-based
Websites cache results for com- optimize the website for human techniques, robust ANOVA, and so
monly requested pages or searches users. Not only are robots not part on—to determine whether the data
to improve performance. While both of the population of interest, adding are similar to those of a standard
the control and treatment cached background noise, they can seriously analysis using, say, t-tests or ANOVA.
results in the same cache, they used impact the conclusions. A large discrepancy in the results
different keys for the same request. Robots can easily bias experi- warrants further investigation.
Because the control received much mental results because some robots Finally, note that robot detection is
more traffic than the treatment, it have many more events—page views, an adversarial AI-complete problem:
occupied a much larger part of the clicks, searches, and so on—than your goal is to identify the nonhuman
shared cache. The result was that the members of the true user population. robots, while adversaries write robots
treatment had a much lower cache This biases the mean and increases that masquerade as humans—for
hit rate than the control, resulting in the variance of the groups into which example, using distributed botnets.
significantly slower performance. they’re randomized. Increased vari- Above all, be vigilant.
The solution in this case was ance means less chance to detect
simple: create a special “dummy” effects. For example, a robot we dis- Configuration drift
control whose allocation matches the covered in one experiment claimed it In a typical online system, opera-
treatment; this equalizes the cache hit was IE8 yet “clicked” 600,000 times tion is controlled by many different
rate of the control and treatment. The in a month. A single robot like this configuration settings distributed
lesson: look for timing differences in significantly skews multiple metrics across multiple systems. These set-
the user experience between variants. if not detected and removed. tings change while the experiment
Some metrics are more sensitive runs. There’s a danger of the control
UNAVOIDABLE DATA TRAPS to robot bias than others. Any metric and treatment drifting apart so they
It’s harder to keep other data with no upper bound is much more no longer match the original experi-
traps from affecting your online sensitive to outliers than metrics ment’s design. A good example is a
experiment. having a bounded value. For exam- bug fix made only to the control.
84 COMPUTER
IDENTIFYING DATA TRAPS 1.2
Uncontrolled seven-hour period
USING STATISTICS Control
We use a three-step process to help 1.0
Treatment
validate experimental data:
Click-through rate (percent)
0.8
• logging audits,
• A/A experiments, and 0.6
• data validation checks.
0.4
We’ve previously written about the
importance of conducting logging 0.2
audits and A/A experiments before
doing the first A/B experiment (“Con- 0.0
Monday 3 PM
Monday6 PM
Monday 9 PM
Tuesday 12 PM
Tuesday 3 PM
Tuesday 6 PM
Tuesday 9 PM
Wednesday 12 PM
Wednesday 3 PM
Wednesday 6 PM
Wednesday 9 PM
Thursday 12 PM
Thursday 3 PM
Thursday 6 PM
Thursday 9 PM
Friday 12 AM
Friday 3 AM
Friday 6 AM
Friday 9 AM
Friday 12 PM
Friday 3 PM
Tuesday 12 AM
Tuesday 3 AM
Tuesday 6 AM
Tuesday 9 AM
Wednesday 12 AM
Wednesday 3 AM
Wednesday 6 AM
Wednesday 9 AM
Thursday 12 AM
Thursday 3 AM
Thursday 6 AM
Thursday 9 AM
trolled Experiments on the Web:
Survey and Practical Guide”). As an
experiment is running, you should
conduct ongoing and final data
validation checks. While there are Time (hours)
nonstatistical checks that you should
Figure 1. Hourly click-through rate (CTR) for the treatment and control of one
always do—for example, check for
experiment revealed an uncontrolled difference in the websites during a seven-hour
time periods with missing data—here
period.
we focus on statistical data validation
checks.
Platform, Roger Longbotham is the
We recommend defining statis- variance. Our analysis of one experi- team’s analytics manager, and Toby
tical integrity constraints, criteria ment showed a 2 percent increase of Walker works on the Bing Data
that should hold with high probabil- the treatment over the control. How- Mining Team as technical lead for
ity even if there’s a large treatment ever, a graph of the treatment and experiment analytics. Contact them
effect, and checking them with statis- control means over time revealed at http://exp-platform.com.
tical hypothesis tests. For example, if an unexpected increase in the treat-
an experiment assigns 50 percent of ment effect for a seven-hour period,
users to the control and 50 percent to as Figure 1 shows. We determined Editor: Simon S.Y. Shim, Dept. of Computer
the treatment, you could use a bino- that the increase was due to an Engineering, San Jose State Univ., San Jose, CA;
mial or chi-squared test to check that uncontrolled difference in the web- simon.shim@sjsu.edu
the percent in each isn’t significantly sites during this period (editorial
different from the expected 50 per- error); after we removed the data for
cent. If the percentages turned out to this seven-hour period, there was no
be, say, 50.1 and 49.9, that could mean longer a difference between the treat- Selected CS articles and columns
almost a 0.5 percent bias in key met- ment and control. are available for free at http://
rics depending on the reason for the ComputingNow.computer.org.
B
imbalance. If the p-value of the test is ecause data issues are the
small, you should assume some bias norm, not the exception,
could exist and investigate. in online experiments, we
Dat a-qua lit y metrics a re a n advocate a healthy degree of “data
important type of statistical integ- paranoia.” Finding and understand-
rity constraint. These are metrics for ing such issues are key to getting
both the control and treatment that the right results but also require a
the experiment shouldn’t affect—for substantial investment of time and
example, cache hit rates. A significant energy. Even the smallest anoma-
difference in a data-quality metric lies can lead to new insights. Getting
calls for further investigation. numbers is easy; getting numbers
We also recommend looking for you can trust is hard.
unexpected changes over time in an
experiment’s key statistics: sample Ron Kohavi is the general manager www.computer.org/itpro
sizes, treatment effect, means, and of Mic rosoft ’s Ex pe rime ntation
SEPTEMBER 2010 85