Online Experiments: Practical Lessons

Ron Kohavi; Roger Longbotham; Toby Walker

Online Experiments: Practical Lessons

2010, IEEE Computer

W EB T ECHNOLOGIE S Online Experiments: Practical Lessons Ron Kohavi, Roger Longbotham, and Toby Walker, Microsoft When running online experiments, getting numbers is easy; getting numbers you can trust is hard. F rom ancient times through have spread into the online world of to two or more groups for some the 19th century, physi- websites and services. In an earlier period of time and exposed to differ- cians used bloodletting Web Technologies article (R. Kohavi ent variants of the website. The most to treat acne, ca ncer, and R. Longobotham, “Online Experi- common online experiment, the A/B diabetes, jaundice, plague, and hun- ments: Lessons Learned,” Computer, test, has two variants: the A version dreds of other diseases and ailments Sept. 2007, pp. 85-87) and a related of the site is the control and the B ver- (D. Wooton, Doctors Doing Harm survey (R. Kohavi et al., “Controlled sion is the treatment. since Hippocrates, Oxford Univ. Press, Experiments on the Web: Survey and The experimenters define an 2006). It was judged most effective Practical Guide,” Data Mining and overall evaluation criterion (OEC) to bleed patients while they were sit- Knowledge Discovery, Feb. 2009, pp. and compute a statistic—for exam- ting upright or standing erect, and 140-181), Microsoft’s Experimentation ple, the mean of the OEC—for each blood was often removed until the Platform team introduced basic prac- variant. The OEC statistic is also patient fainted. On 12 December tices of good online experimentation. referred to as a key performance 1799, 67-year-old President George Three years later and having run indicator (KPI); in statistics, the Washington rode his horse in heavy hundreds of experiments on more OEC is often called the response or snowfall to inspect his plantation at than 20 websites, including some dependent variable. Mount Vernon. A day later, he was in of the world’s largest, like msn.com The difference between the OEC respiratory distress and his doctors and bing.com, we have learned some statistic for the treatment and con- extracted nearly half of his blood important practical lessons about trol groups is the treatment effect. over 10 hours, causing anemia and the limitations of standard statisti- If the experiment was designed and hypotension; he died that night. cal formulas and about data traps. executed properly, the only thing Today, we know that bloodletting These lessons, even for seemingly consistently different between the is unhelpful because in 1828 a Pari- simple univariate experiments, aren’t two variants is the planned change sian doctor named Pierre Louis did taught in Statistics 101. After reading between the control and treatment. a controlled experiment. He treated this article, we hope you’ll have better Consequently, any statistically signifi- 78 people suffering from pneumonia negative introspection: to know what cant effect is the result of the planned with early and frequent bloodlet- you don’t know. change, establishing causality with ting or less aggressive measures and high probability. found that bloodletting didn’t help ONLINE EXPERIMENTS: Common extensions to simple survival rates or recovery times. A QUICK REVIEW A/B tests include multiple variants Having roots in agriculture and In an online controlled experi- along a single axis—for example, medicine, controlled experiments ment, users are randomly assigned A/B/C/D—and multivariable tests 82 COMPUTER Published by the IEEE Computer Society 0018-9162/10/$26.00 © 2010 IEEE that expose users to changes along AVOIDABLE DATA TRAPS the Microsoft support site cached several axes, such as font choice, In our work we’ve encountered page outputs and didn’t update the size, and color. many common data traps that are cookies properly only for one vari- avoidable. Some are seemingly obvi- ant, causing a significant bias. It’s LIMITATIONS OF ous; others are obvious in hindsight, therefore important to check for con- STATISTICAL FORMULAS but the hindsight was painfully won. sistent allocation of the users against Most online experiments ran- the target percentages. domly assign users to variants based Variant assignment on a user ID stored in a cookie. Thus A basic requirement of variant Data loss “users” are statistically indepen- assignment is that randomization Online experiments are vulner- dent of one another, and any metric work correctly. We’ve observed able to data loss at all stages of the calculated by user can employ stan- subtle examples of variant assign- pipeline: logging, extract, transform, dard statistical methods requiring ment bias that significantly impacted and load. We recommend monitor- independence—t-tests, analysis of the results. For example, in one Bing ing for data loss at both the aggregate variance (ANOVA), and so on. Stan- experiment, a small misconfigura- and machine level. At best, the data dard metrics like clicks per user tion caused internal Microsoft traffic loss is unbiased and you’ll have lost work well for this analysis, but many some sample power, but in some important metrics aren’t defined by cases, the loss can bias the results. user but by, for example, page view Using browser redirects In one experiment, users were sent or session. to send users to a to one website in the control and a One metric used extensively in variant if they aren’t in different one in the treatment. The online experiments is click-through the control introduces website in the control redirected rate (CTR), which is the total number subtle biases,including non-US users to their international of clicks divided by the total number performance differences. site, which wasn’t logging to the of impressions. This formula can be same system, and they were “lost” to used to estimate the average CTR, but the experiment. Using browser redi- estimating the standard deviation is to always be assigned to the control. rects to send users to a variant if they difficult. It’s tempting to treat this as In another experiment on the MSN aren’t in the control also introduces a sequence of independent Bernoulli homepage, cobranded users, whose subtle biases, including performance trials, with each page view either page is slightly different, were always differences. getting a click or not, but this will shown the control. Even a small 1 per- underestimate the standard devia- cent imbalance between the control Event tracking tion, and too many experiments will and treatment can easily cause a 5 Most online experiments log clicks incorrectly appear as statistically percent delta to the treatment effect. to the servers using some form of call- significant. The assumption of inde- Such users, if not assigned randomly, back, such as JavaScript handlers or pendence of page views doesn’t hold must be excluded from the analysis. URL redirects. Such tracking mecha- because clicks and page views from nisms can cause problems. In one the same user are correlated. For User identification experiment, an unexpectedly strong the same reason, metrics calculated User IDs are typically stored in negative impact on CTR surfaced. by user-day or by session are also browser cookies. In Microsoft’s eco- An investigation revealed a problem correlated. system, which has multiple domains with the callback mechanism used To address this problem, we use (msn.com, microsoft.com, bing. for tracking clicks that affected only both bootstrapping and the delta com, live.com), synchronization of Internet Explorer 6 (IE6). Another method. Bootstrapping “resamples” identity across domains is complex. experiment used redirects for track- the dataset to estimate the standard Because user assignment to variants ing ads in the control and JavaScript deviation. It’s computationally inten- is based on user ID, the user’s expe- callbacks for the treatment. Because sive, especially for large datasets. The rience may “flip” to another variant the loss rate for these is different and delta method, on the other hand, is if a user’s cookie changes due to the because some robots don’t respect computationally simpler because it identity synchronization. In addi- JavaScript, there was significant bias uses a simple, low-order Taylor series tion, if one of the variants updates in click-through metrics. approximation to compute the vari- the cookie more frequently, there ance. For large sample sizes it works will be a consistent bias. Server differences well because higher-order terms con- Cookie-related bugs are extremely A tempting experimental design is verge to zero. hard to debug. In one experiment, to run one variant—say, the control— SEPTEMBER 2010 83 W EB T ECHNOLOGIE S on an existing server fleet and the Robots ple, a metric that computes average treatment on another fleet. If there Robot re q u e st s a r e websit e number of queries per user can be are systematic differences in the serv- requests that don’t come from users’ heavily influenced by a single robot ers—different hardware capabilities, direct interaction with the website. that issues thousands of queries. a different network, patches, more Robot traffic can have nonmalicious Metrics such as conversion rate crashes, different data centers, and so sources—ranging from browser add- that take only 0 or 1 for each user on—any or all of these can severely ins that prefetch pages linked from are least affected by robots or out- bias the experiment. the current one, to search-engine liers. When business requirements crawlers, to a graduate student scrap- dictate nonrobust metrics, we rec- System-wide interference ing webpages—as well as malicious ommend calculating related robust Because online experiments for sources like ad-click spam. For a metrics and investigating directional the same site typically run in the large search engine or online portal differences—for example, from same physical systems, experi- site, robots can commonly represent robots. ment variants can interfere with one 15 to 30 percent of page views. Detecting and removing robots is another or with other experiments both critical and challenging. Most when running at the same time. For robots, especially nonmalicious example, if variants use a shared Robots can easily bias ones, can be removed using simple resource—say, a least recently used experimental results techniques. For example, you could (LRU) cache—differential use of that because some robots remove users with an unlikely large resource can have unexpected con- have many more events— number of queries, who issue a sig- sequences on the entire system and page views, clicks, nificant number of queries but never thus the experiments. searches, and so on— click, or who issue queries too rap- For example, in one experiment, than members of the true idly; you could also blacklist known many key performance metrics went user population. user agents. Some robots will still down when a small change was made slip through, so it’s a good practice to how results were generated for a to use manual or automated meth- page. The cause turned out to be an It’s naïve to hope that robots won’t ods to look for outliers. Finally, you unexpected cache-hit-rate difference affect experimental results. Almost should use robust statistical meth- between the control and treatment. all experiments are conducted to ods—nonparametric or rank-based Websites cache results for com- optimize the website for human techniques, robust ANOVA, and so monly requested pages or searches users. Not only are robots not part on—to determine whether the data to improve performance. While both of the population of interest, adding are similar to those of a standard the control and treatment cached background noise, they can seriously analysis using, say, t-tests or ANOVA. results in the same cache, they used impact the conclusions. A large discrepancy in the results different keys for the same request. Robots can easily bias experi- warrants further investigation. Because the control received much mental results because some robots Finally, note that robot detection is more traffic than the treatment, it have many more events—page views, an adversarial AI-complete problem: occupied a much larger part of the clicks, searches, and so on—than your goal is to identify the nonhuman shared cache. The result was that the members of the true user population. robots, while adversaries write robots treatment had a much lower cache This biases the mean and increases that masquerade as humans—for hit rate than the control, resulting in the variance of the groups into which example, using distributed botnets. significantly slower performance. they’re randomized. Increased vari- Above all, be vigilant. The solution in this case was ance means less chance to detect simple: create a special “dummy” effects. For example, a robot we dis- Configuration drift control whose allocation matches the covered in one experiment claimed it In a typical online system, opera- treatment; this equalizes the cache hit was IE8 yet “clicked” 600,000 times tion is controlled by many different rate of the control and treatment. The in a month. A single robot like this configuration settings distributed lesson: look for timing differences in significantly skews multiple metrics across multiple systems. These set- the user experience between variants. if not detected and removed. tings change while the experiment Some metrics are more sensitive runs. There’s a danger of the control UNAVOIDABLE DATA TRAPS to robot bias than others. Any metric and treatment drifting apart so they It’s harder to keep other data with no upper bound is much more no longer match the original experi- traps from affecting your online sensitive to outliers than metrics ment’s design. A good example is a experiment. having a bounded value. For exam- bug fix made only to the control. 84 COMPUTER IDENTIFYING DATA TRAPS 1.2 Uncontrolled seven-hour period USING STATISTICS Control We use a three-step process to help 1.0 Treatment validate experimental data: Click-through rate (percent) 0.8 • logging audits, • A/A experiments, and 0.6 • data validation checks. 0.4 We’ve previously written about the importance of conducting logging 0.2 audits and A/A experiments before doing the first A/B experiment (“Con- 0.0 Monday 3 PM Monday6 PM Monday 9 PM Tuesday 12 PM Tuesday 3 PM Tuesday 6 PM Tuesday 9 PM Wednesday 12 PM Wednesday 3 PM Wednesday 6 PM Wednesday 9 PM Thursday 12 PM Thursday 3 PM Thursday 6 PM Thursday 9 PM Friday 12 AM Friday 3 AM Friday 6 AM Friday 9 AM Friday 12 PM Friday 3 PM Tuesday 12 AM Tuesday 3 AM Tuesday 6 AM Tuesday 9 AM Wednesday 12 AM Wednesday 3 AM Wednesday 6 AM Wednesday 9 AM Thursday 12 AM Thursday 3 AM Thursday 6 AM Thursday 9 AM trolled Experiments on the Web: Survey and Practical Guide”). As an experiment is running, you should conduct ongoing and final data validation checks. While there are Time (hours) nonstatistical checks that you should Figure 1. Hourly click-through rate (CTR) for the treatment and control of one always do—for example, check for experiment revealed an uncontrolled difference in the websites during a seven-hour time periods with missing data—here period. we focus on statistical data validation checks. Platform, Roger Longbotham is the We recommend defining statis- variance. Our analysis of one experi- team’s analytics manager, and Toby tical integrity constraints, criteria ment showed a 2 percent increase of Walker works on the Bing Data that should hold with high probabil- the treatment over the control. How- Mining Team as technical lead for ity even if there’s a large treatment ever, a graph of the treatment and experiment analytics. Contact them effect, and checking them with statis- control means over time revealed at http://exp-platform.com. tical hypothesis tests. For example, if an unexpected increase in the treat- an experiment assigns 50 percent of ment effect for a seven-hour period, users to the control and 50 percent to as Figure 1 shows. We determined Editor: Simon S.Y. Shim, Dept. of Computer the treatment, you could use a bino- that the increase was due to an Engineering, San Jose State Univ., San Jose, CA; mial or chi-squared test to check that uncontrolled difference in the web- simon.shim@sjsu.edu the percent in each isn’t significantly sites during this period (editorial different from the expected 50 per- error); after we removed the data for cent. If the percentages turned out to this seven-hour period, there was no be, say, 50.1 and 49.9, that could mean longer a difference between the treat- Selected CS articles and columns almost a 0.5 percent bias in key met- ment and control. are available for free at http:// rics depending on the reason for the ComputingNow.computer.org. B imbalance. If the p-value of the test is ecause data issues are the small, you should assume some bias norm, not the exception, could exist and investigate. in online experiments, we Dat a-qua lit y metrics a re a n advocate a healthy degree of “data important type of statistical integ- paranoia.” Finding and understand- rity constraint. These are metrics for ing such issues are key to getting both the control and treatment that the right results but also require a the experiment shouldn’t affect—for substantial investment of time and example, cache hit rates. A significant energy. Even the smallest anoma- difference in a data-quality metric lies can lead to new insights. Getting calls for further investigation. numbers is easy; getting numbers We also recommend looking for you can trust is hard. unexpected changes over time in an experiment’s key statistics: sample Ron Kohavi is the general manager www.computer.org/itpro sizes, treatment effect, means, and of Mic rosoft ’s Ex pe rime ntation SEPTEMBER 2010 85

Log In

Online Experiments: Practical Lessons

Related papers

Related papers

Related topics