research-article

How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments

Authors:

Ya XuAuthors Info & Claims

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Pages 501 - 509

https://doi.org/10.1145/3289600.3291000

Published: 30 January 2019 Publication History

Abstract

We have seen a massive growth of online experiments at Internet companies. Although conceptually simple, A/B tests can easily go wrong in the hands of inexperienced users and on an A/B testing platform with little governance. An invalid A/B test hurts the business by leading to non-optimal decisions. Therefore, it is now more important than ever to create an intelligent A/B platform that democratizes A/B testing and allows everyone to make quality decisions through built-in detection and diagnosis of invalid tests. In this paper, we share how we mined through historical A/B tests and identified the most common causes for invalid tests, ranging from biased design, self-selection bias to attempting to generalize A/B test result beyond the experiment population and time frame. Furthermore, we also developed scalable algorithms to automatically detect invalid A/B tests and diagnose the root cause of invalidity. Surfacing up invalidity not only improved decision quality, but also served as a user education and reduced problematic experiment designs in the long run.

References

[1]

{n. d.}. Cross Promotion. https://en.wikipedia.org/wiki/Cross-promotion

[2]

{n. d.}. Detecting and Avoiding Bucket Imbalance in A/B Tests.

[3]

Eytan Bakshy, Dean Eckles, and Michael S Bernstein. 2014. Designing and deploying online field experiments. In Proceedings of the 23rd international conference on World wide web. ACM, 283--292.

Digital Library

[4]

George EP Box, J Stuart Hunter, and William Gordon Hunter. 2005. Statistics for experimenters: design, innovation, and discovery. Vol. 2. Wiley-Interscience New York.

[5]

Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. 2009. Seven pitfalls to avoid when running controlled experiments on the web. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1105--1114.

Digital Library

[6]

pages = 349--358 title = Diluted Treatment Effect Estimation for Trigger Analysis in Online Controlled Experiments year = 2015 Deng, Alex; Hu, Victor, booktitle = Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM '15. {n. d.}.

Digital Library

[7]

Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Experiments. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1427--1436.

Digital Library

[8]

Alan S Gerber and Donald P Green. 2012. Field experiments: Design, analysis, and interpretation. WW Norton.

[9]

Henning Hohnhold, Deirdre O'Brien, and Diane Tang. 2015. Focusing on the Long-term: It's Good for Users and Business. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1849--1858.

Digital Library

[10]

Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya Xu. 2012. Trustworthy online controlled experiments: Five puzzling outcomes explained. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 786--794.

Digital Library

[11]

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1168--1176.

Digital Library

[12]

Ron Kohavi, Alex Deng, Roger Longbotham, and Ya Xu. 2014. Seven rules of thumb for web site experimenters. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14.

Digital Library

[13]

Ron Kohavi and Roger Longbotham. 2017. Online controlled experiments and a/b testing. In Encyclopedia of machine learning and data mining. Springer, 922--929.

[14]

Jon NK Rao and Alastair J Scott. 1981. The analysis of categorical data from complex sample surveys: chi-squared tests for goodness of fit and independence in two-way tables. Journal of the American statistical association 76, 374 (1981), 221--230.

[15]

Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41--55.

[16]

Paul R. Rosenbaum and Donald B. Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika (1983). arXiv:http://www.jstor.org/stable/2335942

[17]

Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology 66, 5 (1974), 688.

[18]

Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo M Airoldi. 2017. Detecting network effects: Randomizing over randomized experiments. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 1027--1035.

Digital Library

[19]

Diane Tang, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer. 2010. Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 17--26.

Digital Library

[20]

Ya Xu and Nanyu Chen. 2016. Evaluating mobile apps with a/b and quasi a/b tests. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 313--322.

Digital Library

[21]

Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. From Infrastructure to Culture. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15.

Digital Library

[22]

Ya Xu, Weitao Duan, and Shaochen Huang. 2018. SQR: Balancing Speed, Quality and Risk in Online Experiments. arXiv preprint arXiv:1801.08532 (2018).

Cited By

Quin FWeyns DBaresi LMa XPasquale L(2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643915.3644087
Xiong HBian JLi YLi XDu MWang SYin DHelal S(2024)When Search Engine Services meet Large Language Models: Visions and ChallengesIEEE Transactions on Services Computing10.1109/TSC.2024.3451185(1-23)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3451185
Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011
Show More Cited By

Index Terms

How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Causal reasoning and diagnostics
2. Mathematics of computing
  1. Probability and statistics
    1. Probabilistic inference problems

Recommendations

Trustworthy and Powerful Online Marketplace Experimentation with Budget-split Design
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Online experimentation, also known as A/B testing, is the gold standard for measuring product impacts and making business decisions in the tech industry. The validity and utility of experiments, however, hinge on unbiasedness and sufficient power. In ...
A/B testing: A systematic literature review
Abstract
A/B testing, also referred to as online controlled experimentation or continuous experimentation, is a form of hypothesis testing where two variants of a piece of software are compared in the field from an end user’s point of view. A/B testing is ...
Highlights
- We consolidate 143 studies on software engineering aspects of A/B testing.
- We present the different roles stakeholders take in A/B test design and execution.
- A/B testing has gained traction in fields like embedded and cyber–...
Improving unfamiliar code with unit tests: an empirical investigation on tool-supported and human-based testing
PROFES'12: Proceedings of the 13th international conference on Product-Focused Software Process Improvement

Software testing is a well-established approach in modern software engineering practice to improve software products by systematically introducing unit tests on different levels during software development projects. Nevertheless existing software ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

January 2019

874 pages

ISBN:9781450359405

DOI:10.1145/3289600

General Chairs:
J. Shane Culpepper
RMIT University
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Paul N. Bennett
Microsoft
,
Kristina Lerman
University of Southern California

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM '19

Sponsor:

WSDM '19: The Twelfth ACM International Conference on Web Search and Data Mining

February 11 - 15, 2019

Melbourne VIC, Australia

Acceptance Rates

WSDM '19 Paper Acceptance Rate 84 of 511 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
1,378
Total Downloads

Downloads (Last 12 months)55
Downloads (Last 6 weeks)8

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Quin FWeyns DBaresi LMa XPasquale L(2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643915.3644087
Xiong HBian JLi YLi XDu MWang SYin DHelal S(2024)When Search Engine Services meet Large Language Models: Visions and ChallengesIEEE Transactions on Services Computing10.1109/TSC.2024.3451185(1-23)Online publication date: 2024
https://doi.org/10.1109/TSC.2024.3451185
Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011
Le TDeng AFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)The Price is Right: Removing A/B Test Bias in a Marketplace of Expirable GoodsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615502(4681-4687)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615502
Kohavi RLongbotham R(2023)Online Controlled Experiments and A/B TestsEncyclopedia of Machine Learning and Data Science10.1007/978-1-4899-7502-7_891-2(1-13)Online publication date: 8-Mar-2023
https://doi.org/10.1007/978-1-4899-7502-7_891-2
Nie KZhang ZXu BYuan TAl Hasan MXiong L(2022)Ensure A/B Test Quality at Scale with Automated Randomization Validation and Sample Ratio Mismatch DetectionProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557087(3391-3399)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557087
Sadeghi SGupta SGramatovici SLu JAi HZhang R(2022)Novelty and Primacy: A Long-Term Estimator for Online ExperimentsTechnometrics10.1080/00401706.2022.212430964:4(524-534)Online publication date: 8-Nov-2022
https://doi.org/10.1080/00401706.2022.2124309
Ramanathan MClapp LBarik RSridharan MRothermel GBae D(2020)PiranhaProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice10.1145/3377813.3381350(221-230)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377813.3381350
Puha ZKaptein MLemmens A(2020)Batch Mode Active Learning for Individual Treatment Effect Estimation2020 International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW51313.2020.00123(859-866)Online publication date: Nov-2020
https://doi.org/10.1109/ICDMW51313.2020.00123
Kohavi RTang DXu Y(2020)Trustworthy Online Controlled Experiments10.1017/9781108653985Online publication date: 13-Mar-2020
https://doi.org/10.1017/9781108653985

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten