Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2806416.2806496acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Practical Aspects of Sensitivity in Online Experimentation with User Engagement Metrics

Published: 17 October 2015 Publication History

Abstract

Online controlled experiments, e.g., A/B testing, is the state-of-the-art approach used by modern Internet companies to improve their services based on data-driven decisions. The most challenging problem is to define an appropriate online metric of user behavior, so-called Overall Evaluation Criterion (OEC), which is both interpretable and sensitive. A typical OEC consists of a key metric and an evaluation statistic. Sensitivity of an OEC to the treatment effect of an A/B test is measured by a statistical significance test. We introduce the notion of Overall Acceptance Criterion (OAC) that includes both the components of an OEC and a statistical significance test. While existing studies on A/B tests are mostly concentrated on the first component of an OAC, its key metric, we widely study the two latter ones by comparison of several statistics and several statistical tests with respect to user engagement metrics on hundreds of A/B experiments run on real users of Yandex. We discovered that the application of the state-of-the-art Student's t-tests to several main user engagement metrics may lead to an underestimation of the false-positive rate by an order of magnitude. We investigate both well-known and novel techniques to overcome this issue in practical settings. At last, we propose the entropy and the quantiles as novel OECs that reflect the diversity and extreme cases of user engagement.

References

[1]
E. Bakshy and D. Eckles. Uncertainty in online experiments with dependent data: An evaluation of bootstrap methods. In KDD'2013, pages 1303--1311, 2013.
[2]
C. Bandt and B. Pompe. Permutation entropy: a natural complexity measure for time series. Physical review letters, 88(17):174102, 2002.
[3]
S. Chakraborty, F. Radlinski, M. Shokouhi, and P. Baecke. On correlation of absence time and search effectiveness. In SIGIR'2014, pages 1163--1166, 2014.
[4]
T. Crook, B. Frasca, R. Kohavi, and R. Longbotham. Seven pitfalls to avoid when running controlled experiments on the web. In KDD'2009, pages 1105--1114, 2009.
[5]
A. Deng and V. Hu. Diluted treatment effect estimation for trigger analysis in online controlled experiments. In WSDM'2015, pages 349--358, 2015.
[6]
A. Deng, T. Li, and Y. Guo. Statistical inference in two-stage online controlled experiments with treatment selection and validation. In WWW'2014, pages 609--618, 2014.
[7]
A. Deng, Y. Xu, R. Kohavi, and T. Walker. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In WSDM'2013, pages 123--132, 2013.
[8]
A. Drutsa, G. Gusev, and P. Serdyukov. Engagement periodicity in search engine usage: analysis and its application to search quality evaluation. In WSDM'2015, pages 27--36, 2015.
[9]
G. Dupret and M. Lalmas. Absence time and user engagement: evaluating ranking functions. In WSDM'2013, pages 173--182, 2013.
[10]
J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 2001.
[11]
V. Hu, M. Stone, J. Pedersen, and R. W. White. Effects of search success on search engine re-use. In CIKM'2011, pages 1841--1846, 2011.
[12]
B. J. Jansen, A. Spink, and V. Kathuria. How to define searching sessions on web search engines. In Advances in Web Mining and Web Usage Analysis, pages 92--109. Springer, 2007.
[13]
R. Kohavi, T. Crook, R. Longbotham, B. Frasca, R. Henne, J. L. Ferres, and T. Melamed. Online experimentation at microsoft. Data Mining Case Studies, page 11, 2009.
[14]
R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, and Y. Xu. Trustworthy online controlled experiments: Five puzzling outcomes explained. In KDD'2012, pages 786--794, 2012.
[15]
R. Kohavi, A. Deng, B. Frasca, T. Walker, Y. Xu, and N. Pohlmann. Online controlled experiments at large scale. In KDD'2013, pages 1168--1176, 2013.
[16]
R. Kohavi, A. Deng, R. Longbotham, and Y. Xu. Seven rules of thumb for web site experimenters. In KDD'2014, 2014.
[17]
R. Kohavi, R. M. Henne, and D. Sommerfield. Practical guide to controlled experiments on the web: listen to your customers not to the hippo. In KDD'2007, pages 959--967, 2007.
[18]
R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery, 18(1):140--181, 2009.
[19]
R. Kohavi, D. Messner, S. Eliot, J. L. Ferres, R. Henne, V. Kannappan, and J. Wang. Tracking users' clicks and submits: Tradeoffs between user experience and data loss, 2010.
[20]
J. Lehmann, M. Lalmas, G. Dupret, and R. Baeza-Yates. Online multitasking and user engagement. In CIKM'2013, pages 519--528, 2013.
[21]
J. Lehmann, M. Lalmas, E. Yom-Tov, and G. Dupret. Models of user engagement. In User Modeling, Adaptation, and Personalization, pages 164--175. Springer, 2012.
[22]
E. T. Peterson. Web analytics demystified: a marketer's guide to understanding how your web site affects your business. Ingram, 2004.
[23]
S. M. Pincus. Approximate entropy as a measure of system complexity. Proceedings of the National Academy of Sciences, 88(6):2297--2301, 1991.
[24]
J. S. Richman and J. R. Moorman. Physiological time-series analysis using approximate entropy and sample entropy. American Journal of Physiology-Heart and Circulatory Physiology, 278(6):H2039--H2049, 2000.
[25]
K. Rodden, H. Hutchinson, and X. Fu. Measuring the user experience on a large scale: user-centered metrics for web applications. In CHI'2010, pages 2395--2398, 2010.
[26]
C. E. Shannon. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):3--55, 2001.
[27]
Y. Song, X. Shi, and X. Fu. Evaluating and predicting user engagement change with degraded search relevance. In WWW'2013, pages 1213--1224, 2013.
[28]
D. Tang, A. Agarwal, D. O'Brien, and M. Meyer. Overlapping experiment infrastructure: More, better, faster experimentation. In KDD'2010, pages 17--26, 2010.
[29]
W. W.-S. Wei. Time series analysis. Addison-Wesley Redwood City, California, 1994.
[30]
R. W. White, A. Kapoor, and S. T. Dumais. Modeling long-term search engine usage. In User Modeling, Adaptation, and Personalization, pages 28--39. Springer, 2010.

Cited By

View all
  • (2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
  • (2023)Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing MethodologyThe American Statistician10.1080/00031305.2023.225723778:2(135-149)Online publication date: 18-Oct-2023
  • (2022)Using Survival Models to Estimate User Engagement in Online ExperimentsProceedings of the ACM Web Conference 202210.1145/3485447.3512038(3186-3195)Online publication date: 25-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
October 2015
1998 pages
ISBN:9781450337946
DOI:10.1145/2806416
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. a/b test
  2. evaluation statistic
  3. online controlled experiment
  4. overall acceptance criterion
  5. p-value
  6. quality metrics
  7. sensitivity
  8. significance level
  9. user engagement

Qualifiers

  • Research-article

Conference

CIKM'15
Sponsor:

Acceptance Rates

CIKM '15 Paper Acceptance Rate 165 of 646 submissions, 26%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)8
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
  • (2023)Statistical Challenges in Online Controlled Experiments: A Review of A/B Testing MethodologyThe American Statistician10.1080/00031305.2023.225723778:2(135-149)Online publication date: 18-Oct-2023
  • (2022)Using Survival Models to Estimate User Engagement in Online ExperimentsProceedings of the ACM Web Conference 202210.1145/3485447.3512038(3186-3195)Online publication date: 25-Apr-2022
  • (2020)Dealing with Ratio Metrics in A/B Testing at the Presence of Intra-user Correlation and SegmentsWeb Information Systems Engineering – WISE 202010.1007/978-3-030-62008-0_39(563-577)Online publication date: 20-Oct-2020
  • (2019)Effective Online Evaluation for Web SearchProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331378(1399-1400)Online publication date: 18-Jul-2019
  • (2018)Consistent Transformation of Ratio Metrics for Efficient Online Controlled ExperimentsProceedings of the Eleventh ACM International Conference on Web Search and Data Mining10.1145/3159652.3159699(55-63)Online publication date: 2-Feb-2018
  • (2018)Das Experiment gestern und heute, oder: die normative Kraft des FaktischenQualität und Data Science in der Marktforschung10.1007/978-3-658-19660-8_14(217-241)Online publication date: 28-Apr-2018
  • (2017)Beyond Success RateProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3132850(757-765)Online publication date: 6-Nov-2017
  • (2017)Characterizing and Predicting Supply-side Engagement on Video Sharing Platforms Using a Hawkes Process ModelProceedings of the ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3121050.3121077(159-166)Online publication date: 1-Oct-2017
  • (2017)Using the Delay in a Treatment Effect to Improve Sensitivity and Preserve Directionality of Engagement Metrics in A/B ExperimentsProceedings of the 26th International Conference on World Wide Web10.1145/3038912.3052664(1301-1310)Online publication date: 3-Apr-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media