Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3511808.3557123acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open access

Debiased Balanced Interleaving at Amazon Search

Published: 17 October 2022 Publication History

Abstract

Interleaving is an online evaluation technique that has shown to be orders of magnitude more sensitive than traditional A/B tests. It presents users with a single merged result of the compared rankings and then attributes user actions back to the evaluated rankers. Different interleaving methods in the literature have their advantages and limitations with respect to unbiasedness, sensitivity, preservation of user experience, and implementation and computation complexity. We propose a new interleaving method that utilizes a counterfactual evaluation framework for credit attribution while sticking to the simple ranking merge policy of balanced interleaving, and formally derive an unbiased estimator for comparing rankers with theoretical guarantees. We then confirm the effectiveness of our method with both synthetic and real experiments. We also discuss practical considerations of bringing different interleaving methods from the literature into a large-scale experiment, and show that our method achieves a favorable tradeoff in implementation and computation complexity while preserving statistical power and reliability. We have successfully implemented our method and produced consistent conclusions at the scale of billions of search queries. We report 10 online experiments that apply our method to e-commerce search, and observe a 60x sensitivity gain over A/B tests. We also find high correlations between our proposed estimator and corresponding A/B metrics, which helps interpret interleaving results in the magnitude of A/B measurements.

Supplementary Material

MP4 File (CIKM22-app179.mp4)
Interleaving is an online evaluation technique that has shown to be orders of magnitude more sensitive than traditional A/B tests. In this paper we propose a new interleaving method that utilizes a counterfactual evaluation framework for credit attribution while sticking to the simple ranking merge policy of balanced interleaving, and formally derive an unbiased estimator for comparing rankers with theoretical guarantees. We discuss practical considerations of bringing different interleaving methods from the literature into a large-scale experiment, and show that our method achieves a favorable tradeoff in implementation and computation complexity while preserving statistical power and reliability. We have successfully implemented our method at the scale of billions of search queries. We report 10 online experiments that apply our method to e-commerce search, where we observe a 60x sensitivity gain over A/B tests and also high correlations between our estimator and A/B metrics.

References

[1]
Brian Brost, Ingemar J. Cox, Yevgeny Seldin, and Christina Lioma. 2016. An Improved Multileaving Algorithm for Online Ranker Evaluation. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR 2016). ACM, New York, NY, USA, 745--748. https://doi.org/10.1145/2911451.2914706
[2]
Ben Carterette. 2011. System Effectiveness, User Models, and User Utility: A Conceptual Framework for Investigation. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (Beijing, China) (SIGIR 2011). ACM, New York, NY, USA, 903--912. https://doi.org/10.1145/2009916.2010037
[3]
Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. Large-Scale Validation and Analysis of Interleaved Search Evaluation. ACM Transactions on Information Systems, Vol. 30, 1, Article 6 (March 2012). https://doi.org/10.1145/2094072.2094078
[4]
Olivier Chapelle and Ya Zhang. 2009. A Dynamic Bayesian Network Click Model for Web Search Ranking. In Proceedings of the 18th International Conference on World Wide Web (Madrid, Spain) (WWW 2009). ACM, New York, NY, USA, 1--10. https://doi.org/10.1145/1526709.1526711
[5]
Jing He, Chengxiang Zhai, and Xiaoming Li. 2009. Evaluation of Methods for Relative Comparison of Retrieval Systems Based on Clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (Hong Kong, China) (CIKM 2009). ACM, New York, NY, USA, 2029--2032. https://doi.org/10.1145/1645953.1646293
[6]
Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2011. A Probabilistic Method for Inferring Preferences from Clicks. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (Glasgow, Scotland, UK) (CIKM 2011). ACM, New York, NY, USA, 249--258. https://doi.org/10.1145/2063576.2063618
[7]
Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. 2013. Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods. ACM Transactions on Information Systems, Vol. 31, 4, Article 17 (Nov. 2013). https://doi.org/10.1145/2536736.2536737
[8]
Kojiro Iizuka, Yoshifumi Seki, and Makoto P. Kato. 2021. Decomposition and Interleaving for Variance Reduction of Post-Click Metrics. In Proceedings of the 7th ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR 2021). ACM, New York, NY, USA, 221--230. https://doi.org/10.1145/3471158.3472235
[9]
Kojiro Iizuka, Takeshi Yoneda, and Yoshifumi Seki. 2019. Greedy Optimized Multileaving for Personalization. In Proceedings of the 13th ACM Conference on Recommender Systems (Copenhagen, Denmark) (RecSys 2019). ACM, New York, NY, USA, 413--417. https://doi.org/10.1145/3298689.3347008
[10]
Thorsten Joachims. 2002. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Edmonton, Alberta, Canada) (KDD 2002). ACM, New York, NY, USA, 133--142. https://doi.org/10.1145/775047.775067
[11]
Thorsten Joachims. 2003. Evaluating Retrieval Performance using Clickthrough Data. In Text Mining, J. Franke, G. Nakhaeizadeh, and I. Renz (Eds.). Physica / Springer Verlag, 79--96.
[12]
Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2013. Using Historical Click Data to Increase Interleaving Sensitivity. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (San Francisco, California, USA) (CIKM 2013). ACM, New York, NY, USA, 679--688. https://doi.org/10.1145/2505515.2505687
[13]
Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015. Generalized Team Draft Interleaving. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (Melbourne, Australia) (CIKM 2015). ACM, New York, NY, USA, 773--782. https://doi.org/10.1145/2806416.2806477
[14]
Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online Controlled Experiments at Large Scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Chicago, Illinois, USA) (KDD 2013). ACM, New York, NY, USA, 1168--1176. https://doi.org/10.1145/2487575.2488217
[15]
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009. Controlled experiments on the web: survey and practical guide. Data Mining and Knowledge Discovery, Vol. 18, 1 (2009), 140--181. https://doi.org/10.1007/s10618-008-0114--1
[16]
Donald R. J. Laming. 1986. Sensory Analysis. Academic Press, Cambridge, MA, USA.
[17]
Harrie Oosterhuis and Maarten de Rijke. 2017. Sensitive and Scalable Online Evaluation with Theoretical Guarantees. In Proceedings of the 26th ACM Conference on Information and Knowledge Management (Singapore) (CIKM 2017). ACM, New York, NY, USA, 77--86. https://doi.org/10.1145/3132847.3132895
[18]
Harrie Oosterhuis, Anne Schuth, and Maarten de Rijke. 2016. Probabilistic Multileave Gradient Descent. In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20--23, 2016. Proceedings (Lecture Notes in Computer Science, Vol. 9626), Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello (Eds.). Springer, 661--668. https://doi.org/10.1007/978--3--319--30671--1_50
[19]
Filip Radlinski and Nick Craswell. 2010. Comparing the Sensitivity of Information Retrieval Metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Geneva, Switzerland) (SIGIR 2010). ACM, New York, NY, USA, 667--674. https://doi.org/10.1145/1835449.1835560
[20]
Filip Radlinski and Nick Craswell. 2013. Optimized Interleaving for Online Retrieval Evaluation. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (Rome, Italy) (WSDM 2013). ACM, New York, NY, USA, 245--254. https://doi.org/10.1145/2433396.2433429
[21]
Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How Does Clickthrough Data Reflect Retrieval Quality?. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (Napa Valley, California, USA) (CIKM 2008). ACM, New York, NY, USA, 43--52. https://doi.org/10.1145/1458082.1458092
[22]
Paul R. Rosenbaum and Donald B. Rubin. 1983. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, Vol. 70, 1 (1983), 41--55. https://doi.org/10.2307/2335942
[23]
Anne Schuth, Robert-Jan Bruintjes, Fritjof Buüttner, Joost van Doorn, Carla Groenland, Harrie Oosterhuis, Cong-Nguyen Tran, Bas Veeling, Jos van der Velde, Roger Wechsler, David Woudenberg, and Maarten de Rijke. 2015a. Probabilistic Multileave for Online Retrieval Evaluation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (Santiago, Chile) (SIGIR 2015). ACM, New York, NY, USA, 955--958. https://doi.org/10.1145/2766462.2767838
[24]
Anne Schuth, Katja Hofmann, and Filip Radlinski. 2015b. Predicting Search Satisfaction Metrics with Interleaved Comparisons. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (Santiago, Chile) (SIGIR 2015). ACM, New York, NY, USA, 463--472. https://doi.org/10.1145/2766462.2767695
[25]
Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten de Rijke. 2014. Multileaved Comparisons for Fast Online Evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (Shanghai, China) (CIKM 2014). ACM, New York, NY, USA, 71--80. https://doi.org/10.1145/2661829.2661952
[26]
Daria Sorokina and Erick Cantu-Paz. 2016. Amazon Search: The Joy of Ranking Products. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR 2016). ACM, New York, NY, USA, 459--460. https://doi.org/10.1145/2911451.2926725
[27]
Yisong Yue, Yue Gao, Oliver Chapelle, Ya Zhang, and Thorsten Joachims. 2010. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Geneva, Switzerland) (SIGIR 2010). ACM, New York, NY, USA, 507--514. https://doi.org/10.1145/1835449.1835534

Cited By

View all
  • (2023)RankFormer: Listwise Learning-to-Rank Using Listwide LabelsProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599892(3762-3773)Online publication date: 6-Aug-2023
  • (2023)Interleaved Online Testing in Large-Scale SystemsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587572(921-926)Online publication date: 30-Apr-2023
  • (2023)How Well do Offline Metrics Predict Online Performance of Product Ranking Models?Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591865(3415-3420)Online publication date: 19-Jul-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management
October 2022
5274 pages
ISBN:9781450392365
DOI:10.1145/3511808
  • General Chairs:
  • Mohammad Al Hasan,
  • Li Xiong
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. A/B testing
  2. bias correction
  3. e-commerce search
  4. interleaved evaluation
  5. online evaluation

Qualifiers

  • Research-article

Conference

CIKM '22
Sponsor:

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)220
  • Downloads (Last 6 weeks)16
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)RankFormer: Listwise Learning-to-Rank Using Listwide LabelsProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599892(3762-3773)Online publication date: 6-Aug-2023
  • (2023)Interleaved Online Testing in Large-Scale SystemsCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587572(921-926)Online publication date: 30-Apr-2023
  • (2023)How Well do Offline Metrics Predict Online Performance of Product Ranking Models?Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591865(3415-3420)Online publication date: 19-Jul-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media