research-article

Open access

Debiased Balanced Interleaving at Amazon Search

Authors:

Nan Bi,

Sachin AhujaAuthors Info & Claims

CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Pages 2913 - 2922

https://doi.org/10.1145/3511808.3557123

Published: 17 October 2022 Publication History

PDF eReader

Abstract

Interleaving is an online evaluation technique that has shown to be orders of magnitude more sensitive than traditional A/B tests. It presents users with a single merged result of the compared rankings and then attributes user actions back to the evaluated rankers. Different interleaving methods in the literature have their advantages and limitations with respect to unbiasedness, sensitivity, preservation of user experience, and implementation and computation complexity. We propose a new interleaving method that utilizes a counterfactual evaluation framework for credit attribution while sticking to the simple ranking merge policy of balanced interleaving, and formally derive an unbiased estimator for comparing rankers with theoretical guarantees. We then confirm the effectiveness of our method with both synthetic and real experiments. We also discuss practical considerations of bringing different interleaving methods from the literature into a large-scale experiment, and show that our method achieves a favorable tradeoff in implementation and computation complexity while preserving statistical power and reliability. We have successfully implemented our method and produced consistent conclusions at the scale of billions of search queries. We report 10 online experiments that apply our method to e-commerce search, and observe a 60x sensitivity gain over A/B tests. We also find high correlations between our proposed estimator and corresponding A/B metrics, which helps interpret interleaving results in the magnitude of A/B measurements.

Supplementary Material

MP4 File (CIKM22-app179.mp4)

Interleaving is an online evaluation technique that has shown to be orders of magnitude more sensitive than traditional A/B tests. In this paper we propose a new interleaving method that utilizes a counterfactual evaluation framework for credit attribution while sticking to the simple ranking merge policy of balanced interleaving, and formally derive an unbiased estimator for comparing rankers with theoretical guarantees. We discuss practical considerations of bringing different interleaving methods from the literature into a large-scale experiment, and show that our method achieves a favorable tradeoff in implementation and computation complexity while preserving statistical power and reliability. We have successfully implemented our method at the scale of billions of search queries. We report 10 online experiments that apply our method to e-commerce search, where we observe a 60x sensitivity gain over A/B tests and also high correlations between our estimator and A/B metrics.

Download
21.93 MB

References

[1]

Brian Brost, Ingemar J. Cox, Yevgeny Seldin, and Christina Lioma. 2016. An Improved Multileaving Algorithm for Online Ranker Evaluation. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR 2016). ACM, New York, NY, USA, 745--748. https://doi.org/10.1145/2911451.2914706

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Interleaved Online Testing in Large-Scale Systems

A Comparative Analysis of Interleaving Methods for Aggregated Search

Challenges and Opportunities in Online Evaluation of Search Engines

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations