research-article

Toward Predicting the Outcome of an A/B Experiment for Search Relevance

Authors:

Imed ZitouniAuthors Info & Claims

WSDM '15: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining

Pages 37 - 46

https://doi.org/10.1145/2684822.2685311

Published: 02 February 2015 Publication History

Abstract

A standard approach to estimating online click-based metrics of a ranking function is to run it in a controlled experiment on live users. While reliable and popular in practice, configuring and running an online experiment is cumbersome and time-intensive. In this work, inspired by recent successes of offline evaluation techniques for recommender systems, we study an alternative that uses historical search log to reliably predict online click-based metrics of a \emph{new} ranking function, without actually running it on live users. To tackle novel challenges encountered in Web search, variations of the basic techniques are proposed. The first is to take advantage of diversified behavior of a search engine over a long period of time to simulate randomized data collection, so that our approach can be used at very low cost. The second is to replace exact matching (of recommended items in previous work) by \emph{fuzzy} matching (of search result pages) to increase data efficiency, via a better trade-off of bias and variance. Extensive experimental results based on large-scale real search data from a major commercial search engine in the US market demonstrate our approach is promising and has potential for wide use in Web search.

References

[1]

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval: The Concepts and Technology behind Search. ACM Press Books. Addison-Wesley Professional, 2nd edition, 2011.

Digital Library

[2]

Nicholas J. Belkin. Some(what) grand challenges for information retrieval. ACM SIGIR Forum, 42(1):47--54, 2008.

Digital Library

[3]

Leon Bottou, Jonas Peters, Joaquin Qui~nonero Candela, Denis Xavier Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14:3207--3260, 2013.

Digital Library

[4]

Christopher J. C. Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, pages 89--96, 2005.

Digital Library

[5]

Georg Buscher, Ludger van Elst, and Andreas Dengel. Segment-level display time as implicit feedback: A comparison to eye tracking. In Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 67--74, 2009.

Digital Library

[6]

Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. Large scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Science, 30(1), 2012.

Digital Library

[7]

Olivier Chapelle, Eren Manavoglu, and Romer Rosales. Simple and scalable response prediction for display advertising. ACM Transactions on Intelligent Systems and Technology. To appear.

Digital Library

[8]

Olivier Chapelle and Ya Zhang. A dynamic Bayesian network click model for Web search ranking. In Proceedings of the 18th International Conference on World Wide Web, pages 1--10, 2009.

Digital Library

[9]

Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. Click model-based information retrieval metrics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 493--502, 2013.

Digital Library

[10]

Georges Dupret and Benjamin Piwowarski. A user browsing model to predict search engine click data from past observations. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 331--338, 2008.

Digital Library

[11]

Fan Guo, Chao Liu, Anitha Kannan, Tom Minka, Michael J. Taylor, Yi-Min Wang, and Christos Faloutsos. Click chain model in Web search. In Proceedings of the 18th International Conference on World Wide Web, pages 11--20, 2009.

Digital Library

[12]

Katja Hofmann, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. Reusing historical interaction data for faster online learning to rank for IR. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, pages 183--192, 2013.

Digital Library

[13]

Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. Estimating interleaved comparison outcomes from historical click data. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pages 1779--1783, 2012.

Digital Library

[14]

Kalervo Jarvelin and Jaana Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422--446, 2002.

Digital Library

[15]

Gabriella Kazai and Homer Sung. Dissimilarity based query selection for efficient preference based IR evaluation. In Proceedings of the European Conference on Information Retrieval, pages 172--183, 2014.

[16]

Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. Controlled experiments on the web: Survey and practical guide. Data Minining and Knowledge Discovery, 18:140--181, 2009.

Digital Library

[17]

Diane Lambert and Daryl Pregibon. More bang for their bucks: Assessing new features for online advertisers. SIGKDD Explorations, 9(2):100--107, 2007.

Digital Library

[18]

John Langford, Alexander L. Strehl, and Jennifer Wortman. Exploration scavenging. In Proceedings of the 25th International Conference on Machine Learning, pages 528--535, 2008.

Digital Library

[19]

Lihong Li, Shunbao Chen, Ankur Gupta, and Jim Kleban. Counterfactual analysis of click metrics for search engine optimization. Technical Report MSR-TR-2014-32, Microsoft Research, 2014.

[20]

Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pages 661--670, 2010.

Digital Library

[21]

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased offine evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the 4th International Conference on Web Search and Data Mining, pages 297--306, 2011.

Digital Library

[22]

Lihong Li, Remi Munos, and Csaba Szepesvari. On minimax optimal off-policy policy evaluation. Technical report, Microsoft Research, 2014.

[23]

Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample-variance penalization. In Proceedings of the Twenty-Second Conference on Learning Theory, pages 247--254, 2009.

[24]

Filip Radlinski and Nick Craswell. Optimized interleaving for online retrieval evaluation. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining, pages 245--254, 2013.

Digital Library

[25]

Stephen Robertson. On the history of evaluation in IR. Journal of Information Science, 34(4):439--456, 2008.

Digital Library

[26]

Alexander L. Strehl, John Langford, Lihong Li, and Sham M. Kakade. Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23, pages 2217--2225, 2011.

Digital Library

[27]

Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pages 1587--1594, 2013.

Digital Library

[28]

Andrew H. Turpin and William Hersh. Why batch and user evaluations do not give the same results. In Proceedings of the 24th ACM SIGIR Conference on Research and Development in Information Retrieval, pages 225--231, 2001.

Digital Library

[29]

Zhaohui Zheng, Hongyuan Zha, Tong Zhang, Olivier Chapelle, Keke Chen, and Gordon Sun. A general boosting method and its application to learning ranking functions for web search. In Advances in Neural Information Processing Systems 20, pages 1000--1007, 2008.

Cited By

Quin FWeyns DBaresi LMa XPasquale L(2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643915.3644087
Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011
Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Show More Cited By

Index Terms

Toward Predicting the Outcome of an A/B Experiment for Search Relevance
1. Information systems
  1. Information systems applications

Recommendations

Identifying popular search goals behind search queries to improve web search ranking
AIRS'11: Proceedings of the 7th Asia conference on Information Retrieval Technology

Web users usually have a certain search goal before they submit a search query. However, many laypersons can't transform their search goals into suitable queries. Thus, understanding original search goals behind a query is very important for search ...
An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia
ICIMP '09: Proceedings of the 2009 Fourth International Conference on Internet Monitoring and Protection

This paper investigates the semantic search performance of search engines. Initially, three keyword-based search engines (Google, Yahoo and Msn) and a semantic search engine (Hakia) were selected. Then, ten queries, from various topics, and four phrases,...
Evaluating leading web search engines on children's queries
HCII'11: Proceedings of the 14th international conference on Human-computer interaction: users and applications - Volume Part IV

This study compared retrieved results, relevance ranking, and overlap across Google, Yahoo!, Bing, Yahoo Kids!, and Ask Kids on 15 queries constructed by middle school children. Queries included one word, two words, and multiple words/phrases/natural ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '15: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining

February 2015

482 pages

ISBN:9781450333177

DOI:10.1145/2684822

General Chairs:
Xueqi Cheng
ICT, Chinese Academy of Sciences, China
,
Hang Li
Huawei Technologies, China
,
Program Chairs:
Evgeniy Gabrilovich
Google, USA
,
Jie Tang
Tsinghua University, China

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2015

Sponsor:

WSDM 2015: Eighth ACM International Conference on Web Search and Data Mining

February 2 - 6, 2015

Shanghai, China

Acceptance Rates

WSDM '15 Paper Acceptance Rate 39 of 238 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
341
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Quin FWeyns DBaresi LMa XPasquale L(2024)Automating Pipelines of A/B Tests with Population Split Using Self-Adaptation and Machine LearningProceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing Systems10.1145/3643915.3644087(84-97)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643915.3644087
Quin FWeyns DGalster MSilva C(2024)A/B testingJournal of Systems and Software10.1016/j.jss.2024.112011211:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112011
Breuer TFuhr NSchaer P(2023)Validating Synthetic Usage Data in Living Lab EnvironmentsJournal of Data and Information Quality10.1145/3623640Online publication date: 24-Sep-2023
https://dl.acm.org/doi/10.1145/3623640
Chu ZWang HXiao YLong BWu LChua TLauw HSi LTerzi ETsaparas P(2023)Meta Policy Learning for Cold-Start Conversational RecommendationProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3570443(222-230)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3539597.3570443
Gao CLi SLei WChen JLi BJiang PHe XMao JChua TAl Hasan MXiong L(2022)KuaiRecProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557220(540-550)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557220
Vinay VKilaru MArbour DAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)Offline Evaluation of Ranked Lists using Parametric Estimation of PropensitiesProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3532032(622-632)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3532032
Vlassis NChandrashekar AGil FKallus NRanzato MBeygelzimer ADauphin YLiang PVaughan J(2021)Control variates for slate off-policy evaluationProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3540541(3667-3679)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.5555/3540261.3540541
Zhao XXia LZou LLiu HYin DTang J(2021)UserSim: User Simulation via Supervised GenerativeAdversarial NetworkProceedings of the Web Conference 202110.1145/3442381.3450125(3582-3589)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3442381.3450125
Huang JOosterhuis Hde Rijke Mvan Hoof H(2020)Keeping Dataset Biases out of the SimulationProceedings of the 14th ACM Conference on Recommender Systems10.1145/3383313.3412252(190-199)Online publication date: 22-Sep-2020
https://dl.acm.org/doi/10.1145/3383313.3412252
Cotta RHu MJiang DLiao PCulpepper JMoffat ABennett PLerman K(2019)Off-Policy Evaluation of Probabilistic Identity Data in Lookalike ModelingProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3291033(483-491)Online publication date: 30-Jan-2019
https://dl.acm.org/doi/10.1145/3289600.3291033
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents