research-article

Validating Synthetic Usage Data in Living Lab Environments

Authors:

Norbert Fuhr, and

Philipp SchaerAuthors Info & Claims

ACM Journal of Data and Information Quality, Volume 16, Issue 1

Article No.: 5, Pages 1 - 33

https://doi.org/10.1145/3623640

Published: 06 March 2024 Publication History

Abstract

Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, then click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts.

This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model’s estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data become available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.

References

[1]

Eugene Agichtein, Eric Brill, and Susan T. Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Järvelin (Eds.). ACM, 19–26. DOI:

Digital Library

[2]

Sophia Althammer, Sebastian Hofstätter, Suzan Verberne, and Allan Hanbury. 2022. TripJudge: A relevance judgement test collection for TripClick health retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Mohammad Al Hasan and Li Xiong (Eds.). ACM, 3801–3805. DOI:

Digital Library

[3]

Giambattista Amati. 2006. Frequentist and Bayesian approach to information retrieval. In Advances in Information Retrieval: Proceedings of the 28th European Conference on IR Research (ECIR 2006),Lecture Notes in Computer Science, Vol. 3936, Mounia Lalmas, Andy MacFarlane, Stefan M. Rüger, Anastasios Tombros, Theodora Tsikrika, and Alexei Yavlinsky (Eds.). Springer, 13–24. DOI:

Digital Library

[4]

Leif Azzopardi and Krisztian Balog. 2011. Towards a living lab for information retrieval research and development - A proposal for a living lab for product search tasks. In Multilingual and Multimodal Information Access Evaluation: Proceedings of the 2nd International Conference of the Cross-Language Evaluation Forum (CLEF’11). 26–37. DOI:

[5]

Eytan Bakshy, Dean Eckles, and Michael S. Bernstein. 2014. Designing and deploying online field experiments. In Proceedings of the 23rd International World Wide Web Conference (WWW’14), Chin-Wan Chung, Andrei Z. Broder, Kyuseok Shim, and Torsten Suel (Eds.). ACM, 283–292. DOI:

Digital Library

[6]

Krisztian Balog, David Elsweiler, Evangelos Kanoulas, Liadh Kelly, and Mark D. Smucker. 2013. CIKM 2013 workshop on living labs for information retrieval evaluation. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). 2557–2558. DOI:

Digital Library

[7]

Krisztian Balog, Liadh Kelly, and Anne Schuth. 2014. Head first: Living labs for Ad-Hoc search evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14), Jianzhong Li, Xiaoyang Sean Wang, Minos N. Garofalakis, Ian Soboroff, Torsten Suel, and Min Wang (Eds.). ACM, 1815–1818. DOI:

Digital Library

[8]

Krisztian Balog, David Maxwell, Paul Thomas, and Shuo Zhang. 2021. Report on the 1st simulation for information retrieval workshop (Sim4IR 2021) at SIGIR 2021. SIGIR Forum 55, 2 (2021), 10:1–10:16. DOI:

Digital Library

[9]

Feza Baskaya, Heikki Keskustalo, and Kalervo Järvelin. 2013. Modeling behavioral factors in interactive information retrieval. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), Qi He, Arun Iyengar, Wolfgang Nejdl, Jian Pei, and Rajeev Rastogi (Eds.). ACM, 2297–2302. DOI:

Digital Library

[10]

Jöran Beel, Andrew Collins, Oliver Kopp, Linus W. Dietz, and Petr Knoth. 2019. Online evaluations for everyone: Mr. DLib’s living lab for scholarly recommendations. In Advances in Information Retrieval: Proceedings of the 41st European Conference on IR Research (ECIR’19), Part II. 213–219. DOI:

Digital Library

[11]

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3613–3618. DOI:

[12]

Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A neural click model for web search. In Proceedings of the 25th International Conference on World Wide Web (WWW’16), Jacqueline Bourdeau, Jim Hendler, Roger Nkambou, Ian Horrocks, and Ben Y. Zhao (Eds.). ACM, 531–541. DOI:

Digital Library

[13]

Timo Breuer, Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, Philipp Schaer, and Ian Soboroff. 2020. How to measure the reproducibility of system-oriented IR experiments. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20), Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 349–358. DOI:

Digital Library

[14]

Timo Breuer and Philipp Schaer. 2021. A living lab architecture for reproducible shared task experimentation. In Information between Data and Knowledge: Information Science and Its Neighbors from Data Science to Digital Humanities: Proceedings of the 16th International Symposium of Information Science (ISI’21), Christian Wolff and Thomas Schmidt (Eds.). Werner Hülsbusch, 348–362. DOI:

[15]

Torben Brodt and Frank Hopfgartner. 2014. Shedding light on a living lab: The CLEF NEWSREEL open recommendation platform. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14). 223–226. DOI:

Digital Library

[16]

Chris Buckley and Ellen M. Voorhees. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), Mark Sanderson, Kalervo Järvelin, James Allan, and Peter Bruza (Eds.). ACM, 25–32. DOI:

Digital Library

[17]

Ben Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR’11). 903–912. DOI:

Digital Library

[18]

Ben Carterette, Ashraf Bah, and Mustafa Zengin. 2015. Dynamic test collections for retrieval evaluation. In Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR’15), James Allan, W. Bruce Croft, Arjen P. de Vries, and Chengxiang Zhai (Eds.). ACM, 91–100. DOI:

Digital Library

[19]

Ben Carterette and Praveen Chandar. 2018. Offline comparative evaluation with incremental, minimally-invasive online feedback. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 705–714. DOI:

Digital Library

[20]

Ben Carterette and Rosie Jones. 2007. Evaluating search engines by modeling the relationship between relevance and clicks. In Advances in Neural Information Processing Systems 20: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis (Eds.). Curran Associates, Inc., 217–224.

[21]

Olivier Chapelle and Ya Zhang. 2009. A dynamic Bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web (WWW’09), Juan Quemada, Gonzalo León, Yoëlle S. Maarek, and Wolfgang Nejdl (Eds.). ACM, 1–10. DOI:

Digital Library

[22]

Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (Eds.). ACM, 15–24. DOI:

Digital Library

[23]

Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click Models for Web Search. Morgan & Claypool. DOI:

[24]

Aleksandr Chuklin, Anne Schuth, Katja Hofmann, Pavel Serdyukov, and Maarten de Rijke. 2013. Evaluating aggregated search using interleaving. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), Qi He, Arun Iyengar, Wolfgang Nejdl, Jian Pei, and Rajeev Rastogi (Eds.). ACM, 669–678. DOI:

Digital Library

[25]

Aleksandr Chuklin, Anne Schuth, Ke Zhou, and Maarten de Rijke. 2015. A comparative analysis of interleaving methods for aggregated search. ACM Trans. Inf. Syst. 33, 2 (2015), 5:1–5:38. DOI:

Digital Library

[26]

Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Click model-based information retrieval metrics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), Gareth J. F. Jones, Paraic Sheridan, Diane Kelly, Maarten de Rijke, and Tetsuya Sakai (Eds.). ACM, 493–502. DOI:

Digital Library

[27]

Cyril W. Cleverdon. 1991. The significance of the cranfield tests on index languages. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Abraham Bookstein, Yves Chiaramella, Gerard Salton, and Vijay V. Raghavan (Eds.). ACM, 3–12. DOI:

Digital Library

[28]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 2270–2282. DOI:

[29]

Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, and Bodo Billerbeck. 2020. ORCAS: 20 million clicked query-document pairs for analyzing search. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, Virtual Event (CIKM’20), Mathieu d’Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 2983–2989. DOI:

Digital Library

[30]

Nick Craswell, Onno Zoeter, Michael J. Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08), Marc Najork, Andrei Z. Broder, and Soumen Chakrabarti (Eds.). ACM, 87–94. DOI:

Digital Library

[31]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. DOI:

[32]

Alexey Drutsa, Gleb Gusev, Eugene Kharitonov, Denis Kulemyakin, Pavel Serdyukov, and Igor Yashkov. 2019. Effective online evaluation for web search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19), Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1399–1400. DOI:

Digital Library

[33]

Kristian Gingstad, Øyvind Jekteberg, and Krisztian Balog. 2020. ArXivDigest: A living lab for personalized scientific literature recommendation. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM’20), Mathieu d’Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 3393–3396. DOI:

Digital Library

[34]

Artem Grotov, Aleksandr Chuklin, Ilya Markov, Luka Stout, Finde Xumara, and Maarten de Rijke. 2015. A comparative study of click models for web search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association,(CLEF 2015),Lecture Notes in Computer Science, Vol. 9283, Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cappellato, and Nicola Ferro (Eds.). Springer, 78–90. DOI:

Digital Library

[35]

Fan Guo, Chao Liu, and Yi Min Wang. 2009. Efficient multiple-click models in web search. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining (WSDM’09), Ricardo Baeza-Yates, Paolo Boldi, Berthier A. Ribeiro-Neto, and Berkant Barla Cambazoglu (Eds.). ACM, 124–131. DOI:

Digital Library

[36]

Jing He, Chengxiang Zhai, and Xiaoming Li. 2009. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and Jimmy Lin (Eds.). ACM, 2029–2032. DOI:

Digital Library

[37]

William R. Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kraemer, Lynetta Sacherek, and Daniel Olson. 2000. Do batch and user evaluation give the same results? In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), Emmanuel J. Yannakoudakis, Nicholas J. Belkin, Peter Ingwersen, and Mun-Kew Leong (Eds.). ACM, 17–24. DOI:

Digital Library

[38]

Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online evaluation for information retrieval. Found. Trends Inf. Retr. 10, 1 (2016), 1–117. DOI:

Digital Library

[39]

Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31, 4 (2013), 17:1–17:43. DOI:

Digital Library

[40]

Sebastian Hofstätter, Sophia Althammer, Mete Sertkan, and Allan Hanbury. 2022. Establishing strong baselines for TripClick health retrieval. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part II,Lecture Notes in Computer Science, Vol. 13186, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer, 144–152. DOI:

Digital Library

[41]

Frank Hopfgartner, Benjamin Kille, Andreas Lommatzsch, Till Plumbaum, Torben Brodt, and Tobias Heintz. 2014. Benchmarking news recommendations in a living lab. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction: Proceedings of the 5th International Conference of the CLEF Initiative (CLEF’14),Lecture Notes in Computer Science, Vol. 8685, Evangelos Kanoulas, Mihai Lupu, Paul D. Clough, Mark Sanderson, Mark M. Hall, Allan Hanbury, and Elaine G. Toms (Eds.). Springer, 250–267. DOI:

[42]

Rolf Jagerman, Krisztian Balog, and Maarten de Rijke. 2018. OpenSearch: Lessons learned from an online evaluation campaign. ACM J. Data Inf. Qual. 10, 3 (2018), 13:1–13:15. DOI:

Digital Library

[43]

Jiepu Jiang and James Allan. 2016. Correlation between system and user metrics in a session. In Proceedings of the ACM Conference on Human Information Interaction and Retrieval (CHIIR’16), Diane Kelly, Robert Capra, Nicholas J. Belkin, Jaime Teevan, and Pertti Vakkari (Eds.). ACM, 285–288. DOI:

Digital Library

[44]

Thorsten Joachims, Laura A. Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05), Ricardo A. Baeza-Yates, Nivio Ziviani, Gary Marchionini, Alistair Moffat, and John Tait (Eds.). ACM, 154–161. DOI:

Digital Library

[45]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446. DOI:

Digital Library

[46]

Jaap Kamps, Marijn Koolen, and Andrew Trotman. 2009. Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the Workshop on Web Search Click Data (WSCD@WSDM’09), Nick Craswell, Rosie Jones, Georges Dupret, and Evelyne Viegas (Eds.). ACM, 80–87. DOI:

Digital Library

[47]

Diane Kelly, Susan T. Dumais, and Jan O. Pedersen. 2009. Evaluation challenges and directions for information-seeking support systems. Computer 42, 3 (2009), 60–66. DOI:

Digital Library

[48]

Pooya Khandel, Ilya Markov, Andrew Yates, and Ana Lucia Varbanescu. 2022. ParClick: A scalable algorithm for EM-based click models. In Proceedings of the Web Conference (WWW’22), Frédérique Laforest, Raphaël Troncy, Elena Simperl, Deepak Agarwal, Aristides Gionis, Ivan Herman, and Lionel Médini (Eds.). ACM, 392–400. DOI:

Digital Library

[49]

Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015. Optimised scheduling of online experiments. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Ricardo Baeza-Yates, Mounia Lalmas, Alistair Moffat, and Berthier A. Ribeiro-Neto (Eds.). ACM, 453–462. DOI:

Digital Library

[50]

Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13), Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy (Eds.). ACM, 1168–1176. DOI:

Digital Library

[51]

Sahiti Labhishetty and Chengxiang Zhai. 2021. An exploration of tester-based evaluation of user simulators for comparing interactive retrieval systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 1598–1602. DOI:

Digital Library

[52]

Sahiti Labhishetty and ChengXiang Zhai. 2022. RATE: A reliability-aware tester-based evaluation framework of user simulators. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part I,Lecture Notes in Computer Science, Vol. 13185, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer, 336–350. DOI:

Digital Library

[53]

Lihong Li, Jinyoung Kim, and Imed Zitouni. 2015. Toward predicting the outcome of an A/B experiment for search relevance. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM’15), Xueqi Cheng, Hang Li, Evgeniy Gabrilovich, and Jie Tang (Eds.). ACM, 37–46. DOI:

Digital Library

[54]

Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18), Yike Guo and Faisal Farooq (Eds.). ACM, 1685–1694. DOI:

Digital Library

[55]

Jimmy Lin, Rodrigo Frassetto Nogueira, and Andrew Yates. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool. DOI:

[56]

Mengyang Liu, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. Investigating cognitive effects in session-level search user satisfaction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis (Eds.). ACM, 923–931. DOI:

Digital Library

[57]

Yiqun Liu, Xiaohui Xie, Chao Wang, Jian-Yun Nie, Min Zhang, and Shaoping Ma. 2017. Time-aware click model. ACM Trans. Inf. Syst. 35, 3 (2017), 16:1–16:24. DOI:

Digital Library

[58]

Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified data wrangling with Ir_datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 2429–2436. DOI:

Digital Library

[59]

Craig Macdonald and Nicola Tonellotto. 2020. Declarative experimentation in information retrieval using PyTerrier. In Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR’20), Krisztian Balog, Vinay Setty, Christina Lioma, Yiqun Liu, Min Zhang, and Klaus Berberich (Eds.). ACM, 161–168. DOI:

Digital Library

[60]

Maria Maistro, Timo Breuer, Philipp Schaer, and Nicola Ferro. 2023. An in-depth investigation on the behavior of measures to quantify reproducibility. Inf. Process. Manag. 60, 3 (2023), 103332. DOI:

Digital Library

[61]

Stepan Malkevich, Ilya Markov, Elena Michailova, and Maarten de Rijke. 2017. Evaluating and analyzing click simulation in web search. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), Jaap Kamps, Evangelos Kanoulas, Maarten de Rijke, Hui Fang, and Emine Yilmaz (Eds.). ACM, 281–284. DOI:

Digital Library

[62]

Jiaxin Mao, Zhumin Chu, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. Investigating the reliability of click models. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’19), Yi Fang, Yi Zhang, James Allan, Krisztian Balog, Ben Carterette, and Jiafeng Guo (Eds.). ACM, 125–128. DOI:

Digital Library

[63]

Ilya Markov, Alexey Borisov, and Maarten de Rijke. 2017. Online expectation-maximization for click models. In Proceedings of the ACM on Conference on Information and Knowledge Management (CIKM’17), Ee-Peng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tseng, and Chenliang Li (Eds.). ACM, 2195–2198. DOI:

Digital Library

[64]

Mónica Marrero. 2018. APONE: Academic platform for ONline experiments. In Proceedings of the 1st Biennial Conference on Design of Experimental Search & Information Retrieval SystemsCEUR Workshop Proceedings, Vol. 2167, Omar Alonso and Gianmaria Silvello (Eds.). CEUR-WS.org, 47–53.

[65]

Mónica Marrero and Claudia Hauff. 2018. A/B testing with APONE. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 1269–1272. DOI:

Digital Library

[66]

David Maxwell. 2019. Modelling Search and Stopping in Interactive Information Retrieval. Ph. D. Dissertation. University of Glasgow, UK.

Digital Library

[67]

David Maxwell and Leif Azzopardi. 2016. Agents, simulated users and humans: An analysis of performance and behaviour. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16), Snehasis Mukhopadhyay, ChengXiang Zhai, Elisa Bertino, Fabio Crestani, Javed Mostafa, Jie Tang, Luo Si, Xiaofang Zhou, Yi Chang, Yunyao Li, and Parikshit Sondhi (Eds.). ACM, 731–740. DOI:

Digital Library

[68]

David Maxwell and Leif Azzopardi. 2016. Simulating interactive information retrieval: SimIIR: A framework for the simulation of interaction. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Raffaele Perego, Fabrizio Sebastiani, Javed A. Aslam, Ian Ruthven, and Justin Zobel (Eds.). ACM, 1141–1144. DOI:

Digital Library

[69]

Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Trans. Inf. Syst. 35, 3 (2017), 24:1–24:38. DOI:

Digital Library

[70]

Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2:1–2:27. DOI:

Digital Library

[71]

Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Douglas Johnson. 2005. Terrier information retrieval platform. In Advances in Information Retrieval: Proceedings of the 27th European Conference on IR Research (ECIR’05),Lecture Notes in Computer Science, Vol. 3408, David E. Losada and Juan M. Fernández-Luna (Eds.). Springer, 517–519. DOI:

Digital Library

[72]

Umut Ozertem, Rosie Jones, and Benoît Dumoulin. 2011. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11), Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar (Eds.). ACM, 397–406. DOI:

Digital Library

[73]

Benjamin Piwowarski and Hugo Zaragoza. 2007. Predictive user click models based on click-through history. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07), Mário J. Silva, Alberto H. F. Laender, Ricardo A. Baeza-Yates, Deborah L. McGuinness, Bjørn Olstad, Øystein Haug Olsen, and André O. Falcão (Eds.). ACM, 175–182. DOI:

Digital Library

[74]

Teemu Pääkkönen, Jaana Kekäläinen, Heikki Keskustalo, Leif Azzopardi, David Maxwell, and Kalervo Järvelin. 2017. Validating simulated interaction for retrieval evaluation. Inf. Retr. J. 20, 4 (2017), 338–362. DOI:

Digital Library

[75]

Xin Qian, Jimmy Lin, and Adam Roegiest. 2016. Interleaved evaluation for retrospective summarization and prospective notification on document streams. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Raffaele Perego, Fabrizio Sebastiani, Javed A. Aslam, Ian Ruthven, and Justin Zobel (Eds.). ACM, 175–184. DOI:

Digital Library

[76]

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08), James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi, and Abdur Chowdhury (Eds.). ACM, 43–52. DOI:

Digital Library

[77]

Navid Rekabsaz, Oleg Lesota, Markus Schedl, Jon Brassey, and Carsten Eickhoff. 2021. TripClick: The log files of a large health web search engine. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 2507–2513. DOI:

Digital Library

[78]

Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389. DOI:

Digital Library

[79]

Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retr. 4, 4 (2010), 247–375. DOI:

[80]

Mark Sanderson, Monica Lestari Paramita, Paul D. Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up? In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), Fabio Crestani, Stéphane Marchand-Maillet, Hsin-Hsi Chen, Efthimis N. Efthimiadis, and Jacques Savoy (Eds.). ACM, 555–562. DOI:

Digital Library

[81]

Philipp Schaer, Timo Breuer, Leyla Jael Castro, Benjamin Wolff, Johann Schaible, and Narges Tavakolpoursaleh. 2021. Overview of LiLAS 2021—Living labs for academic search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 12th International Conference of the CLEF Association (CLEF’21),Lecture Notes in Computer Science, Vol. 12880, K. Selçuk Candan, Bogdan Ionescu, Lorraine Goeuriot, Birger Larsen, Henning Müller, Alexis Joly, Maria Maistro, Florina Piroi, Guglielmo Faggioli, and Nicola Ferro (Eds.). Springer, 394–418. DOI:

Digital Library

[82]

Falk Scholer, Milad Shokouhi, Bodo Billerbeck, and Andrew Turpin. 2008. Using clicks as implicit judgments: Expectations versus observations. In Advances in Information Retrieval: Proceedings of the 30th European Conference on IR Research (ECIR’08),Lecture Notes in Computer Science, Vol. 4956, Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White (Eds.). Springer, 28–39. DOI:

[83]

Anne Schuth, Krisztian Balog, and Liadh Kelly. 2015. Overview of the living labs for information retrieval evaluation (LL4IR) CLEF Lab 2015. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association (CLEF’15),Lecture Notes in Computer Science, Vol. 9283, Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cappellato, and Nicola Ferro (Eds.). Springer, 484–496. DOI:

Digital Library

[84]

Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016. Multileave gradient descent for fast online learning to rank. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining, Paul N. Bennett, Vanja Josifovski, Jennifer Neville, and Filip Radlinski (Eds.). ACM, 457–466. DOI:

Digital Library

[85]

Amanpreet Singh, Mike D’Arcy, Arman Cohan, Doug Downey, and Sergey Feldman. 2022. SciRepEval: A multi-format benchmark for scientific document representations. arXiv:2211.13308. Retrieved from https://arxiv.org/abs/2211.13308

[86]

Ian Soboroff, Charles K. Nicholas, and Patrick Cahan. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel (Eds.). ACM, 66–73. DOI:

Digital Library

[87]

Jean Tague and Michael J. Nelson. 1981. Simulation of user judgments in bibliographic retrieval systems. In Theoretical Issues in Information Retrieval: Proceedings of the 4th International Conference on Information Storage and Retrieval, Carolyn J. Crouch (Ed.). ACM, 66–71. DOI:

Digital Library

[88]

Jean Tague, Michael J. Nelson, and Harry Wu. 1980. Problems in the simulation of bibliographic retrieval systems. In Information Retrieval Research: Proceedings of the Joint ACM/BCS Symposium in Information Storage and Retrieval, Robert N. Oddy, Stephen E. Robertson, C. J. van Rijsbergen, and P. W. Williams (Eds.). Butterworths, 236–255.

[89]

Paul Thomas, Alistair Moffat, Peter Bailey, and Falk Scholer. 2014. Modeling decision points in user search behavior. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14), David Elsweiler, Bernd Ludwig, Leif Azzopardi, and Max L. Wilson (Eds.). ACM, 239–242. DOI:

Digital Library

[90]

Andrew Turpin and William R. Hersh. 2001. Why batch and user evaluations do not give the same results. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel (Eds.). ACM, 225–231. DOI:

Digital Library

[91]

Andrew Turpin and William R. Hersh. 2002. User interface effects in past batch versus user experiments. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02), Kalervo Järvelin, Micheline Beaulieu, Ricardo A. Baeza-Yates, and Sung-Hyon Myaeng (Eds.). ACM, 431–432. DOI:

Digital Library

[92]

Andrew Turpin and William R. Hersh. 2004. Do clarity scores for queries correlate with user performance? In Proceedings of the 15th Australasian Database Conference (ADC’04),CRPIT, Vol. 27, Klaus-Dieter Schewe and Hugh E. Williams (Eds.). Australian Computer Society, 85–91.

[93]

Andrew Turpin and Falk Scholer. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Järvelin (Eds.). ACM, 11–18. DOI:

Digital Library

[94]

Ellen M. Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), W. Bruce Croft, Alistair Moffat, C. J. van Rijsbergen, Ross Wilkinson, and Justin Zobel (Eds.). ACM, 315–323. DOI:

Digital Library

[95]

William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (2010), 20:1–20:38. DOI:

Digital Library

[96]

Ryen White. 2013. Beliefs and biases in web search. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), Gareth J. F. Jones, Paraic Sheridan, Diane Kelly, Maarten de Rijke, and Tetsuya Sakai (Eds.). ACM, 3–12. DOI:

Digital Library

[97]

Junqi Zhang, Yiqun Liu, Shaoping Ma, and Qi Tian. 2018. Relevance estimation with multiple information sources on search engine result pages. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18), Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 627–636. DOI:

Digital Library

[98]

Junqi Zhang, Yiqun Liu, Jiaxin Mao, Weizhi Ma, Jiazheng Xu, Shaoping Ma, and Qi Tian. 2023. User behavior simulation for search result re-ranking. ACM Trans. Inf. Syst. 41, 1 (Jan. 2023), 1–35. DOI:

Digital Library

[99]

Junqi Zhang, Yiqun Liu, Jiaxin Mao, Xiaohui Xie, Min Zhang, Shaoping Ma, and Qi Tian. 2022. Global or local: Constructing personalized click models for web search. In Proceedings of the ACM Web Conference (WWW’22), Frédérique Laforest, Raphaël Troncy, Elena Simperl, Deepak Agarwal, Aristides Gionis, Ivan Herman, and Lionel Médini (Eds.). ACM, 213–223. DOI:

Digital Library

[100]

Yinan Zhang, Xueqing Liu, and ChengXiang Zhai. 2017. Information retrieval evaluation as search simulation: A general formal framework for IR evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), Jaap Kamps, Evangelos Kanoulas, Maarten de Rijke, Hui Fang, and Emine Yilmaz (Eds.). ACM, 193–200. DOI:

Digital Library

[101]

Yukun Zheng, Zhen Fan, Yiqun Liu, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. Sogou-QCL: A new dataset with click relevance label. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 1117–1120. DOI:

Digital Library

[102]

Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), W. Bruce Croft, Alistair Moffat, C. J. van Rijsbergen, Ross Wilkinson, and Justin Zobel (Eds.). ACM, 307–314. DOI:

Digital Library

Cited By

Engelmann BBreuer TFriese JSchaer PFuhr N(2024)Context-Driven Interactive Query Simulations Based on Generative Large Language ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_12(173-188)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56060-6_12

Index Terms

Validating Synthetic Usage Data in Living Lab Environments
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment
    2. Users and interactive retrieval

Recommendations

Overview of the Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab 2015
CLEF'15: Proceedings of the 6th International Conference on Experimental IR Meets Multilinguality, Multimodality, and Interaction - Volume 9283

In this paper we report on the first Living Labs for Information Retrieval Evaluation LL4IR CLEF Lab. Our main goal with the lab is to provide a benchmarking platform for researchers to evaluate their ranking systems in a live setting with real users in ...
Read More
Evaluating Research Dataset Recommendations in a Living Lab
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Abstract
The search for research datasets is as important as laborious. Due to the importance of the choice of research data in further research, this decision must be made carefully. Additionally, because of the growing amounts of data in almost all areas,...
Read More
R/quest: A Question Answering System
FQAS 2013: Proceedings of the 10th International Conference on Flexible Query Answering Systems - Volume 8132

In this paper, we discuss our novel, open-domain question answering Q/A system, R/quest. We use web page snippets from Google^TM to extract short paragraphs that become candidate answers. We performed an evaluation that showed, on average, 1.4 times ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality

Journal of Data and Information Quality Volume 16, Issue 1

March 2024

187 pages

ISSN:1936-1955

EISSN:1936-1963

DOI:10.1145/3613486

Editor:
Felix Naumann
Hasso Plattner Institute, Germany

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2024

Online AM: 24 September 2023

Accepted: 30 August 2023

Revised: 21 July 2023

Received: 14 March 2023

Published in JDIQ Volume 16, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
206
Total Downloads

Downloads (Last 12 months)206
Downloads (Last 6 weeks)30

Other Metrics

View Author Metrics

Citations

Cited By

Engelmann BBreuer TFriese JSchaer PFuhr N(2024)Context-Driven Interactive Query Simulations Based on Generative Large Language ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_12(173-188)Online publication date: 24-Mar-2024
https://dl.acm.org/doi/10.1007/978-3-031-56060-6_12

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents