Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Validating Synthetic Usage Data in Living Lab Environments

Published: 06 March 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, then click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts.
    This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model’s estimates about a system ranking compared to a reference ranking for which the relative performance is known. Our experiments compare different click models and their reliability and robustness as more session log data become available. In our setup, simple click models can reliably determine the relative system performance with already 20 logged sessions for 50 queries. In contrast, more complex click models require more session data for reliable estimates, but they are a better choice in simulated interleaving experiments when enough session data are available. While it is easier for click models to distinguish between more diverse systems, it is harder to reproduce the system ranking based on the same retrieval algorithm with different interpolation weights. Our setup is entirely open, and we share the code to reproduce the experiments.

    References

    [1]
    Eugene Agichtein, Eric Brill, and Susan T. Dumais. 2006. Improving web search ranking by incorporating user behavior information. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Järvelin (Eds.). ACM, 19–26. DOI:
    [2]
    Sophia Althammer, Sebastian Hofstätter, Suzan Verberne, and Allan Hanbury. 2022. TripJudge: A relevance judgement test collection for TripClick health retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Mohammad Al Hasan and Li Xiong (Eds.). ACM, 3801–3805. DOI:
    [3]
    Giambattista Amati. 2006. Frequentist and Bayesian approach to information retrieval. In Advances in Information Retrieval: Proceedings of the 28th European Conference on IR Research (ECIR 2006),Lecture Notes in Computer Science, Vol. 3936, Mounia Lalmas, Andy MacFarlane, Stefan M. Rüger, Anastasios Tombros, Theodora Tsikrika, and Alexei Yavlinsky (Eds.). Springer, 13–24. DOI:
    [4]
    Leif Azzopardi and Krisztian Balog. 2011. Towards a living lab for information retrieval research and development - A proposal for a living lab for product search tasks. In Multilingual and Multimodal Information Access Evaluation: Proceedings of the 2nd International Conference of the Cross-Language Evaluation Forum (CLEF’11). 26–37. DOI:
    [5]
    Eytan Bakshy, Dean Eckles, and Michael S. Bernstein. 2014. Designing and deploying online field experiments. In Proceedings of the 23rd International World Wide Web Conference (WWW’14), Chin-Wan Chung, Andrei Z. Broder, Kyuseok Shim, and Torsten Suel (Eds.). ACM, 283–292. DOI:
    [6]
    Krisztian Balog, David Elsweiler, Evangelos Kanoulas, Liadh Kelly, and Mark D. Smucker. 2013. CIKM 2013 workshop on living labs for information retrieval evaluation. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). 2557–2558. DOI:
    [7]
    Krisztian Balog, Liadh Kelly, and Anne Schuth. 2014. Head first: Living labs for Ad-Hoc search evaluation. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM’14), Jianzhong Li, Xiaoyang Sean Wang, Minos N. Garofalakis, Ian Soboroff, Torsten Suel, and Min Wang (Eds.). ACM, 1815–1818. DOI:
    [8]
    Krisztian Balog, David Maxwell, Paul Thomas, and Shuo Zhang. 2021. Report on the 1st simulation for information retrieval workshop (Sim4IR 2021) at SIGIR 2021. SIGIR Forum 55, 2 (2021), 10:1–10:16. DOI:
    [9]
    Feza Baskaya, Heikki Keskustalo, and Kalervo Järvelin. 2013. Modeling behavioral factors in interactive information retrieval. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), Qi He, Arun Iyengar, Wolfgang Nejdl, Jian Pei, and Rajeev Rastogi (Eds.). ACM, 2297–2302. DOI:
    [10]
    Jöran Beel, Andrew Collins, Oliver Kopp, Linus W. Dietz, and Petr Knoth. 2019. Online evaluations for everyone: Mr. DLib’s living lab for scholarly recommendations. In Advances in Information Retrieval: Proceedings of the 41st European Conference on IR Research (ECIR’19), Part II. 213–219. DOI:
    [11]
    Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19), Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3613–3618. DOI:
    [12]
    Alexey Borisov, Ilya Markov, Maarten de Rijke, and Pavel Serdyukov. 2016. A neural click model for web search. In Proceedings of the 25th International Conference on World Wide Web (WWW’16), Jacqueline Bourdeau, Jim Hendler, Roger Nkambou, Ian Horrocks, and Ben Y. Zhao (Eds.). ACM, 531–541. DOI:
    [13]
    Timo Breuer, Nicola Ferro, Norbert Fuhr, Maria Maistro, Tetsuya Sakai, Philipp Schaer, and Ian Soboroff. 2020. How to measure the reproducibility of system-oriented IR experiments. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20), Jimmy X. Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu (Eds.). ACM, 349–358. DOI:
    [14]
    Timo Breuer and Philipp Schaer. 2021. A living lab architecture for reproducible shared task experimentation. In Information between Data and Knowledge: Information Science and Its Neighbors from Data Science to Digital Humanities: Proceedings of the 16th International Symposium of Information Science (ISI’21), Christian Wolff and Thomas Schmidt (Eds.). Werner Hülsbusch, 348–362. DOI:
    [15]
    Torben Brodt and Frank Hopfgartner. 2014. Shedding light on a living lab: The CLEF NEWSREEL open recommendation platform. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14). 223–226. DOI:
    [16]
    Chris Buckley and Ellen M. Voorhees. 2004. Retrieval evaluation with incomplete information. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04), Mark Sanderson, Kalervo Järvelin, James Allan, and Peter Bruza (Eds.). ACM, 25–32. DOI:
    [17]
    Ben Carterette. 2011. System effectiveness, user models, and user utility: A conceptual framework for investigation. In Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR’11). 903–912. DOI:
    [18]
    Ben Carterette, Ashraf Bah, and Mustafa Zengin. 2015. Dynamic test collections for retrieval evaluation. In Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR’15), James Allan, W. Bruce Croft, Arjen P. de Vries, and Chengxiang Zhai (Eds.). ACM, 91–100. DOI:
    [19]
    Ben Carterette and Praveen Chandar. 2018. Offline comparative evaluation with incremental, minimally-invasive online feedback. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 705–714. DOI:
    [20]
    Ben Carterette and Rosie Jones. 2007. Evaluating search engines by modeling the relationship between relevance and clicks. In Advances in Neural Information Processing Systems 20: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, John C. Platt, Daphne Koller, Yoram Singer, and Sam T. Roweis (Eds.). Curran Associates, Inc., 217–224.
    [21]
    Olivier Chapelle and Ya Zhang. 2009. A dynamic Bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web (WWW’09), Juan Quemada, Gonzalo León, Yoëlle S. Maarek, and Wolfgang Nejdl (Eds.). ACM, 1–10. DOI:
    [22]
    Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White (Eds.). ACM, 15–24. DOI:
    [23]
    Aleksandr Chuklin, Ilya Markov, and Maarten de Rijke. 2015. Click Models for Web Search. Morgan & Claypool. DOI:
    [24]
    Aleksandr Chuklin, Anne Schuth, Katja Hofmann, Pavel Serdyukov, and Maarten de Rijke. 2013. Evaluating aggregated search using interleaving. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM’13), Qi He, Arun Iyengar, Wolfgang Nejdl, Jian Pei, and Rajeev Rastogi (Eds.). ACM, 669–678. DOI:
    [25]
    Aleksandr Chuklin, Anne Schuth, Ke Zhou, and Maarten de Rijke. 2015. A comparative analysis of interleaving methods for aggregated search. ACM Trans. Inf. Syst. 33, 2 (2015), 5:1–5:38. DOI:
    [26]
    Aleksandr Chuklin, Pavel Serdyukov, and Maarten de Rijke. 2013. Click model-based information retrieval metrics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), Gareth J. F. Jones, Paraic Sheridan, Diane Kelly, Maarten de Rijke, and Tetsuya Sakai (Eds.). ACM, 493–502. DOI:
    [27]
    Cyril W. Cleverdon. 1991. The significance of the cranfield tests on index languages. In Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Abraham Bookstein, Yves Chiaramella, Gerard Salton, and Vijay V. Raghavan (Eds.). ACM, 3–12. DOI:
    [28]
    Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 2270–2282. DOI:
    [29]
    Nick Craswell, Daniel Campos, Bhaskar Mitra, Emine Yilmaz, and Bodo Billerbeck. 2020. ORCAS: 20 million clicked query-document pairs for analyzing search. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, Virtual Event (CIKM’20), Mathieu d’Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 2983–2989. DOI:
    [30]
    Nick Craswell, Onno Zoeter, Michael J. Taylor, and Bill Ramsey. 2008. An experimental comparison of click position-bias models. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM’08), Marc Najork, Andrei Z. Broder, and Soumen Chakrabarti (Eds.). ACM, 87–94. DOI:
    [31]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. DOI:
    [32]
    Alexey Drutsa, Gleb Gusev, Eugene Kharitonov, Denis Kulemyakin, Pavel Serdyukov, and Igor Yashkov. 2019. Effective online evaluation for web search. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19), Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer (Eds.). ACM, 1399–1400. DOI:
    [33]
    Kristian Gingstad, Øyvind Jekteberg, and Krisztian Balog. 2020. ArXivDigest: A living lab for personalized scientific literature recommendation. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM’20), Mathieu d’Aquin, Stefan Dietze, Claudia Hauff, Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 3393–3396. DOI:
    [34]
    Artem Grotov, Aleksandr Chuklin, Ilya Markov, Luka Stout, Finde Xumara, and Maarten de Rijke. 2015. A comparative study of click models for web search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association,(CLEF 2015),Lecture Notes in Computer Science, Vol. 9283, Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cappellato, and Nicola Ferro (Eds.). Springer, 78–90. DOI:
    [35]
    Fan Guo, Chao Liu, and Yi Min Wang. 2009. Efficient multiple-click models in web search. In Proceedings of the 2nd International Conference on Web Search and Web Data Mining (WSDM’09), Ricardo Baeza-Yates, Paolo Boldi, Berthier A. Ribeiro-Neto, and Berkant Barla Cambazoglu (Eds.). ACM, 124–131. DOI:
    [36]
    Jing He, Chengxiang Zhai, and Xiaoming Li. 2009. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009), David Wai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and Jimmy Lin (Eds.). ACM, 2029–2032. DOI:
    [37]
    William R. Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kraemer, Lynetta Sacherek, and Daniel Olson. 2000. Do batch and user evaluation give the same results? In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’00), Emmanuel J. Yannakoudakis, Nicholas J. Belkin, Peter Ingwersen, and Mun-Kew Leong (Eds.). ACM, 17–24. DOI:
    [38]
    Katja Hofmann, Lihong Li, and Filip Radlinski. 2016. Online evaluation for information retrieval. Found. Trends Inf. Retr. 10, 1 (2016), 1–117. DOI:
    [39]
    Katja Hofmann, Shimon Whiteson, and Maarten de Rijke. 2013. Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31, 4 (2013), 17:1–17:43. DOI:
    [40]
    Sebastian Hofstätter, Sophia Althammer, Mete Sertkan, and Allan Hanbury. 2022. Establishing strong baselines for TripClick health retrieval. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part II,Lecture Notes in Computer Science, Vol. 13186, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer, 144–152. DOI:
    [41]
    Frank Hopfgartner, Benjamin Kille, Andreas Lommatzsch, Till Plumbaum, Torben Brodt, and Tobias Heintz. 2014. Benchmarking news recommendations in a living lab. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction: Proceedings of the 5th International Conference of the CLEF Initiative (CLEF’14),Lecture Notes in Computer Science, Vol. 8685, Evangelos Kanoulas, Mihai Lupu, Paul D. Clough, Mark Sanderson, Mark M. Hall, Allan Hanbury, and Elaine G. Toms (Eds.). Springer, 250–267. DOI:
    [42]
    Rolf Jagerman, Krisztian Balog, and Maarten de Rijke. 2018. OpenSearch: Lessons learned from an online evaluation campaign. ACM J. Data Inf. Qual. 10, 3 (2018), 13:1–13:15. DOI:
    [43]
    Jiepu Jiang and James Allan. 2016. Correlation between system and user metrics in a session. In Proceedings of the ACM Conference on Human Information Interaction and Retrieval (CHIIR’16), Diane Kelly, Robert Capra, Nicholas J. Belkin, Jaime Teevan, and Pertti Vakkari (Eds.). ACM, 285–288. DOI:
    [44]
    Thorsten Joachims, Laura A. Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2005. Accurately interpreting clickthrough data as implicit feedback. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05), Ricardo A. Baeza-Yates, Nivio Ziviani, Gary Marchionini, Alistair Moffat, and John Tait (Eds.). ACM, 154–161. DOI:
    [45]
    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446. DOI:
    [46]
    Jaap Kamps, Marijn Koolen, and Andrew Trotman. 2009. Comparative analysis of clicks and judgments for IR evaluation. In Proceedings of the Workshop on Web Search Click Data (WSCD@WSDM’09), Nick Craswell, Rosie Jones, Georges Dupret, and Evelyne Viegas (Eds.). ACM, 80–87. DOI:
    [47]
    Diane Kelly, Susan T. Dumais, and Jan O. Pedersen. 2009. Evaluation challenges and directions for information-seeking support systems. Computer 42, 3 (2009), 60–66. DOI:
    [48]
    Pooya Khandel, Ilya Markov, Andrew Yates, and Ana Lucia Varbanescu. 2022. ParClick: A scalable algorithm for EM-based click models. In Proceedings of the Web Conference (WWW’22), Frédérique Laforest, Raphaël Troncy, Elena Simperl, Deepak Agarwal, Aristides Gionis, Ivan Herman, and Lionel Médini (Eds.). ACM, 392–400. DOI:
    [49]
    Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015. Optimised scheduling of online experiments. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Ricardo Baeza-Yates, Mounia Lalmas, Alistair Moffat, and Berthier A. Ribeiro-Neto (Eds.). ACM, 453–462. DOI:
    [50]
    Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. 2013. Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13), Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy (Eds.). ACM, 1168–1176. DOI:
    [51]
    Sahiti Labhishetty and Chengxiang Zhai. 2021. An exploration of tester-based evaluation of user simulators for comparing interactive retrieval systems. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 1598–1602. DOI:
    [52]
    Sahiti Labhishetty and ChengXiang Zhai. 2022. RATE: A reliability-aware tester-based evaluation framework of user simulators. In Advances in Information Retrieval: Proceedings of the 44th European Conference on IR Research (ECIR’22), Part I,Lecture Notes in Computer Science, Vol. 13185, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer, 336–350. DOI:
    [53]
    Lihong Li, Jinyoung Kim, and Imed Zitouni. 2015. Toward predicting the outcome of an A/B experiment for search relevance. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM’15), Xueqi Cheng, Hang Li, Evgeniy Gabrilovich, and Jie Tang (Eds.). ACM, 37–46. DOI:
    [54]
    Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18), Yike Guo and Faisal Farooq (Eds.). ACM, 1685–1694. DOI:
    [55]
    Jimmy Lin, Rodrigo Frassetto Nogueira, and Andrew Yates. 2021. Pretrained Transformers for Text Ranking: BERT and Beyond. Morgan & Claypool. DOI:
    [56]
    Mengyang Liu, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. Investigating cognitive effects in session-level search user satisfaction. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis (Eds.). ACM, 923–931. DOI:
    [57]
    Yiqun Liu, Xiaohui Xie, Chao Wang, Jian-Yun Nie, Min Zhang, and Shaoping Ma. 2017. Time-aware click model. ACM Trans. Inf. Syst. 35, 3 (2017), 16:1–16:24. DOI:
    [58]
    Sean MacAvaney, Andrew Yates, Sergey Feldman, Doug Downey, Arman Cohan, and Nazli Goharian. 2021. Simplified data wrangling with Ir_datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 2429–2436. DOI:
    [59]
    Craig Macdonald and Nicola Tonellotto. 2020. Declarative experimentation in information retrieval using PyTerrier. In Proceedings of the ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR’20), Krisztian Balog, Vinay Setty, Christina Lioma, Yiqun Liu, Min Zhang, and Klaus Berberich (Eds.). ACM, 161–168. DOI:
    [60]
    Maria Maistro, Timo Breuer, Philipp Schaer, and Nicola Ferro. 2023. An in-depth investigation on the behavior of measures to quantify reproducibility. Inf. Process. Manag. 60, 3 (2023), 103332. DOI:
    [61]
    Stepan Malkevich, Ilya Markov, Elena Michailova, and Maarten de Rijke. 2017. Evaluating and analyzing click simulation in web search. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), Jaap Kamps, Evangelos Kanoulas, Maarten de Rijke, Hui Fang, and Emine Yilmaz (Eds.). ACM, 281–284. DOI:
    [62]
    Jiaxin Mao, Zhumin Chu, Yiqun Liu, Min Zhang, and Shaoping Ma. 2019. Investigating the reliability of click models. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’19), Yi Fang, Yi Zhang, James Allan, Krisztian Balog, Ben Carterette, and Jiafeng Guo (Eds.). ACM, 125–128. DOI:
    [63]
    Ilya Markov, Alexey Borisov, and Maarten de Rijke. 2017. Online expectation-maximization for click models. In Proceedings of the ACM on Conference on Information and Knowledge Management (CIKM’17), Ee-Peng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tseng, and Chenliang Li (Eds.). ACM, 2195–2198. DOI:
    [64]
    Mónica Marrero. 2018. APONE: Academic platform for ONline experiments. In Proceedings of the 1st Biennial Conference on Design of Experimental Search & Information Retrieval SystemsCEUR Workshop Proceedings, Vol. 2167, Omar Alonso and Gianmaria Silvello (Eds.). CEUR-WS.org, 47–53.
    [65]
    Mónica Marrero and Claudia Hauff. 2018. A/B testing with APONE. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 1269–1272. DOI:
    [66]
    David Maxwell. 2019. Modelling Search and Stopping in Interactive Information Retrieval. Ph. D. Dissertation. University of Glasgow, UK.
    [67]
    David Maxwell and Leif Azzopardi. 2016. Agents, simulated users and humans: An analysis of performance and behaviour. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management (CIKM’16), Snehasis Mukhopadhyay, ChengXiang Zhai, Elisa Bertino, Fabio Crestani, Javed Mostafa, Jie Tang, Luo Si, Xiaofang Zhou, Yi Chang, Yunyao Li, and Parikshit Sondhi (Eds.). ACM, 731–740. DOI:
    [68]
    David Maxwell and Leif Azzopardi. 2016. Simulating interactive information retrieval: SimIIR: A framework for the simulation of interaction. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Raffaele Perego, Fabrizio Sebastiani, Javed A. Aslam, Ian Ruthven, and Justin Zobel (Eds.). ACM, 1141–1144. DOI:
    [69]
    Alistair Moffat, Peter Bailey, Falk Scholer, and Paul Thomas. 2017. Incorporating user expectations and behavior into the measurement of search effectiveness. ACM Trans. Inf. Syst. 35, 3 (2017), 24:1–24:38. DOI:
    [70]
    Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 2:1–2:27. DOI:
    [71]
    Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and Douglas Johnson. 2005. Terrier information retrieval platform. In Advances in Information Retrieval: Proceedings of the 27th European Conference on IR Research (ECIR’05),Lecture Notes in Computer Science, Vol. 3408, David E. Losada and Juan M. Fernández-Luna (Eds.). Springer, 517–519. DOI:
    [72]
    Umut Ozertem, Rosie Jones, and Benoît Dumoulin. 2011. Evaluating new search engine configurations with pre-existing judgments and clicks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11), Sadagopan Srinivasan, Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, and Ravi Kumar (Eds.). ACM, 397–406. DOI:
    [73]
    Benjamin Piwowarski and Hugo Zaragoza. 2007. Predictive user click models based on click-through history. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07), Mário J. Silva, Alberto H. F. Laender, Ricardo A. Baeza-Yates, Deborah L. McGuinness, Bjørn Olstad, Øystein Haug Olsen, and André O. Falcão (Eds.). ACM, 175–182. DOI:
    [74]
    Teemu Pääkkönen, Jaana Kekäläinen, Heikki Keskustalo, Leif Azzopardi, David Maxwell, and Kalervo Järvelin. 2017. Validating simulated interaction for retrieval evaluation. Inf. Retr. J. 20, 4 (2017), 338–362. DOI:
    [75]
    Xin Qian, Jimmy Lin, and Adam Roegiest. 2016. Interleaved evaluation for retrospective summarization and prospective notification on document streams. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16), Raffaele Perego, Fabrizio Sebastiani, Javed A. Aslam, Ian Ruthven, and Justin Zobel (Eds.). ACM, 175–184. DOI:
    [76]
    Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does clickthrough data reflect retrieval quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM’08), James G. Shanahan, Sihem Amer-Yahia, Ioana Manolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi, and Abdur Chowdhury (Eds.). ACM, 43–52. DOI:
    [77]
    Navid Rekabsaz, Oleg Lesota, Markus Schedl, Jon Brassey, and Carsten Eickhoff. 2021. TripClick: The log files of a large health web search engine. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21), Fernando Diaz, Chirag Shah, Torsten Suel, Pablo Castells, Rosie Jones, and Tetsuya Sakai (Eds.). ACM, 2507–2513. DOI:
    [78]
    Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389. DOI:
    [79]
    Mark Sanderson. 2010. Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retr. 4, 4 (2010), 247–375. DOI:
    [80]
    Mark Sanderson, Monica Lestari Paramita, Paul D. Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up? In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’10), Fabio Crestani, Stéphane Marchand-Maillet, Hsin-Hsi Chen, Efthimis N. Efthimiadis, and Jacques Savoy (Eds.). ACM, 555–562. DOI:
    [81]
    Philipp Schaer, Timo Breuer, Leyla Jael Castro, Benjamin Wolff, Johann Schaible, and Narges Tavakolpoursaleh. 2021. Overview of LiLAS 2021—Living labs for academic search. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 12th International Conference of the CLEF Association (CLEF’21),Lecture Notes in Computer Science, Vol. 12880, K. Selçuk Candan, Bogdan Ionescu, Lorraine Goeuriot, Birger Larsen, Henning Müller, Alexis Joly, Maria Maistro, Florina Piroi, Guglielmo Faggioli, and Nicola Ferro (Eds.). Springer, 394–418. DOI:
    [82]
    Falk Scholer, Milad Shokouhi, Bodo Billerbeck, and Andrew Turpin. 2008. Using clicks as implicit judgments: Expectations versus observations. In Advances in Information Retrieval: Proceedings of the 30th European Conference on IR Research (ECIR’08),Lecture Notes in Computer Science, Vol. 4956, Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White (Eds.). Springer, 28–39. DOI:
    [83]
    Anne Schuth, Krisztian Balog, and Liadh Kelly. 2015. Overview of the living labs for information retrieval evaluation (LL4IR) CLEF Lab 2015. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: Proceedings of the 6th International Conference of the CLEF Association (CLEF’15),Lecture Notes in Computer Science, Vol. 9283, Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cappellato, and Nicola Ferro (Eds.). Springer, 484–496. DOI:
    [84]
    Anne Schuth, Harrie Oosterhuis, Shimon Whiteson, and Maarten de Rijke. 2016. Multileave gradient descent for fast online learning to rank. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining, Paul N. Bennett, Vanja Josifovski, Jennifer Neville, and Filip Radlinski (Eds.). ACM, 457–466. DOI:
    [85]
    Amanpreet Singh, Mike D’Arcy, Arman Cohan, Doug Downey, and Sergey Feldman. 2022. SciRepEval: A multi-format benchmark for scientific document representations. arXiv:2211.13308. Retrieved from https://arxiv.org/abs/2211.13308
    [86]
    Ian Soboroff, Charles K. Nicholas, and Patrick Cahan. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel (Eds.). ACM, 66–73. DOI:
    [87]
    Jean Tague and Michael J. Nelson. 1981. Simulation of user judgments in bibliographic retrieval systems. In Theoretical Issues in Information Retrieval: Proceedings of the 4th International Conference on Information Storage and Retrieval, Carolyn J. Crouch (Ed.). ACM, 66–71. DOI:
    [88]
    Jean Tague, Michael J. Nelson, and Harry Wu. 1980. Problems in the simulation of bibliographic retrieval systems. In Information Retrieval Research: Proceedings of the Joint ACM/BCS Symposium in Information Storage and Retrieval, Robert N. Oddy, Stephen E. Robertson, C. J. van Rijsbergen, and P. W. Williams (Eds.). Butterworths, 236–255.
    [89]
    Paul Thomas, Alistair Moffat, Peter Bailey, and Falk Scholer. 2014. Modeling decision points in user search behavior. In Proceedings of the 5th Information Interaction in Context Symposium (IIiX’14), David Elsweiler, Bernd Ludwig, Leif Azzopardi, and Max L. Wilson (Eds.). ACM, 239–242. DOI:
    [90]
    Andrew Turpin and William R. Hersh. 2001. Why batch and user evaluations do not give the same results. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01), W. Bruce Croft, David J. Harper, Donald H. Kraft, and Justin Zobel (Eds.). ACM, 225–231. DOI:
    [91]
    Andrew Turpin and William R. Hersh. 2002. User interface effects in past batch versus user experiments. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’02), Kalervo Järvelin, Micheline Beaulieu, Ricardo A. Baeza-Yates, and Sung-Hyon Myaeng (Eds.). ACM, 431–432. DOI:
    [92]
    Andrew Turpin and William R. Hersh. 2004. Do clarity scores for queries correlate with user performance? In Proceedings of the 15th Australasian Database Conference (ADC’04),CRPIT, Vol. 27, Klaus-Dieter Schewe and Hugh E. Williams (Eds.). Australian Computer Society, 85–91.
    [93]
    Andrew Turpin and Falk Scholer. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06), Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Järvelin (Eds.). ACM, 11–18. DOI:
    [94]
    Ellen M. Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), W. Bruce Croft, Alistair Moffat, C. J. van Rijsbergen, Ross Wilkinson, and Justin Zobel (Eds.). ACM, 315–323. DOI:
    [95]
    William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. 28, 4 (2010), 20:1–20:38. DOI:
    [96]
    Ryen White. 2013. Beliefs and biases in web search. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’13), Gareth J. F. Jones, Paraic Sheridan, Diane Kelly, Maarten de Rijke, and Tetsuya Sakai (Eds.). ACM, 3–12. DOI:
    [97]
    Junqi Zhang, Yiqun Liu, Shaoping Ma, and Qi Tian. 2018. Relevance estimation with multiple information sources on search engine result pages. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM’18), Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang (Eds.). ACM, 627–636. DOI:
    [98]
    Junqi Zhang, Yiqun Liu, Jiaxin Mao, Weizhi Ma, Jiazheng Xu, Shaoping Ma, and Qi Tian. 2023. User behavior simulation for search result re-ranking. ACM Trans. Inf. Syst. 41, 1 (Jan. 2023), 1–35. DOI:
    [99]
    Junqi Zhang, Yiqun Liu, Jiaxin Mao, Xiaohui Xie, Min Zhang, Shaoping Ma, and Qi Tian. 2022. Global or local: Constructing personalized click models for web search. In Proceedings of the ACM Web Conference (WWW’22), Frédérique Laforest, Raphaël Troncy, Elena Simperl, Deepak Agarwal, Aristides Gionis, Ivan Herman, and Lionel Médini (Eds.). ACM, 213–223. DOI:
    [100]
    Yinan Zhang, Xueqing Liu, and ChengXiang Zhai. 2017. Information retrieval evaluation as search simulation: A general formal framework for IR evaluation. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR’17), Jaap Kamps, Evangelos Kanoulas, Maarten de Rijke, Hui Fang, and Emine Yilmaz (Eds.). ACM, 193–200. DOI:
    [101]
    Yukun Zheng, Zhen Fan, Yiqun Liu, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. Sogou-QCL: A new dataset with click relevance label. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR’18), Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz (Eds.). ACM, 1117–1120. DOI:
    [102]
    Justin Zobel. 1998. How reliable are the results of large-scale information retrieval experiments?. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), W. Bruce Croft, Alistair Moffat, C. J. van Rijsbergen, Ross Wilkinson, and Justin Zobel (Eds.). ACM, 307–314. DOI:

    Cited By

    View all
    • (2024)Context-Driven Interactive Query Simulations Based on Generative Large Language ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_12(173-188)Online publication date: 24-Mar-2024

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 16, Issue 1
    March 2024
    187 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3613486
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 March 2024
    Online AM: 24 September 2023
    Accepted: 30 August 2023
    Revised: 21 July 2023
    Received: 14 March 2023
    Published in JDIQ Volume 16, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Synthetic usage data
    2. click signals
    3. system evaluation
    4. living labs

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)206
    • Downloads (Last 6 weeks)30

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Context-Driven Interactive Query Simulations Based on Generative Large Language ModelsAdvances in Information Retrieval10.1007/978-3-031-56060-6_12(173-188)Online publication date: 24-Mar-2024

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media