Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Online and Offline Evaluation in Search Clarification

Published: 04 November 2024 Publication History

Abstract

The effectiveness of clarification question models in engaging users within search systems is currently constrained, casting doubt on their overall usefulness. To improve the performance of these models, it is crucial to employ assessment approaches that encompass both real-time feedback from users (online evaluation) and the characteristics of clarification questions evaluated through human assessment (offline evaluation). However, the relationship between online and offline evaluations has been debated in information retrieval. This study aims to investigate how this discordance holds in search clarification. We use user engagement as ground truth and employ several offline labels to investigate to what extent the offline ranked lists of clarification resemble the ideal ranked lists based on online user engagement. Contrary to the current understanding that offline evaluations fall short of supporting online evaluations, we indicate that when identifying the most engaging clarification questions from the user’s perspective, online and offline evaluations correspond with each other. We show that the query length does not influence the relationship between online and offline evaluations, and reducing uncertainty in online evaluation strengthens this relationship. We illustrate that an engaging clarification needs to excel from multiple perspectives, and SERP quality and characteristics of the clarification are equally important. We also investigate if human labels can enhance the performance of Large Language Models (LLMs) and Learning-to-Rank (LTR) models in identifying the most engaging clarification questions from the user’s perspective by incorporating offline evaluations as input features. Our results indicate that LTR models do not perform better than individual offline labels. However, GPT, an LLM, emerges as the standout performer, surpassing all LTR models and offline labels.

References

[1]
Rakesh Agrawal, Alan Halverson, Krishnaram Kenthapadi, Nina Mishra, and Panayiotis Tsaparas. 2009. Generating labels from clicks. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 172–181.
[2]
Azzah Al-Maskari and Mark Sanderson. 2011. The effect of user characteristics on search effectiveness in information retrieval. Information Processing & Management 47, 5 (2011), 719–729.
[3]
Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2020. ConvAI3: Generating clarifying questions for open-domain dialogue systems (ClariQ). arXiv:2009.11352.
[4]
Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. 2021. Building and evaluating open-domain dialogue corpora with clarifying questions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 4473–4484.
[5]
Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. 2019. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd international ACM SIGIR Conference on Research and Development in Information Retrieval, 475–484.
[6]
Alexandros Bampoulidis, João Palotti, Mihai Lupu, Jon Brassey, and Allan Hanbury. 2017. Does online evaluation correspond to offline evaluation in query auto completion?. In Proceedings of the European Conference on Information Retrieval. Springer, 713–719.
[7]
Joeran Beel and Stefan Langer. 2015. A comparison of offline evaluations, online evaluations, and user studies in the context of research-paper recommender systems. In Proceedings of the International Conference on Theory and Practice of Digital Libraries. Springer, 153–168.
[8]
Joeran Beel, Stefan Langer, Marcel Genzmehr, Bela Gipp, Corinna Breitinger, and Andreas Nürnberger. 2013. Research paper recommender system evaluation: A quantitative literature survey. In Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation, 15–22.
[9]
Michael Bendersky and W. Bruce Croft. 2009. Analysis of long queries in a large scale search log. In Proceedings of the 2009 Workshop on Web Search Click Data, 8–14.
[10]
Pavel Braslavski, Denis Savenkov, Eugene Agichtein, and Alina Dubatovka. 2017. What do you mean exactly? Analyzing clarification questions in CQA. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval, 345–348.
[11]
Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems (TOIS) 30, 1 (2012), 1–41.
[12]
Olivier Chapelle and Ya Zhang. 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web, 1–10.
[13]
Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 15–24.
[14]
Aleksandr Chuklin and Maarten de Rijke. 2016. Incorporating clicks, attention and satisfaction into a search engine result page evaluation model. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 175–184.
[15]
Aleksandr Chuklin, Pavel Serdyukov, and Maarten De Rijke. 2013. Click model-based information retrieval metrics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 493–502.
[16]
Cyril Cleverdon, Jack Mills, and Michael Keen. 1966. Factors Determining the Performance of Indexing Systems. Vol. 1, Part 2. Cranfield Research Projects.
[17]
Paolo Cremonesi, Franca Garzotto, and Roberto Turrin. 2012. Investigating the persuasion potential of recommender systems from a quality perspective: An empirical study. ACM Transactions on Interactive Intelligent Systems (TiiS) 2, 2 (2012), 1–41.
[18]
V. Dang. 2013. The Lemur Project-Wiki-RankLib. Lemur Project. Retrieved from https://sourceforge.net/p/lemur/wiki/RankLib.
[19]
Michael D. Ekstrand, F. M. Harper, Martijn C. Willemsen, and Joseph A. Konstan. 2014. User perception of differences in recommender algorithms. In Proceedings of the 8th ACM Conference on Recommender Systems, 161–168.
[20]
Mehmet Firat. 2023. What ChatGPT means for universities: Perceptions of scholars and students. Journal of Applied Learning and Teaching 6, 1 (2023), 57–63.
[21]
Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. 2005. Evaluating implicit measures to improve web search. ACM Transactions on Information Systems (TOIS) 23, 2 (2005), 147–168.
[22]
Malte Gabsdil. 2003. Clarification in spoken dialogue systems. In Proceedings of the 2003 AAAI Spring Symposium. Workshop on Natural Language Generation in Spoken and Written Dialogue, 28–35.
[23]
Florent Garcin, Boi Faltings, Olivier Donatsch, Ayar Alazzawi, Christophe Bruttin, and Amr Huber. 2014. Offline and online evaluation of news recommender systems at swissinfo.ch. In Proceedings of the 8th ACM Conference on Recommender Systems, 169–176.
[24]
Bo Geng, Linjun Yang, Chao Xu, Xian-Sheng Hua, and Shipeng Li. 2011. The role of attractiveness in web image search. In Proceedings of the 19th ACM International Conference on Multimedia, 63–72.
[25]
Bruce Hanington and Bella Martin. 2019. Universal Methods of Design Expanded and Revised: 125 Ways to Research Complex Problems, Develop Innovative Ideas, and Design Effective Solutions. Rockport publishers.
[26]
Samuel Huston and W. B. Croft. 2010. Evaluating verbose query processing techniques. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 291–298.
[27]
Osman A. S. Ibrahim and Eman M. G. Younis. 2022. Hybrid online–offline learning to rank using simulated annealing strategy based on dependent click model. Knowledge and Information Systems 64, 10 (2022), 2833–2847.
[28]
Amir Ingber, Liane Lewin-Eytan, Alexander Libov, Yoelle Maarek, and Eliyahu Osherovich. 2018. Offline vs. online evaluation in voice product search. In Proceedings of the 1st International Workshop on Generalization in Information Retrieval (GLARE ’18). Retrieved from http://glare2018.dei.unipd.it/paper/glare2018-paper4.pdf.
[29]
Jiepu Jiang, Ahmed H. Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and predicting graded search satisfaction. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining, 57–66.
[30]
Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 133–142.
[31]
Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 217–226.
[32]
Thorsten Joachims. 2003. Evaluating retrieval performance using clickthrough data. In Text Mining. J. Franke, G. Nakhaeizadeh, and I. Renz (Eds.), Physica.
[33]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2017. Accurately interpreting clickthrough data as implicit feedback. In ACM SIGIR Forum, Vol. 51, 1. ACM, New York, NY, 4–11.
[34]
Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems (TOIS) 25, 2 (2007), 7–es.
[35]
Diane Kelly and Jaime Teevan. 2003. Implicit feedback for inferring user preference: a bibliography. In ACM SIGIR Forum, Vol. 37. ACM, New York, NY, 18–28.
[36]
Maurice George Kendall. 1975. Correlation Methods (4th. ed.). Charles Griffin, London, United Kingdom.
[37]
Kyung-Sun Kim. 2008. Effects of emotion control and task on web searching behavior. Information Processing & Management 44, 1 (2008), 373–385.
[38]
Youngho Kim, Ahmed Hassan, Ryen W. White, and Imed Zitouni. 2014. Comparing client and server dwell time estimates for click-level satisfaction prediction. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, 895–898.
[39]
Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne. 2009. Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery 18, 1 (2009), 140–181.
[40]
Antonios Minas Krasakis, Mohammad Aliannejadi, Nikos Voskarides, and Evangelos Kanoulas. 2020. Analysing the effect of clarifying questions on document ranking in conversational search. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, 129–132.
[41]
Vaibhav Kumar, Vikas Raunak, and Jamie Callan. 2020. Ranking clarification questions via natural language inference. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), 2093–2096.
[42]
Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, Shan Muthukrishnan, Vishwa Vinay, and Zheng Wen. 2018. Offline evaluation of ranking policies with click models. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1685–1694.
[43]
Jiqun Liu, Yiwei Wang, Soumik Mandal, and Chirag Shah. 2019. Exploring the immediate and short-term effects of peer advice and cognitive authority on Web search behavior. Information Processing & Management 56, 3 (2019), 1010–1025.
[44]
Jiqun Liu and Ran Yu. 2021. State-aware meta-evaluation of evaluation metrics in interactive information retrieval. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 3258–3262.
[45]
Yiqun Liu, Ye Chen, Jinhui Tang, Jiashen Sun, Min Zhang, Shaoping Ma, and Xuan Zhu. 2015. Different users, different opinions: Predicting search satisfaction with mouse movement information. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 493–502.
[46]
Yiqun Liu, Yupeng Fu, Min Zhang, Shaoping Ma, and Liyun Ru. 2007. Automatic search engine performance evaluation with click-through data analysis. In Proceedings of the 16th International Conference on World Wide Web, 1133–1134.
[47]
Marco Markwald, Jiqun Liu, and Ran Yu. 2023. Constructing and meta-evaluating state-aware evaluation metrics for interactive search systems. Information Retrieval Journal 26, 1 (2023), 10.
[48]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 1–27.
[49]
Heather L O’Brien, Jaime Arguello, and Rob Capra. 2020. An empirical study of interest, task complexity, and search behaviour on user engagement. Information Processing & Management 57, 3 (2020), 102226.
[50]
Wenjie Ou and Yue Lin. 2020. A clarifying question selection system from NTES \(\_\) ALONG in Convai3 challenge. arXiv:2010.14202. Retrieved from https://doi.org/10.48550/arXiv.2010.14202
[51]
Maeve O’Brien and Mark T Keane. 2006. Modeling result-list searching in the World Wide Web: The role of relevance topologies and trust bias. In Proceedings of the 28th Annual Conference of the Cognitive Science Society, Vol. 28. Citeseer, 1881–1886.
[52]
Karl Pearson. 1896. VII. Mathematical contributions to the theory of evolution.—III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 187 (1896), 253–318.
[53]
Gustavo Penha, Alexandru Balan, and Claudia Hauff. 2019. Introducing MANtIS: A novel multi-domain information seeking dialogues dataset. arXiv:1912.04639. Retrieved from https://doi.org/10.48550/arXiv.1912.04639
[54]
Silvia Quarteroni and Suresh Manandhar. 2007. A chatbot-based interactive question answering system. In Proceedings of the 11th Workshop on the Semantics and Pragmatics of Dialogue (Decalog ’07), 83.
[55]
Hossein A. Rahmani, Xi Wang, Mohammad Aliannejadi, Mohammadmehdi Naghiaei, and Emine Yilmaz. 2024. Clarifying the path to user satisfaction: An investigation into clarification usefulness. In Proceedings of the Findings of the Association for Computational Linguistics (EACL ’24), 1266–1277.
[56]
Hossein A. Rahmani, Xi Wang, Yue Feng, Qiang Zhang, Emine Yilmaz, and Aldo Lipani. 2023. A survey on asking clarification questions datasets in conversational systems. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2698–2716.
[57]
Sudha Rao and Hal Daumé III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. In Annual Meeting of the Association for Computational Linguistics, 2736–2745. DOI:
[58]
Sudha Rao and Hal Daumé III. 2019. Answer-based adversarial training for generating clarification questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers), 143–155. DOI: https://aclanthology.org/N19-1.pdf
[59]
Marco Rossetti, Fabio Stella, and Markus Zanker. 2016. Contrasting offline and online results when evaluating recommendation algorithms. In Proceedings of the 10th ACM Conference on Recommender Systems, 31–34.
[60]
Alan Said and Alejandro Bellogín. 2014. Comparative recommender system evaluation: Benchmarking recommendation frameworks. In Proceedings of the 8th ACM Conference on Recommender Systems, 129–136.
[61]
Ivan Sekulić, Mohammad Aliannejadi, and Fabio Crestani. 2021. User engagement prediction for clarification in search. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part I 43. Springer-Verlag, Berlin, Heidelberg, 619–633. DOI:
[62]
Vered Shwartz, Peter West, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. Unsupervised commonsense question answering with self-talk. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
[63]
A Stoll. 2017. Post hoc tests: Tukey honestly significant difference test. The SAGE encyclopedia of communication research methods (2017), 1306–1307.
[64]
Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual risk minimization: Learning from logged bandit feedback. In Proceedings of the International Conference on Machine Learning. PMLR, 814–823.
[65]
Leila Tavakoli, Johanne R. Trippas, Hamed Zamani, Falk Scholer, and Mark Sanderson. 2022. MIMICS-Duo: Offline & online evaluation of search clarification. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22). ACM, New York, NY, 3198–3208. DOI:
[66]
Leila Tavakoli, Hamed Zamani, Falk Scholer, William Bruce Croft, and Mark Sanderson. 2022. Analyzing clarification in asynchronous information-seeking conversations. Journal of the Association for Information Science and Technology 73, 3 (2022), 449–471.
[67]
Jaime Teevan, Susan T. Dumais, and Eric Horvitz. 2007. Characterizing the value of personalizing search. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 757–758.
[68]
John W. Tukey. 1949. Comparing individual means in the analysis of variance. Biometrics (1949), 99–114.
[69]
Ellen M. Voorhees, Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. Vol. 63. MIT press.
[70]
Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016. Learning to rank with selection bias in personal search. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 115–124.
[71]
William Webber, Alistair Moffat, and Justin Zobel. 2010. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS) 28, 4 (2010), 1–38.
[72]
Clark Wissler. 1905. The Spearman correlation formula. Science 22, 558 (1905), 309–311.
[73]
Jingjing Xu, Yuechen Wang, Duyu Tang, Nan Duan, Pengcheng Yang, Qi Zeng, Ming Zhou, and Sun Xu. 2019. Asking clarification questions in knowledge-based question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP ’19), 1618–1629.
[74]
Jeonghee Yi, Ye Chen, Jie Li, Swaraj Sett, and Tak W. Yan. 2013. Predictive model performance: Offline and online evaluations. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1294–1302.
[75]
Emine Yilmaz, Manisha Verma, Nick Craswell, Filip Radlinski, and Peter Bailey. 2014. Relevance and effort: An analysis of document utility. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, 91–100.
[76]
Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck. 2020. Generating clarifying questions for information retrieval. In Proceedings of The Web Conference 2020, 418–428.
[77]
Hamed Zamani, Gord Lueck, Everest Chen, Rodolfo Quispe, Flint Luu, and Nick Craswell. 2020. MIMICS: A large-scale data collection for search clarification. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 3189–3196.
[78]
Hamed Zamani, Bhaskar Mitra, Everest Chen, Gord Lueck, Fernando Diaz, Paul N. Bennett, Nick Craswell, and Susan T. Dumais. 2020. Analyzing and learning from user interactions for search clarification. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1181–1190.
[79]
Fan Zhang, Ke Zhou, Yunqiu Shao, Cheng Luo, Min Zhang, and Shaoping Ma. 2018. How well do offline and online evaluation metrics measure user satisfaction in web image search? In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 615–624.
[80]
Hua Zheng, Dong Wang, Qi Zhang, Hang Li, and Tinghao Yang. 2010. Do clicks measure recommendation relevancy? An empirical user study. In Proceedings of the 4th ACM Conference on Recommender Systems, 249–252.
[81]
Jie Zou, Aixin Sun, Cheng Long, Mohammad Aliannejadi, and Evangelos Kanoulas. 2023. Asking clarifying questions: To benefit or to disturb users in Web search? Information Processing & Management 60, 2 (2023), 103176.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 43, Issue 1
January 2025
87 pages
EISSN:1558-2868
DOI:10.1145/3702036
  • Editor:
  • Min Zhang
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2024
Online AM: 25 July 2024
Accepted: 16 July 2024
Revised: 11 July 2024
Received: 14 March 2024
Published in TOIS Volume 43, Issue 1

Check for updates

Author Tags

  1. Search Clarification
  2. online evaluation
  3. offline evaluation
  4. large language model

Qualifiers

  • Research-article

Funding Sources

  • Australian Research Council
  • Office of Naval Research contract
  • NSF grant number

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 160
    Total Downloads
  • Downloads (Last 12 months)160
  • Downloads (Last 6 weeks)50
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media