research-article

Online and Offline Evaluation in Search Clarification

Authors:

Mark SandersonAuthors Info & Claims

ACM Transactions on Information Systems, Volume 43, Issue 1

Article No.: 2, Pages 1 - 30

https://doi.org/10.1145/3681786

Published: 04 November 2024 Publication History

Get Access

Abstract

The effectiveness of clarification question models in engaging users within search systems is currently constrained, casting doubt on their overall usefulness. To improve the performance of these models, it is crucial to employ assessment approaches that encompass both real-time feedback from users (online evaluation) and the characteristics of clarification questions evaluated through human assessment (offline evaluation). However, the relationship between online and offline evaluations has been debated in information retrieval. This study aims to investigate how this discordance holds in search clarification. We use user engagement as ground truth and employ several offline labels to investigate to what extent the offline ranked lists of clarification resemble the ideal ranked lists based on online user engagement. Contrary to the current understanding that offline evaluations fall short of supporting online evaluations, we indicate that when identifying the most engaging clarification questions from the user’s perspective, online and offline evaluations correspond with each other. We show that the query length does not influence the relationship between online and offline evaluations, and reducing uncertainty in online evaluation strengthens this relationship. We illustrate that an engaging clarification needs to excel from multiple perspectives, and SERP quality and characteristics of the clarification are equally important. We also investigate if human labels can enhance the performance of Large Language Models (LLMs) and Learning-to-Rank (LTR) models in identifying the most engaging clarification questions from the user’s perspective by incorporating offline evaluations as input features. Our results indicate that LTR models do not perform better than individual offline labels. However, GPT, an LLM, emerges as the standout performer, surpassing all LTR models and offline labels.

References

[1]

Rakesh Agrawal, Alan Halverson, Krishnaram Kenthapadi, Nina Mishra, and Panayiotis Tsaparas. 2009. Generating labels from clicks. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, 172–181.

Abstract

References

Index Terms

Recommendations

A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation

How Well do Offline Metrics Predict Online Performance of Product Ranking Models?

Predictive model performance: offline and online evaluations

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations