Abstract
Complex dynamic search tasks typically involve multi-aspect information needs and repeated interactions with an information retrieval system. Various metrics have been proposed to evaluate dynamic search systems, including the Cube Test, Expected Utility, and Session Discounted Cumulative Gain. While these complex metrics attempt to measure overall system “goodness” based on a combination of dimensions – such as topical relevance, novelty, or user effort – it remains an open question how well each of the competing evaluation dimensions is reflected in the final score. To investigate this, we adapt two meta-analysis frameworks: the Intuitiveness Test and Metric Unanimity. This study is the first to apply these frameworks to the analysis of dynamic search metrics and also to study how well these two approaches agree with each other. Our analysis shows that the complex metrics differ markedly in the extent to which they reflect these dimensions, and also demonstrates that the behaviors of the metrics change as a session progresses. Finally, our investigation of the two meta-analysis frameworks demonstrates a high level of agreement between the two approaches. Our findings can help to inform the choice and design of appropriate metrics for the evaluation of dynamic search systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Code is available at https://github.com/aalbahem/ir-eval-meta-analysis.
- 2.
Sakai [16] evaluated metrics considering diversity and relevance simultaneously, but the procedure was not detailed.
- 3.
Also known as Intent Recall [16].
- 4.
Due to space limitations, in Table 1 we only show the results for the TREC DD 2016 runs, which is the second edition of the track and had almost as twice as many runs as the last edition.
- 5.
Other combinations and iterations are not reported due to lack of space, but overall trends were consistent with these settings. We also calculated the ranking of metrics based directly on their intuitiveness test relationship (i.e. without taking statistical significance into account); overall trends were again consistent with those presented here.
- 6.
The Metric Unanimity framework differs from the Intuitiveness Test framework in that there is no equivalent concept of underlying “number of successes”, therefore a significance test similar to the sign test in the Intuitiveness Test framework cannot be carried out.
References
Albahem, A., Spina, D., Scholer, F., Moffat, A., Cavedon, L.: Desirable properties for diversity and truncated effectiveness metrics. In: Proceedings of Australasian Document Computing Symposium, pp. 9:1–9:7 (2018)
Amigó, E., Gonzalo, J., Verdejo, F.: A general evaluation measure for document organization tasks. In: Proceedings of SIGIR, pp. 643–652 (2013)
Amigó, E., Spina, D., Carrillo-de Albornoz, J.: An axiomatic analysis of diversity evaluation metrics: introducing the rank-biased utility metric. In: Proceedings of SIGIR, pp. 625–634 (2018)
Busin, L., Mizzaro, S.: Axiometrics: an axiomatic approach to information retrieval effectiveness metrics. In: Proceedings of ICTIR, pp. 8:22–8:29 (2013)
Carterette, B., Kanoulas, E., Hall, M., Clough, P.: Overview of the TREC 2014 session track. In: Proceedings of TREC (2014)
Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of CIKM, pp. 621–630 (2009)
Chuklin, A., Zhou, K., Schuth, A., Sietsma, F., de Rijke, M.: Evaluating intuitiveness of vertical-aware click models. In: Proceedings of SIGIR, pp. 1075–1078 (2014)
Clarke, C.L., Craswell, N., Soboroff, I., Ashkan, A.: A comparative analysis of cascade measures for novelty and diversity. In: Proceedings of WSDM, pp. 75–84 (2011)
Clarke, C.L., et al.: Novelty and diversity in information retrieval evaluation. In: Proceedings of SIGIR, pp. 659–666 (2008)
Ferrante, M., Ferro, N., Maistro, M.: Towards a formal framework for utility-oriented measurements of retrieval effectiveness. In: Proceedings of ICTIR, pp. 21–30 (2015)
Jiang, J., He, D., Allan, J.: Comparing in situ and multidimensional relevance judgments. In: Proceedings of SIGIR, pp. 405–414 (2017)
Jin, X., Sloan, M., Wang, J.: Interactive exploratory search for multi page search results. In: Proceedings of WWW, pp. 655–666 (2013)
Kanoulas, E., Azzopardi, L., Yang, G.H.: Overview of the CLEF dynamic search evaluation lab 2018. In: Bellot, P., et al. (eds.) CLEF 2018. LNCS, vol. 11018, pp. 362–371. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_31
Luo, J., Wing, C., Yang, H., Hearst, M.: The water filling model and the cube test: multi-dimensional evaluation for professional search. In: Proceedings of CIKM, pp. 709–714 (2013)
Moffat, A.: Seven numeric properties of effectiveness metrics. In: Banchs, R.E., Silvestri, F., Liu, T.-Y., Zhang, M., Gao, S., Lang, J. (eds.) AIRS 2013. LNCS, vol. 8281, pp. 1–12. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45068-6_1
Sakai, T.: Evaluation with informational and navigational intents. In: Proceedings of WWW, pp. 499–508 (2012)
Sakai, T.: How intuitive are diversified search metrics? Concordance test results for the diversity U-Measures. In: Banchs, R.E., Silvestri, F., Liu, T.-Y., Zhang, M., Gao, S., Lang, J. (eds.) AIRS 2013. LNCS, vol. 8281, pp. 13–24. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45068-6_2
Smucker, M.D., Clarke, C.L.: Time-based calibration of effectiveness measures. In: Proceedings of SIGIR, pp. 95–104 (2012)
Tang, Z., Yang, G.H.: Investigating per topic upper bound for session search evaluation. In: Proceedings of ICTIR, pp. 185–192 (2017)
Turpin, A., Scholer, F.: User Performance versus precision measures for simple web search tasks. In: Proceedings of SIGIR, pp. 11–18 (2006)
Yang, H., Frank, J., Soboroff, I.: TREC 2015 dynamic domain track overview. In: Proceedings of TREC (2015)
Yang, H., Soboroff, I.: TREC 2016 dynamic domain track overview. In: Proceedings of TREC (2016)
Yang, H., Tang, Z., Soboroff, I.: TREC 2017 dynamic domain track overview. In: Proceedings of TREC (2017)
Zhou, K., Lalmas, M., Sakai, T., Cummins, R., Jose, J.M.: On the reliability and intuitiveness of aggregated search metrics. In: Proceedings of CIKM, pp. 689–698 (2013)
Acknowledgement
This research was partially supported by Australian Research Council (projects LP130100563 and LP150100252), and Real Thing Entertainment Pty Ltd.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Albahem, A., Spina, D., Scholer, F., Cavedon, L. (2019). Meta-evaluation of Dynamic Search: How Do Metrics Capture Topical Relevance, Diversity and User Effort?. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds) Advances in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science(), vol 11437. Springer, Cham. https://doi.org/10.1007/978-3-030-15712-8_39
Download citation
DOI: https://doi.org/10.1007/978-3-030-15712-8_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-15711-1
Online ISBN: 978-3-030-15712-8
eBook Packages: Computer ScienceComputer Science (R0)