Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Meta-evaluation of Conversational Search Evaluation Metrics

Published: 01 September 2021 Publication History

Abstract

Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

References

[1]
Giambattista Amati. 2006. Frequentist and Bayesian approach to information retrieval. In Proceedings of the European Conference on Information Retrieval. Springer, 13–24.
[2]
Giambattista Amati, Giuseppe Amodeo, Marco Bianchi, Carlo Gaibisso, and Giorgio Gambosi. 2008. FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track. Technical Report. Fondazione UGO Bordoni, Rome, Italy.
[3]
Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (2002), 357–389.
[4]
Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno Stein. 2020. Conversational Search (Dagstuhl Seminar 19461). In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
[6]
Basma El Amel Boussaha, Nicolas Hernandez, Christine Jacquin, and Emmanuel Morin. 2019. Deep retrieval-based dialogue systems: A short review. arXiv preprint arXiv:1907.12878 (2019).
[7]
Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ—A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278 (2018).
[8]
Benjamin A. Carterette. 2012. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst. 30, 1 (2012), 4.
[9]
Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 621–630.
[10]
Delphine Charlet and Geraldine Damnati. 2017. Simbow at semeval-2017 task 3: Soft-cosine semantic similarity between questions for community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval’17). 315–319.
[11]
Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 15–24.
[12]
Jason Ingyu Choi, Ali Ahmadvand, and Eugene Agichtein. 2019. Offline and online satisfaction prediction in open-domain conversational systems. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1281–1290.
[13]
Stéphane Clinchant and Eric Gaussier. 2009. Bridging language modeling and divergence from randomness models: A log-logistic model for IR. In Proceedings of the Conference on the Theory of Information Retrieval. Springer, 54–65.
[14]
Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 234–241.
[15]
Daniel Cohen, Liu Yang, and W. Bruce Croft. 2018. WikiPassageQA: A benchmark collection for research on non-factoid answer passage retrieval. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1165–1168.
[16]
Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev. 54, 1 (2021), 755–810.
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[18]
Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241 (2018).
[19]
Peter W. Foltz, Walter Kintsch, and Thomas K. Landauer. 1998. The measurement of textual coherence with latent semantic analysis. Disc. Process. 25, 2–3 (1998), 285–307.
[20]
Gabriel Forgues, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. 2014. Bootstrapping dialog systems with word embeddings. In Proceedings of the NIPS, Modern Machine Learning and Natural Language Processing Workshop, Vol. 2.
[21]
Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. arXiv preprint arXiv:1506.06863 (2015).
[22]
Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. arXiv preprint arXiv:1904.10635 (2019).
[23]
Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2017. A knowledge-grounded neural conversation model. arXiv preprint arXiv:1702.01932 (2017).
[24]
Pavel Gulyaev, Eugenia Elistratova, Vasily Konovalov, Yuri Kuratov, Leonid Pugachev, and Mikhail Burtsev. 2020. Goal-oriented multi-task bert-based dialogue state tracker. arXiv preprint arXiv:2002.02450 (2020).
[25]
Aaron L. F. Han, Derek F. Wong, and Lidia S. Chao. 2012. LEPOR: A robust evaluation metric for machine translation with augmented factors. In Proceedings of COLING 2012: Posters. 441–450.
[26]
Stephen P. Harter. 1975. A probabilistic approach to automatic keyword indexing. Part I. On the distribution of specialty words in a technical literature. J. Amer. Soc. Inf. Sci. 26, 4 (1975), 197–206.
[27]
Seyyed Hadi Hashemi, Kyle Williams, Ahmed El Kholy, Imed Zitouni, and Paul A. Crook. 2018. Measuring user satisfaction on smart speaker intelligent assistants using intent sensitive query embeddings. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1183–1192.
[28]
Djoerd Hiemstra. 2001. Using Language Models for Information Retrieval. Citeseer.
[29]
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.
[30]
Jiepu Jiang and James Allan. 2016. Correlation between system and user metrics in a session. In Proceedings of the ACM on Conference on Human Information Interaction and Retrieval. 285–288.
[31]
Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and predicting graded search satisfaction. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 57–66.
[32]
Xisen Jin, Wenqiang Lei, Zhaochun Ren, Hongshen Chen, Shangsong Liang, Yihong Zhao, and Dawei Yin. 2018. Explicit state tracking with semi-supervisionfor neural dialogue generation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 1403–1412.
[33]
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
[34]
Katharina Kann, Sascha Rothe, and Katja Filippova. 2018. Sentence-level fluency evaluation: References help, but can be spared!arXiv preprint arXiv:1809.08731 (2018).
[35]
Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating multi-query sessions. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1053–1062.
[36]
Diane Kelly. 2009. Methods for evaluating interactive information retrieval systems with users. Found. Trends Inf. Ret. 3, 1–2 (2009), 1–224.
[37]
Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. 2016. Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:1612.07600 (2016).
[38]
Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada, et al. 2019. The eighth dialog system technology challenge. arXiv preprint arXiv:1911.06394 (2019).
[39]
Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Predicting user satisfaction with intelligent assistants. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 45–54.
[40]
Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Understanding user satisfaction with intelligent assistants. In Proceedings of the ACM Conference on Human Information Interaction and Retrieval. 121–130.
[41]
Tian Lan, Xianling Mao, Heyan Huang, and Wei Wei. 2019. When to talk: Chatbot controls the timing of talking during multi-turn open-domain dialogue generation. arXiv preprint arXiv:1912.09879 (2019).
[42]
Tian Lan, Xian-Ling Mao, Wei Wei, Xiaoyan Gao, and Heyan Huang. 2020. PONE: A novel automatic evaluation metric for open-domain generative dialogue systems. arXiv preprint arXiv:2004.02399 (2020).
[43]
Thomas K. Landauer and Susan T. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.Psychol. Rev. 104, 2 (1997), 211.
[44]
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957 (2017).
[45]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL 2004 Workshop on Text Summarization Branches Out. 74–81.
[46]
Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2019. From a user model for query sessions to Session Rank Biased Precision (sRBP). In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. 109–116.
[47]
Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023 (2016).
[48]
Mengyang Liu, Yiqun Liu, Jiaxin Mao, Cheng Luo, and Shaoping Ma. 2018. Towards designing better session search evaluation metrics. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1121–1124.
[49]
Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149 (2017).
[50]
Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909 (2015).
[51]
Ryan Lowe, Iulian V. Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. On the evaluation of dialogue systems with next utterance classification. arXiv preprint arXiv:1605.05414 (2016).
[52]
Yi Luan, Chris Brockett, Bill Dolan, Jianfeng Gao, and Michel Galley. 2017. Multi-task learning for speaker-role adaptation in neural conversation models. arXiv preprint arXiv:1710.07388 (2017).
[53]
Shikib Mehri and Maxine Eskenazi. 2020. USR: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456 (2020).
[54]
Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL): Human Language Technologies. 236–244.
[55]
Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 1–27.
[56]
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. arXiv preprint arXiv:1707.06875 (2017).
[57]
Joseph Olive, Caitlin Christianson, and John McCary. 2011. Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. Springer Science & Business Media.
[58]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 311–318.
[59]
Chen Qu, Liu Yang, W. Bruce Croft, Johanne R. Trippas, Yongfeng Zhang, and Minghui Qiu. 2018. Analyzing and characterizing user intent in information-seeking conversations. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 989–992.
[60]
Filip Radlinski and Nick Craswell. 2010. Comparing the sensitivity of information retrieval metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.
[61]
Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversational search. In Proceedings of the Conference on Human Information Interaction and Retrieval. ACM, 117–126.
[62]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
[63]
Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of Twitter conversations. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 172–180.
[64]
Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-driven response generation in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 583–593.
[65]
Stephen Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Docum. 60, 5 (2004), 503–520.
[66]
Vasile Rus and Mihai Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the 7th Workshop on Building Educational Applications Using NLP. Association for Computational Linguistics, 157–162.
[67]
Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 525–532.
[68]
Tetsuya Sakai. 2007. On the reliability of information retrieval metrics based on graded relevance. Inf. Process. Manag. 43, 2 (2007), 531–548.
[69]
Tetsuya Sakai. 2012. Evaluation with informational and navigational intents. In Proceedings of the 21st International Conference on World Wide Web. ACM, 499–508.
[70]
Tetsuya Sakai. 2013. Metrics, statistics, tests. In PROMISE winter school. Springer, 116–163.
[71]
Tetsuya Sakai et al. 2005. The effect of topic sampling on sensitivity comparisons of information retrieval metrics. In Proceedings of the NTCIR Conference.
[72]
Tetsuya Sakai and Noriko Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Ret. 11, 5 (2008), 447–470.
[73]
Tetsuya Sakai and Zhaohao Zeng. 2019. Which diversity evaluation measures are “Good”? In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 595–604.
[74]
Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up? In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 555–562.
[75]
Shumpei Sano, Nobuhiro Kaji, and Manabu Sassano. 2016. Prediction of prospective user engagement with intelligent assistants. In Proceedings of the 54th Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1203–1212.
[76]
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
[77]
Grigori Sidorov, Alexander Gelbukh, Helena Gómez-Adorno, and David Pinto. 2014. Soft similarity and soft cosine measure: Similarity of features in vector space model. Comput. Sistem. 18, 3 (2014), 491–504.
[78]
Koustuv Sinha, Prasanna Parthasarathi, Jasmine Wang, Ryan Lowe, William L. Hamilton, and Joelle Pineau. 2020. Learning an unreferenced metric for online dialogue evaluation. arXiv preprint arXiv:2005.00583 (2020).
[79]
Yiping Song, Rui Yan, Cheng-Te Li, Jian-Yun Nie, Ming Zhang, and Dongyan Zhao. 2018. An ensemble of retrieval-based and generation-based human-computer conversation systems. In Proceedings of the IJCAI. 4382–4388.
[80]
Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, and Ming Zhang. 2016. Two are better than one: An ensemble of retrieval-and generation-based dialog systems. arXiv preprint arXiv:1610.07149 (2016).
[81]
Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714 (2015).
[82]
Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 341–351.
[83]
Louise T. Su. 1992. Evaluation measures for interactive information retrieval. Inf. Process. Manag. 28, 4 (1992), 503–516.
[84]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3104–3112.
[85]
Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. RUBER: An unsupervised method for automatic evaluation of open-domain dialog systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
[86]
Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao. 2017. How to make context more useful? An empirical study on context-aware neural conversational models. In Proceedings of the 55th Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 231–236.
[87]
Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson. 2018. Informing the design of spoken conversational search: perspective paper. In Proceedings of the Conference on Human Information Interaction & Retrieval. ACM, 32–41.
[88]
Andrew Turpin and Falk Scholer. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. 11–18.
[89]
Billy Wong and Chunyu Kit. 2011. Comparative evaluation of term informativeness measures for machine translation evaluation metrics. In Proceedings of the MT Summit, Vol. 2011. 537–544.
[90]
Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2016. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. arXiv preprint arXiv:1612.01627 (2016).
[91]
Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response ranking with deep matching networks and external knowledge in information-seeking conversation systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 245–254.
[92]
Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013–2018.
[93]
Chengxiang Zhai. 2001. Notes on the Lemur TFIDF model. Retrieved from http://lemurproject.org/lemur/tfidf.pdf.
[94]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating text generation with BERT. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SkeHuCVFDr.
[95]
Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. 2018. Generating informative and diverse conversational responses via adversarial information maximization. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1810–1820.
[96]
Wen Zheng and Ke Zhou. 2019. Enhancing conversational dialogue models with grounded knowledge. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 709–718.
[97]
Ke Zhou, Ronan Cummins, Mounia Lalmas, and Joemon M. Jose. 2012. Evaluating aggregated search pages. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 115–124.
[98]
Ke Zhou, Mounia Lalmas, Tetsuya Sakai, Ronan Cummins, and Joemon M. Jose. 2013. On the reliability and intuitiveness of aggregated search metrics. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 689–698.
[99]
Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan. 2016. Multi-view response selection for human-computer conversation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 372–381.

Cited By

View all
  • (2025)Modeling higher-order social influence using multi-head graph attention autoencoderInformation Systems10.1016/j.is.2024.102474128(102474)Online publication date: Feb-2025
  • (2024)The Eighth Workshop on Search-Oriented Conversational Artificial Intelligence (SCAI’24)Proceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638310(433-435)Online publication date: 10-Mar-2024
  • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 39, Issue 4
October 2021
482 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3477247
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021
Accepted: 01 December 2020
Revised: 01 December 2020
Received: 01 May 2020
Published in TOIS Volume 39, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Conversational search
  2. meta-evaluation
  3. metric
  4. discriminative power

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)63
  • Downloads (Last 6 weeks)8
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Modeling higher-order social influence using multi-head graph attention autoencoderInformation Systems10.1016/j.is.2024.102474128(102474)Online publication date: Feb-2025
  • (2024)The Eighth Workshop on Search-Oriented Conversational Artificial Intelligence (SCAI’24)Proceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638310(433-435)Online publication date: 10-Mar-2024
  • (2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
  • (2024)Understanding Users’ Confidence in Spoken Queries for Conversational Search SystemsEngineering Applications of Neural Networks10.1007/978-3-031-62495-7_31(405-418)Online publication date: 22-Jun-2024
  • (2023)Microblog Emotion Analysis Using Improved DBN Under Spark PlatformInternational Journal of Information Technologies and Systems Approach10.4018/IJITSA.31814116:2(1-16)Online publication date: 16-Feb-2023
  • (2023)A Survey on Review - Aware Recommendation SystemsProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617050(198-207)Online publication date: 23-Oct-2023
  • (2023)Aspect-Based Sentiment Analysis for Arabic Food Delivery ReviewsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360514622:7(1-18)Online publication date: 20-Jul-2023
  • (2023)A is for Adele: An Offline Evaluation Metric for Instant SearchProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605115(3-12)Online publication date: 9-Aug-2023
  • (2023)Modeling User Reviews through Bayesian Graph Attention Networks for RecommendationACM Transactions on Information Systems10.1145/357050041:3(1-29)Online publication date: 25-Apr-2023
  • (2023)Learning Implicit and Explicit Multi-task Interactions for Information ExtractionACM Transactions on Information Systems10.1145/353302041:2(1-29)Online publication date: 8-Apr-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media