research-article

Meta-evaluation of Conversational Search Evaluation Metrics

Authors:

Max L. WilsonAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 39, Issue 4

Article No.: 52, Pages 1 - 42

https://doi.org/10.1145/3445029

Published: 01 September 2021 Publication History

Abstract

Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.

References

[1]

Giambattista Amati. 2006. Frequentist and Bayesian approach to information retrieval. In Proceedings of the European Conference on Information Retrieval. Springer, 13–24.

Digital Library

[2]

Giambattista Amati, Giuseppe Amodeo, Marco Bianchi, Carlo Gaibisso, and Giorgio Gambosi. 2008. FUB, IASI-CNR and University of Tor Vergata at TREC 2008 Blog Track. Technical Report. Fondazione UGO Bordoni, Rome, Italy.

[3]

Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (2002), 357–389.

Digital Library

[4]

Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno Stein. 2020. Conversational Search (Dagstuhl Seminar 19461). In Dagstuhl Reports, Vol. 9. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

Digital Library

[5]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.

[6]

Basma El Amel Boussaha, Nicolas Hernandez, Christine Jacquin, and Emmanuel Morin. 2019. Deep retrieval-based dialogue systems: A short review. arXiv preprint arXiv:1907.12878 (2019).

[7]

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ—A large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278 (2018).

[8]

Benjamin A. Carterette. 2012. Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM Trans. Inf. Syst. 30, 1 (2012), 4.

Digital Library

[9]

Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management. 621–630.

Digital Library

[10]

Delphine Charlet and Geraldine Damnati. 2017. Simbow at semeval-2017 task 3: Soft-cosine semantic similarity between questions for community question answering. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval’17). 315–319.

[11]

Ye Chen, Ke Zhou, Yiqun Liu, Min Zhang, and Shaoping Ma. 2017. Meta-evaluation of online and offline web search evaluation metrics. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 15–24.

Digital Library

[12]

Jason Ingyu Choi, Ali Ahmadvand, and Eugene Agichtein. 2019. Offline and online satisfaction prediction in open-domain conversational systems. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 1281–1290.

Digital Library

[13]

Stéphane Clinchant and Eric Gaussier. 2009. Bridging language modeling and divergence from randomness models: A log-logistic model for IR. In Proceedings of the Conference on the Theory of Information Retrieval. Springer, 54–65.

Digital Library

[14]

Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 234–241.

Digital Library

[15]

Daniel Cohen, Liu Yang, and W. Bruce Croft. 2018. WikiPassageQA: A benchmark collection for research on non-factoid answer passage retrieval. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1165–1168.

Digital Library

[16]

Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems. Artif. Intell. Rev. 54, 1 (2021), 755–810.

Digital Library

[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[18]

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2018. Wizard of Wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241 (2018).

[19]

Peter W. Foltz, Walter Kintsch, and Thomas K. Landauer. 1998. The measurement of textual coherence with latent semantic analysis. Disc. Process. 25, 2–3 (1998), 285–307.

[20]

Gabriel Forgues, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. 2014. Bootstrapping dialog systems with word embeddings. In Proceedings of the NIPS, Modern Machine Learning and Natural Language Processing Workshop, Vol. 2.

[21]

Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. arXiv preprint arXiv:1506.06863 (2015).

[22]

Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. arXiv preprint arXiv:1904.10635 (2019).

[23]

Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, and Michel Galley. 2017. A knowledge-grounded neural conversation model. arXiv preprint arXiv:1702.01932 (2017).

[24]

Pavel Gulyaev, Eugenia Elistratova, Vasily Konovalov, Yuri Kuratov, Leonid Pugachev, and Mikhail Burtsev. 2020. Goal-oriented multi-task bert-based dialogue state tracker. arXiv preprint arXiv:2002.02450 (2020).

[25]

Aaron L. F. Han, Derek F. Wong, and Lidia S. Chao. 2012. LEPOR: A robust evaluation metric for machine translation with augmented factors. In Proceedings of COLING 2012: Posters. 441–450.

[26]

Stephen P. Harter. 1975. A probabilistic approach to automatic keyword indexing. Part I. On the distribution of specialty words in a technical literature. J. Amer. Soc. Inf. Sci. 26, 4 (1975), 197–206.

[27]

Seyyed Hadi Hashemi, Kyle Williams, Ahmed El Kholy, Imed Zitouni, and Paul A. Crook. 2018. Measuring user satisfaction on smart speaker intelligent assistants using intent sensitive query embeddings. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1183–1192.

Digital Library

[28]

Djoerd Hiemstra. 2001. Using Language Models for Information Retrieval. Citeseer.

[29]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4 (2002), 422–446.

Digital Library

[30]

Jiepu Jiang and James Allan. 2016. Correlation between system and user metrics in a session. In Proceedings of the ACM on Conference on Human Information Interaction and Retrieval. 285–288.

Digital Library

[31]

Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. Understanding and predicting graded search satisfaction. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 57–66.

Digital Library

[32]

Xisen Jin, Wenqiang Lei, Zhaochun Ren, Hongshen Chen, Shangsong Liang, Yihong Zhao, and Dawei Yin. 2018. Explicit state tracking with semi-supervisionfor neural dialogue generation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 1403–1412.

Digital Library

[33]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).

[34]

Katharina Kann, Sascha Rothe, and Katja Filippova. 2018. Sentence-level fluency evaluation: References help, but can be spared!arXiv preprint arXiv:1809.08731 (2018).

[35]

Evangelos Kanoulas, Ben Carterette, Paul D. Clough, and Mark Sanderson. 2011. Evaluating multi-query sessions. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1053–1062.

Digital Library

[36]

Diane Kelly. 2009. Methods for evaluating interactive information retrieval systems with users. Found. Trends Inf. Ret. 3, 1–2 (2009), 1–224.

Digital Library

[37]

Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. 2016. Re-evaluating automatic metrics for image captioning. arXiv preprint arXiv:1612.07600 (2016).

[38]

Seokhwan Kim, Michel Galley, Chulaka Gunasekara, Sungjin Lee, Adam Atkinson, Baolin Peng, Hannes Schulz, Jianfeng Gao, Jinchao Li, Mahmoud Adada, et al. 2019. The eighth dialog system technology challenge. arXiv preprint arXiv:1911.06394 (2019).

[39]

Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Predicting user satisfaction with intelligent assistants. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 45–54.

Digital Library

[40]

Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C. Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Understanding user satisfaction with intelligent assistants. In Proceedings of the ACM Conference on Human Information Interaction and Retrieval. 121–130.

Digital Library

[41]

Tian Lan, Xianling Mao, Heyan Huang, and Wei Wei. 2019. When to talk: Chatbot controls the timing of talking during multi-turn open-domain dialogue generation. arXiv preprint arXiv:1912.09879 (2019).

[42]

Tian Lan, Xian-Ling Mao, Wei Wei, Xiaoyan Gao, and Heyan Huang. 2020. PONE: A novel automatic evaluation metric for open-domain generative dialogue systems. arXiv preprint arXiv:2004.02399 (2020).

[43]

Thomas K. Landauer and Susan T. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.Psychol. Rev. 104, 2 (1997), 211.

[44]

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957 (2017).

[45]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL 2004 Workshop on Text Summarization Branches Out. 74–81.

[46]

Aldo Lipani, Ben Carterette, and Emine Yilmaz. 2019. From a user model for query sessions to Session Rank Biased Precision (sRBP). In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval. 109–116.

Digital Library

[47]

Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. arXiv preprint arXiv:1603.08023 (2016).

[48]

Mengyang Liu, Yiqun Liu, Jiaxin Mao, Cheng Luo, and Shaoping Ma. 2018. Towards designing better session search evaluation metrics. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 1121–1124.

Digital Library

[49]

Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149 (2017).

[50]

Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The Ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909 (2015).

[51]

Ryan Lowe, Iulian V. Serban, Mike Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. On the evaluation of dialogue systems with next utterance classification. arXiv preprint arXiv:1605.05414 (2016).

[52]

Yi Luan, Chris Brockett, Bill Dolan, Jianfeng Gao, and Michel Galley. 2017. Multi-task learning for speaker-role adaptation in neural conversation models. arXiv preprint arXiv:1710.07388 (2017).

[53]

Shikib Mehri and Maxine Eskenazi. 2020. USR: An unsupervised and reference free evaluation metric for dialog generation. arXiv preprint arXiv:2005.00456 (2020).

[54]

Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL): Human Language Technologies. 236–244.

[55]

Alistair Moffat and Justin Zobel. 2008. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Syst. 27, 1 (2008), 1–27.

Digital Library

[56]

Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. arXiv preprint arXiv:1707.06875 (2017).

[57]

Joseph Olive, Caitlin Christianson, and John McCary. 2011. Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation. Springer Science & Business Media.

Digital Library

[58]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 311–318.

Digital Library

[59]

Chen Qu, Liu Yang, W. Bruce Croft, Johanne R. Trippas, Yongfeng Zhang, and Minghui Qiu. 2018. Analyzing and characterizing user intent in information-seeking conversations. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 989–992.

Digital Library

[60]

Filip Radlinski and Nick Craswell. 2010. Comparing the sensitivity of information retrieval metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.

Digital Library

[61]

Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversational search. In Proceedings of the Conference on Human Information Interaction and Retrieval. ACM, 117–126.

Digital Library

[62]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).

[63]

Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of Twitter conversations. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 172–180.

Digital Library

[64]

Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-driven response generation in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 583–593.

Digital Library

[65]

Stephen Robertson. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Docum. 60, 5 (2004), 503–520.

[66]

Vasile Rus and Mihai Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the 7th Workshop on Building Educational Applications Using NLP. Association for Computational Linguistics, 157–162.

Digital Library

[67]

Tetsuya Sakai. 2006. Evaluating evaluation metrics based on the bootstrap. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 525–532.

Digital Library

[68]

Tetsuya Sakai. 2007. On the reliability of information retrieval metrics based on graded relevance. Inf. Process. Manag. 43, 2 (2007), 531–548.

Digital Library

[69]

Tetsuya Sakai. 2012. Evaluation with informational and navigational intents. In Proceedings of the 21st International Conference on World Wide Web. ACM, 499–508.

Digital Library

[70]

Tetsuya Sakai. 2013. Metrics, statistics, tests. In PROMISE winter school. Springer, 116–163.

[71]

Tetsuya Sakai et al. 2005. The effect of topic sampling on sensitivity comparisons of information retrieval metrics. In Proceedings of the NTCIR Conference.

[72]

Tetsuya Sakai and Noriko Kando. 2008. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Inf. Ret. 11, 5 (2008), 447–470.

Digital Library

[73]

Tetsuya Sakai and Zhaohao Zeng. 2019. Which diversity evaluation measures are “Good”? In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 595–604.

Digital Library

[74]

Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas. 2010. Do user preferences and evaluation measures line up? In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 555–562.

Digital Library

[75]

Shumpei Sano, Nobuhiro Kaji, and Manabu Sassano. 2016. Prediction of prospective user engagement with intelligent assistants. In Proceedings of the 54th Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1203–1212.

[76]

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.

Digital Library

[77]

Grigori Sidorov, Alexander Gelbukh, Helena Gómez-Adorno, and David Pinto. 2014. Soft similarity and soft cosine measure: Similarity of features in vector space model. Comput. Sistem. 18, 3 (2014), 491–504.

[78]

Koustuv Sinha, Prasanna Parthasarathi, Jasmine Wang, Ryan Lowe, William L. Hamilton, and Joelle Pineau. 2020. Learning an unreferenced metric for online dialogue evaluation. arXiv preprint arXiv:2005.00583 (2020).

[79]

Yiping Song, Rui Yan, Cheng-Te Li, Jian-Yun Nie, Ming Zhang, and Dongyan Zhao. 2018. An ensemble of retrieval-based and generation-based human-computer conversation systems. In Proceedings of the IJCAI. 4382–4388.

Digital Library

[80]

Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, and Ming Zhang. 2016. Two are better than one: An ensemble of retrieval-and generation-based dialog systems. arXiv preprint arXiv:1610.07149 (2016).

[81]

Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to context-sensitive generation of conversational responses. arXiv preprint arXiv:1506.06714 (2015).

[82]

Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 341–351.

Digital Library

[83]

Louise T. Su. 1992. Evaluation measures for interactive information retrieval. Inf. Process. Manag. 28, 4 (1992), 503–516.

Digital Library

[84]

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3104–3112.

Digital Library

[85]

Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. RUBER: An unsupervised method for automatic evaluation of open-domain dialog systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.

[86]

Zhiliang Tian, Rui Yan, Lili Mou, Yiping Song, Yansong Feng, and Dongyan Zhao. 2017. How to make context more useful? An empirical study on context-aware neural conversational models. In Proceedings of the 55th Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 231–236.

[87]

Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Sanderson. 2018. Informing the design of spoken conversational search: perspective paper. In Proceedings of the Conference on Human Information Interaction & Retrieval. ACM, 32–41.

Digital Library

[88]

Andrew Turpin and Falk Scholer. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the 29th International ACM SIGIR Conference on Research and Development in Information Retrieval. 11–18.

Digital Library

[89]

Billy Wong and Chunyu Kit. 2011. Comparative evaluation of term informativeness measures for machine translation evaluation metrics. In Proceedings of the MT Summit, Vol. 2011. 537–544.

[90]

Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2016. Sequential matching network: A new architecture for multi-turn response selection in retrieval-based chatbots. arXiv preprint arXiv:1612.01627 (2016).

[91]

Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, and Haiqing Chen. 2018. Response ranking with deep matching networks and external knowledge in information-seeking conversation systems. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 245–254.

Digital Library

[92]

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2013–2018.

[93]

Chengxiang Zhai. 2001. Notes on the Lemur TFIDF model. Retrieved from http://lemurproject.org/lemur/tfidf.pdf.

[94]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating text generation with BERT. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SkeHuCVFDr.

[95]

Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and Bill Dolan. 2018. Generating informative and diverse conversational responses via adversarial information maximization. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1810–1820.

Digital Library

[96]

Wen Zheng and Ke Zhou. 2019. Enhancing conversational dialogue models with grounded knowledge. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 709–718.

Digital Library

[97]

Ke Zhou, Ronan Cummins, Mounia Lalmas, and Joemon M. Jose. 2012. Evaluating aggregated search pages. In Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 115–124.

Digital Library

[98]

Ke Zhou, Mounia Lalmas, Tetsuya Sakai, Ronan Cummins, and Joemon M. Jose. 2013. On the reliability and intuitiveness of aggregated search metrics. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 689–698.

Digital Library

[99]

Xiangyang Zhou, Daxiang Dong, Hua Wu, Shiqi Zhao, Dianhai Yu, Hao Tian, Xuan Liu, and Rui Yan. 2016. Multi-view response selection for human-computer conversation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 372–381.

Cited By

Meydani EDuesing CTrier M(2025)Modeling higher-order social influence using multi-head graph attention autoencoderInformation Systems10.1016/j.is.2024.102474128(102474)Online publication date: Feb-2025
https://doi.org/10.1016/j.is.2024.102474
Frummet APapenmeier AFröbe MKiesel J(2024)The Eighth Workshop on Search-Oriented Conversational Artificial Intelligence (SCAI’24)Proceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638310(433-435)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638310
Thomas PKazai GCraswell NSpielman SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657845
Show More Cited By

Index Terms

Meta-evaluation of Conversational Search Evaluation Metrics
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness

Recommendations

Towards Conversational Search and Recommendation: System Ask, User Respond
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Conversational search and recommendation based on user-system dialogs exhibit major differences from conventional search and recommendation tasks in that 1) the user and system can interact for multiple semantically coherent rounds on a task through ...
Meta-Information in Conversational Search
The exchange of meta-information has always formed part of information behavior. In this article, we show that this rule also extends to conversational search. Information about the user’s information need, their preferences, and the quality of search ...
Measuring the Discriminative Power of Object-Oriented Class Cohesion Metrics

Several object-oriented cohesion metrics have been proposed in the literature. These metrics aim to measure the relationship between class members, namely, methods and attributes. Different metrics use different models to represent the connectivity ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 39, Issue 4

October 2021

482 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/3477247

Editor:
Min Zhang
Tsinghua University, China

Issue’s Table of Contents

Copyright © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Accepted: 01 December 2020

Revised: 01 December 2020

Received: 01 May 2020

Published in TOIS Volume 39, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
287
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)8

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Meydani EDuesing CTrier M(2025)Modeling higher-order social influence using multi-head graph attention autoencoderInformation Systems10.1016/j.is.2024.102474128(102474)Online publication date: Feb-2025
https://doi.org/10.1016/j.is.2024.102474
Frummet APapenmeier AFröbe MKiesel J(2024)The Eighth Workshop on Search-Oriented Conversational Artificial Intelligence (SCAI’24)Proceedings of the 2024 Conference on Human Information Interaction and Retrieval10.1145/3627508.3638310(433-435)Online publication date: 10-Mar-2024
https://dl.acm.org/doi/10.1145/3627508.3638310
Thomas PKazai GCraswell NSpielman SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)What Matters in a Measure? A Perspective from Large-Scale Search EvaluationProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657845(282-292)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657845
Yu YShi ZLipani A(2024)Understanding Users’ Confidence in Spoken Queries for Conversational Search SystemsEngineering Applications of Neural Networks10.1007/978-3-031-62495-7_31(405-418)Online publication date: 22-Jun-2024
https://doi.org/10.1007/978-3-031-62495-7_31
Chang WLi YDu Q(2023)Microblog Emotion Analysis Using Improved DBN Under Spark PlatformInternational Journal of Information Technologies and Systems Approach10.4018/IJITSA.31814116:2(1-16)Online publication date: 16-Feb-2023
https://dl.acm.org/doi/10.4018/IJITSA.318141
Bittencourt GFonseca GAndrade YSilva NRocha L(2023)A Survey on Review - Aware Recommendation SystemsProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617050(198-207)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617050
Al-Jarrah IMustafa ANajadat H(2023)Aspect-Based Sentiment Analysis for Arabic Food Delivery ReviewsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360514622:7(1-18)Online publication date: 20-Jul-2023
https://dl.acm.org/doi/10.1145/3605146
Arabzadeh NKmet OCarterette BClarke CHauff CChandar PYoshioka MKiseleva JAliannejadi M(2023)A is for Adele: An Offline Evaluation Metric for Instant SearchProceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3578337.3605115(3-12)Online publication date: 9-Aug-2023
https://dl.acm.org/doi/10.1145/3578337.3605115
Zhao YXu QZou YLi W(2023)Modeling User Reviews through Bayesian Graph Attention Networks for RecommendationACM Transactions on Information Systems10.1145/357050041:3(1-29)Online publication date: 25-Apr-2023
https://dl.acm.org/doi/10.1145/3570500
Sun KZhang RMensah SMao YLiu X(2023)Learning Implicit and Explicit Multi-task Interactions for Information ExtractionACM Transactions on Information Systems10.1145/353302041:2(1-29)Online publication date: 8-Apr-2023
https://dl.acm.org/doi/10.1145/3533020
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents