survey

A Survey of Evaluation Metrics Used for NLG Systems

Authors:

Akash Kumar Mohankumar,

Mitesh M. KhapraAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 55, Issue 2

Article No.: 26, Pages 1 - 39

https://doi.org/10.1145/3485766

Published: 18 January 2022 Publication History

Abstract

In the last few years, a large number of automatic evaluation metrics have been proposed for evaluating Natural Language Generation (NLG) systems. The rapid development and adoption of such automatic evaluation metrics in a relatively short time has created the need for a survey of these metrics. In this survey, we (i) highlight the challenges in automatically evaluating NLG systems, (ii) propose a coherent taxonomy for organising existing evaluation metrics, (iii) briefly describe different existing metrics, and finally (iv) discuss studies criticising the use of automatic evaluation metrics. We then conclude the article highlighting promising future directions of research.

References

[1]

Douglas Adams. 2007. The Hitchhiker’s Guide to the Galaxy. Random House Publishing Group.

[2]

Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimonos, and Cornelia Fermüller. 2018. Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding 173 (2018), 33–45.

[3]

Jacopo Amidei, Paul Piwek, and Alistair Willis. 2018. Rethinking the agreement in human evaluation tasks. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 3318–3329.

[4]

Ananthakrishnan, Pushpak Bhattacharyya, Murugesan Sasikumar, and Ritesh M. Shah. 2006. Some issues in automatic evaluation of english-hindi MT: More blues for BLEU. ICON.

[5]

Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In European Conference on Computer Vision. B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9909. Springer, 382–398.

[6]

JinYeong Bak and Alice Oh. 2020. Speaker sensitive response evaluation model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 6376–6385.

[7]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, 65–72.

Digital Library

[8]

Srinivas Bangalore, Owen Rambow, and Steve Whittaker. 2000. Evaluation metrics for generation. In Proceedings of the 1st International Conference on Natural Language Generation. The Association for Computer Linguistics, 1–8.

Digital Library

[9]

Anja Belz and Eric Kow. 2011. Discrete vs. continuous rating scales for language evaluation in NLP. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers. The Association for Computer Linguistics, 230–235.

Digital Library

[10]

Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computer Linguistics.

[11]

Ondrej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Ales Tamchyna. 2014. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the 9th Workshop on Statistical Machine Translation. The Association for Computer Linguistics, 12–58.

[12]

Ondrej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana L. Neves, Martin Popel, Matt Post, Raphael Rubino, Carolina Scarton, Lucia Specia, Marco Turchi, Karin M. Verspoor, and Marcos Zampieri. 2016. Findings of the 2016 conference on machine translation. In Proceedings of the 1st Conference on Machine Translation. The Association for Computer Linguistics, 131–198.

[13]

Ondrej Bojar, Yvette Graham, and Amir Kamran. 2017. Results of the WMT17 metrics shared task. In Proceedings of the 2nd Conference on Machine Translation. Association for Computational Linguistics, 489–513.

[14]

Ondrej Bojar, Yvette Graham, Amir Kamran, and Milos Stanojevic. 2016. Results of the WMT16 metrics shared task. In Proceedings of the 1st Conference on Machine Translation. The Association for Computer Linguistics, 199–231.

[15]

Chris Callison-Burch. 2009. Fast, cheap, and creative: Evaluating translation quality using Amazon’s mechanical turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. ACL, 286–295.

Digital Library

[16]

Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the role of Bleu in machine translation research. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Trento, Italy.

[17]

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Brian Strope, and Ray Kurzweil. 2018. Universal Sentence Encoder for English. In EMNLP (Demonstration). Association for Computational Linguistics, 169–174.

[18]

Arun Tejasvi Chaganty, Stephen Mussmann, and Percy Liang. 2018. The price of debiasing automatic metrics in natural language evaluation. 643–653. https://doi.org/10.18653/v1/P18-1060

[19]

Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. Association for Computational Linguistics, 119–124.

[20]

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1657–1668.

[21]

Julian Chow, Lucia Specia, and Pranava Madhyastha. 2019. WMDO: Fluency-based Word Mover’s distance for machine translation evaluation. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, 494–500.

[22]

Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555. Retrieved from https://arxiv.org/abs/1412.3555.

[23]

Elizabeth Clark, Asli Çelikyilmaz, and Noah A. Smith. 2019. Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2748–2760.

[24]

Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. The Association for Computer Linguistics.

Digital Library

[25]

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 670–680.

[26]

Deborah Coughlin. 2003. Correlating automated and human assessments of machine translation quality. In Proceedings of MT Summit IX. Citeseer, 63–70.

[27]

Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, and Serge J. Belongie. 2018. Learning to evaluate image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 5804–5812.

[28]

Ido Dagan. 2000. Contextual word similarity. In Handbook of Natural Language Processing. Rob Dale, Hermann Moisl, and Harold Somers (Eds.), Marcel Dekker Inc, 459–475.

[29]

Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, 6 (1990), 391–407.

[30]

Michael J. Denkowski and Alon Lavie. 2010. METEOR-NEXT and the METEOR paraphrase tables: Improved evaluation support for five target languages. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and MetricsMATR. Association for Computational Linguistics, 339–342.

Digital Library

[31]

Michael J. Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the 9th Workshop on Statistical Machine Translation. The Association for Computer Linguistics, 376–380.

[32]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.

[33]

Bhuwan Dhingra, Manaal Faruqui, Ankur P. Parikh, Ming-Wei Chang, Dipanjan Das, and William W. Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4884–4895.

[34]

George Doddington. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the 2nd International Conference on Human Language Technology Research. 138–145.

Digital Library

[35]

Li Dong and Mirella Lapata. 2018. Coarse-to-Fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 731–742.

[36]

Ondrej Dusek, Jekaterina Novikova, and Verena Rieser. 2020. Evaluating the state-of-the-art of End-to-End natural language generation: The E2E NLG challenge. Computer Speech & Language 59 (2020), 123–156.

Digital Library

[37]

Hiroshi Echizen’ya, Kenji Araki, and Eduard Hovy. 2019. Word embedding-based automatic MT evaluation metric using word position information. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics.

[38]

Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 489–500.

[39]

Desmond Elliott and Frank Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). The Association for Computer Linguistics, 452–457.

[40]

Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2020. SummEval: Re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguistics 9 (2021), 391–409.

[41]

Marina Fomicheva and Lucia Specia. 2019. Taking MT evaluation metrics to extremes: Beyond correlation with human judgments. Computational Linguistics 45, 3 (2019), 515–558.

Digital Library

[42]

Gabriel Forgues and Joelle Pineau. 2014. Bootstrapping dialog systems with word embeddings. In Modern Machine Learning and Natural Language Processing Workshop (NeurIPS’14), Vol. 2.

[43]

Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). The Association for Computer Linguistics, 445–450.

[44]

Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 317–326.

[45]

Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 179–188.

[46]

Albert Gatt and Anja Belz. 2010. Introducing shared tasks to NLG: The TUNA shared task evaluation challenges. In Empirical Methods in Natural Language Generation. E. Krahmer and M. Theune (Eds.), Lecture Notes in Computer Science, Vol. 5790, Springer, 264–293.

[47]

Albert Gatt and Emiel Krahmer. 2018. Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research 61, 1 (2018), 65–170.

Digital Library

[48]

Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Àgata Lapedriza, and Rosalind W. Picard. 2019. Approximating interactive human evaluation with self-play for open-domain dialog systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 13658–13669.

Digital Library

[49]

Sarik Ghazarian, Johnny Tian-Zheng Wei, Aram Galstyan, and Nanyun Peng. 2019. Better automatic evaluation of open-domain dialogue systems with contextualized embeddings. NAACL HLT 2019 (2019), 82.

[50]

Jesús Giménez and Lluís Màrquez. 2010. Asiya: An open toolkit for automatic machine translation (meta-)evaluation. The Prague Bulletin of Mathematical Linguistics 94 (2010), 77–86.

[51]

Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. The Association for Computer Linguistics, 33–41.

[52]

Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2017. Can machine translation systems be evaluated by the crowd alone. Natural Language Engineering 23, 1 (2017), 3–30.

[53]

Yinuo Guo and Junfeng Hu. 2019. Meteor++ 2.0: Adopt syntactic level paraphrase knowledge into machine translation evaluation. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, 501–506.

[54]

Yinuo Guo, Chong Ruan, and Junfeng Hu. 2018. Meteor++: Incorporating copy knowledge into machine translation evaluation. In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers. Association for Computational Linguistics, 740–745.

[55]

Tingting He, Jinguang Chen, Liang Ma, Zhuoming Gui, Fang Li, Wei Shao, and Qian Wang. 2008. ROUGE-C: A fully automated evaluation method for multi-document summarization. In Proceedings of the 2008 IEEE International Conference on Granular Computing. IEEE, 269–274.

[56]

Salah El Hihi and Yoshua Bengio. 1995. Hierarchical recurrent neural networks for long-term dependencies. In Proceedings of the 8th International Conference on Neural Information Processing Systems. MIT Press, 493–499.

Digital Library

[57]

Andrea Horbach, Itziar Aldabe, Marie Bexte, Oier Lopez de Lacalle, and Montse Maritxalar. 2020. Linguistic appropriateness and pedagogic usefulness of reading comprehension questions. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 1753–1762.

[58]

Hassan Kané, Muhammed Yusuf Kocyigit, Ali Abdalla, Pelkins Ajanoh, and Mohamed Coulibali. 2020. NUBIA: NeUral based interchangeability assessor for text generation. EvalNLGEval 2020 (2020), 28.

[59]

Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. 2017. Re-evaluating automatic metrics for image captioning. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, 199–209.

[60]

Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, Yejin Choi, and Luke Zettlemoyer. 2017. Neural AMR: sequence-to-sequence models for parsing and generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 146–157.

[61]

Wojciech Kryscinski, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 540–551.

[62]

Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning. Vol. 37, JMLR.org, 957–966.

Digital Library

[63]

Thomas K. Landauer and Susan T. Dumais. 1997. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 104, 2 (1997), 211.

[64]

Gregor Leusch, Nicola Ueffing, and Hermann Ney. 2006. CDER: Efficient MT evaluation using block movements. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computer Linguistics.

[65]

Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2157–2169.

[66]

Margaret Li, Jason Weston, and Stephen Roller. 2019. ACUTE-EVAL: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv:1909.03087. Retrieved from https://arxiv.org/abs/1909.03087.

[67]

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, 986–995.

[68]

Weixin Liang, James Zou, and Zhou Yu. 2020. Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation. CoRR abs/2005.10716. (2020).

[69]

Rensis Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology 22, 140 (1932), 55.

[70]

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out. Association for Computational Linguistics, 74–81.

[71]

Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT To evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. The Association for Computational Linguistics, 2122–2132.

[72]

Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved Image Captioning via Policy Gradient optimization of SPIDEr. In Proceedings of the 2017 IEEE International Conference on Computer Vision. IEEE Computer Society, 873–881.

[73]

Chi-kiu Lo. 2017. MEANT 2.0: Accurate semantic MT evaluation for any output language. In Proceedings of the 2nd Conference on Machine Translation. Association for Computational Linguistics, 589–597.

[74]

Chi-kiu Lo. 2019. YiSi - A unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, 507–513.

[75]

Chi-kiu Lo, Meriem Beloucif, Markus Saers, and Dekai Wu. 2014. XMEANT: Better semantic MT evaluation without reference translations. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). The Association for Computer Linguistics, 765–771.

[76]

Chi-kiu Lo, Anand Karthik Tumuluru, and Dekai Wu. 2012. Fully automatic semantic MT evaluation. In Proceedings of the 7th Workshop on Statistical Machine Translation. The Association for Computer Linguistics, 243–252.

Digital Library

[77]

Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. In Proceedings of the International Conference on Learning Representations. OpenReview.net.

[78]

Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an automatic turing test: Learning to evaluate dialogue responses. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1116–1126.

[79]

Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue. The Association for Computer Linguistics, 285–294.

[80]

Qingsong Ma, Ondrej Bojar, and Yvette Graham. 2018. Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance. In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers. Association for Computational Linguistics, 671–688.

[81]

Qingsong Ma, Yvette Graham, Shugen Wang, and Qun Liu. 2017. Blend: A novel combined MT metric based on direct assessment - CASICT-DCU submission to WMT17 metrics task. In Proceedings of the 2nd Conference on Machine Translation. Association for Computational Linguistics, 598–603.

[82]

Qingsong Ma, Johnny Wei, Ondrej Bojar, and Yvette Graham. 2019. Results of the WMT19 metrics shared task: Segment-Level and strong MT systems pose big challenges. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, 62–90.

[83]

Matous Machácek and Ondrej Bojar. 2013. Results of the WMT13 metrics shared task. In Proceedings of the 8th Workshop on Statistical Machine Translation. The Association for Computer Linguistics.

Digital Library

[84]

Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2019. Putting evaluation in context: Contextual embeddings improve machine translation evaluation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2799–2808.

[85]

Nitika Mathur, Tim Baldwin, and Trevor Cohn. 2020. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. 4984–4997.

[86]

Nitika Mathur, Johnny Wei, Markus Freitag, Qingsong Ma, and Ondrej Bojar. 2020. Results of the WMT20 metrics shared task. In Proceedings of the 5th Conference on Machine Translation. Association for Computational Linguistics, 688–725.

[87]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from https://arxiv.org/abs/1301.3781.

[88]

Preksha Nema and Mitesh M. Khapra. 2018. Towards a better metric for evaluating question generation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3950–3959.

[89]

Sonja Nießen, Franz Josef Och, Gregor Leusch, and Hermann Ney. 2000. An evaluation tool for machine translation: Fast evaluation for MT research. In Proceedings of the 2nd International Conference on Language Resources and Evaluation. European Language Resources Association.

[90]

Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2241–2252.

[91]

Juri Opitz and Anette Frank. 2020. Towards a decomposable metric for explainable evaluation of text generation from AMR. arXiv:2008.08896. Retrieved from https://arxiv.org/abs/2008.08896.

[92]

Juri Opitz, Anette Frank, and Letitia Parcalabescu. 2020. AMR similarity metrics from principles. Transactions of the Association for Computational Linguistics 8 (2020), 522–538.

[93]

Joybrata Panja and Sudip Kumar Naskar. 2018. ITER: Improving translation edit rate through optimizable edit costs. In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers. Association for Computational Linguistics, 746–750.

[94]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. ACL, 311–318.

Digital Library

[95]

Prasanna Parthasarathi, Joelle Pineau, and Sarath Chandar. 2020. How to evaluate your dialogue system: Probe tasks as an alternative for token-level evaluation metrics. arXiv:2008.10427. Retrieved from https://arxiv.org/abs/2008.10427.

[96]

Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. WordNet: Similarity - Measuring the relatedness of concepts. In Proceedings of the 19th National Conference on Artificial Intelligence. AAAI Press/The MIT Press, 1024–1025.

Digital Library

[97]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. ACL, 1532–1543.

[98]

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2227–2237.

[99]

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the 2015 IEEE International Conference on Computer Vision. IEEE Computer Society, 2641–2649.

Digital Library

[100]

Maja Popovic. 2015. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the 10th Workshop on Statistical Machine Translation. The Association for Computer Linguistics, 392–395.

[101]

Maja Popovic. 2017. chrF++: Words helping character n-grams. In Proceedings of the 2nd Conference on Machine Translation. Association for Computational Linguistics, 612–618.

[102]

Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational Linguistics, 186–191.

[103]

Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11 (Aug. 2010), 1297–1322.

Digital Library

[104]

Ehud Reiter. 2018. A structured review of the validity of BLEU. Computational Linguistics 44, 3 (2018), 393–401.

Digital Library

[105]

Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics 35, 4 (2009), 529–558.

Digital Library

[106]

Alan Ritter, Colin Cherry, and Bill Dolan. 2010. Unsupervised modeling of twitter conversations. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. The Association for Computational Linguistics, 172–180.

Digital Library

[107]

Vasile Rus and Mihai C. Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the 7th Workshop on Building Educational Applications Using NLP. 157–162.

Digital Library

[108]

Ananya B. Sai, Tanay Dixit, Dev Yashpal Sheth, Sreyas Mohan, and Mitesh M. Khapra. 2021. Perturbation checklists for evaluating NLG evaluation metrics. arXiv:2109.05771. Retrieved from https://arxiv.org/abs/2109.05771.

[109]

Ananya B. Sai, Mithun Das Gupta, Mitesh M. Khapra, and Mukundhan Srinivasan. 2019. Re-Evaluating ADEM: A deeper look at scoring dialogue responses. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, 6220–6227.

Digital Library

[110]

Ananya B. Sai, Akash Kumar Mohankumar, Siddhartha Arora, and Mitesh M. Khapra. 2020. Improving dialog evaluation with a multi-reference adversarial dataset and large scale pretraining. Transactions of the Association for Computational Linguistics 8 (2020), 810–827.

[111]

Geoffrey Sampson and Anna Babarczy. 2008. Definitional and human constraints on structural annotation of English. Natural Language Engineering 14, 4 (2008), 471–494.

Digital Library

[112]

Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? How controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 1702–1723.

[113]

Thibault Sellam, Dipanjan Das, and Ankur P. Parikh. 2020. BLEURT: Learning robust metrics for text generation. arXiv:2004.04696. Retrieved from https://arxiv.org/abs/2004.04696.

[114]

Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 3776–3784.

Digital Library

[115]

Naeha Sharif, Lyndon White, Mohammed Bennamoun, and Syed Afaq Ali Shah. 2018. Learning-based composite metrics for improved caption evaluation. In Proceedings of ACL 2018, Student Research Workshop. Association for Computational Linguistics, 14–20.

[116]

Naeha Sharif, Lyndon White, Mohammed Bennamoun, and Syed Afaq Ali Shah. 2018. NNEval: Neural network based evaluation metric for image captioning. In Computer Vision — ECCV 2018 . V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Lecture Notes in Computer Science, Vol. 11212, Springer, 39–55.

[117]

Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. arXiv:1706.09799. Retrieved from https://arxiv.org/abs/1706.09799.

[118]

Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2018. RUSE: Regressor using sentence embeddings for automatic machine translation evaluation. In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers. Association for Computational Linguistics, 751–758.

[119]

Hiroki Shimanaka, Tomoyuki Kajiwara, and Mamoru Komachi. 2019. Machine translation evaluation with BERT regressor. arXiv:1907.12679. Retrieved from https://arxiv.org/abs/1907.12679.

[120]

Anastasia Shimorina, Claire Gardent, Shashi Narayan, and Laura Perez-Beltrachini. 2019. WebNLG Challenge: Human Evaluation Results. Technical Report.

[121]

Koustuv Sinha, Prasanna Parthasarathi, Jasmine Wang, Ryan Lowe, William L. Hamilton, and Joelle Pineau. 2020. Learning an unreferenced metric for online dialogue evaluation. arXiv:2005.00583. Retrieved from https://arxiv.org/abs/2005.00583.

[122]

Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. 2006. A study of translation edit rate with targeted human annotation. In Proceedings of the Association for Machine Translation in the Americas.

[123]

Peter Stanchev, Weiyue Wang, and Hermann Ney. 2019. EED: Extended edit distance measure for machine translation. In Proceedings of the 4th Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1). Association for Computational Linguistics, 514–520.

[124]

Milos Stanojevic, Amir Kamran, Philipp Koehn, and Ondrej Bojar. 2015. Results of the WMT15 metrics shared task. In Proceedings of the 10th Workshop on Statistical Machine Translation. The Association for Computer Linguistics, 256–273.

[125]

Milos Stanojevic and Khalil Sima’an. 2014. Fitting sentence level translation evaluation with many dense features. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. ACL, 202–206.

[126]

Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Computational Linguistics and Intelligent Text Processing. A. Gelbukh (Ed.), Lecture Notes in Computer Science, Vol. 3406, Springer, 341–351.

Digital Library

[127]

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to summarize from human feedback. arXiv:2009.01325. Retrieved from https://arxiv.org/abs/2009.01325.

[128]

Keh-Yih Su, Ming-Wen Wu, and Jing-Shin Chang. 1992. A new quantitative quality measure for machine translation systems. In Proceedings of the 14th Conference on Computational Linguistics. 433–439.

Digital Library

[129]

Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. RUBER: An unsupervised method for automatic evaluation of open-domain dialog systems. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 722–729.

Digital Library

[130]

Ran Tian, Shashi Narayan, Thibault Sellam, and Ankur P. Parikh. 2019. Sticking to the facts: Confident decoding for faithful data-to-text generation. arXiv:1910.08684. Retrieved from https://arxiv.org/abs/1910.08684.

[131]

Christoph Tillmann, Stephan Vogel, Hermann Ney, A. Zubiaga, and Hassan Sawaf. 1997. Accelerated DP based search for statistical translation. In European Conf. on Speech Communication and Technology. ISCA.

[132]

Joseph Turian, Luke Shen, and I. Melamed. 2003. Evaluation of machine translation and its evaluation. In Proceedings of the MT Summit IX.

[133]

A. M. Turing. 1950. Computing machinery and intelligence. Mind LIX, 236 (Oct 1950), 433–460.

[134]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 5998–6008.

Digital Library

[135]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 4566–4575.

[136]

Suzan Verberne, Emiel Krahmer, Sander Wubben, and Antal van den Bosch. 2020. Query-based summarization of discussion threads. Natural Language Engineering 26, 1 (2020), 3–29.

[137]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 3156–3164.

[138]

Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. 2016. CharacTer: Translation edit rate on character level. In Proceedings of the 1st Conference on Machine Translation: Volume 2, Shared Task Papers. The Association for Computer Linguistics, 505–510.

[139]

John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond BLEU: Training neural machine translation with semantic similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4344–4355.

[140]

John Wieting and Kevin Gimpel. 2018. ParaNMT-50M: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 451–462.

[141]

Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2017. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2253–2263.

[142]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 5754–5764.

Digital Library

[143]

Hao Zhang and Daniel Gildea. 2007. Factorization of synchronous context-free grammars in linear time. In Proceedings of SSST, NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation. Association for Computational Linguistics, 25–32.

Digital Library

[144]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating text generation with BERT. arXiv:1904.09675. Retrieved from https://arxiv.org/abs/1904.09675.

[145]

Ying Zhang, Stephan Vogel, and Alex Waibel. 2004. Interpreting BLEU/NIST scores: How much improvement do we need to have a better system? In Proceedings of the 4th International Conference on Language Resources and Evaluation. European Language Resources Association.

[146]

Tianyu Zhao, Divesh Lala, and Tatsuya Kawahara. 2020. Designing precise and robust dialogue response evaluators. arXiv:2004.04908. Retrieved from https://arxiv.org/abs/2004.04908.

[147]

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 563–578.

Cited By

Bose AMajumder G(2024)A Case Study on Tools and Techniques of Machine Translation of Indian Low Resource LanguagesEmpowering Low-Resource Languages With NLP Solutions10.4018/979-8-3693-0728-1.ch004(51-85)Online publication date: 27-Feb-2024
https://doi.org/10.4018/979-8-3693-0728-1.ch004
Xu ZLamba HAi QTetreault JJaimes AOosterhuis HBast HXiong C(2024)CFE2: Counterfactual Editing for Search Result ExplanationProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672508(145-155)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672508
Zhu XJiang MZhang XNie LDing Z(2024)MTAS: A Reference-Free Approach for Evaluating Abstractive Summarization SystemsProceedings of the ACM on Software Engineering10.1145/36608201:FSE(2561-2583)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660820
Show More Cited By

Index Terms

A Survey of Evaluation Metrics Used for NLG Systems
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

A Survey of Natural Language Generation
This article offers a comprehensive review of the research on Natural Language Generation (NLG) over the past two decades, especially in relation to data-to-text generation and text-to-text generation deep learning methods, as well as new applications of ...
Reassessing automatic evaluation metrics for code summarization tasks
ESEC/FSE 2021: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

In recent years, research in the domain of source code summarization has adopted data-driven techniques pioneered in machine translation (MT). Automatic evaluation metrics such as BLEU, METEOR, and ROUGE, are fundamental to the evaluation of MT systems ...
Research on Text Generation Techniques Combining Machine Learning and Deep Learning
IPEC '22: Proceedings of the 3rd Asia-Pacific Conference on Image Processing, Electronics and Computers

Natural language generation (NLG) is a part of natural language processing (NLP), the main purpose of which is to build a natural language text generation system capable of generating human-understandable languages such as Chinese and English through ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 55, Issue 2

February 2023

803 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3505209

Editor:
Albert Zomaya
University of Sydney, Australia

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 January 2022

Accepted: 01 September 2021

Revised: 01 May 2021

Received: 01 September 2020

Published in CSUR Volume 55, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Refereed

Funding Sources

Department of Computer Science and Engineering
Robert Bosch Center for Data Science and Artificial Intelligence

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
4,560
Total Downloads

Downloads (Last 12 months)1,738
Downloads (Last 6 weeks)142

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bose AMajumder G(2024)A Case Study on Tools and Techniques of Machine Translation of Indian Low Resource LanguagesEmpowering Low-Resource Languages With NLP Solutions10.4018/979-8-3693-0728-1.ch004(51-85)Online publication date: 27-Feb-2024
https://doi.org/10.4018/979-8-3693-0728-1.ch004
Xu ZLamba HAi QTetreault JJaimes AOosterhuis HBast HXiong C(2024)CFE2: Counterfactual Editing for Search Result ExplanationProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672508(145-155)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672508
Zhu XJiang MZhang XNie LDing Z(2024)MTAS: A Reference-Free Approach for Evaluating Abstractive Summarization SystemsProceedings of the ACM on Software Engineering10.1145/36608201:FSE(2561-2583)Online publication date: 12-Jul-2024
https://dl.acm.org/doi/10.1145/3660820
Shin JHedderich MRey BLucero AOulasvirta A(2024)Understanding Human-AI Workflows for Generating PersonasProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660729(757-781)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3660729
Yu FZhang PDing XLu TGu N(2024)BNoteHelper: A Note-based Outline Generation Tool for Structured Learning on Video-sharing PlatformsACM Transactions on the Web10.1145/363877518:2(1-30)Online publication date: 12-Mar-2024
https://dl.acm.org/doi/10.1145/3638775
Yang SVermeulen JFitzmaurice GMatejka J(2024)AQuA: Automated Question-Answering in Software Tutorial Videos with Visual AnchorsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642752(1-19)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642752
Areshey AMathkour H(2024) Exploring transformer models for sentiment classification: A comparison of BERT , RoBERTa , ALBERT , DistilBERT , and XLNet Expert Systems10.1111/exsy.1370141:11Online publication date: 14-Aug-2024
https://doi.org/10.1111/exsy.13701
Le NHuh J(2024)AgTech: Building Smart Aquaculture Assistant System Integrated IoT and Big Data AnalysisIEEE Transactions on AgriFood Electronics10.1109/TAFE.2024.34164152:2(471-482)Online publication date: Sep-2024
https://doi.org/10.1109/TAFE.2024.3416415
Sghaier OBoudrias JSahraoui H(2024)Toward Optimal Psychological Functioning in AI-Driven Software Engineering Tasks: The Software Evaluation for Well-Being and Optimal Psychological Functioning in a Context-Aware Environment Assessment FrameworkIEEE Software10.1109/MS.2024.338236441:4(105-114)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.1109/MS.2024.3382364
Schoen ABlanc GGimenez PHan YMajorczyk FMe L(2024)A Tale of Two Methods: Unveiling the Limitations of GAN and the Rise of Bayesian Networks for Synthetic Network Traffic Generation2024 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW)10.1109/EuroSPW61312.2024.00036(273-286)Online publication date: 8-Jul-2024
https://doi.org/10.1109/EuroSPW61312.2024.00036
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents