Abstract
Neural machine translation (NMT) has recently gained substantial popularity not only in academia, but also in industry. For its acceptance in industry it is important to investigate how NMT performs in comparison to the phrase-based statistical MT (PBSMT) model, that until recently was the dominant MT paradigm. In the present work, we compare the quality of the PBSMT and NMT solutions of KantanMT—a commercial platform for custom MT—that are tailored to accommodate large-scale translation production, where there is a limited amount of time to train an end-to-end system (NMT or PBSMT). In order to satisfy the time requirements of our production line, we restrict the NMT training time to 4 days; to train a PBSMT system typically requires no longer than one day with the current training pipeline of KantanMT. To train both NMT and PBSMT engines for each language pair, we strictly use the same parallel corpora and the same pre- and post-processing steps (when applicable). Our results show that, even with time-restricted training of 4 days, NMT quality substantially surpasses that of PBSMT. Furthermore, we challenge the reliability of automatic quality evaluation metrics based on n-gram comparison (in particular F-measure, BLEU and TER) for NMT quality evaluation. We support our hypothesis with both analytical and empirical evidence. We investigate how suitable these metrics are when comparing the two different paradigms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
F-measure, BLEU and TER are algorithms for quality evaluation of MT systems, typically used to estimate fluency, adequacy and extent of translation errors (cf. Way 2018b for more details).
We consider tokenization and cleaning as general preprocessing steps; word segmentation (e.g. Byte Pair Encoding (BPE): Sennrich et al. 2016) is an NMT-specific pre-processing step.
BLEU scores are presented in the range of 0–100.
In http://kv-emptypages.blogspot.ie/2016/09/the-google-neural-machine-translation.html the author argues against the generalizability of the results and the appropriateness of the evaluations performed.
Few translations will attain a score of 1 unless they are identical to a reference translation. Even translations by professional translators will not necessarily obtain a BLEU score of 1.
Character-based F-measure was also shown to correlate well with human judgment (Popović 2015).
See e.g. https://deeplearning4j.org/word2vec.html.
An MT engine refers to the package of models (translation, language and recasing models for PBSMT, and an encoder–decoder model for NMT) as well as to the required rules and dictionaries for pre- and post-processing.
KantanMT provides both cloud-based and on-premise solutions.
Typically additional monolingual data is employed to train the language model of a PBSMT engine. While we acknowledge that monolingual data, among other optimisations, may improve a PBSMT engine, within the scope of this work we keep the same-data assumption. That is, we do not employ any other data except for the parallel corpus provided. This requirement is vital when trying to answer the question: “When a user has at their disposal a set of parallel data, which of the two paradigms is preferable to train: PBSMT or NMT?”.
Expressed in terms of increases in BLEU and F-measure and decreases in TER, as well as according to internal human evaluation.
By Chinese, we mean Simplified Mandarin Chinese.
Training, testing and tuning data was normalised prior to building the MT engines.
References
Agarwal A, Lavie A (2008) METEOR, M-BLEU and M-TER: evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 115–118
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 6th international conference on learning representations (ICLR 2015), San Diego, CA, USA
Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, pp 257–267
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the eleventh conference of the European chapter of the association for computational linguistics, Trento, Italy, pp 249–256
Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108(1):109–120
Cer D, Manning CD, Jurafsky D (2010) The best lexical metric for phrase-based statistical MT system optimization. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, Los Angeles, California, pp 555–563
Cettolo M, Niehues J, Stüker S, Bentivogli L, Cattoni R, Federico M (2015) The IWSLT 2015 evaluation campaign. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 2–14
Chen B, Cherry C (2014) A systematic comparison of smoothing techniques for sentence-level BLEU. In: Proceedings of the ninth workshop on statistical machine translation (WMT@ACL 2014), Baltimore, Maryland, USA, pp 362–367
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05), Ann Arbor, Michigan, pp 263–270
Chiang D, DeNeefe S, Chan YS, Ng HT (2008) Decomposability of translation metrics for improved evaluation and efficient algorithms. In: Proceedings of the conference on empirical methods in natural language processing, Honolulu, Hawaii, USA, pp 610–619
Cho K, van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, Doha, Qatar, pp 1724–1734
Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, vol 1, long papers, Berlin, Germany, pp 1693–1703
Costa-Jussà MR, Farrús M, Mariño JB, Fonollosa JAR (2012) Study and comparison of rule-based and statistical Catalan-Spanish machine translation systems. Comput Inform 31(2):245–270
Crego JM, Kim J, Klein G, Rebollo A, Yang K, Senellart J, Akhanov E, Brunelle P, Coquard A, Deng Y, Enoue S, Geiss C, Johanson J, Khalsa A, Khiari R, Ko B, Kobus C, Lorieux J, Martins L, Nguyen D, Priori A, Riccardi T, Segal N, Servan C, Tiquet C, Wang B, Yang J, Zhang D, Zhou J, Zoldan P (2016) Systran’s pure neural machine translation systems. CoRR arXiv:1610.05540
Daems J, Vandepitte S, Hartsuiker RJ, Macken L (2017) Identifying the machine translation error types with the greatest impact on post-editing effort. Front Psychol 8:1282
Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL), Atlanta, USA, pp 644–649
Farrús M, Costa-jussà MR, Popović M (2012) Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations. J Assoc Inf Sci Technol 63(1):174–184
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382
Ha TL, Niehues J, Eunah C, Mediani M, Waibel A (2015) The KIT translation systems for IWSLT 2015. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 62–69
Junczys-Dowmunt M, Dwojak T, Hoang H (2016) Is neural machine translation ready for deployment? A case study on 30 translation directions. In: Proceedings of the 9th international workshop on spoken language translation, Seattle, WA
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. CoRR arXiv:1412.6980
Klein G, Kim Y, Deng Y, Senellart J, Rush AM (2017) Opennmt: open-source toolkit for neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, System Demonstrations, Vancouver, Canada, pp 67–72
Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. The Prague Bulletin of Mathematical Linguistics, pp 121–132
Koehn P (2010) Statistical machine translation, 1st edn. Cambridge University Press, New York, NY
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, Prague, Czech Republic, pp 177–180
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174
Luong MT, Manning CD (2015) Stanford neural machine translation systems for spoken language domains. In: Proceedings of the 12th international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 76–79
Melamed ID, Green R, Turian JP (2003) Precision and recall of machine translation. In: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada, pp 61–63
Och F, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Philadelphia, Pennsylvania, USA, pp 311–318
Popović M (2015) chrF: character n-gram f-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation (WMT@EMNLP 2015), Lisbon, Portugal, pp 392–395
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, vol 1, Long Papers, Berlin, Germany, pp 1715–1725
Shterionov D, Du J, Palminteri MA, Casanellas L, O’Dowd T, Way A (2016) Improving KantanMT training efficiency with FastAlign. In: Proceedings of AMTA 2016, the twelfth conference of the Association for Machine Translation in the Americas, vol 2, MT Users’ Track, Austin, TX, USA, pp 222–231
Shterionov D, Nagle P, Casanellas L, Superbo R, ODowd T (2017) Empirical evaluation of NMT and PBSMT quality for large-scale translation production. In: Proceedings of the user track of the 20th annual conference of the European Association for Machine Translation (EAMT), Prague, Czech Republic, pp 74–79
Smith A, Hardmeier C, Tiedemann J (2016) Climbing Mont BLEU: the strange world of reachable high-BLEU translations. In: Proceedings of the 19th annual conference of the European Association for Machine Translation, EAMT 2017, Riga, Latvia, pp 269–281
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006. Proceedings of the 7th conference of the association for machine translation of the Americas. Visions for the future of machine translation, Cambridge, Massachusetts, USA, pp 223–231
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of advances in neural information processing systems 27: annual conference on neural information processing systems, Montreal, Quebec, Canada, pp 3104–3112
Vanmassenhove E, Du J, Way A (2016) Improving subject-verb agreement in SMT. In: Proceedings of the fifth workshop on hybrid approaches to translation, Riga, Latvia
Way A (2018a) Machine translation: where are we at today? In: Angelone E, Massey G, Ehrensberger-Dow M (eds) The Bloomsbury Companion to language industry studies. Bloomsbury, London
Way A (2018b) Quality expectations of machine translation. In: Moorkens J, Castilho S, Gaspari F, Doherty S (eds) Translation quality assessment: from principles to practice. Springer, Berlin
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR arXiv:1609.08144
Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The United Nations Parallel Corpus v1.0. In: Proceedings of the tenth international conference on language resources and evaluation, Portorož, Slovenia, pp 3530–3534
Acknowledgements
We would like to thank our external evaluators: Xiyi Fan, Ruopu Wang, Wan Nie, Ayumi Tanaka, Maki Iwamoto, Risako Hayakawa, Silvia Doehner, Daniela Naumann, Moritz Philipp, Annabella Ferola, Anna Ricciardelli, Paola Gentile, Celia Ruiz Arca, Clara Beltrá, Giulia Mattoni, as well as University College London, Dublin City University, KU Leuven, University of Strasbourg, and University of Stuttgart.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shterionov, D., Superbo, R., Nagle, P. et al. Human versus automatic quality evaluation of NMT and PBSMT. Machine Translation 32, 217–235 (2018). https://doi.org/10.1007/s10590-018-9220-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-018-9220-z