Abstract
Most evaluation metrics for machine translation (MT) require reference translations for each sentence in order to produce a score reflecting certain aspects of its quality. The de facto metrics, BLEU and NIST, are known to have good correlation with human evaluation at the corpus level, but this is not the case at the segment level. As an attempt to overcome these two limitations, we address the problem of evaluating the quality of MT as a prediction task, where reference-independent features are extracted from the input sentences and their translation, and a quality score is obtained based on models produced from training data. We show that this approach yields better correlation with human evaluation as compared to commonly used metrics, even with models trained on different MT systems, language-pairs and text domains.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Albrecht J, Hwa R (2007a) A re-examination of machine learning approaches for sentence-level MT evaluation. In: 45th meeting of the association for computational linguistics, Prague, pp 880–887
Albrecht J, Hwa R (2007b) Regression for sentence-level MT evaluation with pseudo references. In: 45th meeting of the association for computational linguistics, Prague, pp 296–303
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2003) Confidence estimation for machine translation. Technical report. Johns Hopkins University, Baltimore
Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: 20th coling, Geneva, pp 315–321
Callison-Burch C, Fordyce C, Koehn P, Monz C, Schroeder J (2008) Further meta-evaluation of machine translation. In: 3rd workshop on statistical machine translation, Columbus, pp 70–106
Callison-Burch C, Koehn P, Monz C, Schroeder J (2009) Findings of the 2009 workshop on statistical machine translation. In: 4th workshop on statistical machine translation, Athens, pp 1–28
Chang C, Lin C (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20(1): 37–46
Doddington G (2002) Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Conference on human language technology, San Diego, pp 138–145
Gamon M, Aue A, Smets M (2005) Sentence-level MT evaluation without reference translations: beyond language modeling. In: 10th meeting of the European association for machine translation, Budapest
Gandrabur S, Foster G (2003) Confidence estimation for translation prediction. In: 7th conference on natural language learning, Edmonton, pp 95–102
Gimenez J, Marquez L (2008) A smorgasbord of features for automatic MT evaluation. In: 3rd workshop on statistical machine translation, Columbus, OH, pp 195–198
Joachims T (1999) Making large-scale SVM learning practical. In: Schoelkopf B, Burges CJC, Smola AJ (eds) Advances in Kernel methods—support vector learning. MIT Press, Cambridge
Johnson H, Sadat F, Foster G, Kuhn R, Simard M, Joanis E, Larkin S (2006) Portage: with smoothed phrase tables and segment choice models. In: Workshop on statistical machine translation, New York, pp 134–137
Kääriäinen M (2009) Sinuhe—statistical machine translation using a globally trained conditional exponential family translation model. In: Conference on empirical methods in natural language processing, Singapore, pp 1027–1036
Kadri Y, Nie JY (2006) Improving query translation with confidence estimation for cross language information retrieval. In: 15th ACM international conference on information and knowledge management, Arlington, pp 818–819
Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Conference on empirical methods in natural language processing, Barcelona
Lavie A, Agarwal A (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: 2nd workshop on statistical machine translation, Prague, Czech Republic, pp 228–231
Lin CY, Och FJ (2004) ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In: Coling-2004, Geneva, pp 501–507
Pado S, Galley M, Jurafsky D, Manning CD (2009) Textual entailment features for machine translation evaluation. In: 4th workshop on statistical machine translation, Athens, pp 37–41
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: 40th meeting of the association for computational linguistics, Morristown, pp 311–318
Quirk CB (2004) Training a sentence-level machine translation confidence measure. In: 4th language resources and evaluation conference, Lisbon, pp 825–828
Saunders C (2008) Application of Markov approaches to SMT. Technical report. SMART Project Deliverable 2.2
Simard M, Cancedda N, Cavestro B, Dymetman M, Gaussier E, Goutte C, Yamada K (2005) Translating with non-contiguous phrases. In: Conference on empirical methods in natural language processing, Vancouver, pp 755–762
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Conference of the 7th association for machine translation in the Americas, Cambridge, MA, pp 223–231
Specia L, Turchi M, Cancedda N, Dymetman M, Cristianini N (2009) Estimating the sentence-level quality of machine translation systems. In: 13th meeting of the European association for machine translation, Barcelona
Ueffing N, Ney H (2005) Application of word-level confidence measures in interactive statistical machine translation. In: 10th meeting of the European association for machine translation, Budapest, pp 262–270
Author information
Authors and Affiliations
Corresponding author
Additional information
Lucia Specia—Work developed while working at the Xerox Research Centre Europe, France.
Dhwaj Raj—Work developed during an internship at the Xerox Research Centre Europe, France.
Marco Turchi—Work developed while working at the Department of Engineering Mathematics, University of Bristol, UK.
Rights and permissions
About this article
Cite this article
Specia, L., Raj, D. & Turchi, M. Machine translation evaluation versus quality estimation. Machine Translation 24, 39–50 (2010). https://doi.org/10.1007/s10590-010-9077-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-010-9077-2