Abstract
A high-quality parallel corpus needs to be manually created to achieve good machine translation for the domains which do not have enough existing resources. Although the quality of the corpus to some extent can be improved by asking the professional translators to translate, it is impossible to completely avoid making any mistakes. In this paper, we propose a framework for cleaning the existing professionally-translated parallel corpus in a quick and cheap way. The proposed method uses a 3-step crowdsourcing procedure to efficiently detect and edit the translation flaws, and also guarantees the reliability of the edits. The experiments using the fashion-domain e-commerce-site (EC-site) parallel corpus show the effectiveness of the proposed method for the parallel corpus cleaning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Unfortunately, this service has been closed now.
- 2.
Pants for children without the inside of a thigh being sewn up.
- 3.
In our experiments, we showed both source and translated sentences.
- 4.
- 5.
We excluded some sentences which are garbled.
- 6.
References
Ambati, V., Vogel, S.: Can crowds build parallel corpora for machine translation systems? In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 62–65 (2010)
Ambati, V., Vogel, S., Carbonell, J.: Active learning and crowd-sourcing for machine translation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010) (2010)
Aranberri, N., Labaka, G., de Ilarraza, A.D., Sarasola, K.: Comparison of post-editing productivity between professional translators and lay users. In: Proceedings of the Third Workshop on Post-Editing Technology and Practice, pp. 20–33 (2014)
Cao, D., Nakano, H., Xu, Y., Kumai, H.: Development of “Chinese-Japanese bilingual corpus” and its remaining tasks. IPSJ SIG Notes 99(95), 1–8 (1999)
Chu, C., Nakazawa, T., Kurohashi, S.: Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon. In: Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), pp. 1144–1150 (2013)
Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-Japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora (BUCC 2013), pp. 34–42 (2013)
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, pp. 388–395. Association for Computational Linguistics, Barcelona, July 2004
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit (MT Summit X), pp. 79–86 (2005)
Nakazawa, T., Kurohashi, S.: Alignment by bilingual generation and monolingual derivation. In: Proceedings of COLING 2012, pp. 1963–1978. The COLING 2012 Organizing Committee, Mumbai, December 2012. http://www.aclweb.org/anthology/C12-1120
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Richardson, J., Cromières, F., Nakazawa, T., Kurohashi, S.: KyotoEBMT: an example-based dependency-to-dependency translation framework. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 79–84 (2014)
Schwartz, L.: Monolingual post-editing by a domain expert is highly effective for translation triage. In: Proceedings of the Third Workshop on Post-editing Technology and Practice, pp. 34–44 (2014)
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 403–411 (2010)
Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1101–1109 (2010)
Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: MT summit XI, pp. 475–482 (2007)
Zaidan, O.F., Callison-Burch, C.: Crowdsourcing translation: professional quality from non-professionals. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1220–1229 (2011)
Zhang, Y., Uchimoto, K., Ma, Q., Isahara, H.: Building an annotated Japanese-Chinese parallel corpus - a part of NICT multilingual corpora. In: Proceedings of 2nd International Joint Conference on Natural Language Processing, pp. 85–90 (2005)
Acknowledgments
This work is supported by the Yahoo Japan Corporation. We want to thank the anonymous reviewers for many very useful comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Nakazawa, T., Kurohashi, S., Kobayashi, H., Ishikawa, H., Sassano, M. (2016). 3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers. In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. PACLING 2015. Communications in Computer and Information Science, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-10-0515-2_6
Download citation
DOI: https://doi.org/10.1007/978-981-10-0515-2_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-0514-5
Online ISBN: 978-981-10-0515-2
eBook Packages: Computer ScienceComputer Science (R0)