3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers

Nakazawa, Toshiaki; Kurohashi, Sadao; Kobayashi, Hayato; Ishikawa, Hiroki; Sassano, Manabu

doi:10.1007/978-981-10-0515-2_6

Toshiaki Nakazawa¹²,
Sadao Kurohashi¹²,
Hayato Kobayashi¹³,
Hiroki Ishikawa¹³ &
…
Manabu Sassano¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 593))

Included in the following conference series:

Conference of the Pacific Association for Computational Linguistics

670 Accesses
1 Citations

Abstract

A high-quality parallel corpus needs to be manually created to achieve good machine translation for the domains which do not have enough existing resources. Although the quality of the corpus to some extent can be improved by asking the professional translators to translate, it is impossible to completely avoid making any mistakes. In this paper, we propose a framework for cleaning the existing professionally-translated parallel corpus in a quick and cheap way. The proposed method uses a 3-step crowdsourcing procedure to efficiently detect and edit the translation flaws, and also guarantees the reliability of the edits. The experiments using the fashion-domain e-commerce-site (EC-site) parallel corpus show the effectiveness of the proposed method for the parallel corpus cleaning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

Automatic Collection of the Parallel Corpus with Little Prior Knowledge

Monolingual Denoising with Large Language Models for Low-Resource Machine Translation

Notes

1.
Unfortunately, this service has been closed now.
2.
Pants for children without the inside of a thigh being sewn up.
3.
In our experiments, we showed both source and translated sentences.
4.
http://crowdsourcing.yahoo.co.jp.
5.
We excluded some sentences which are garbled.
6.
http://www.editage.com.

References

Ambati, V., Vogel, S.: Can crowds build parallel corpora for machine translation systems? In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 62–65 (2010)
Google Scholar
Ambati, V., Vogel, S., Carbonell, J.: Active learning and crowd-sourcing for machine translation. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010) (2010)
Google Scholar
Aranberri, N., Labaka, G., de Ilarraza, A.D., Sarasola, K.: Comparison of post-editing productivity between professional translators and lay users. In: Proceedings of the Third Workshop on Post-Editing Technology and Practice, pp. 20–33 (2014)
Google Scholar
Cao, D., Nakano, H., Xu, Y., Kumai, H.: Development of “Chinese-Japanese bilingual corpus” and its remaining tasks. IPSJ SIG Notes 99(95), 1–8 (1999)
Google Scholar
Chu, C., Nakazawa, T., Kurohashi, S.: Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon. In: Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), pp. 1144–1150 (2013)
Google Scholar
Chu, C., Nakazawa, T., Kurohashi, S.: Chinese-Japanese parallel sentence extraction from quasi-comparable corpora. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora (BUCC 2013), pp. 34–42 (2013)
Google Scholar
Koehn, P.: Statistical significance tests for machine translation evaluation. In: Lin, D., Wu, D. (eds.) Proceedings of EMNLP 2004, pp. 388–395. Association for Computational Linguistics, Barcelona, July 2004
Google Scholar
Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the Tenth Machine Translation Summit (MT Summit X), pp. 79–86 (2005)
Google Scholar
Nakazawa, T., Kurohashi, S.: Alignment by bilingual generation and monolingual derivation. In: Proceedings of COLING 2012, pp. 1963–1978. The COLING 2012 Organizing Committee, Mumbai, December 2012. http://www.aclweb.org/anthology/C12-1120
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Google Scholar
Richardson, J., Cromières, F., Nakazawa, T., Kurohashi, S.: KyotoEBMT: an example-based dependency-to-dependency translation framework. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 79–84 (2014)
Google Scholar
Schwartz, L.: Monolingual post-editing by a domain expert is highly effective for translation triage. In: Proceedings of the Third Workshop on Post-editing Technology and Practice, pp. 34–44 (2014)
Google Scholar
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 403–411 (2010)
Google Scholar
Uszkoreit, J., Ponte, J., Popat, A., Dubiner, M.: Large scale parallel document mining for machine translation. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1101–1109 (2010)
Google Scholar
Utiyama, M., Isahara, H.: A Japanese-English patent parallel corpus. In: MT summit XI, pp. 475–482 (2007)
Google Scholar
Zaidan, O.F., Callison-Burch, C.: Crowdsourcing translation: professional quality from non-professionals. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1220–1229 (2011)
Google Scholar
Zhang, Y., Uchimoto, K., Ma, Q., Isahara, H.: Building an annotated Japanese-Chinese parallel corpus - a part of NICT multilingual corpora. In: Proceedings of 2nd International Joint Conference on Natural Language Processing, pp. 85–90 (2005)
Google Scholar

Download references

Acknowledgments

This work is supported by the Yahoo Japan Corporation. We want to thank the anonymous reviewers for many very useful comments.

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501, Japan
Toshiaki Nakazawa & Sadao Kurohashi
Yahoo Japan Corporation, Midtown Tower, 9-7-1 Akasaka, Minato-ku, Tokyo, 107-6211, Japan
Hayato Kobayashi, Hiroki Ishikawa & Manabu Sassano

Authors

Toshiaki Nakazawa
View author publications
You can also search for this author in PubMed Google Scholar
Sadao Kurohashi
View author publications
You can also search for this author in PubMed Google Scholar
Hayato Kobayashi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Ishikawa
View author publications
You can also search for this author in PubMed Google Scholar
Manabu Sassano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Toshiaki Nakazawa .

Editor information

Editors and Affiliations

Graduate School of Information Science, The University of Tokyo, Bunkyo-ku, Tokyo, Japan
Kôiti Hasida
School of Electrical Eng and Informatics, Bandung Institute of Technology, Bandung, Indonesia
Ayu Purwarianti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nakazawa, T., Kurohashi, S., Kobayashi, H., Ishikawa, H., Sassano, M. (2016). 3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers. In: Hasida, K., Purwarianti, A. (eds) Computational Linguistics. PACLING 2015. Communications in Computer and Information Science, vol 593. Springer, Singapore. https://doi.org/10.1007/978-981-10-0515-2_6

Download citation

DOI: https://doi.org/10.1007/978-981-10-0515-2_6
Published: 20 February 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-0514-5
Online ISBN: 978-981-10-0515-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

Automatic Collection of the Parallel Corpus with Little Prior Knowledge

Monolingual Denoising with Large Language Models for Low-Resource Machine Translation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

3-Step Parallel Corpus Cleaning Using Monolingual Crowd Workers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation

Automatic Collection of the Parallel Corpus with Little Prior Knowledge

Monolingual Denoising with Large Language Models for Low-Resource Machine Translation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation