Article

Word segmentation for dialect translation

Authors:

Eiichiro SumitaAuthors Info & Claims

CICLing'11: Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II

Pages 55 - 67

Published: 20 February 2011 Publication History

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating three Japanese local dialects (Kumamoto, Kyoto, Osaka) into three Indo-European languages (English, German, Russian) revealed that the proposed system outperforms SMT engines trained on character-based as well as standard dialect segmentation schemes for the majority of the investigated translation tasks and automatic evaluation metrics.

References

[1]

Nerbonne, J., Heeringa, W.: Measuring Dialect Distance Phonetically. In: Proc. of the ACL SIG in Computational Phonology, Madrid, Spain, pp. 11-18 (1997).

[2]

Heeringa, W., Kleiweg, P., Gosskens, C., Nerbonne, J.: Evaluation of String Distance Algorithms for Dialectology. In: Proc. of the Workshop on Linguistic Distances, Sydney, Australia, pp. 51-62 (2006).

Digital Library

[3]

Scherrer, Y.: Adaptive String Distance Measures for Bilingual Dialect Lexicon Induction. In: Proc. of the ACL Student Research Workshop, Prague, Czech Republic, pp. 55-60 (2007).

Digital Library

[4]

Chitturi, R., Hansen, J.: Dialect Classification for online podcasts fusing Acoustic and Language-based Structural and Semantic Information. In: Proc. of the ACL-HLT (Companion Volume), Columbus, USA, pp. 21-24 (2008).

Digital Library

[5]

Habash, N., Rambow, O., Kiraz, G.: Morphological Analysis and Generation for Arabic Dialects. In: Proc. of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, USA, pp. 17-24 (2005).

Digital Library

[6]

Chiang, D., Diab, M., Habash, N., Rainbow, O., Shareef, S.: Parsing Arabic Dialects. In: Proc. of the EACL, Trento, Italy, pp. 369-376 (2006).

[7]

Biadsy, F., Hirschberg, J., Habash, N.: Spoken Arabic Dialect Identification Using Phonotactic Modeling. In: Proc. of the EACL, Athens, Greek, pp. 53-61 (2009).

Digital Library

[8]

Weber, D., Mann, W.: Prospects for Computer-Assisted Dialect Adaption. American Journal of Computational Linguistics 7(3), 165-177 (1981).

Digital Library

[9]

Zhang, X., Hom, K.H.: Dialect MT: A Case Study between Cantonese and Mandarin. In: Proc. of the ACL-COLING, Montreal, Canada, pp. 1460-1464 (1998).

Digital Library

[10]

Sawaf, H.: Arabic Dialect Handling in Hybrid Machine Translation. In: Proc. of the AMTA, Denver, USA (2010).

[11]

Cheng, K.S., Young, G., Wong, K.F.: A study on word-based and integrat-bit Chinese text compression algorithms. American Society of Information Science 50(3), 218-228 (1999).

Digital Library

[12]

Venkataraman, A.: A statistical model for word discovery in transcribed speech. Computational Linguistics 27(3), 351-372 (2001).

Digital Library

[13]

Goldwater, S., Griffith, T., Johnson, M.: Contextual Dependencies in Unsupervised Word Segmentation. In: Proc. of the ACL, Sydney, Australia, pp. 673-680 (2006).

Digital Library

[14]

Chang, P.C., Galley, M., Manning, C.: Optimizing Chinese Word Segmentation for Machine Translation Performance. In: Proc. of the 3rd Workshop on SMT, Columbus, USA, pp. 224-232 (2008).

Digital Library

[15]

Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian Semi-Supervised Chinese Word Segmentation for SMT. In: Proc. of the COLING, Manchester, UK, pp. 1017-1024 (2008).

Digital Library

[16]

Zhang, R., Yasuda, K., Sumita, E.: Improved Statistical Machine Translation by Multiple Chinese Word Segmentation. In: Proc. of the 3rd Workshop on SMT, Columbus, USA, pp. 216-223 (2008).

Digital Library

[17]

Dyer, C.: Using a maximum entropy model to build segmentation lattices for MT. In: Proc. of HLT, Boulder, USA, pp. 406-414 (2009).

Digital Library

[18]

Ma, Y., Way, A.: Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation. In: Proc. of the 12th EACL, Athens, Greece, pp. 549-557 (2009).

Digital Library

[19]

Berger, A., Pietra, S.D., Pietra, V.D.: A maximum entropy approach to NLP. Computational Linguistics 22(1), 39-71 (1996).

Digital Library

[20]

Pietra, S.D., Pietra, V.D., Lafferty, J.: Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 380-393 (1997).

Digital Library

[21]

Ratnaparkhi, A.: A Maximum Entropy Model for Part-Of-Speech Tagging. In: Proc. of the EMNLP, Pennsylvania, USA, pp. 133-142 (1996).

[22]

Kikui, G., Yamamoto, S., Takezawa, T., Sumita, E.: Comparative study on corpora for speech translation. IEEE Transactions on Audio, Speech and Language 14(5), 1674-1682 (2006).

Digital Library

[23]

Och, F.J., Ney, H.: A Systematic Comparison of Statistical Alignment Models. Computational Linguistics 29(1), 19-51 (2003).

Digital Library

[24]

Stolcke, A.: SRILM an extensible language modeling toolkit. In: Proc. of ICSLP, Denver, USA, pp. 901-904 (2002).

[25]

Finch, A., Denoual, E., Okuma, H., Paul, M., Yamamoto, H., Yasuda, K., Zhang, R., Sumita, E.: The NICT/ATR Speech Translation System. In: Proc. of the IWSLT, Trento, Italy, pp. 103-110 (2007).

[26]

Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proc. of the 40th ACL, Philadelphia, USA, pp. 311-318 (2002).

Digital Library

[27]

Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proc. of the AMTA, Cambridge and USA, pp. 223-231 (2006).

Cited By

Paul MFinch ADixon PSumita EJancsary JNeubarth FTrost H(2011)Dialect translationProceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties10.5555/2140533.2140534(1-9)Online publication date: 31-Jul-2011
https://dl.acm.org/doi/10.5555/2140533.2140534

Recommendations

Dialect translation: integrating Bayesian co-segmentation models with pivot-based SMT
DIALECTS '11: Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

Recent research on multilingual statistical machine translation (SMT) focuses on the usage of pivot languages in order to overcome resource limitations for certain language pairs. This paper proposes a new method to translate a dialect language into a ...
Bilingually Motivated Word Segmentation for Statistical Machine Translation

We introduce a bilingually motivated word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Our approach is motivated from the insight ...
Nonparametric word segmentation for machine translation
COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics

We present an unsupervised word segmentation model for machine translation. The model uses existing monolingual segmentation techniques and models the joint distribution over source sentence segmentations and alignments to the target sentence. During ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

CICLing'11: Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II

February 2011

520 pages

ISBN:9783642194368

Editor:
Alexander Gelbukh
Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico D.F., Mexico

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 20 February 2011

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Paul MFinch ADixon PSumita EJancsary JNeubarth FTrost H(2011)Dialect translationProceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties10.5555/2140533.2140534(1-9)Online publication date: 31-Jul-2011
https://dl.acm.org/doi/10.5555/2140533.2140534

View Options

View options

Media

Figures

Other

Tables

View Table of Contents