Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1964750.1964756guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Word segmentation for dialect translation

Published: 20 February 2011 Publication History

Abstract

This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches for the translation of local dialects by exploiting linguistic information of the standard language. The method iteratively learns multiple segmentation schemes that are consistent with (1) the standard dialect segmentations and (2) the phrasal segmentations of an SMT system trained on the resegmented bitext of the local dialect. In a second step multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating three Japanese local dialects (Kumamoto, Kyoto, Osaka) into three Indo-European languages (English, German, Russian) revealed that the proposed system outperforms SMT engines trained on character-based as well as standard dialect segmentation schemes for the majority of the investigated translation tasks and automatic evaluation metrics.

References

[1]
Nerbonne, J., Heeringa, W.: Measuring Dialect Distance Phonetically. In: Proc. of the ACL SIG in Computational Phonology, Madrid, Spain, pp. 11-18 (1997).
[2]
Heeringa, W., Kleiweg, P., Gosskens, C., Nerbonne, J.: Evaluation of String Distance Algorithms for Dialectology. In: Proc. of the Workshop on Linguistic Distances, Sydney, Australia, pp. 51-62 (2006).
[3]
Scherrer, Y.: Adaptive String Distance Measures for Bilingual Dialect Lexicon Induction. In: Proc. of the ACL Student Research Workshop, Prague, Czech Republic, pp. 55-60 (2007).
[4]
Chitturi, R., Hansen, J.: Dialect Classification for online podcasts fusing Acoustic and Language-based Structural and Semantic Information. In: Proc. of the ACL-HLT (Companion Volume), Columbus, USA, pp. 21-24 (2008).
[5]
Habash, N., Rambow, O., Kiraz, G.: Morphological Analysis and Generation for Arabic Dialects. In: Proc. of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor, USA, pp. 17-24 (2005).
[6]
Chiang, D., Diab, M., Habash, N., Rainbow, O., Shareef, S.: Parsing Arabic Dialects. In: Proc. of the EACL, Trento, Italy, pp. 369-376 (2006).
[7]
Biadsy, F., Hirschberg, J., Habash, N.: Spoken Arabic Dialect Identification Using Phonotactic Modeling. In: Proc. of the EACL, Athens, Greek, pp. 53-61 (2009).
[8]
Weber, D., Mann, W.: Prospects for Computer-Assisted Dialect Adaption. American Journal of Computational Linguistics 7(3), 165-177 (1981).
[9]
Zhang, X., Hom, K.H.: Dialect MT: A Case Study between Cantonese and Mandarin. In: Proc. of the ACL-COLING, Montreal, Canada, pp. 1460-1464 (1998).
[10]
Sawaf, H.: Arabic Dialect Handling in Hybrid Machine Translation. In: Proc. of the AMTA, Denver, USA (2010).
[11]
Cheng, K.S., Young, G., Wong, K.F.: A study on word-based and integrat-bit Chinese text compression algorithms. American Society of Information Science 50(3), 218-228 (1999).
[12]
Venkataraman, A.: A statistical model for word discovery in transcribed speech. Computational Linguistics 27(3), 351-372 (2001).
[13]
Goldwater, S., Griffith, T., Johnson, M.: Contextual Dependencies in Unsupervised Word Segmentation. In: Proc. of the ACL, Sydney, Australia, pp. 673-680 (2006).
[14]
Chang, P.C., Galley, M., Manning, C.: Optimizing Chinese Word Segmentation for Machine Translation Performance. In: Proc. of the 3rd Workshop on SMT, Columbus, USA, pp. 224-232 (2008).
[15]
Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian Semi-Supervised Chinese Word Segmentation for SMT. In: Proc. of the COLING, Manchester, UK, pp. 1017-1024 (2008).
[16]
Zhang, R., Yasuda, K., Sumita, E.: Improved Statistical Machine Translation by Multiple Chinese Word Segmentation. In: Proc. of the 3rd Workshop on SMT, Columbus, USA, pp. 216-223 (2008).
[17]
Dyer, C.: Using a maximum entropy model to build segmentation lattices for MT. In: Proc. of HLT, Boulder, USA, pp. 406-414 (2009).
[18]
Ma, Y., Way, A.: Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation. In: Proc. of the 12th EACL, Athens, Greece, pp. 549-557 (2009).
[19]
Berger, A., Pietra, S.D., Pietra, V.D.: A maximum entropy approach to NLP. Computational Linguistics 22(1), 39-71 (1996).
[20]
Pietra, S.D., Pietra, V.D., Lafferty, J.: Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 380-393 (1997).
[21]
Ratnaparkhi, A.: A Maximum Entropy Model for Part-Of-Speech Tagging. In: Proc. of the EMNLP, Pennsylvania, USA, pp. 133-142 (1996).
[22]
Kikui, G., Yamamoto, S., Takezawa, T., Sumita, E.: Comparative study on corpora for speech translation. IEEE Transactions on Audio, Speech and Language 14(5), 1674-1682 (2006).
[23]
Och, F.J., Ney, H.: A Systematic Comparison of Statistical Alignment Models. Computational Linguistics 29(1), 19-51 (2003).
[24]
Stolcke, A.: SRILM an extensible language modeling toolkit. In: Proc. of ICSLP, Denver, USA, pp. 901-904 (2002).
[25]
Finch, A., Denoual, E., Okuma, H., Paul, M., Yamamoto, H., Yasuda, K., Zhang, R., Sumita, E.: The NICT/ATR Speech Translation System. In: Proc. of the IWSLT, Trento, Italy, pp. 103-110 (2007).
[26]
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proc. of the 40th ACL, Philadelphia, USA, pp. 311-318 (2002).
[27]
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proc. of the AMTA, Cambridge and USA, pp. 223-231 (2006).

Cited By

View all
  • (2011)Dialect translationProceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties10.5555/2140533.2140534(1-9)Online publication date: 31-Jul-2011

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
CICLing'11: Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
February 2011
520 pages
ISBN:9783642194368
  • Editor:
  • Alexander Gelbukh

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 20 February 2011

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2011)Dialect translationProceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties10.5555/2140533.2140534(1-9)Online publication date: 31-Jul-2011

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media