Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

  • Published:
Sādhanā Aims and scope Submit manuscript

Abstract

The use of subword embedding has proved to be a major innovation in Neural machine translation (NMT). It helps NMT to learn better context vectors for Low resource languages (LRLs) so as to predict the target words by better modelling the morphologies of the two languages and also the morphosyntax transfer. Some of the NMT models that achieve state-of-the-art improvement on LRLs are Transformer, BERT, BART, and mBART, which can all use sub-word embeddings. Even so, their performance for translation in Indian language to Indian language scenario is still not as good as for resource-rich languages. One reason for this is the relative morphological richness of Indian languages, while another is that most of them fall into the extremely low resource or zero-shot categories. Since most major Indian languages use Indic or Brahmi origin scripts, the text written in them is highly phonetic in nature and phonetically similar in terms of abstract letters and their arrangements. We use these characteristics of Indian languages and their scripts to propose an approach based on common multilingual Latin-based encoding (WX notation) that takes advantage of language similarity while addressing the morphological complexity issue in NMT. Such multilingual Latin-based encodings in NMT, together with Byte Pair Embedding allow us to better exploit their phonetic and orthographic as well as lexical similarities to improve the translation quality by projecting different but similar languages on the same orthographic-phonetic character space. We verify the proposed approach by demonstrating experiments on similar language pairs (Gujarati\(\leftrightarrow \)Hindi, Marathi\(\leftrightarrow \)Hindi, Nepali\(\leftrightarrow \)Hindi, Maithili\(\leftrightarrow \)Hindi, Punjabi\(\leftrightarrow \)Hindi, and Urdu\(\leftrightarrow \)Hindi) under low resource conditions. The proposed approach shows an improvement in a majority of cases, in one case as much as \(\sim \)10 BLEU points compared to baseline techniques for similar language pairs. We also get up to \(\sim \)1 BLEU points improvement on distant and zero-shot language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2

Similar content being viewed by others

Notes

  1. Using encoding converters, such as  https://pypi.org/project/wxconv/

  2. Punjabi is written in two scripts, Gurumukhi and Shahmukhi, of which the former is a Brahmi-derived script, while the latter is a variant of Perso-Arabic. Urdu is written in a similar variant of Perso-Arabic, also called Nastaliq.

  3. https://pypi.org/project/wxconv/,https://github.com/irshadbhat/indic-wx-converter

  4. https://github.com/google/sentencepiece

  5. https://en.wikipedia.org/wiki/International_Phonetic_Alphabet_chart

  6. https://www.internationalphoneticassociation.org/

  7. http://www.statmt.org/wmt20/similar.html

  8. http://www.statmt.org/wmt19/index.html

  9. https://opus.nlpl.eu/

  10. http://www.tdil-dc.in/index.php?lang=en

  11. https://pypi.org/project/wxconv/

  12. https://github.com/facebookresearch/fairseq

  13. http://www2.statmt.org/moses/

  14. https://github.com/mjpost/sacrebleu

  15. Although there are communities in South Asia and probably in other countries who actually speak the refined version of Urdu, much more similar to its written or literary form.

  16. We do have parallel corpus, which is currently being cleaned and sentence aligned and will be available in the near future.

  17. https://sites.google.com/view/loresmt

References

  1. Booth A D 1955 Machine translation of languages, fourteen essays. Technology Press of the Massachusetts Institute of Technology and Wiley, New York

    Google Scholar 

  2. Banik D, Ekbal A, Bhattacharyya P and Bhattacharyya S 2019 Assembling translations from multi-engine machine translation outputs. Appl. Soft Comput. 78: 230–239

    Article  Google Scholar 

  3. Koehn P 2009 Statistical machine translation. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511815829.

    Book  MATH  Google Scholar 

  4. Banik D, Ekbal A and Bhattacharyya, P 2020 Statistical machine translation based on weighted syntax-semantics. S\(\bar{a}\)dhan\(\bar{a}\), 45: 1–12

  5. Banik D 2021 Phrase table re-adjustment for statistical machine translation. Int. J. Speech Technol. 24: 903–911

    Article  Google Scholar 

  6. Sutskever I, Vinyals O and Le Q V 2014 Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27: 3104–3112

    Google Scholar 

  7. Bao W, Zhang J, Pan J and Yin X 2022 A novel chinese dialect TTS frontend with non-autoregressive neural machine translation. arXiv preprint arXiv:2206.04922.

  8. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I 2017 Attention is all you need. Adv. Neural Inf. Process. Syst. 30: 5998–6008

    Google Scholar 

  9. Devlin J, Chang M W, Lee K and Toutanova K 2019 BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171—4186

  10. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P and Soricut R 2020 ALBERT: a lite BERT for self-supervised learning of language representations. In: Proc. ICLR 2020

  11. Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M and Zettlemoyer L 2020 Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8: 726–742

    Article  Google Scholar 

  12. Barrault L, Bojar O, Costa-jussà M R, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Malmasi S, Monz C, Müller M, Pal S, Post M and Zampieri M 2019 Findings of the 2019 conference on machine translation (WMT19). In: Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1—61

  13. Cieri C, Maxwell M, Strassel S and Tracey J 2016 Selection criteria for low resource language programs. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 4543–4549.

  14. Sitaram S 2015 Pronunciation modeling for synthesis of low resource languages. PhD thesis. Carnegie Mellon University, Pittsburgh

  15. Emeneau M B 1956 India as a lingustic area. Language 32: 3–16

    Article  Google Scholar 

  16. Diwakar S, Goyal P and Gupta R 2010 Transliteration among Indian languages using WX notation. In: Proceedings of the conference on natural language processing 2010. Saarland University Press, pp. 147–150

  17. Singh A K 2006 A computational phonetic model for Indian language scripts. In: Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, pp. 1–19

  18. Singh A K 2010 Modeling and application of linguistic similarity. PhD thesis. IIIT, Hyderabad, India

  19. Singh A K and Surana H 2007 Using a single framework for computational modeling of linguistic similarity for solving many NLP problems. In: EUROLAN 2007 Summer School. Alexandru Ioan Cuza University of Iasi, pp. 46

  20. Bharathi Raja C, Rani P, Arcan M and McCrae J P 2021 A survey of orthographic information in machine translation. SN Comput. Sci. 2: 1–19

    Google Scholar 

  21. Singh A K, Rama T and Dasigi P 2009 A computational model of the phonetic space and its applications. LTRC IIIT, Hyderabad

    Google Scholar 

  22. Kumar A, Pratap A and Singh A K 2023 Generative adversarial neural machine translation for phonetic languages via reinforcement learning. IEEE Trans. Emerg. Top. Comput. Intell. 7: 190–199

    Article  Google Scholar 

  23. Kumar A, Pratap A, Singh A K and Saha S 2022 Addressing domain shift in neural machine translation via reinforcement learning. Expert Syst. Appl. 201: 1117039

    Article  Google Scholar 

  24. Madaan L, Sharma S and Singla P 2020 Transfer learning for related languages: submissions to the WMT20 similar language translation task. In: Proc. Fifth Conference on Machine Translation, pp. 402–408

  25. Mujadia V and Sharma D 2020 NMT based similar language translation for Hindi-Marathi. In: Proc. Fifth Conference on Machine Translation, pp. 414–417

  26. Rathinasamy K, Singh A, Sivasambagupta B, Prasad Neerchal P and Sivasankaran V 2020 Infosys machine translation system for WMT20 similar language translation task. In: Proc. Fifth Conference on Machine Translation, pp. 437–441

  27. Laskar S R, Pakray P and Bandyopadhyay S 2019 Neural machine translation: Hindi-Nepali. In: Proc. Fourth Conference on Machine Translation, pp. 202–207

  28. Ojha A K, Rani P, Bansal A, Chakravarthi B R, Kumar R and McCrae J P 2020 NUIG-Panlingua-KMI Hindi-Marathi MT systems for similar language translation task @ WMT 2020. In: Proc. Fifth Conference on Machine Translation, pp. 418–423

  29. Kumar A, Baruah R, Mundotiya R K and Singh A K 2020 Transformer-based neural machine translation system for Hindi–Marathi: WMT20 shared task. In: Proc. Fifth Conference on Machine Translation, pp. 393–395

  30. Balashov Y 2022 The boundaries of meaning: a case study in neural machine translation. Inquiry, 1–34

  31. Pal S and Zampieri M 2020 Neural machine translation for similar languages: the case of Indo-Aryan languages. In: Proc. Fifth Conference on Machine Translation, pp. 424–429

  32. Przystupa M and Abdul-Mageed M 2019 Neural machine translation of low-resource and similar languages with backtranslation. In: Proc. Fourth Conference on Machine Translation, pp. 224–235

  33. Kakwani D, Kunchukuttan A, Golla S, Gokul N C, Bhattacharyya A, Khapra M M and Kumar P 2020 IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. Find. Assoc. Comput. Linguist. EMNLP 2020: 4948–4961

    Google Scholar 

  34. Dabre R, Shrotriya H, Kunchukuttan A, Puduppully R, Khapra M and Kumar P 2022 IndicBART: a pre-trained model for Indic natural language generation. Find. Assoc. Comput. Linguist. ACL 2022: 1849–1863

    Article  Google Scholar 

  35. Klein G, Kim Y, Deng Y, Senellart J and Rush A M 2017 OpenNMT: open-source toolkit for neural machine translation. In: Proc. ACL 2017, System Demonstrations, pp. 67–72

  36. Johnson M, Schuster M, Le Q V, Krikun M, Wu Y, Chen Z, Thorat N, Viégas F, Wattenberg M, Corrado G, Hughes M and Dean J 2017 Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5: 339–351

    Article  Google Scholar 

  37. Luong T, Pham H and Manning C D 2015 Effective approaches to attention-based neural machine translation. Proc. EMNLP 2015: 1412–1421

    Google Scholar 

  38. Currey A and Heafield K 2019 Incorporating source syntax into transformer-based neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 24–33

  39. Raganato A, Scherrer Y and Tiedemann J 2020 Fixed encoder self-attention patterns in transformer-based machine translation. Find. ACL: EMNLP 2020: 556–568

    Google Scholar 

  40. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V and Zettlemoyer L 2020 BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proc. ACL 2020: 7871–7880

    Google Scholar 

  41. Edunov S, Ott M, Auli M and Grangier D 2018 Understanding back-translation at scale. Proc. EMNLP 2018: 489–500

    Google Scholar 

  42. Hoang V C D, Koehn P, Haffari G and Cohn T 2018 Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24

  43. Trubetzkoy N S 1928 Proposition 16. In: Acts of the First International Congress of Linguists, pp. 17–18

  44. Haspelmath M 2001 The European linguistic area: standard average European. In: Halbband Language Typology and Language Universals 2. edited by Teilband. Berlin: De Gruyter Mouton, pp. 1492–1510

  45. Guzmán F, Chen P J, Ott M, Pino J, Lample G, Koehn P, Chaudhary V and Ranzato M A 2019 The FLORES evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English. Proc. EMNLP-IJCNLP 2019: 6098–6111

    Google Scholar 

  46. Gillon B S 1995 Review of Natural language processing: a Paninian perspective by Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. Prentice-Hall of India 1995. Comput. Linguist. 21: 419–421

    Google Scholar 

  47. Sennrich R, Haddow B and Birch A 2016 Neural machine translation of rare words with subword units. Proc. ACL 2016: 1715–1725

    Google Scholar 

  48. Kudo T 2018 Subword regularization: improving neural network translation models with multiple subword candidates. Proc. ACL 2018: 66–75

    Google Scholar 

  49. Philip J, Siripragada S, Namboodiri V P and Jawahar C V 2021 Revisiting low resource status of indian languages in machine translation. In: 8th ACM IKDD CODS and 26th COMAD, pp. 178—187

  50. Haddow B and Kirefu F 2020 PMIndia– a collection of parallel corpora of languages of India. arXiv e-prints. arXiv:2001.09907

  51. Tiedemann J 2012 Parallel data, tools and interfaces in OPUS. Proc. LREC 2012: 2214–2218

    Google Scholar 

  52. Kudo T and Richardson J 2018 SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Proc. EMNLP 2018: System Demonstrations, pp. 66–71

  53. Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D and Auli M 2019 fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations, pp. 48—53

  54. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A and Herbst E 2007 Moses: open source toolkit for statistical machine translation. In: Proc. 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180

  55. Och F J and Ney H 2003 A systematic comparison of various statistical alignment models. Comput. Linguist. 29: 19–51.

    Article  MATH  Google Scholar 

  56. Heafield K 2011 KenLM: faster and smaller language model queries. In: Proc. Sixth Workshop on Statistical Machine Translation, pp. 187–197

  57. Och F J 2003 Minimum error rate training in statistical machine translation. In: Proc. 41st Annual Meeting of the Association for Computational Linguistics, pp. 160–167

  58. Papineni K, Roukos S, Ward T and Zhu W J 2002 Bleu: a Method for Automatic Evaluation of Machine Translation. Proc. ACL 2002: 311–318

    Google Scholar 

  59. Virpioja S and Grönroos S 2015 LeBLEU: N-gram-based translation evaluation score for morphologically complex languages. In: Proc. Tenth Workshop on Statistical Machine Translation, pp. 411–416

  60. Banik D, Ekbal A and Bhattacharyya P 2018 Wuplebleu: the wordnet-based evaluation metric for machine translation. In: Proc. 15th International Conference on Natural Language Processing, pp. 104–108

  61. Snover M, Dorr B, Schwartz R, Micciulla L and Makhoul J 2006 A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pp. 223–231

  62. Popović M 2015 chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395

  63. Mundotiya R K, Singh M K, Kapur R, Mishra S and Singh A K 2021 Linguistic resources for Bhojpuri, Magahi, and Maithili: statistics about them, their similarity estimates, and baselines for three applications. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20: 1–37

    Article  Google Scholar 

  64. Rama T and Singh A K 2009 From bag of languages to family trees from noisy corpus. In: Proceedings of the International Conference RANLP-2009, pp. 355–359

  65. Denoual E and Lepage Y 2005 BLEU in characters: towards automatic MT evaluation in languages without word delimiters. In: Companion Volume to the Proceedings of Conference including Posters/Demos and Tutorial Abstracts, pp. 79–84

  66. Bharati A, Rao K P, Sangal R and Bendre S M 2000 Basic statistical analysis of corpus and cross comparison among corpora. In: Proceedings of the International Conference on Natural Language Processing, pp. 10

  67. Bentz C and Alikaniotis D 2016 The word entropy of natural languages. arXiv. 1606.06996

  68. Shannon C E and Weaver W 1949 The mathematical theory of communication. The University of Illinois Press, Urbana

    MATH  Google Scholar 

  69. Kumar A, Mundotiya R K, Pratap A and Singh A K 2022 TLSPG: transfer learning-based semi-supervised pseudo-corpus generation approach for zero-shot translation. J. King Saud Univ. Comput. Inf. Sci. 34: 6552–6563

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amit Kumar.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumar, A., Parida, S., Pratap, A. et al. Machine translation by projecting text into the same phonetic-orthographic space using a common encoding. Sādhanā 48, 238 (2023). https://doi.org/10.1007/s12046-023-02275-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12046-023-02275-0

Keywords