Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

Kumar, Amit; Parida, Shantipriya; Pratap, Ajay; Singh, Anil Kumar

doi:10.1007/s12046-023-02275-0

Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

Published: 04 November 2023

Volume 48, article number 238, (2023)
Cite this article

Sādhanā Aims and scope Submit manuscript

Amit Kumar ORCID: orcid.org/0000-0002-4482-6678¹,
Shantipriya Parida²,
Ajay Pratap¹ &
…
Anil Kumar Singh¹

133 Accesses
Explore all metrics

Abstract

The use of subword embedding has proved to be a major innovation in Neural machine translation (NMT). It helps NMT to learn better context vectors for Low resource languages (LRLs) so as to predict the target words by better modelling the morphologies of the two languages and also the morphosyntax transfer. Some of the NMT models that achieve state-of-the-art improvement on LRLs are Transformer, BERT, BART, and mBART, which can all use sub-word embeddings. Even so, their performance for translation in Indian language to Indian language scenario is still not as good as for resource-rich languages. One reason for this is the relative morphological richness of Indian languages, while another is that most of them fall into the extremely low resource or zero-shot categories. Since most major Indian languages use Indic or Brahmi origin scripts, the text written in them is highly phonetic in nature and phonetically similar in terms of abstract letters and their arrangements. We use these characteristics of Indian languages and their scripts to propose an approach based on common multilingual Latin-based encoding (WX notation) that takes advantage of language similarity while addressing the morphological complexity issue in NMT. Such multilingual Latin-based encodings in NMT, together with Byte Pair Embedding allow us to better exploit their phonetic and orthographic as well as lexical similarities to improve the translation quality by projecting different but similar languages on the same orthographic-phonetic character space. We verify the proposed approach by demonstrating experiments on similar language pairs (Gujarati\(\leftrightarrow \)Hindi, Marathi\(\leftrightarrow \)Hindi, Nepali\(\leftrightarrow \)Hindi, Maithili\(\leftrightarrow \)Hindi, Punjabi\(\leftrightarrow \)Hindi, and Urdu\(\leftrightarrow \)Hindi) under low resource conditions. The proposed approach shows an improvement in a majority of cases, in one case as much as \(\sim \)10 BLEU points compared to baseline techniques for similar language pairs. We also get up to \(\sim \)1 BLEU points improvement on distant and zero-shot language pairs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Figure 1

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Article 01 December 2021

Hindi to English: Transformer-Based Neural Machine Translation

Experience of neural machine translation between Indian languages

Article 01 April 2021

Notes

Using encoding converters, such as https://pypi.org/project/wxconv/
Punjabi is written in two scripts, Gurumukhi and Shahmukhi, of which the former is a Brahmi-derived script, while the latter is a variant of Perso-Arabic. Urdu is written in a similar variant of Perso-Arabic, also called Nastaliq.
https://pypi.org/project/wxconv/,https://github.com/irshadbhat/indic-wx-converter
https://github.com/google/sentencepiece
https://en.wikipedia.org/wiki/International_Phonetic_Alphabet_chart
https://www.internationalphoneticassociation.org/
http://www.statmt.org/wmt20/similar.html
http://www.statmt.org/wmt19/index.html
https://opus.nlpl.eu/
http://www.tdil-dc.in/index.php?lang=en
https://pypi.org/project/wxconv/
https://github.com/facebookresearch/fairseq
http://www2.statmt.org/moses/
https://github.com/mjpost/sacrebleu
Although there are communities in South Asia and probably in other countries who actually speak the refined version of Urdu, much more similar to its written or literary form.
We do have parallel corpus, which is currently being cleaned and sentence aligned and will be available in the near future.
https://sites.google.com/view/loresmt

References

Booth A D 1955 Machine translation of languages, fourteen essays. Technology Press of the Massachusetts Institute of Technology and Wiley, New York
Google Scholar
Banik D, Ekbal A, Bhattacharyya P and Bhattacharyya S 2019 Assembling translations from multi-engine machine translation outputs. Appl. Soft Comput. 78: 230–239
Article Google Scholar
Koehn P 2009 Statistical machine translation. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511815829.
Book MATH Google Scholar
Banik D, Ekbal A and Bhattacharyya, P 2020 Statistical machine translation based on weighted syntax-semantics. S\(\bar{a}\)dhan\(\bar{a}\), 45: 1–12
Banik D 2021 Phrase table re-adjustment for statistical machine translation. Int. J. Speech Technol. 24: 903–911
Article Google Scholar
Sutskever I, Vinyals O and Le Q V 2014 Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27: 3104–3112
Google Scholar
Bao W, Zhang J, Pan J and Yin X 2022 A novel chinese dialect TTS frontend with non-autoregressive neural machine translation. arXiv preprint arXiv:2206.04922.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser L and Polosukhin I 2017 Attention is all you need. Adv. Neural Inf. Process. Syst. 30: 5998–6008
Google Scholar
Devlin J, Chang M W, Lee K and Toutanova K 2019 BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171—4186
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P and Soricut R 2020 ALBERT: a lite BERT for self-supervised learning of language representations. In: Proc. ICLR 2020
Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, Lewis M and Zettlemoyer L 2020 Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 8: 726–742
Article Google Scholar
Barrault L, Bojar O, Costa-jussà M R, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Malmasi S, Monz C, Müller M, Pal S, Post M and Zampieri M 2019 Findings of the 2019 conference on machine translation (WMT19). In: Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1—61
Cieri C, Maxwell M, Strassel S and Tracey J 2016 Selection criteria for low resource language programs. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp. 4543–4549.
Sitaram S 2015 Pronunciation modeling for synthesis of low resource languages. PhD thesis. Carnegie Mellon University, Pittsburgh
Emeneau M B 1956 India as a lingustic area. Language 32: 3–16
Article Google Scholar
Diwakar S, Goyal P and Gupta R 2010 Transliteration among Indian languages using WX notation. In: Proceedings of the conference on natural language processing 2010. Saarland University Press, pp. 147–150
Singh A K 2006 A computational phonetic model for Indian language scripts. In: Constraints on Spelling Changes: Fifth International Workshop on Writing Systems, pp. 1–19
Singh A K 2010 Modeling and application of linguistic similarity. PhD thesis. IIIT, Hyderabad, India
Singh A K and Surana H 2007 Using a single framework for computational modeling of linguistic similarity for solving many NLP problems. In: EUROLAN 2007 Summer School. Alexandru Ioan Cuza University of Iasi, pp. 46
Bharathi Raja C, Rani P, Arcan M and McCrae J P 2021 A survey of orthographic information in machine translation. SN Comput. Sci. 2: 1–19
Google Scholar
Singh A K, Rama T and Dasigi P 2009 A computational model of the phonetic space and its applications. LTRC IIIT, Hyderabad
Google Scholar
Kumar A, Pratap A and Singh A K 2023 Generative adversarial neural machine translation for phonetic languages via reinforcement learning. IEEE Trans. Emerg. Top. Comput. Intell. 7: 190–199
Article Google Scholar
Kumar A, Pratap A, Singh A K and Saha S 2022 Addressing domain shift in neural machine translation via reinforcement learning. Expert Syst. Appl. 201: 1117039
Article Google Scholar
Madaan L, Sharma S and Singla P 2020 Transfer learning for related languages: submissions to the WMT20 similar language translation task. In: Proc. Fifth Conference on Machine Translation, pp. 402–408
Mujadia V and Sharma D 2020 NMT based similar language translation for Hindi-Marathi. In: Proc. Fifth Conference on Machine Translation, pp. 414–417
Rathinasamy K, Singh A, Sivasambagupta B, Prasad Neerchal P and Sivasankaran V 2020 Infosys machine translation system for WMT20 similar language translation task. In: Proc. Fifth Conference on Machine Translation, pp. 437–441
Laskar S R, Pakray P and Bandyopadhyay S 2019 Neural machine translation: Hindi-Nepali. In: Proc. Fourth Conference on Machine Translation, pp. 202–207
Ojha A K, Rani P, Bansal A, Chakravarthi B R, Kumar R and McCrae J P 2020 NUIG-Panlingua-KMI Hindi-Marathi MT systems for similar language translation task @ WMT 2020. In: Proc. Fifth Conference on Machine Translation, pp. 418–423
Kumar A, Baruah R, Mundotiya R K and Singh A K 2020 Transformer-based neural machine translation system for Hindi–Marathi: WMT20 shared task. In: Proc. Fifth Conference on Machine Translation, pp. 393–395
Balashov Y 2022 The boundaries of meaning: a case study in neural machine translation. Inquiry, 1–34
Pal S and Zampieri M 2020 Neural machine translation for similar languages: the case of Indo-Aryan languages. In: Proc. Fifth Conference on Machine Translation, pp. 424–429
Przystupa M and Abdul-Mageed M 2019 Neural machine translation of low-resource and similar languages with backtranslation. In: Proc. Fourth Conference on Machine Translation, pp. 224–235
Kakwani D, Kunchukuttan A, Golla S, Gokul N C, Bhattacharyya A, Khapra M M and Kumar P 2020 IndicNLPSuite: monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. Find. Assoc. Comput. Linguist. EMNLP 2020: 4948–4961
Google Scholar
Dabre R, Shrotriya H, Kunchukuttan A, Puduppully R, Khapra M and Kumar P 2022 IndicBART: a pre-trained model for Indic natural language generation. Find. Assoc. Comput. Linguist. ACL 2022: 1849–1863
Article Google Scholar
Klein G, Kim Y, Deng Y, Senellart J and Rush A M 2017 OpenNMT: open-source toolkit for neural machine translation. In: Proc. ACL 2017, System Demonstrations, pp. 67–72
Johnson M, Schuster M, Le Q V, Krikun M, Wu Y, Chen Z, Thorat N, Viégas F, Wattenberg M, Corrado G, Hughes M and Dean J 2017 Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5: 339–351
Article Google Scholar
Luong T, Pham H and Manning C D 2015 Effective approaches to attention-based neural machine translation. Proc. EMNLP 2015: 1412–1421
Google Scholar
Currey A and Heafield K 2019 Incorporating source syntax into transformer-based neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 24–33
Raganato A, Scherrer Y and Tiedemann J 2020 Fixed encoder self-attention patterns in transformer-based machine translation. Find. ACL: EMNLP 2020: 556–568
Google Scholar
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V and Zettlemoyer L 2020 BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proc. ACL 2020: 7871–7880
Google Scholar
Edunov S, Ott M, Auli M and Grangier D 2018 Understanding back-translation at scale. Proc. EMNLP 2018: 489–500
Google Scholar
Hoang V C D, Koehn P, Haffari G and Cohn T 2018 Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24
Trubetzkoy N S 1928 Proposition 16. In: Acts of the First International Congress of Linguists, pp. 17–18
Haspelmath M 2001 The European linguistic area: standard average European. In: Halbband Language Typology and Language Universals 2. edited by Teilband. Berlin: De Gruyter Mouton, pp. 1492–1510
Guzmán F, Chen P J, Ott M, Pino J, Lample G, Koehn P, Chaudhary V and Ranzato M A 2019 The FLORES evaluation datasets for low-resource machine translation: Nepali-English and Sinhala-English. Proc. EMNLP-IJCNLP 2019: 6098–6111
Google Scholar
Gillon B S 1995 Review of Natural language processing: a Paninian perspective by Akshar Bharati, Vineet Chaitanya, and Rajeev Sangal. Prentice-Hall of India 1995. Comput. Linguist. 21: 419–421
Google Scholar
Sennrich R, Haddow B and Birch A 2016 Neural machine translation of rare words with subword units. Proc. ACL 2016: 1715–1725
Google Scholar
Kudo T 2018 Subword regularization: improving neural network translation models with multiple subword candidates. Proc. ACL 2018: 66–75
Google Scholar
Philip J, Siripragada S, Namboodiri V P and Jawahar C V 2021 Revisiting low resource status of indian languages in machine translation. In: 8th ACM IKDD CODS and 26th COMAD, pp. 178—187
Haddow B and Kirefu F 2020 PMIndia– a collection of parallel corpora of languages of India. arXiv e-prints. arXiv:2001.09907
Tiedemann J 2012 Parallel data, tools and interfaces in OPUS. Proc. LREC 2012: 2214–2218
Google Scholar
Kudo T and Richardson J 2018 SentencePiece: a simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Proc. EMNLP 2018: System Demonstrations, pp. 66–71
Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, Grangier D and Auli M 2019 fairseq: a fast, extensible toolkit for sequence modeling. In: Proceedings of NAACL-HLT 2019: Demonstrations, pp. 48—53
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A and Herbst E 2007 Moses: open source toolkit for statistical machine translation. In: Proc. 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180
Och F J and Ney H 2003 A systematic comparison of various statistical alignment models. Comput. Linguist. 29: 19–51.
Article MATH Google Scholar
Heafield K 2011 KenLM: faster and smaller language model queries. In: Proc. Sixth Workshop on Statistical Machine Translation, pp. 187–197
Och F J 2003 Minimum error rate training in statistical machine translation. In: Proc. 41st Annual Meeting of the Association for Computational Linguistics, pp. 160–167
Papineni K, Roukos S, Ward T and Zhu W J 2002 Bleu: a Method for Automatic Evaluation of Machine Translation. Proc. ACL 2002: 311–318
Google Scholar
Virpioja S and Grönroos S 2015 LeBLEU: N-gram-based translation evaluation score for morphologically complex languages. In: Proc. Tenth Workshop on Statistical Machine Translation, pp. 411–416
Banik D, Ekbal A and Bhattacharyya P 2018 Wuplebleu: the wordnet-based evaluation metric for machine translation. In: Proc. 15th International Conference on Natural Language Processing, pp. 104–108
Snover M, Dorr B, Schwartz R, Micciulla L and Makhoul J 2006 A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pp. 223–231
Popović M 2015 chrF: character n-gram F-score for automatic MT evaluation. In: Proceedings of the Tenth Workshop on Statistical Machine Translation, pp. 392–395
Mundotiya R K, Singh M K, Kapur R, Mishra S and Singh A K 2021 Linguistic resources for Bhojpuri, Magahi, and Maithili: statistics about them, their similarity estimates, and baselines for three applications. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 20: 1–37
Article Google Scholar
Rama T and Singh A K 2009 From bag of languages to family trees from noisy corpus. In: Proceedings of the International Conference RANLP-2009, pp. 355–359
Denoual E and Lepage Y 2005 BLEU in characters: towards automatic MT evaluation in languages without word delimiters. In: Companion Volume to the Proceedings of Conference including Posters/Demos and Tutorial Abstracts, pp. 79–84
Bharati A, Rao K P, Sangal R and Bendre S M 2000 Basic statistical analysis of corpus and cross comparison among corpora. In: Proceedings of the International Conference on Natural Language Processing, pp. 10
Bentz C and Alikaniotis D 2016 The word entropy of natural languages. arXiv. 1606.06996
Shannon C E and Weaver W 1949 The mathematical theory of communication. The University of Illinois Press, Urbana
MATH Google Scholar
Kumar A, Mundotiya R K, Pratap A and Singh A K 2022 TLSPG: transfer learning-based semi-supervised pseudo-corpus generation approach for zero-shot translation. J. King Saud Univ. Comput. Inf. Sci. 34: 6552–6563
Google Scholar

Download references

Author information

Authors and Affiliations

Indian Institute of Technology (BHU), Varanasi, India
Amit Kumar, Ajay Pratap & Anil Kumar Singh
Silo AI, Helsinki, Finland
Shantipriya Parida

Authors

Amit Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Shantipriya Parida
View author publications
You can also search for this author in PubMed Google Scholar
Ajay Pratap
View author publications
You can also search for this author in PubMed Google Scholar
Anil Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amit Kumar.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kumar, A., Parida, S., Pratap, A. et al. Machine translation by projecting text into the same phonetic-orthographic space using a common encoding. Sādhanā 48, 238 (2023). https://doi.org/10.1007/s12046-023-02275-0

Download citation

Received: 17 April 2021
Revised: 19 May 2023
Accepted: 07 July 2023
Published: 04 November 2023
DOI: https://doi.org/10.1007/s12046-023-02275-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

EUR 32.99 /Month

Get 10 units per month
Download Article/Chapter or Ebook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Hindi to English: Transformer-Based Neural Machine Translation

Experience of neural machine translation between Indian languages

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Machine translation by projecting text into the same phonetic-orthographic space using a common encoding

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages

Hindi to English: Transformer-Based Neural Machine Translation

Experience of neural machine translation between Indian languages

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation