Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

Published: 13 September 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare them with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were expected to indicate linguistic properties, such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to match the corpus size across the languages to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we tried to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The POS-tagged data sizes are 16,067, 14,669, and 12,310 sentences, respectively, for Bhojpuri, Magahi, and Maithili. The sizes for chunking are 9,695 and 1,954 sentences for Bhojpuri and Maithili, respectively. The inter-annotator agreement for these annotations, using Cohen’s Kappa, was 0.92, 0.64, and 0.74, respectively, for the three languages. These (annotated) corpora have been used for developing preliminary automated tools, which include POS tagger, Chunker, and Language Identifier. We have also developed the Bilingual dictionary (Purvanchal languages to Hindi) and a Synset (that can be integrated later in the Indo-WordNet) as additional resources. The main contribution of the work is the creation of basic resources for facilitating further language processing research for these languages, providing some quantitative measures about them and their similarities among themselves and with Hindi. For similarities, we use a somewhat novel measure of language similarity based on an n-gram-based language identification algorithm. An additional contribution is providing baselines for three basic NLP applications (POS tagging, chunking, and language identification) for these closely related languages.

    References

    [1]
    Wafia Adouane and Simon Dobnik. 2017. Identification of languages in Algerian Arabic multilingual documents. In Proceedings of the 3rd Arabic Natural Language Processing Workshop. 1–8.
    [2]
    Beatrice Alex. 2005. An unsupervised system for identifying English inclusions in German text. In Proceedings of the ACL Student Research Workshop. 133–138.
    [3]
    Supriya Anand. 2014. Language identification for transliterated forms of Indian language queries. In Proceedings of the Forum for Information Retrieval Evaluation (FIRE’14).
    [4]
    Srinivasu Badugu. 2014. Morphology-based POS tagging on Telugu. Int. J. Comput. Sci. Issues 11, 1 (2014), 181.
    [5]
    Timothy Baldwin and Marco Lui. 2010. Language identification: The long and the short of the matter. In Proceedings of theAnnual Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technologies. Association for Computational Linguistics, 229–237.
    [6]
    Somnath Banerjee, Alapan Kuila, Aniruddha Roy, Sudip Kumar Naskar, Paolo Rosso, and Sivaji Bandyopadhyay. 2014. A hybrid approach for transliterated word-level language identification: CRF with post-processing heuristics. In Proceedings of the Forum for Information Retrieval Evaluation. 54–59.
    [7]
    Akshar Bharati, K. Prakash Rao, Rajeev Sangal, and S. M. Bendre. 2000. Basic statistical analysis of corpus and cross comparison among corpora. Technical Report, Indian Institute of Information Technology.
    [8]
    Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Lakshmi Bai. 2006. Anncorra: Annotating corpora guidelines for pos and chunk annotation for Indian languages. In Proceedings of the Annual Language Testing Research Colloquium (LTRC’06). 1–38.
    [9]
    Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Anil Kumar Singh. 2014. SSF: A common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT. 66–76.
    [10]
    Pushpak Bhattacharyya. 2010. Indowordnet. In Proceedings of the Language Resources and Evaluation Conference (LREC’10).
    [11]
    Ondrej Bojar, Vojtech Diatka, Pavel Rychlỳ, Pavel Stranák, Vít Suchomel, Ales Tamchyna, and Daniel Zeman. 2014. HindEnCorp-Hindi-English and Hindi-only corpus for machine translation. In Proceedings of the Language Resources and Evaluation Conference (LREC’14). 3550–3555.
    [12]
    William B. Cavnar, John M. Trenkle et al. 1994. N-gram-based text categorization. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, Vol. 161175. Citeseer.
    [13]
    Sunita Kumar Chatterji. 1926. The Evolution of Bengali Language. Rupa, Delhi.
    [14]
    Suniti Kumar Chatterji. 1986. The Origin and Development of the Bengali Language, vol. 1. Rupa, Delhi.
    [15]
    B. B. Chaudhuri and S. Ghosh. 1998. A statistical study of Bangla corpus. In Proceedings of the International Conference on Computational Linguistics, Speech, and Document Processing.
    [16]
    Alina Maria Ciobanu and Liviu Petrisor Dinu. 2013. A dictionary-based approach for evaluating orthographic methods in cognates identification. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’13). 141–147. Retrieved from https://www.aclweb.org/anthology/R13-1019.
    [17]
    Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Edu. Psychol. Measure. 20, 1 (1960), 37–46. https://doi.org/10.1177/001316446002000104
    [18]
    Çağrı Çöltekin and Taraka Rama. 2016. Discriminating similar languages with linear SVMs and neural networks. In Proceedings of the 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’16). 15–24.
    [19]
    Michael A. Covington and Joe D. McFall. 2010. Cutting the gordian knot: The moving-average type–token ratio (MATTR). J. Quant. Linguist. 17, 2 (2010), 94–100. https://doi.org/10.1080/09296171003643098
    [20]
    Marc Damashek. 1995. Gauging similarity with n-grams: Language-independent categorization of text. Science 267, 5199 (1995), 843–848.
    [21]
    Niladri Sekhar Dash. 2004. Language corpora: Present Indian need. In Proceedings of the SCALLA Working Conference. 5–7.
    [22]
    Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2016. Learning crosslingual word embeddings without bilingual corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1285–1295. https://doi.org/10.18653/v1/D16-1136
    [23]
    Heba Elfardy, Mohamed Al-Badrashiny, and Mona Diab. 2014. AIDA: Identifying code switching in informal Arabic text. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 94–101.
    [24]
    Heba Elfardy and Mona Diab. 2013. Sentence-level dialect identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 456–461.
    [25]
    Meng Fang and Trevor Cohn. 2017. Model transfer for tagging low-resource languages using a bilingual dictionary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 587–593. https://doi.org/10.18653/v1/P17-2093
    [26]
    Pablo Gamallo, José Ramom Pichel, and Iñaki Alegria. 2017. From language identification to language distance. Physica A: Stat. Mech. Appl. 484 (2017), 152–162.
    [27]
    Jorge Gracia, Besim Kabashi, Ilan Kernerman, Marta Lanau-Coronas, and Dorielle Lonke. 2019. Results of the translation inference across dictionaries 2019 shared task. In Proceedings of TIAD-2019 Shared Task - Translation Inference Across Dictionaries co-located with the 2nd Language, Data and Knowledge Conference (LDK’19), Leipzig, Germany, May 20, 2019, Vol. 2493. CEUR-WS.org, 1–12.
    [28]
    George Abraham Grierson. 1967. Linguistic Survey of India, vol. III. Motilal Banarsidass. https://dsal.uchicago.edu/books/lsi/.
    [29]
    Viktor Hangya, Fabienne Braune, Alexander Fraser, and Hinrich Schütze. 2018. Two methods for domain adaptation of bilingual tasks: Delightfully simple and broadly applicable. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 810–820.
    [30]
    Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. 2013. Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. 690–696. Retrieved from https://kheafield.com/papers/edinburgh/estimate_paper.pdf.
    [31]
    Goonjan Jain and D. K. Lobiyal. 2020. Word sense disambiguation using implicit information. Nat. Lang. Eng. 26, 4 (2020), 413–432.
    [32]
    Tommi Jauhiainen, Krister Lindén, and Heidi Jauhiainen. 2019. Language model adaptation for language and dialect identification of text. Nat. Lang. Eng. 25, 5 (2019), 561–583.
    [33]
    Robert J. Jeffers. 1976. Syntactic change and syntactic reconstruction. In Proceedings of the 2nd International Conference on Historical Linguistics, vol. 1. John Benjamin, 15.
    [34]
    Girish Nath Jha. 2010. The TDIL Program and the Indian language corpora intitiative (ILCI). In Proceedings of the Language Resources and Evaluation Conference (LREC’10).
    [35]
    Kimmo Kettunen. 2014. Can type-token ratio be used to show morphological complexity of languages?J. Quant. Linguist. 21, 3 (2014), 223–245.
    [36]
    Soma Khan, Joyanta Basu, Tulika Basu, Milton Samirakshma Bepari, Madhab Pal, and Rajib Roy. 2014. Bengali basic travel expression corpus: A statistical analysis. In Proceedings of the 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA’14). IEEE, 1–6.
    [37]
    J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S. Chissom.1975. Derivation of New Readability Formulas for Navy Enlisted Personnel. Technical Report Research Branch Report. 8–75
    [38]
    G. Bharadwaja Kumar, Kavi Narayana Murthy, and B. B. Chaudhuri. 2007. Statistical analysis of Telugu text corpora. International journal of Dravidian linguistics 36, 2 (2007), 71–99.
    [39]
    Rohit Kumar, S. Kishore, Anumanchipalli Gopalakrishna, Rahul Chitturi, Sachin Joshi, Satinder Singh, and R. Sitaram. 2005. Development of Indian language speech databases for large vocabulary speech recognition systems. In Proceedings of the International Conference on Speech and Computer (SPECOM’05).
    [40]
    Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2011. Challenges in developing lrs for non-scheduled languages: A case of Magahi. In Proceedings of the 5th Language and Technology Conference Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC’11). 60–64.
    [41]
    Ritesh Kumar, Bornini Lahiri, and Deepak Alok. 2012. Developing a POS tagger for Magahi: A comparative study. In Proceedings of the 10th Workshop on Asian Language Resources. 105–114.
    [42]
    Anil Kumar Singh. 2007. Using a single framework for computational modeling of linguistic similarity for solving many NLP problems. In Proceedings of the EUROLAN Summer School. Alexandru Ioan Cuza University of Ias̨i.
    [43]
    Anil Kumar Singh. 2010. Modeling and Application of Linguistic Similarity. Ph.D. Dissertation. IIIT, Hyderabad, India.
    [44]
    Guillaume Lample, Alexis Conneau, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. 2018. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=H196sainb.
    [45]
    Gaël Le Godais, Tal Linzen, and Emmanuel Dupoux. 2017. Comparing character-level neural language models using a lexical decision task. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 125–130. Retrieved from https://www.aclweb.org/anthology/E17-2020.
    [46]
    Mangala Madankar, M. B. Chandak, and Nekita Chavhan. 2016. Information retrieval system and machine translation: A review. Procedia Comput. Sci. 78 (2016), 845–850.
    [47]
    Ian Maddieson. 2009. Calculating phonological complexity. Approach. Phonol. Complex. 85 (2009), 109.
    [48]
    Khair Md Majumder and Yasir Arafat. 2006. Analysis of and observations from a Bangla News Corpus. 13–19. http://dspace.bracu.ac.bd/xmlui/handle/10361/616.
    [49]
    Jean-Christophe Marcadet, Volker Fischer, and Claire Waast-Richard. 2005. A transformation-based learning approach to language identification for mixed-lingual text-to-speech synthesis. In Proceedings of the 9th European Conference on Speech Communication and Technology.
    [50]
    Matej Martinc, Iza Skrjanec, Katja Zupan, and Senja Pollak. 2017. PAN 2017: Author profiling-gender and language variety prediction. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF’17).
    [51]
    Colin P. Masica. 1993. The Indo-Aryan Languages. Cambridge University Press.
    [52]
    Paul McNamee. 2005. Language identification: A solved problem suitable for undergraduate instruction. J. Comput. Sci. Coll. 20, 3 (Feb. 2005), 94–101.
    [53]
    G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, and K. Miller.2010. WordNet: An online lexical database. Int. J. Lexicogr. 3, 4 (2010), 235–244.
    [54]
    Aanchan Mohan, Richard Rose, Sina Hamidi Ghalehjegh, and Srinivasan Umesh. 2014. Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun. 56 (2014), 167–180.
    [55]
    Kavi Narayana Murthy and G. Bharadwaja Kumar. 2006. Language identification from small text samples. J. Quant. Linguist. 13, 01 (2006), 57–80.
    [56]
    Svetlin Nakov, Preslav Nakov, and Elena Paskaleva. 2009. Unsupervised extraction of false friends from parallel bi-texts using the web as a corpus. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’09). 292–298.
    [57]
    Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2016. Constraint-based bilingual lexicon induction for closely related languages. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 3291–3298.
    [58]
    Atul Ku Ojha, Pitambar Behera, Srishti Singh, and Girish N. Jha. 2015. Training & evaluation of POS taggers in Indo-Aryan languages: A case of Hindi, Odia and Bhojpuri. In Proceedings of the 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. 524–529.
    [59]
    Steven T. Piantadosi. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonom. Bull. Rev. 21, 5 (2014), 1112–1130.
    [60]
    Jordi Porta and José-Luis Sancho. 2014. Using maximum entropy models to discriminate between similar languages and varieties. In Proceedings of the 1st Workshop on Applying NLP Tools to Similar Languages, Varieties, and Dialects. 120–128.
    [61]
    Nikhil Prabhu and S. Natarajan. 2019. Extraction of character personas from novels using dependency trees and POS tags. In Emerging Research in Computing, Information, Communication, and Applications. Springer, 65–74.
    [62]
    Ankur Priyadarshi and Sujan Kumar Saha. 2020. Towards the first Maithili part of speech tagger: Resource creation and system development. Comput. Speech Lang. 62 (2020), 101054.
    [63]
    Katharina Probst and Ralf Brown. 2002. Using similarity scoring to improve the bilingual dictionary for word alignment. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 409–416.
    [64]
    Radim Rehurek and Milan Kolkus. 2009. Language identification on the web: Extending the dictionary method. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 357–368.
    [65]
    Rishikesh. 2018. Parts of speech tagger for Maithili language using HMM. Int. J. Innovat. Adv. Comput. Sci. 7 (2018), 206.
    [66]
    Harald Romsdorfer and Beat Pfister. 2007. Text analysis and language identification for polyglot text-to-speech synthesis. Speech Commun. 49, 9 (2007), 697–724.
    [67]
    Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2019. A survey of cross-lingual word embedding models. J. Artific. Intell. Res. 65 (2019), 569–631.
    [68]
    Sujan Kumar Saha and Ankur Priyadarshi. [n.d.]. A study on the importance of linguistic suffixes in Maithili POS tagger development. In Proceedings of the 7th International Conference on Mining Intelligence and Knowledge Exploration (MIKE’19). Lecture Notes in Computer Science, vol. 11987. Springer, 11–20.
    [69]
    Rajeev Sangal, Sushma Bendre, Dipti Sharma, and Prashanth Mannem. 2007. Introduction to shallow parsing contest on south asian languages. In Proceedings of the IJCAI Workshop On Shallow Parsing for South Asian Languages (SPSAL’07). 1–8.
    [70]
    A. Sarkar, A. De Roeck, and P. Garthwaite. 2004. Easy measures for evaluating non-English corpora for language engineering: Some lessons from Arabic and Bengali. Technical report, Dept. of Comp., Faculty of Math. and Comp., Open University, Walton Hall, UK.
    [71]
    Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL’03). Association for Computational Linguistics, 134–141. https://doi.org/10.3115/1073445.1073473
    [72]
    Vijay Kumar Sharma and Namita Mittal. 2018. Cross-lingual information retrieval: A dictionary-based query translation approach. In Advances in Computer and Computational Sciences. Springer, 611–618.
    [73]
    Gary F. Simons and Charles D. Fennig. 2017. Ethnologue: Languages of Asia. SIL International.
    [74]
    Anil Kumar Singh. 2006. A computational phonetic model for Indian language scripts. In Proceedings of the 5th International Workshop on Writing Systems: Constraints on Spelling Changes. 1–19.
    [75]
    Anil Kumar Singh. 2006. Study of some distance measures for language and encoding identification. In Proceedings of the Workshop on Linguistic Distances. 63–72.
    [76]
    Anil Kumar Singh. 2008. A mechanism to provide language-encoding support and an nlp friendly editor. In Proceedings of the 3rd International Joint Conference on Natural Language Processing.
    [77]
    Anil Kumar Singh and Jagadeesh Gorla. 2007. Identification of languages and encodings in a multilingual document. In Proceedings of the 3rd Web as Corpus Workshop, Incorporating Cleaneval: Building and Exploring Web Corpora (WAC3’07), Vol. 4. Presses Univ. de Louvain, 95.
    [78]
    Anil Kumar Singh, Kiran Pala, and Harshit Surana. 2008. Estimating the resource adaption cost from a resource rich language to a similar resource poor language. In Proceedings of the Language Resources and Evaluation Conference (LREC’08).
    [79]
    Loitongbam Gyanendro Singh, Lenin Laitonjam, and Sanasam Ranbir Singh. 2016. Automatic syllabification for manipuri language. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16). 349–357.
    [80]
    Srishti Singh. [n.d.]. Web drawn corpus for Bhojpuri. In Proceedings of the Conference on NLP, MGAHV, Wardha.
    [81]
    Srishti Singh and Girish Nath Jha. 2015. Statistical tagger for Bhojpuri (employing support vector machine). In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI’15). IEEE, 1524–1529.
    [82]
    Peter Smit, Sami Virpioja, Stig-Arne Grönroos, and Mikko Kurimo. 2014. Morfessor 2.0: Toolkit for statistical morphological segmentation. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics. 21–24.
    [83]
    Brij Mohan Lal Srivastava, Sunayana Sitaram, Rupesh Kumar Mehta, Krishna Doss Mohan, Pallavi Matani, Sandeepkumar Satpal, Kalika Bali, Radhakrishnan Srikanth, and Niranjan Nayak. 2018. Interspeech 2018: Low-resource automatic speech recognition challenge for Indian languages. In Proceedings of the 6th International Workshop on Spoken Language Technologies for Under-resourced Languages. 11–14.
    [84]
    Erik Sterneberg. 2012. Language identification of person names using cascaded SVMs. Bachelor’s Thesis, Uppsala University, Uppsala.
    [85]
    Jörg Tiedemann. 2017. Cross-lingual dependency parsing for closely related languages—Helsinki’s submission to VarDial 2017. In Proceedings of the 4th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’17). Association for Computational Linguistics, 131–136. https://doi.org/10.18653/v1/W17-1216
    [86]
    Zankhana B. Vaishnav and Priti S. Sajja. 2019. Knowledge-based approach for word sense disambiguation using genetic algorithm for gujarati. In Proceedings of the Conference on Information and Communication Technology for Intelligent Systems (ICTIS’19). Springer, 485–494.
    [87]
    Manindra K. Verma. 1991. Exploring the parameters of agreement: The case of Magahi. Lang. Sci. 13, 2 (1991), 125–143.
    [88]
    Haoxing Wang and Laurianne Sitbon. 2014. Multilingual lexical resources to detect cognates in non-aligned texts. In Proceedings of the Australasian Language Technology Association Workshop, Vol. 12. 14–22.
    [89]
    Gergely Windisch and László Csink. 2005. Language identification using global statistics of natural languages. In Proceedings of the 2nd Romanian-Hungarian Joint Symposium on Applied Computational Intelligence (SACI’05). 243–255.
    [90]
    Nianheng Wu, Eric DeMattos, Kwok Him So, Pin-zhen Chen, and Çağrı Çöltekin. 2019. Language discrimination and transfer learning for similar languages: Experiments with feature combinations and adaptation. In Proceedings of the 6th Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics, 54–63. Retrieved from https://www.aclweb.org/anthology/W19-1406.
    [91]
    Martin Wynne. 2005. Developing Linguistic Corpora: A Guide to Good Practice. Vol. 92. Oxbow Books Oxford.
    [92]
    Yogendra P. Yadava, Oliver Bond, Irina Nikolaeva, and Sandy Ritchie. 2019. The syntax of possessor prominence in Maithili. Prom. Intern. Possess. (2019), 39–79.
    [93]
    Yin-Lai Yeong and Tien-Ping Tan. 2011. Applying grapheme, word, and syllable information for language identification in code switching sentences. In Proceedings of the International Conference on Asian Language Processing. IEEE, 111–114.
    [94]
    Jia-Li You, Yi-Ning Chen, Min Chu, Frank K. Soong, and Jin-Lin Wang. 2008. Identifying language origin of named entity with multiple information sources. IEEE Trans. Audio Speech Lang. Process. 16, 6 (2008), 1077–1086.
    [95]
    Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardžić, Nikola Ljubešić, Jörg Tiedemann, et al. 2018. Language identification and morphosyntactic tagging. The second VarDial evaluation campaign. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial@COLING’18), Santa Fe, New Mexico. Association for Computational Linguistics, 1–17. https://aclanthology.org/W18-3901/.
    [96]
    Meng Zhang, Haoruo Peng, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Bilingual lexicon induction from non-parallel data with minimal supervision. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
    [97]
    Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems. MIT Press, 649–657.
    [98]
    Yujie Zhang. 2019. Improving performance of NMT using semantic concept of wordnet synset. In Proceedings of the 14th China Workshop on Machine Translation (CWMT’18), Vol. 954. Springer, 39.
    [99]
    Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1568–1575. https://doi.org/10.18653/v1/D16-1163

    Cited By

    View all
    • (2023)A Study on the Performance of Recurrent Neural Network based Models in Maithili Part of Speech TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/354026022:2(1-16)Online publication date: 21-Feb-2023
    • (2023)Machine translation by projecting text into the same phonetic-orthographic space using a common encodingSādhanā10.1007/s12046-023-02275-048:4Online publication date: 4-Nov-2023
    • (2023)Automatic language identification: a case study of Pahari languagesLanguage Resources and Evaluation10.1007/s10579-023-09651-657:3(1361-1387)Online publication date: 12-May-2023
    • Show More Cited By

    Index Terms

    1. Linguistic Resources for Bhojpuri, Magahi, and Maithili: Statistics about Them, Their Similarity Estimates, and Baselines for Three Applications

            Recommendations

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Transactions on Asian and Low-Resource Language Information Processing
            ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 20, Issue 6
            November 2021
            439 pages
            ISSN:2375-4699
            EISSN:2375-4702
            DOI:10.1145/3476127
            Issue’s Table of Contents
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 13 September 2021
            Accepted: 01 March 2021
            Revised: 01 November 2020
            Received: 01 May 2020
            Published in TALLIP Volume 20, Issue 6

            Permissions

            Request permissions for this article.

            Check for updates

            Author Tags

            1. Corpus
            2. syntactic annotation
            3. low resource language
            4. inter-annotator agreement
            5. POS tagging
            6. chunking
            7. language identification
            8. language similarity

            Qualifiers

            • Research-article
            • Refereed

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)45
            • Downloads (Last 6 weeks)3

            Other Metrics

            Citations

            Cited By

            View all
            • (2023)A Study on the Performance of Recurrent Neural Network based Models in Maithili Part of Speech TaggingACM Transactions on Asian and Low-Resource Language Information Processing10.1145/354026022:2(1-16)Online publication date: 21-Feb-2023
            • (2023)Machine translation by projecting text into the same phonetic-orthographic space using a common encodingSādhanā10.1007/s12046-023-02275-048:4Online publication date: 4-Nov-2023
            • (2023)Automatic language identification: a case study of Pahari languagesLanguage Resources and Evaluation10.1007/s10579-023-09651-657:3(1361-1387)Online publication date: 12-May-2023
            • (2023)Deep Learning-Based Similar Languages’ POS Tagging: Experiments on Bhojpuri, Maithili, and MagahiSoft Computing: Theories and Applications10.1007/978-981-19-9858-4_72(845-855)Online publication date: 25-Apr-2023
            • (2022)TLSPGJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2022.03.00834:9(6552-6563)Online publication date: 1-Oct-2022
            • (2022)Hierarchical self attention based sequential labelling model for Bhojpuri, Maithili and Magahi languagesJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2021.09.02234:10(8739-8749)Online publication date: Nov-2022
            • (2021)Low Resource Neural Machine Translation: Assamese to/from Other Indo-Aryan (Indic) LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/346972121:1(1-32)Online publication date: 16-Nov-2021

            View Options

            Get Access

            Login options

            Full Access

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format.

            HTML Format

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media