research-article

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing

Authors:

Girdhari SinghAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 2

Article No.: 47, Pages 1 - 34

https://doi.org/10.1145/3548457

Published: 27 December 2022 Publication History

Abstract

Natural Language Processing (NLP) has been in practice for the past couple of decades, and extensive work has been done for the Western languages, particularly the English language. The Eastern counterpart, especially the languages of the Indian subcontinent, needs attention as not much language processing work has been done on these languages. Western languages are rich in dictionaries, WordNet, and associated tools, while Indian languages are lagging behind in this segment. Marathi is the third most spoken language in India and the 15th most spoken language worldwide. Lack of resources, complex linguistic facts, and the inclusion of prevalent dialects of neighbors have resulted in limited work for Marathi. The aim of this study is to provide an insight into the various linguistic resources, tools, and state-of-the-art techniques applied to the processing of the Marathi language. Initially, morphological descriptions of the Marathi language are provided, followed by a discussion on the characteristics of the Marathi language. Thereafter, for Marathi language, the availability of corpus, tools, and techniques to be used to develop NLP tasks is reviewed. Finally, gap analysis is discussed in current research and future directions for this new and dynamic area of research are listed that will benefit the Marathi Language Processing research community.

References

[1]

Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyoti, Sunayana Sitaram, and Vivek Seshadri. 2020. Crowdsourcing speech data for low-resource languages from low-income workers. In Proceedings of the 12th Language Resources and Evaluation Conference. 2819–2826.

[2]

Alekh Agarwal and Pushpak Bhattacharyya. 2006. Augmenting word net with polarity information on adjectives. In Proceedings of the 3rd International Wordnet Conference. 3–8.

[3]

Željko Agić and Ivan Vulić. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3204–3210.

[4]

Md Zahangir Alom, Tarek M. Taha, Chris Yakopcic, Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin, Mahmudul Hasan, Brian C. Van Essen, Abdul A. S. Awwal, and Vijayan K. Asari. 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 3 (March2019), 292.

[5]

Md Zahangir Alom, Tarek M. Taha, Chris Yakopcic, Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin, Mahmudul Hasan, Brian C. Van Essen, Abdul A. S. Awwal, and Vijayan K. Asari. 2019. A state-of-the-art survey on deep learning theory and architectures. Electronics 8, 3 (2019), 292.

[6]

Dhiraj Amin and Sharvari Govilkar. 2015. ARQAS: Augmented reality based question answering system using ontology in HINDI and MARATHI language. Int. J. Comput. Appl. 126, 13 (2015).

[7]

Mohammed Arshad Ansari and Sharvari Govilkar. 2018. Sentiment analysis of mixed code for the transliterated hindi and marathi texts. Int. J. Nat. Lang. Comput. 7 (2018).

[8]

Gaurav Arora. 2020. iNLTK: Natural language toolkit for Indic languages. In Proceedings of the 2nd Workshop for NLP Open Source Software (NLP-OSS’20). 66–71.

[9]

Paul Baker, Andrew Hardie, Tony McEnery, Hamish Cunningham, and Robert J. Gaizauskas. 2002. EMILLE, A 67-million word corpus of Indic languages: Data collection, mark-up and harmonisation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’02).

[10]

Somnath Banerjee and Sivaji Bandyopadhyay. 2012. Bengali question classification: Towards developing QA system. In Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. 25–40.

[11]

Mahathi Bhagavatula, GSK Santosh, and Vasudeva Varma. 2012. Named entity recognition an aid to improve multilingual entity filling in language-independent approach. In Proceedings of the 1st Workshop on Information and Knowledge Management for Developing Region. 3–10.

Digital Library

[12]

Akshar Bharati, Rajeev Sangal, Dipti Sharma, and Anil Kumar Singh. 2014. SSF: A common representation scheme for language analysis for language technology infrastructure development. In Proceedings of the International Conference on Computational Linguistics (COLING’14). 66.

[13]

Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma, and Lakshmi Bai. 2006. Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. Technical Report.

[14]

Pushpak Bhattacharyya. 2010. IndoWordnet. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA). 1–8.

[15]

Sudha Bhingardive and Pushpak Bhattacharyya. 2017. Word sense disambiguation using IndoWordNet. In The WordNet in Indian Languages. Springer, 243–260.

[16]

Sudha Bhingardive, Samiulla Shaikh, and Pushpak Bhattacharyya. 2013. Neighbors help: Bilingual unsupervised WSD using context. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 538–542.

[17]

Darshana S. Bhole and Sandip S. Patil. 2018. Detection of paraphrases for Devanagari languages using support vector machine. In Proceedings of the International Conference on Communication information and Computing Technology (ICCICT’18). 1–5.

[18]

D. Chakrabarty, P. Pande, D. Narayan, and Pushpak Bhattacharyya. 2002. An experience in building the Indo WordNet—A WordNet for Hindi. In Proceedings of the International Conference on Global WordNet (GWC’02).

[19]

Chitra V. Chaudhari, Ashwini V. Khaire, Rashmi R. Murtadak, and Komal S. Sirsulla. 2017. Sentiment analysis in Marathi using Marathi WordNet. Imp. J. Interdiscip. Res. 3, 4 (2017), 1253–1256.

[20]

Narayan Choudhary. 2021. LDC-IL: The Indian repository of resources for language technology. Lang. Res. Eval. (2021), 1–13.

[21]

Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: The Bible in 100 languages. Lang. Resource. Eval. 49, 2 (2015), 375–395.

Digital Library

[22]

Amitava Das and Sivaji Bandyopadhyay. 2010. SentiWordNet for Indian languages. In Proceedings of the 8th Workshop on Asian Language Resouces. 56–63.

[23]

Bhargav Dave, Surupendu Gangopadhyay, Prasenjit Majumder, Pushpak Bhattacharya, Sudeshna Sarkar, and Sobha Lalitha Devi. 2020. FIRE 2020 EDNIL track: Event detection from news in Indian languages. In Forum for Information Retrieval Evaluation. 25–28.

Digital Library

[24]

Sujata Deshmukh, Nileema Patil, Surabhi Rotiwar, and Jason Nunes. 2017. Sentiment analysis of Marathi language. Int. J. Res. Publ. Eng. Technol. 3, 6 (2017), 93–97.

[25]

Madhuri M. Deshpande and Sharad D. Gore. 2018. A hybrid part-of-speech tagger for Marathi sentences. In Proceedings of the International Conference on Communication information and Computing Technology (ICCICT’18). IEEE, 1–10.

[26]

Rushali Dhumal and Arvind Kiwelekar. 2020. Deep learning techniques for part of speech tagging by natural language processing. In Proceedings of the 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA’20). IEEE, 76–81.

[27]

Ljiljana Dolamic and Jacques Savoy. 2010. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans. As. Lang. Inf. Process. 9, 3 (2010), 1–24.

Digital Library

[28]

Alabhya Farkiya, Prashant Saini, Shubham Sinha, and Sharmishta Desai. 2015. Natural language processing using NLTK and WordNet. Int. J. Comput. Sci. Inf. Technol. 6, 6 (2015), 5465–5469.

[29]

Saurabh Sampatrao Gaikwad, Tharindu Ranasinghe, Marcos Zampieri, and Christopher Homan. 2021. Cross-lingual offensive language identification for low resource languages: The case of Marathi. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’21). 437–443.

[30]

Vivekananda Gayen and Kamal Sarkar. 2014. An HMM based named entity recognition system for indian languages: The JU system at ICON 2013. arXiv:1405.7397. Retrieved from https://arxiv.org/abs/1405.7397.

[31]

Google and KPMG. 2017. Indian Languages–Defining India’s Internet–KPMG India. Retrieved November 02, 2020 from https://home.kpmg/in/en/home/insights/2017/04/indian-language-internet-users.html.

[32]

Sharvari S. Govilkar and J. W. Bakal. 2017. Question answering system using ontology in Marathi language. Int. J. Artif. Intell. Appl. 8 (2017), 53–64.

[33]

Archana Goyal, Vishal Gupta, and Manish Kumar. 2018. Recent named entity recognition and classification techniques: A systematic review. Comput. Sci. Rev. 29 (2018), 21–43.

[34]

Deepak Gupta, Asif Ekbal, and Pushpak Bhattacharyya. 2019. A deep neural network framework for English Hindi question answering. ACM Trans. As. Low-Resour. Lang. Inf. Process. 19, 2 (2019), 1–22.

[35]

Deepak Gupta, Surabhi Kumari, Asif Ekbal, and Pushpak Bhattacharyya. 2018. MMQA: A multi-domain multi-lingual question-answering framework for English and Hindi. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).

[36]

Somil Gupta and Nilesh Khade. 2020. Bert based multilingual machine comprehension in English and Hindi. arXiv:2006.01432. Retrieved from https://arxiv.org/abs/2006.01432.

[37]

Barry Haddow and Faheem Kirefu. 2020. PMIndia—A collection of parallel corpora of languages of India. arXiv:2001.09907. Retrieved from https://arxiv.org/abs/2001.09907.

[38]

Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. 2021. XL-Sum: Large-scale multilingual abstractive summarization for 44 Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 4693–4703.

[39]

Fei He, Shan Hui Cathy Chu, Oddur Kjartansson, Clara Rivera, Anna Katanova, Alexander Gutkin, Isin Demirsahin, Cibu Johny, Martin Jansche, Supheakmungkol Sarin, and Knot Pipatsrisawat. 2020. Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems. In Proceedings of the 12th Language Resources and Evaluation Conference. 6494–6503.

[40]

Prashant Itankar and Ms Anushree Mane. 2021. Marathi text document summarization using neural networks. Int. Organiz. Res. Dev. 8, 2 (2021), 4–4.

[41]

Girish Nath Jha. 2010. The TDIL program and the Indian langauge corpora intitiative (ILCI). In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 982–985.

[42]

Aditya Joshi, A. R. Balamurali, Pushpak Bhattacharyya, et al. 2010. A fall-back strategy for sentiment analysis in Hindi: A case study. In Proceedings of the 8th International Conference on Natural Language Processing (ICON’10).

[43]

Raviraj Joshi. 2022. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, Marathi bert language models, and resources. arXiv:2202.01159. Retrieved from https://arxiv.org/abs/2002.01159.

[44]

Shripad S. Joshi. 2013. Sandhi splitting of Marathi compound words. Int. J. Adv. Comput. Theory Eng. 2, 2 (2013), 43–46.

[45]

Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, N. C. Gokul, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite : Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: Findings. 4948–4961.

[46]

Debanjana Kar, Sudeshna Sarkar, and Pawan Goyal. 2020. Event argument extraction using causal knowledge structures. In Proceedings of the 17th International Conference on Natural Language Processing (ICON’20). 287–296.

[47]

Kalpana Khandale and C. Namrata Mahender. 2019. Rule-based design for anaphora resolution of Marathi sentence. In Proceedings of the IEEE 5th International Conference for Convergence in Technology (I2CT’19). IEEE, 1–7.

[48]

Kalpana B. Khandale. 2020. Natural language processing based rule based discourse analysis of Marathi text. In Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC’20). IEEE, 356–362.

[49]

Mitesh M. Khapra, Salil Joshi, and Pushpak Bhattacharyya. 2011. It takes two to tango: A bilingual unsupervised approach for estimating sense distributions using expectation maximization. In Proceedings of the 5th International Joint Conference on Natural Language Processing. 695–704.

[50]

Mitesh M. Khapra, Salil Joshi, Arindam Chatterjee, and Pushpak Bhattacharyya. 2011. Together we can: Bilingual bootstrapping for WSD. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 561–569.

[51]

Mitesh M. Khapra, Anup Kulkarni, Saurabh Sohoney, and Pushpak Bhattacharyya. 2010. All words domain adapted WSD: Finding a middle ground between supervision and unsupervision. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 1532–1541.

Digital Library

[52]

Namrata G. Kharate and Varsha H. Patil. 2021. Word sense disambiguation for Marathi language using WordNet and the lesk approach. In Proceeding of 1st Doctoral Symposium on Natural Computing Research (DSNCR’20), Vol. 169. Springer Nature, 45.

[53]

Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, Jayashree Jagdale, and Raviraj Joshi. 2022. Experimental evaluation of deep learning models for Marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Springer, 605–613.

[54]

Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, and Raviraj Joshi. 2021. L3CubeMahaSent: A Marathi tweet-based sentiment analysis dataset. In Proceedings of the 11th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 213–220.

[55]

N. Kiran Kumar, G. S. K. Santosh, and Vasudeva Varma. 2011. A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, 74–82.

[56]

Praveen Kumar, Shrikant Kashyap, Ankush Mittal, and Sumit Gupta. 2005. A Hindi question answering system for E-learning documents. In Proceedings of the 3rd International Conference on Intelligent Sensing and Information Processing. IEEE, 80–85.

Digital Library

[57]

Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. AI4Bharat-IndicNLP corpus: Monolingual corpora and word embeddings for Indic languages. arXiv:2005.00085. Retrieved from https://arxiv.org/abs/2005.00085.

[58]

Anoop Kunchukuttan, Ratish Puduppully, and Pushpak Bhattacharyya. 2015. Brahmi-Net: A transliteration and script conversion system for languages of the Indian subcontinent. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 81–85.

[59]

LDC-IL. 2020. A Gold Standard Marathi Raw Text Corpus. Retrieved from https://data.ldcil.org/text/text-raw-corpus/a-gold-standard-marathi-raw-text-corpus.

[60]

Babak Loni. 2011. A Survey of State-of-the-art Methods on Question Classification. Technical Report. Delft University of Technology, Mediamatics Department.

[61]

Yash Madhani, Sushane Parthan, Priyanka Bedekar, Ruchi Khapra, Vivek Seshadri, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh M. Khapra. 2022. Aksharantar: Towards building open transliteration tools for the next billion users. arXiv:2205.03018. Retrieved from https://arxiv.org/abs/2205.03018.

[62]

Ayush Maheshwari, Hrishikesh Patel, Nandan Rathod, and Pushpak Bhattacharyya. 2019. Tale of tails using rule augmented sequence labeling for event extraction. arXiv:1908.07018. Retrieved from https://arxiv.org/abs/1908.07018.

[63]

C. S. Malarkodi and Sobha Lalitha Devi. 2020. A deeper study on features for named entity recognition. In Proceedings of the 5th Workshop on Indian Language Data: Resources and Evaluation (WILDRE5). 66–72.

[64]

N. T. Mhaske and A. S. Patil. 2021. Resource creation for opinion mining: A case study with Marathi movie reviews. Int. J. Inf. Technol. (2021), 1–9.

[65]

Neelima Mhaske and Ajay S. Patil. 2016. Issues and challenges in analyzing opinions in Marathi text. Int. J. Comput. Sci. Iss. 13, 2 (2016), 19.

[66]

MHRD. 2013. Languages in India. Technical Report.

[67]

George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41.

Digital Library

[68]

Rudra Murthy, Mitesh M. Khapra, and Pushpak Bhattacharyya. 2018. Improving NER tagging performance in low-resource languages via multilingual learning. ACM Trans. As. Low-Resour. Lang. Inf. Process. 18, 2 (2018), 1–20.

Digital Library

[69]

Ramesh Ram Naik, Maheshkumar B. Landge, et al. 2017. Plagiarism detection in Marathi language using semantic analysis. Int. J. Strateg. Inf. Technol. Appl. 8, 4 (2017), 30–39.

Digital Library

[70]

Ramesh R. Naik, Maheshkumar B. Landge, and C. Namrata Mahender. 2016. Development of Marathi text corpus for plagiarism detection in Marathi language. In Proceedings of the 2nd International Conference on Cognitive Knowledge Engineering (ICKE’16). 340–344.

[71]

Ramesh R. Naik, Maheshkumar B. Landge, and C. Namrata Mahender. 2018. Word level plagiarism detection of Marathi text using N-gram approach. In Proceedings of the International Conference on Recent Trends in Image Processing and Pattern Recognition, K. C. Santosh and Ravindra S. Hegadi (Eds.). Springer, Singapore, 14–23.

[72]

Ramesh R. Naik, Maheshkumar B. Landge, and C. Namrata Mahender. 2019. A proposed model to identify paraphrasing in Marathi text. In Proceedings of the National Conference on Recent Innovation in Computer Science & Electronics. 48–51.

[73]

Shraddha A. Narhari and Rajashree Shedge. 2017. Text categorization of Marathi documents using modified LINGO. In Proceedings of the International Conference on Advances in Computing, Communication and Control (ICAC3’17). IEEE, 1–5.

[74]

Korawit Orkphol and Wu Yang. 2019. Word sense disambiguation using cosine similarity collaborates with Word2vec and WordNet. Fut. Internet 11, 5 (2019), 114.

[75]

Daniel W. Otter, Julian R. Medina, and Jugal K. Kalita. 2020. A survey of the usages of deep learning for natural language processing. IEEE Trans. Neur. Netw. Learn. Syst. (2020).

[76]

Jiaul H. Paik, Mandar Mitra, Swapan K. Parui, and Kalervo Järvelin. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29, 4 (2011), 1–24.

Digital Library

[77]

Jiaul H. Paik, Dipasree Pal, and Swapan K. Parui. 2011. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 863–872.

Digital Library

[78]

Jiaul H. Paik and Swapan K. Parui. 2011. A fast corpus-based stemmer. ACM Trans. As. Lang. Inf. Process. 10, 2 (2011), 1–16.

Digital Library

[79]

Jiaul H. Paik, Swapan K. Parui, Dipasree Pal, and Stephen E. Robertson. 2013. Effective and robust query-based stemming. ACM Trans. Inf. Syst. 31, 4 (2013), 1–29.

Digital Library

[80]

Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. 2017. Cross-lingual name tagging and linking for 282 languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. 1946–1958.

[81]

Anup Patel, Ganesh Ramakrishnan, and Pushpak Bhattacharya. 2009. Relational learning assisted construction of rule base for Indian language NER. In Proceedings of 7th International Conference on Natural Language Processing (ICON’09), (2009), 7th.

[82]

H. B. Patil, A. S. Patil, and B. V. Pawar. 2014. Part-of-speech tagger for Marathi language using limited training corpora. Int. J. Comput. Appl. 975 (2014), 8887.

[83]

Harshali B. Patil, Neelima T. Mhaske, and Ajay S. Patil. 2017. Design and development of a dictionary based stemmer for Marathi language. In International Conference on Next Generation Computing Technologies.Springer, Singapore, 769–777.

[84]

Harshali B. Patil and Ajay S. Patil. 2017. MarS : A rule-based stemmer for morphologically. In Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix’17). IEEE, 580–584.

[85]

Harshali B. Patil and Ajay S. Patil. 2020. A hybrid stemmer for the affix stacking language: Marathi. In Computing in Engineering and Technology. Springer, 441–449.

[86]

Nita Patil, Ajay S. Patil, and B. V. Pawar. 2017. Hybrid approach for Marathi named entity recognition. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017). 103–111.

[87]

Nita Patil, Ajay S. Patil, and B. V. Pawar. 2020. Named entity recognition using conditional random fields. Proc. Comput. Sci. 167 (2020), 1181–1188.

Digital Library

[88]

Nita V. Patil, Ajay S. Patil, and B. V. Pawar. 2017. HMM based named entity recognition for inflectional language. In Proceedings of the International Conference on Computer, Communications and Electronics (Comptelix’17). IEEE, 565–572.

[89]

Parth Patil, Aparna Ranade, Maithili Sabane, Onkar Litake, and Raviraj Joshi. 2022. L3Cube-MahaNER: A Marathi named entity recognition dataset and BERT models. arXiv:2204.06029. Retrieved from https://arxiv.org/abs/2204.06029.

[90]

Rupali P. Patil, R. P. Bhavsar, and B. V. Pawar. 2019. Automatic Marathi text classification. Int. J. Innovat. Technol. Explor. Eng. 9 (2019), 2446–2454. Issue 2.

[91]

S. V. Pawar and S. Mali. 2017. Sentiment analysis in Marathi language. Int. J. Recent Innov. Trends Comput. Commun. (2017), 2321–8169.

[92]

Jerin Philip, Shashank Siripragada, Vinay P. Namboodiri, and C. V. Jawahar. 2021. Revisiting low resource status of Indian languages in machine translation. In Proceedings of the 8th ACM India Joint International Conference on Data Science & Management of Data (IKDD CODS’21) and 26th COMAD. 178–187.

Digital Library

[93]

Lata Popale and Pushpak Bhattacharyya. 2017. Creating Marathi WordNet. In The WordNet in Indian Languages, Niladri Sekhar Dash, Pushpak Bhattacharyya, and Jyoti D. Pawar (Eds.). Springer, Singapore, 147–166.

[94]

Annie Rajan, Ambuja Salgaonkar, and Ramprasad Joshi. 2020. A survey of Konkani NLP resources. Comput. Sci. Rev. 38 (2020), 100299.

[95]

Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, A. K. Raghavan, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Divyanshu Kakwani, Navneet Kumar, et al. 2022. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Trans. Assoc. Comput. Linguist. 10 (2022), 145–162.

[96]

Pratibha Rani, Vikram Pudi, and Dipti M. Sharma. 2017. Semisupervied data driven word sense disambiguation for resource-poor languages. In Proceedings of the 14th International Conference on Natural Language Processing (ICON’17). 503–512.

[97]

Yogeshwari V. Rathod. 2018. Extractive text summarization of Marathi news articles. Int. Res. J. Eng. Technol. 5 (2018), 1204–1210.

[98]

Vinit Ravishankar. 2017. A universal dependencies treebank for Marathi. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories. 190–200.

[99]

Santosh Kumar Ray, Amir Ahmad, and Khaled Shaalan. 2018. A review of the state of the art in Hindi question answering systems. In Intelligent Natural Language Processing: Trends and Applications. 265–292.

[100]

Brian Roark, Lawrence Wolf-Sonkin, Christo Kirov, Sabrina J. Mielke, Cibu Johny, Isin Demirsahin, and Keith Hall. 2020. Processing South Asian languages written in the Latin script: The dakshina dataset. In Proceedings of the 12th Language Resources and Evaluation Conference.

[101]

Sovan Kumar Sahoo, Saumajit Saha, Asif Ekbal, and Pushpak Bhattacharyya. 2020. A platform for event extraction in Hindi. In Proceedings of the 12th Language Resources and Evaluation Conference. 2241–2250.

[102]

Jacques Savoy, Ljiljana Dolamic, and Mitra Akasereh. 2013. Information retrieval with Hindi, Bengali, and Marathi languages: Evaluation and analysis. In Multilingual Information Access in South Asian Languages. Springer, 334–352.

[103]

Yves Scherrer. 2020. TaPaCo: A corpus of sentential paraphrases for 73 languages. In Proceedings of the 12th Language Resources and Evaluation Conference. 6868–6873.

[104]

Sonali Rajesh Shah, Abhishek Kaushik, Shubham Sharma, and Janice Shah. 2020. Opinion-mining on Marglish and Devanagari comments of YouTube cookery channels using parametric and non-parametric learning models. Big Data Cogn. Comput. 4, 1 (2020), 3.

[105]

Raksha Sharma and Pushpak Bhattacharyya. 2014. A sentiment analyzer for Hindi using Hindi senti lexicon. In Proceedings of the 11th International Conference on Natural Language Processing. 150–155.

[106]

Jasmeet Singh and Vishal Gupta. 2016. Text stemming: Approaches, applications, and challenges. ACM Comput. Surv. 49, 3 (2016), 1–46.

Digital Library

[107]

Jasmeet Singh and Vishal Gupta. 2017. An efficient corpus-based stemmer. Cogn. Comput. 9, 5 (2017), 671–688.

[108]

Jasmeet Singh and Vishal Gupta. 2019. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowl.-Bas. Syst. 180 (2019), 147–162.

Digital Library

[109]

Jyoti Singh, Nisheeth Joshi, and Iti Mathur. 2013. Part of speech tagging of Marathi text using trigram method. Int. J. Adv. Inf. Technol. 3, 2 (2013), 35–41.

[110]

Jyoti Singh, Nisheeth Joshi, and Iti Mathur. 2014. Marathi parts-of-speech tagger using supervised learning. In Intelligent Computing, Networking, and Informatics. Springer, 251–257.

[111]

Shashank Siripragada, Jerin Philip, Vinay P. Namboodiri, and C. V. Jawahar. 2020. A multilingual parallel corpora collection effort for Indian languages. In Proceedings of the 12th Language Resources and Evaluation Conference. 3743–3751.

[112]

Shruti Srivastava and Sharvari Govilkar. 2018. Paraphrase identification of Marathi sentences. In Proceedings of the International Conference on Intelligent Data Communication Technologies and Internet of Things. Springer, 534–544.

[113]

Shruti Srivastava and Sharvari Govilkar. 2020. Detecting paraphrases in Marathi language. Int. J. of Smart Computing and Information Technology 1, 1 (2020), 7–17.

[114]

Ortiz Suárez, Pedro Javier, Benoît Sagot, Laurent Romary, Pedro Javier, Ortiz Suárez, Benoît Sagot, Laurent Romary, Asynchronous Pipeline, Pedro Javier, and Ortiz Su. 2019. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In Proceedings of the 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache, 1–8.

[115]

Tasmiah Tahsin Mayeesha, Abdullah Md Sarwar, and Rashedur M. Rahman. 2021. Deep learning based question answering system in Bengali. J. Inf. Telecommun. 5, 2 (2021), 145–178.

[116]

Juhi Tandon and Dipti Misra Sharma. 2017. Unity in diversity: A unified parsing strategy for major Indian languages. In Proceedings of the 4th International Conference on Dependency Linguistics (Depling’17). 255–265.

[117]

Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the International Conference on Language Resources and Evaluation. 2214–2218.

[118]

Abhishek Velankar, Hrushikesh Patil, Amol Gore, Shubham Salunke, and Raviraj Joshi. 2022. L3Cube-MahaHate: A tweet-based Marathi hate speech detection dataset and BERT models. arXiv:2203.13778. Retrieved from https://arxiv.org/abs/2203.13778.

[119]

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Édouard Grave. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference. 4003–4012.

[120]

Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 13, 3 (2018), 55–75.

Cited By

Dave NMehta MKotecha K(2024)A Systematic Review of Stemmers of Indian and Non-Indian Vernacular LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360461223:1(1-51)Online publication date: 15-Jan-2024
https://dl.acm.org/doi/10.1145/3604612
Deshmukh MKolhe S(2024)Recognition and Transcription of Archaic Handwritten Modi Script Document: A Thought-Provoking Crucial Research AreaRecent Trends in Image Processing and Pattern Recognition10.1007/978-3-031-53082-1_20(242-261)Online publication date: 31-Jan-2024
https://doi.org/10.1007/978-3-031-53082-1_20
Shoukat MUsama MAli HLatif S(2023)Breaking Barriers: Can Multilingual Foundation Models Bridge the Gap in Cross-Language Speech Emotion Recognition?2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS)10.1109/SNAMS60348.2023.10375468(1-9)Online publication date: 21-Nov-2023
https://doi.org/10.1109/SNAMS60348.2023.10375468
Show More Cited By

Index Terms

A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. General and reference
  1. Document types
    1. Surveys and overviews

Recommendations

A survey on NLP tasks, resources and techniques for low-resource Telugu-English code-mixed text
With the proliferation of informal content on various social media platforms in the form of posts, comments, and feedback, the importance of analyzing text in code-mixed form is gaining importance. Telugu, a low-resource Indian language, has a lot of ...
Urdu language processing: a survey

Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core ...
Toward an Effective Igbo Part-of-Speech Tagger

Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 2

February 2023

624 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3572719

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2022

Online AM: 13 July 2022

Accepted: 04 July 2022

Revised: 08 June 2022

Received: 14 May 2021

Published in TALLIP Volume 22, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
739
Total Downloads

Downloads (Last 12 months)254
Downloads (Last 6 weeks)25

Reflects downloads up to 04 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dave NMehta MKotecha K(2024)A Systematic Review of Stemmers of Indian and Non-Indian Vernacular LanguagesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/360461223:1(1-51)Online publication date: 15-Jan-2024
https://dl.acm.org/doi/10.1145/3604612
Deshmukh MKolhe S(2024)Recognition and Transcription of Archaic Handwritten Modi Script Document: A Thought-Provoking Crucial Research AreaRecent Trends in Image Processing and Pattern Recognition10.1007/978-3-031-53082-1_20(242-261)Online publication date: 31-Jan-2024
https://doi.org/10.1007/978-3-031-53082-1_20
Shoukat MUsama MAli HLatif S(2023)Breaking Barriers: Can Multilingual Foundation Models Bridge the Gap in Cross-Language Speech Emotion Recognition?2023 Tenth International Conference on Social Networks Analysis, Management and Security (SNAMS)10.1109/SNAMS60348.2023.10375468(1-9)Online publication date: 21-Nov-2023
https://doi.org/10.1109/SNAMS60348.2023.10375468
Ransing RGulati ASrivastava R(2023)Modified Lesk approach for Word Sense Disambiguation in the Marathi Language2023 IEEE 7th Conference on Information and Communication Technology (CICT)10.1109/CICT59886.2023.10455633(1-4)Online publication date: 15-Dec-2023
https://doi.org/10.1109/CICT59886.2023.10455633
Zia Ur Rehman MMehta SSingh KKaushik KKumar N(2023)User-aware multilingual abusive content detection in social mediaInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10345060:5Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1016/j.ipm.2023.103450
Saini JGaikwad H(2023)A Generic Tool for Identification of Indo-Aryan Multi Word ExpressionSN Computer Science10.1007/s42979-023-02181-64:6Online publication date: 14-Oct-2023
https://dl.acm.org/doi/10.1007/s42979-023-02181-6
Patil RKolhe S(2023)Building Marathi SentiWordNetRecent Trends in Image Processing and Pattern Recognition10.1007/978-3-031-23599-3_18(244-260)Online publication date: 11-Jan-2023
https://doi.org/10.1007/978-3-031-23599-3_18

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents