Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
note

BenLem (A Bengali Lemmatizer) and Its Role in WSD

Published: 26 February 2016 Publication History

Abstract

A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the frequent morphological variations of the root words appearing in the text. Therefore, a lemmatizer is essential for developing natural language processing (NLP) tools for such languages. In this experiment, Bengali, which is the national language of Bangladesh and the second most popular language in the Indian subcontinent, has been taken as a reference. In order to design the Bengali lemmatizer (named as BenLem), possible transformations through which surface words are formed from lemmas are studied so that appropriate reverse transformations can be applied on a surface word to get the corresponding lemma back. BenLem is found to be capable of handling both inflectional and derivational morphology in Bengali. It is evaluated on a set of 18 news articles taken from the FIRE Bengali News Corpus consisting of 3,342 surface words (excluding proper nouns) and found to be 81.95% accurate. The role of the lemmatizer is then investigated for Bengali WSD. Ten highly polysemous Bengali words are considered for sense disambiguation. The FIRE corpus and a collection of Tagore’s short stories are considered for creating the WSD dataset. Different WSD systems are considered for this experiment, and it is noticed that BenLem improves the performance of all the WSD systems and the improvements are statistically significant.

References

[1]
Samit Bhattacharya, Monojit Choudhury, Sudeshna Sarkar, and Anupam Basu. 2005. Inflectional morphology synthesis for bengali noun, pronoun and verb systems. Proc. of NCCPB 8 (2005), 34--43.
[2]
Pushpak Bhattacharyya, Ankit Bahuguna, Lavita Talukdar, and Bornali Phukan. 2014. Facilitating multi-lingual sense annotation: Human mediated lemmatizer. In Proceedings of the Global WordNet Conference.
[3]
Sajib Dasgupta and Vincent Ng. 2007. Unsupervised morphological parsing of bengali. Language Resources and Evaluation 40, (2007), 311--330.
[4]
Niladri Sekhar Dash. 2015. A Descriptive Study of Bengali Words. Cambridge University Press.
[5]
Ljiljana Dolamic and Jacques Savoy. 2010. Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages. ACM Transactions on Asian Language Information Processing (TALIP) 9, 3, Article 11 (Sept. 2010), 24 pages.
[6]
Abu Zaher Md Faridee, Francis M. Tyers, and others. 2009. Development of a morphological analyser for bengali. In Proceedings of the 1st International Workshop on Free/Open-Source Rule-Based Machine Translation. Universidad de Alicante. Departamento de Lenguajes y Sistemas Informáticos.
[7]
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press.
[8]
Debasis Ganguly, Johannes Leveling, and Gareth J. F. Jones. 2012. DCU@ FIRE-2012: Rule-based stemmers for bengali and hindi. In Working Notes for the FIRE 2012 Workshop.
[9]
Andrea Gesmundo and Tanja Samardžić. 2012. Lemmatisation as a tagging task. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2. Association for Computational Linguistics, Jeju Island, Korea, 368--372. http://www.aclweb.org/anthology/P12-2072.
[10]
Adam Kilgarriff and Joseph Rosenzweig. 2000. English Senseval: Report and results. In LREC, Vol. 6. 2.
[11]
Kimmo Koskenniemi. 1984. A general computational model for word-form recognition and production. In Proceedings of the 10th International Conference on Computational Linguistics and 22nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 178--181.
[12]
Michael Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, 24--26.
[13]
Krister Lindén. 2008. A probabilistic model for guessing base forms of new words by analogy. In Computational Linguistics and Intelligent Text Processing. Springer, 106--116.
[14]
Aki Loponen and Kalervo Järvelin. 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In Multilingual and Multimodal Information Access Evaluation. Springer, 3--14.
[15]
Aki Loponen, Jiaul H. Paik, and Kalervo Järvelin. 2013. UTA stemming and lemmatization experiments in the FIRE bengali Ad Hoc task. In Multilingual Information Access in South Asian Languages. Springer, 258--268.
[16]
Prasenjit Majumder, Mandar Mitra, Dipasree Pal, Ayan Bandyopadhyay, Samaresh Maiti, Sukomal Pal, Deboshree Modak, and Sucharita Sanyal. 2010. The FIRE 2008 evaluation exercise. ACM Transactions on Asian Language Information Processing (TALIP) 9, 3, Article 10 (Sept. 2010), 24 pages.
[17]
Prasenjit Majumder, Mandar Mitra, Swapan K. Parui, Gobinda Kole, Pabitra Mitra, and Kalyankumar Datta. 2007. YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS) 25, 4, Article 18 (Oct. 2007), 20 pages.
[18]
Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
[19]
George A. Miller, Martin Chodorow, Shari Landes, Claudia Leacock, and Robert G. Thomas. 1994. Using a semantic concordance for sense identification. In Proceedings of the Workshop on Human Language Technology. Association for Computational Linguistics, 240--243.
[20]
Thomas Müller, Ryan Cotterell, Alexander Fraser, and Hinrich Schütze. 2015. Joint lemmatization and morphological tagging with lemming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 2268--2274. http://aclweb.org/anthology/D15-1272.
[21]
Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41, 2, Article 10 (Feb. 2009), 69 pages.
[22]
Jiaul H. Paik, Mandar Mitra, Swapan K. Parui, and Kalervo Järvelin. 2011. GRAS: An effective and efficient stemming algorithm for information retrieval. ACM Transactions on Information Systems (TOIS) 29, 4, Article 19 (Dec. 2011), 24 pages.
[23]
Jiaul H. Paik and Swapan K. Parui. 2011. A fast corpus-based stemmer. ACM Transactions on Asian Language Information Processing (TALIP) 10, 2, Article 8 (June 2011), 16 pages.
[24]
Joël Plisson, Nada Lavrac, Dunja Mladenic, and others. 2004. A rule based approach to word lemmatization. Proceedings of IS-2004 (2004), 83--86.
[25]
Sandipan Sarkar and Sivaji Bandyopadhyay. 2012a. FIRE 2012 working notes: Morpheme extraction task using mulaadhaar--a rule-based stemmer for bengali. In Working Notes for the FIRE 2012 Workshop.
[26]
Sandipan Sarkar and Sivaji Bandyopadhyay. 2012b. On the evolution of stemmers: A study in the context of bengali language. International Journal of Computational Linguistics and Natural Language Processing 1, 2 (2012), 51--59.
[27]
Apurbalal Senapati and Utpal Garain. 2012. Bangla Morphological Analyzer using Finite Automata: ISI@ FIRE MET 2012. In Working Notes for the FIRE 2012 Workshop.
[28]
Kristina Toutanova and Colin Cherry. 2009. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, Suntec, Singapore, 486--494. http://www.aclweb.org/anthology/P/P09/P09-1055.
[29]
Richard Wicentowski and David Yarowsky. 2002. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. Ph.D. Dissertation. Ph. D. Thesis. Johns Hopkins University, Baltimore, Maryland.

Cited By

View all
  • (2024)KULemma: Towards a Comprehensive Bangla Lemmatizer2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT)10.1109/ICEEICT62016.2024.10534443(1089-1094)Online publication date: 2-May-2024
  • (2023)Modeling Topics in DFA-Based Lemmatized Gujarati TextSensors10.3390/s2305270823:5(2708)Online publication date: 1-Mar-2023
  • (2023)Contextual Urdu Lemmatization Using Recurrent Neural Network ModelsMathematics10.3390/math1102043511:2(435)Online publication date: 13-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 3
March 2016
220 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/2876004
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 February 2016
Accepted: 01 October 2015
Revised: 01 October 2015
Received: 01 August 2014
Published in TALLIP Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bengali
  2. Indic languages
  3. evaluation
  4. lemmatizer
  5. word sense disambiguation (WSD)

Qualifiers

  • Note
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)KULemma: Towards a Comprehensive Bangla Lemmatizer2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT)10.1109/ICEEICT62016.2024.10534443(1089-1094)Online publication date: 2-May-2024
  • (2023)Modeling Topics in DFA-Based Lemmatized Gujarati TextSensors10.3390/s2305270823:5(2708)Online publication date: 1-Mar-2023
  • (2023)Contextual Urdu Lemmatization Using Recurrent Neural Network ModelsMathematics10.3390/math1102043511:2(435)Online publication date: 13-Jan-2023
  • (2023)Low-resource Multilingual Neural Translation Using Linguistic Feature-based Relevance MechanismsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359463122:7(1-36)Online publication date: 18-May-2023
  • (2023)MorphBen: A Neural Morphological Analyzer for Bengali LanguageComputational Linguistics and Intelligent Text Processing10.1007/978-3-031-24337-0_42(595-607)Online publication date: 26-Feb-2023
  • (2022)An Enhanced Neural Word Embedding Model for Transfer LearningApplied Sciences10.3390/app1206284812:6(2848)Online publication date: 10-Mar-2022
  • (2022)A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese LanguageACM Transactions on Asian and Low-Resource Language Information Processing10.1145/350215721:4(1-22)Online publication date: 31-Jul-2022
  • (2020)Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)10.1109/DSAA49011.2020.00053(390-399)Online publication date: Oct-2020
  • (2019)Vector Representation of Bengali Word Using Various Word Embedding Model2019 8th International Conference System Modeling and Advancement in Research Trends (SMART)10.1109/SMART46866.2019.9117386(27-30)Online publication date: Nov-2019
  • (2018)Processing Texts in a CorpusUtility and Application of Language Corpora10.1007/978-981-13-1801-6_5(73-90)Online publication date: 14-Aug-2018
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media