research-article

AMRITA_CEN@FIRE-2014: Morpheme Extraction and Lemmatization for Tamil using Machine Learning

Authors:

Anand M. kumar and

K. P. SomanAuthors Info & Claims

FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation

December 2014

Pages 112 - 120

https://doi.org/10.1145/2824864.2824883

Published: 05 December 2014 Publication History

Abstract

This paper presents the method of Morpheme Extraction and lemmatization for Tamil language in Morpheme Extraction Task (MET) of FIRE-2014. Tamil is a morphologically rich and agglutinative language. Such a language needs deeper analysis at the word level to capture the meaning of the word from its morphemes and its categories. In this attempt, the methodology employed to extract Tamil morphemes and lemmas are based on a supervised machine learning algorithm for nouns and verbs and simple suffix stripping for pronouns and proper nouns. Morphemes are extracted for other Part-of-Speech categories using Tamil Part of Speech tagger. In supervised learning, Morphological analyzer problem is redefined as a classification problem. We decompose the problem of noun and verb morpheme extraction into two sub-problems: learning to perform morpheme identification of words in a text, and learning to perform morpheme tagging. In addition to the Morpheme extraction task results of FIRE-2014, we have carried out different experiments to show the effectiveness of the proposed method.

References

[1]

M. Anand Kumar. Morphology based prototype statistical machine translation system for English to Tamil language. PhD thesis, Amrita Vishwa Vidyapeetham, Coimbatore, 2013.

[2]

M. Anand Kumar, V. Dhanalakshmi, K. Soman, and S. Rajendran. A sequence labeling approach to morphological analyzer for tamil language. IJCSE) International Journal on Computer Science and Engineering, 2(06):1944--195, 2010.

[3]

M. Anand Kumar, V. Dhanalakshmi, K. Soman, and S. Rajendran. Factored statistical machine translation system for english to tamil language. Journal of Social Sciences & Humanities, page 1045, 2014.

[4]

K. Asanee and T. Chalathip. A statistical approach to thai morphological analyzer. In Natural Lan-guage Processing and Intelligent Information System Technology Research, 1995.

[5]

A. Bharati, R. Sangal, S. Bendre, P. Kumar, and K. Aishwarya. Unsupervised improvement of morphological analyzer for inflectionally rich languages. In NLPRS, pages 685--692, 2001.

[6]

W. Daelemans, J. Zavrel, A. Van den Bosch, and K. Van der Sloot. Mbt: memory-based tagger. Version, 3:10--04, 2010.

[7]

V. Dhanalakshmi, M. Anand Kumar, R. Rekha, C. Arun Kumar, K. Soman, and S. Rajendran. Morphological analyzer for agglutinative languages using machine learning approaches. In Advances in Recent Technologies in Communication and Computing, 2009. ARTCom'09. International Conference on, pages 433--435. IEEE, 2009.

Digital Library

[8]

V. Dhanalakshmi, M. Anandkumar, M. Vijaya, R. Loganathan, K. Soman, and S. Rajendran. Tamil part-of-speech tagger based on svmtool. Proceedings of the COLIPS International Conference on natural language processing (IALP), Chiang Mai, Thailand, 2008.

[9]

T. Erjavec and S. DŽEROSKI. Machine learning of morphosyntactic structure: Lemmatizing unknown slovene words. Applied Artificial Intelligence, 18(1):17--41, 2004.

[10]

M. Ganesan. Morph and pos tagger for tamil. Software) Annamalai University, Annamalai Nagar, 2007.

[11]

J. Gimenez and L. Marquez. Svmtool technical manual v1. 3. 2006.

[12]

J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational linguistics, 27(2):153--198, 2001.

Digital Library

[13]

A. Itai and E. Segal. A corpus based morphological analyzer for unvocalized modern hebrew. In Proceedings of Machine Translation for Semitic Languages: Issues and Approaches, Workshop at MT Summit IX (MT-SUMMIT-IX), 2003.

[14]

T. Joachims. Making large scale svm learning practical. Technical report, Universität Dortmund, 1999.

[15]

C. Kruengkrai, V. Sornlertlamvanich, and H. Isahara. A conditional random field framework for thai morphological analysis. In Proceedings of LREC, pages 2419--2424, 2006.

[16]

M. Kurimo, S. Virpioja, V. Turunen, and K. Lagus. Morpho challenge competition 2005--2010: evaluations and results. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 87--95. Association for Computational Linguistics, 2010.

Digital Library

[17]

A. Menon, S. Saravanan, R. Loganathan, and D. K. Soman. Amrita morph analyzer and generator for tamil: a rule based approach. In Proceedings of Tamil Internet Conference, Cologne, pages 239--243, 2009.

[18]

K. Parameshwari. An implementation of apertium morphological analyzer and generator for tamil. an e journal of Language in India (www. languageinindia.com), 2011.

[19]

K. Rajan, V. Ramalingam, and M. Ganesan. Unsupervised approach to tamil morpheme segmentation. In Conference papers Conference papers, page 228, 2009.

[20]

V. Renganathan. Development of part-of-speech tagger for tamil. In Tamil Internet 2001 conference, 2001.

[21]

U. Sharma, J. Kalita, and R. Das. Unsupervised learning of morphology for building lexicon for a highly inflectional language. In Proceedings of the ACL-02 workshop on Morphological and phonological learning-Volume 6, pages 1--10. Association for Computational Linguistics, 2002.

Digital Library

[22]

K. Uchimoto, S. Sekine, and H. Isahara. The unknown word problem: a morphological analysis of japanese using maximum entropy aided by a dictionary. In Proceedings of EMNLP, pages 91--99, 2001.

[23]

M. S. Vijay Sundar Ram R and S. L. Devi. A tamil morphological analyser. In Morphological Analyser For Indian Languages, CIIL, Mysore, India, 2010.

[24]

S. Viswanathan, S. Ramesh Kumar, B. Kumara Shanmugam, S. Arulmozi, and K. Vijay Shanker. A tamil morphological analyser. In Proceedings of the International Conference on Natural Language Processing (ICON), CIIL, Mysore, India, 2003.

[25]

D. Yuret and F. Türe. Learning morphological disambiguation rules for turkish. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 328--334. Association for Computational Linguistics, 2006.

Digital Library

Cited By

Chauhan UShah SShiroya DSolanki DPatel ZBhatia JTanwar SSharma RMarina VRaboaca M(2023)Modeling Topics in DFA-Based Lemmatized Gujarati TextSensors10.3390/s2305270823:5(2708)Online publication date: 1-Mar-2023
https://doi.org/10.3390/s23052708
Rajendran SAnand Kumar MRajalakshmi RDhanalakshmi VBalasubramanian PSoman K(2023)Tamil NLP Technologies: Challenges, State of the Art, Trends and Future ScopeSpeech and Language Technologies for Low-Resource Languages10.1007/978-3-031-33231-9_6(73-98)Online publication date: 29-May-2023
https://doi.org/10.1007/978-3-031-33231-9_6
Madasamy APadannayil S(2021)Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languagesJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03573-314:6(7207-7218)Online publication date: 3-Nov-2021
https://doi.org/10.1007/s12652-021-03573-3
Show More Cited By

Recommendations

AMRITA_CEN@FIRE-2014: Named Entity Recognition for Indian Languages using Rich Features
FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation

This paper aims at implementing Named Entity Recognition (NER) for four languages such as English, Tamil, Hindi and Malayalam. The results obtained from this work are submitted to a research evaluation workshop Forum for Information Retrieval and ...
Read More
A morphological analyzer using hash tables in main memory (MAHT) and a lexical knowledge base
CICLing'12: Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

This paper presents a morphological analyzer for the Spanish language (MAHT). This system is mainly based on the storage of words and its morphological information, leading to a lexical knowledge base that has almost five million words. The lexical ...
Read More
Word normalization and decompounding in mono- and bilingual IR
Abstract
The present research studies the impact of decompounding and two different word normalization methods, stemming and lemmatization, on monolingual and bilingual retrieval. The languages in the monolingual runs are English, Finnish, German and ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation

December 2014

151 pages

ISBN:9781450337557

DOI:10.1145/2824864

Editors:
Prasenjit Majumder
Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India
,
Mandar Mitra
Indian Statistical Institute, Kolkata, India
,
Sukomal Pal
Indian School of Mines, Dhanbad
,
Madhulika Agrawal
Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India
,
Parth Mehta
Dhirubhai Ambani Institute of Information and Communication Technology, Gujarat, India

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

FIRE '14

FIRE '14: Forum for Information Retrieval Evaluation

December 5 - 7, 2014

Bangalore, India

Acceptance Rates

Overall Acceptance Rate 19 of 64 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
88
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Chauhan UShah SShiroya DSolanki DPatel ZBhatia JTanwar SSharma RMarina VRaboaca M(2023)Modeling Topics in DFA-Based Lemmatized Gujarati TextSensors10.3390/s2305270823:5(2708)Online publication date: 1-Mar-2023
https://doi.org/10.3390/s23052708
Rajendran SAnand Kumar MRajalakshmi RDhanalakshmi VBalasubramanian PSoman K(2023)Tamil NLP Technologies: Challenges, State of the Art, Trends and Future ScopeSpeech and Language Technologies for Low-Resource Languages10.1007/978-3-031-33231-9_6(73-98)Online publication date: 29-May-2023
https://doi.org/10.1007/978-3-031-33231-9_6
Madasamy APadannayil S(2021)Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languagesJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03573-314:6(7207-7218)Online publication date: 3-Nov-2021
https://doi.org/10.1007/s12652-021-03573-3
Singh JGupta V(2016)Text StemmingACM Computing Surveys10.1145/297560849:3(1-46)Online publication date: 16-Sep-2016
https://dl.acm.org/doi/10.1145/2975608

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents