Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2824864.2824883acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfireConference Proceedingsconference-collections
research-article

AMRITA_CEN@FIRE-2014: Morpheme Extraction and Lemmatization for Tamil using Machine Learning

Published: 05 December 2014 Publication History
  • Get Citation Alerts
  • Abstract

    This paper presents the method of Morpheme Extraction and lemmatization for Tamil language in Morpheme Extraction Task (MET) of FIRE-2014. Tamil is a morphologically rich and agglutinative language. Such a language needs deeper analysis at the word level to capture the meaning of the word from its morphemes and its categories. In this attempt, the methodology employed to extract Tamil morphemes and lemmas are based on a supervised machine learning algorithm for nouns and verbs and simple suffix stripping for pronouns and proper nouns. Morphemes are extracted for other Part-of-Speech categories using Tamil Part of Speech tagger. In supervised learning, Morphological analyzer problem is redefined as a classification problem. We decompose the problem of noun and verb morpheme extraction into two sub-problems: learning to perform morpheme identification of words in a text, and learning to perform morpheme tagging. In addition to the Morpheme extraction task results of FIRE-2014, we have carried out different experiments to show the effectiveness of the proposed method.

    References

    [1]
    M. Anand Kumar. Morphology based prototype statistical machine translation system for English to Tamil language. PhD thesis, Amrita Vishwa Vidyapeetham, Coimbatore, 2013.
    [2]
    M. Anand Kumar, V. Dhanalakshmi, K. Soman, and S. Rajendran. A sequence labeling approach to morphological analyzer for tamil language. IJCSE) International Journal on Computer Science and Engineering, 2(06):1944--195, 2010.
    [3]
    M. Anand Kumar, V. Dhanalakshmi, K. Soman, and S. Rajendran. Factored statistical machine translation system for english to tamil language. Journal of Social Sciences & Humanities, page 1045, 2014.
    [4]
    K. Asanee and T. Chalathip. A statistical approach to thai morphological analyzer. In Natural Lan-guage Processing and Intelligent Information System Technology Research, 1995.
    [5]
    A. Bharati, R. Sangal, S. Bendre, P. Kumar, and K. Aishwarya. Unsupervised improvement of morphological analyzer for inflectionally rich languages. In NLPRS, pages 685--692, 2001.
    [6]
    W. Daelemans, J. Zavrel, A. Van den Bosch, and K. Van der Sloot. Mbt: memory-based tagger. Version, 3:10--04, 2010.
    [7]
    V. Dhanalakshmi, M. Anand Kumar, R. Rekha, C. Arun Kumar, K. Soman, and S. Rajendran. Morphological analyzer for agglutinative languages using machine learning approaches. In Advances in Recent Technologies in Communication and Computing, 2009. ARTCom'09. International Conference on, pages 433--435. IEEE, 2009.
    [8]
    V. Dhanalakshmi, M. Anandkumar, M. Vijaya, R. Loganathan, K. Soman, and S. Rajendran. Tamil part-of-speech tagger based on svmtool. Proceedings of the COLIPS International Conference on natural language processing (IALP), Chiang Mai, Thailand, 2008.
    [9]
    T. Erjavec and S. DŽEROSKI. Machine learning of morphosyntactic structure: Lemmatizing unknown slovene words. Applied Artificial Intelligence, 18(1):17--41, 2004.
    [10]
    M. Ganesan. Morph and pos tagger for tamil. Software) Annamalai University, Annamalai Nagar, 2007.
    [11]
    J. Gimenez and L. Marquez. Svmtool technical manual v1. 3. 2006.
    [12]
    J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational linguistics, 27(2):153--198, 2001.
    [13]
    A. Itai and E. Segal. A corpus based morphological analyzer for unvocalized modern hebrew. In Proceedings of Machine Translation for Semitic Languages: Issues and Approaches, Workshop at MT Summit IX (MT-SUMMIT-IX), 2003.
    [14]
    T. Joachims. Making large scale svm learning practical. Technical report, Universität Dortmund, 1999.
    [15]
    C. Kruengkrai, V. Sornlertlamvanich, and H. Isahara. A conditional random field framework for thai morphological analysis. In Proceedings of LREC, pages 2419--2424, 2006.
    [16]
    M. Kurimo, S. Virpioja, V. Turunen, and K. Lagus. Morpho challenge competition 2005--2010: evaluations and results. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 87--95. Association for Computational Linguistics, 2010.
    [17]
    A. Menon, S. Saravanan, R. Loganathan, and D. K. Soman. Amrita morph analyzer and generator for tamil: a rule based approach. In Proceedings of Tamil Internet Conference, Cologne, pages 239--243, 2009.
    [18]
    K. Parameshwari. An implementation of apertium morphological analyzer and generator for tamil. an e journal of Language in India (www. languageinindia.com), 2011.
    [19]
    K. Rajan, V. Ramalingam, and M. Ganesan. Unsupervised approach to tamil morpheme segmentation. In Conference papers Conference papers, page 228, 2009.
    [20]
    V. Renganathan. Development of part-of-speech tagger for tamil. In Tamil Internet 2001 conference, 2001.
    [21]
    U. Sharma, J. Kalita, and R. Das. Unsupervised learning of morphology for building lexicon for a highly inflectional language. In Proceedings of the ACL-02 workshop on Morphological and phonological learning-Volume 6, pages 1--10. Association for Computational Linguistics, 2002.
    [22]
    K. Uchimoto, S. Sekine, and H. Isahara. The unknown word problem: a morphological analysis of japanese using maximum entropy aided by a dictionary. In Proceedings of EMNLP, pages 91--99, 2001.
    [23]
    M. S. Vijay Sundar Ram R and S. L. Devi. A tamil morphological analyser. In Morphological Analyser For Indian Languages, CIIL, Mysore, India, 2010.
    [24]
    S. Viswanathan, S. Ramesh Kumar, B. Kumara Shanmugam, S. Arulmozi, and K. Vijay Shanker. A tamil morphological analyser. In Proceedings of the International Conference on Natural Language Processing (ICON), CIIL, Mysore, India, 2003.
    [25]
    D. Yuret and F. Türe. Learning morphological disambiguation rules for turkish. In Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 328--334. Association for Computational Linguistics, 2006.

    Cited By

    View all
    • (2023)Modeling Topics in DFA-Based Lemmatized Gujarati TextSensors10.3390/s2305270823:5(2708)Online publication date: 1-Mar-2023
    • (2023)Tamil NLP Technologies: Challenges, State of the Art, Trends and Future ScopeSpeech and Language Technologies for Low-Resource Languages10.1007/978-3-031-33231-9_6(73-98)Online publication date: 29-May-2023
    • (2021)Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languagesJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03573-314:6(7207-7218)Online publication date: 3-Nov-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    FIRE '14: Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation
    December 2014
    151 pages
    ISBN:9781450337557
    DOI:10.1145/2824864
    • Editors:
    • Prasenjit Majumder,
    • Mandar Mitra,
    • Sukomal Pal,
    • Madhulika Agrawal,
    • Parth Mehta
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 December 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Lemmatization
    2. Machine Learning
    3. Morpheme Extraction
    4. Natural Language Processing
    5. Support Vector Machines

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    FIRE '14
    FIRE '14: Forum for Information Retrieval Evaluation
    December 5 - 7, 2014
    Bangalore, India

    Acceptance Rates

    Overall Acceptance Rate 19 of 64 submissions, 30%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Modeling Topics in DFA-Based Lemmatized Gujarati TextSensors10.3390/s2305270823:5(2708)Online publication date: 1-Mar-2023
    • (2023)Tamil NLP Technologies: Challenges, State of the Art, Trends and Future ScopeSpeech and Language Technologies for Low-Resource Languages10.1007/978-3-031-33231-9_6(73-98)Online publication date: 29-May-2023
    • (2021)Transfer learning based code-mixed part-of-speech tagging using character level representations for Indian languagesJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-021-03573-314:6(7207-7218)Online publication date: 3-Nov-2021
    • (2016)Text StemmingACM Computing Surveys10.1145/297560849:3(1-46)Online publication date: 16-Sep-2016

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media