Abstract
Part of speech (POS) tagging is one of the basic preprocessing techniques for any text processing NLP application. It is a difficult task for morphologically rich and partially free word order languages. This paper describes a Part of Speech (POS) tagger of one such morphologically rich language, Tamil. The main issue of POS tagging is the ambiguity that arises because different POS tags can have the same inflections, and have to be disambiguated using the context. This paper presents a pattern based bootstrapping approach using only a small set of POS labeled suffix context patterns. The pattern consists of a stem and a sequence of suffixes, obtained by segmentation using a suffix list. This bootstrapping technique generates new patterns by iteratively masking suffixes with low probability of occurrences in the suffix context, and replacing them with other co-occurring suffixes. We have tested our system with a corpus containing 20,000 Tamil documents having 2,71,933 unique words. Our system achieves a precision of 87.74%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Garg, N., Goyal, V., Preet, S.: Rules Based Part of Speech Tagger. In: The Proceedings of COLING, pp. 163–174 (2012)
Bagul, P., Mishra, A., Mahajan, P., Kulkarni, M., Dhopavkar, G.: Rule Based POS Tagger for Marathi Text. The Proceedings of International Journal of Computer Science and Information Technologies (IJCSIT) 5(2), 1322–1326 (2014)
Joshi, N., Darbari, H., Mathur, I.: Hmm Based Pos Tagger For Hindi. In: The Proceedings of the Computer Science Conference Proceedings, CSCP (2013)
Manju, K., Soumya, S., Idicula, S.M.: Development of a Pos Tagger for Malayalam-An Experience. In: Proceedings of the International Conference on Advances in Recent Technologies in Communication and Computing (2009)
Saharia, N., Das, D., Sharma, U., Kalita, J.: Part of Speech Tagger for Assamese Text. In: The Proceedings of ACL-IJCNLP Conference Short Papers, pp. 33–36 (2009)
Singh, J., Joshi, N., Mathur, I.: Part of Speech Tagging of Marathi Text Using Trigram method. Proceedings of the International Journal of Advanced Information Technology (IJAIT)Â 3(2) (April 2013)
Singh, T.D.: Manipuri POS Tagging using CRF and SVM: A Language Independent Approach. In: Proceedings of the International Conference on Natural Language Processing, ICON (2008)
Pallavi, A.S.P.: Parts Of Speech (POS) Tagger for Kannada Using Conditional Random Fields (CRFs). In: Proceedings of the National Conference on Indian Language Computing, NCILC (2014)
Patel, C., Gali, K.: Part-Of-Speech Tagging for Gujarati Using Conditional Random Fields. In: Proceedings of the IJCNLP Workshop on NLP for Less Privileged Languages, pp. 117–122 (2008)
Antony, P.J., Mohan, S.P., Soman K.P.: SVM Based Part of Speech Tagger for Malayalam. In: Proceedings of the International Conference on Recent Trends in Information (2010)
Sindhiya Binulal, G., Anand Goud, P., Soman, K.P.: A SVM based approach to Telugu Parts of Speech Tagging using SVMTool. Proceedings of the International Journal of Recent Trends in Engineering 1(2) (2009)
Chandrakanth, D., Anand Kumar, M., Gunasekaran, S.: Part-Of-Speech Tagging For Tamil Language. Proceedings of the International Journal of Communications and Engineering 06(6(1)) (March 2012)
Lakshmana Pandian, S., Geetha, T.V.: Morpheme based Language Model for Tamil Part-of-Speech Tagging. Proceedings of the Research Journal on Computer Science and Computer Engineering with Applications, 19–25 (July-December 2008)
Akilan, R., Naganathan, E.R.: Pos Tagging for Classical Tamil Texts. Proceedings of the International Journal of Business Intelligent 1(01) (January-June 2012)
Palanisamy, A., Devi, S.L.: HMM based POS Tagger for a Relatively Free Word Order Language. Proceedings of the Research in Computing Science (18), 37–48 (2006)
Arulmozhi, P., Pattabhi R K Rao, T., Sobha, L.: A Hybrid POS Tagger for a Relative Free Word Order Language. In: Proceedings of the MSPIL 2006 (2006)
Dhanalakshmi, V., Anand Kumar, M., Rajendran, S., Soman, K.P.: POS Tagger and Chunker for Tamil Language. In: Proceedings of Tamil Internet Conference (2009)
Murthy, K.N., Badugu, S.: A New Approach to Tagging in Indian Languages. Proceedings of the Research in Computing Science (70), 45–56 (2013)
Lakshmana Pandian, S.: Language models developed for POS tagging and chunking. In: Proceedings of 22nd International Conference, ICCPOL 2009 (2009)
Anand Kumar, M., Dhanalakshmi, V., Soman, K.P., Rajendran, S.: A Sequence Labeling Approach to Morphological Analyzer for Tamil Language. Proceedings of International Journal on Computer Science and Engineering International Journal on Computer Science and Engineering (IJCSE) 02(06), 1944–1951 (2010)
Cucerzan, Yarowsky, D.: Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day. In: Proceedings of the Sixth Conference on Natural Language Learning (CoNLL), pp. 132–138 (2002)
Clark, S., Curran, J.R., Osborne, M.: Bootstrapping POS taggers using Unlabelled Data. In: Proceedings of the Seventh CoNLL Conference (2003)
Wang, W., Huang, Z., Harper, M.: Semi-Supervised Learning for Part-of-Speech Tagging of Mandarin Transcribed Speech. In: Proceedings of the ICASSP, vol. 4 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Ganesh, J., Parthasarathi, R., Geetha, T.V., Balaji, J. (2014). Pattern Based Bootstrapping Technique for Tamil POS Tagging. In: Prasath, R., O’Reilly, P., Kathirvalavakumar, T. (eds) Mining Intelligence and Knowledge Exploration. Lecture Notes in Computer Science(), vol 8891. Springer, Cham. https://doi.org/10.1007/978-3-319-13817-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-319-13817-6_25
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13816-9
Online ISBN: 978-3-319-13817-6
eBook Packages: Computer ScienceComputer Science (R0)