research-article

NeuMorph: Neural Morphological Tagging for Low-Resource Languages—An Experimental Study for Indic Languages

Authors:

Abhisek Chakrabarty,

Akshay Chaturvedi,

Utpal GarainAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 19, Issue 1

Article No.: 16, Pages 1 - 19

https://doi.org/10.1145/3342354

Published: 10 August 2019 Publication History

Abstract

This article deals with morphological tagging for low-resource languages. For this purpose, five Indic languages are taken as reference. In addition, two severely resource-poor languages, Coptic and Kurmanji, are also considered. The task entails prediction of the morphological tag (case, degree, gender, etc.) of an in-context word. We hypothesize that to predict the tag of a word, considering its longer context such as the entire sentence is not always necessary. In this light, the usefulness of convolution operation is studied resulting in a convolutional neural network (CNN) based morphological tagger. Our proposed model (BLSTM-CNN) achieves insightful results in comparison to the present state-of-the-art. Following the recent trend, the task is carried out under three different settings: single language, across languages, and across keys. Whereas the previous models used only character-level features, we show that the addition of word vectors along with character-level embedding significantly improves the performance of all the models. Since obtaining high-quality word vectors for resource-poor languages remains a challenge, in that scenario, the proposed character-level BLSTM-CNN proves to be most effective.¹

References

[1]

Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 349--359. http://aclweb.org/anthology/D15-1041.

[2]

Mugdha Bapat, Harshada Gune, and Pushpak Bhattacharyya. 2010. A paradigm-based finite state morphological analyzer for Marathi. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing. 26--34.

[3]

Akshar Bharati, Amba P. Kulkarni, and V. Sheeba. 2006. Building a wide coverage Sanskrit morphological analyser: A practical approach. In The First National Symposium on Modelling and Shallow Parsing of Indian Languages.

[4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5 (2017), 135--146. https://transacl.org/ojs/index.php/tacl/article/view/999.

[5]

Jan Buys and Jan A. Botha. 2016. Cross-lingual morphological tagging for low-resource languages. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1954--1964. http://www.aclweb.org/anthology/P16-1184.

[6]

François Chollet et al. 2015. Keras. https://keras.io.

[7]

Ryan Cotterell and Georg Heigold. 2017. Cross-lingual character-level neural morphological tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 748--759. https://www.aclweb.org/anthology/D17-1078.

[8]

Raj Dabre, Archana Amberkar, and Pushpak Bhattacharyya. 2012. Morphological analyzer for affix stacking languages: A case study of Marathi. In Proceedings of COLING 2012: Posters. The COLING 2012 Organizing Committee, 225--234. http://www.aclweb.org/anthology/C12-2023.

[9]

Raj Dabre, Archana Amberkar, and Pushpak Bhattacharyya. 2013. A way to break them all: A compound word analyzer for Marathi. ICON (2013). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.433.61078rep=rep18type=pdf.

[10]

Yann N. Dauphin, Harm de Vries, and Yoshua Bengio. 2015. Equilibrated adaptive learning rates for non-convex optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). MIT Press, Cambridge, MA, 1504--1512. http://dl.acm.org/citation.cfm?id=2969239.2969407.

Digital Library

[11]

Cícero Nogueira Dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on on Machine Learning - Volume 32 (ICML’14). JMLR.org, II--1818--II--1826. http://dl.acm.org/citation.cfm?id=3044805.3045095.

Digital Library

[12]

Timothy Dozat. 2016. Incorporating nesterov momentum into adam. https://web.stanford.edu/tdozat/files/TDozat-CS229-Paper.pdf.

[13]

Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 334--343. http://www.aclweb.org/anthology/P15-1033.

[14]

V. Goyal and G. S. Lehal. 2008. Hindi morphological analyzer and generator. In 2008 1st International Conference on Emerging Trends in Engineering and Technology. 1156--1159.

Digital Library

[15]

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resource Association. http://aclweb.org/anthology/L18-1550.

[16]

Georg Heigold, Guenter Neumann, and Josef van Genabith. 2016. Neural morphological tagging from characters for morphologically rich languages. CoRR abs/1606.06640 (2016). arxiv:1606.06640 http://arxiv.org/abs/1606.06640.

[17]

Georg Heigold, Guenter Neumann, and Josef van Genabith. 2017. An extensive empirical evaluation of character-based morphological tagging for 14 languages. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, 505--513. http://www.aclweb.org/anthology/E17-1048.

[18]

Girish Nath Jha, Muktanand Agrawal, Sudhir K. Mishra, Diwakar Mani, Diwakar Mishra, Manji Bhadra, Surjit K. Singh, et al. 2009. Inflectional morphology analyzer for Sanskrit. In Sanskrit Computational Linguistics. Springer, 219--238. https://link.springer.com/chapter/10.1007/978-3-642-00155-0_8.

[19]

Nikhil Kanuparthi, Abhilash Inumella, and Dipti Misra Sharma. 2012. Hindi derivational morphological analyzer. In Proceedings of the 12th Meeting of the Special Interest Group on Computational Morphology and Phonology (SIGMORPHON’12). Association for Computational Linguistics, 10--16. http://dl.acm.org/citation.cfm?id=2390930.2390932.

Digital Library

[20]

Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 https://arxiv.org/abs/1412.6980.

[21]

S. Kullback and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Statistics 22 (1951), 79--86.

[22]

Arun Kumar, Ryan Cotterell, Lluís Padró, and Antoni Oliver. 2017. Morphological analysis of the Dravidian language family. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 217--222. http://www.aclweb.org/anthology/E17-2035.

[23]

Arun Kumar, Ryan Cotterell, Lluís Padró, and Antoni Oliver. 2017. Morphological analysis of the Dravidian language family. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 217--222. http://www.aclweb.org/anthology/E17-2035.

[24]

Arun Kumar, Lluís Padró, and Antoni Oliver. 2015. Learning agglutinative morphology of Indian languages with linguistically motivated adaptor grammars. In Proceedings of the International Conference Recent Advances in Natural Language Processing. 307--312. http://www.aclweb.org/anthology/R15-1041.

[25]

Deepak Kumar, Manjeet Singh, and Seema Shukla. 2012. FST based morphological analyzer for Hindi language. CoRR abs/1207.5409 (2012). arxiv:1207.5409 http://arxiv.org/abs/1207.5409.

[26]

Vishal Kumar and Rupinderdeep Guide Kaur. 2013. Paradigm Based Hindi Morphological Analyzer. Ph.D. Dissertation.

[27]

Matthieu Labeau, Kevin Löser, and Alexandre Allauzen. 2015. Non-lexical neural architecture for fine-grained POS tagging. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 232--237. http://aclweb.org/anthology/D15-1025.

[28]

Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1520--1530. http://aclweb.org/anthology/D15-1176.

[29]

S. Lushanthan, A. R. Weerasinghe, and D. L. Herath. 2014. Morphological analyzer and generator for Tamil language. In 2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer’14). 190--196.

[30]

Chaitanya Malaviya, Matthew R. Gormley, and Graham Neubig. 2018. Neural factor graph models for cross-lingual morphological tagging. In The 56th Annual Meeting of the Association for Computational Linguistics (ACL). Melbourne, Australia.

[31]

Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. CoRR abs/1707.05589 (2017). arxiv:1707.05589 http://arxiv.org/abs/1707.05589.

[32]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS). 3111--3119.

Digital Library

[33]

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746--751.

[34]

Thomas Mueller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 322--332. http://www.aclweb.org/anthology/D13-1032.

[35]

Joakim Nivre, Željko Agić, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Eckhard Bick, Cristina Bosco, Gosse Bouma, Sam Bowman, Marie Candito, Gülşen Cebiroǧlu Eryiǧit, Giuseppe G. A. Celano, Fabricio Chalub, Jinho Choi, Çaǧrı Çöltekin, Miriam Connor, Elizabeth Davidson, Marie-Catherine de Marneffe, Valeria de Paiva, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Cláudia Freitas, Katarína Gajdošová, Daniel Galbraith, Marcos Garcia, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta Gonzáles Saavedra, Matias Grioni, Normunds Grūzītis, Bruno Guillaume, Nizar Habash, Jan Hajič, Linh Hà Mỹ, Dag Haug, Barbora Hladká, Petter Hohle, Radu Ion, Elena Irimia, Anders Johannsen, Fredrik Jørgensen, Hüner Kaıkara, Hiroshi Kanayama, Jenna Kanerva, Natalia Kotsyba, Simon Krek, Veronika Laippala, Phương Lê Hồng, Alessandro Lenci, Nikola Ljubešić, Olga Lyashevskaya, Teresa Lynn, Aibek Makazhanov, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simonetta Montemagni, Amir More, Shunsuke Mori, Bohdan Moskalevskyi, Kadri Muischnek, Nina Mustafina, Kaili Müürisep, Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Vitaly Nikolaev, Hanna Nurmi, Stina Ojala, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-Augusto Perez, Guy Perrier, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Lauma Pretkalniņa, Prokopis Prokopidis, Tiina Puolakainen, Sampo Pyysalo, Alexandre Rademaker, Loganathan Ramasamy, Livy Real, Laura Rituma, Rudolf Rosa, Shadi Saleh, Manuela Sanguinetti, Baiba Saulīte, Sebastian Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan Seraji, Lena Shakurova, Mo Shen, Dmitry Sichinava, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Dima Taji, Takaaki Tanaka, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Larraitz Uria, Gertjan van Noord, Viktor Varga, Veronika Vincze, Jonathan North Washington, Zdeněk Žabokrtský, Amir Zeldes, Daniel Zeman, and Hanzhi Zhu. 2017. Universal Dependencies 2.0. http://hdl.handle.net/11234/1-1983 LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University.

[36]

Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 412--418. http://anthology.aclweb.org/P16-2067.

[37]

Mayuri Rastogi and Pooja Khanna. 2014. Development of morphological analyzer for Hindi. International Journal of Computer Applications 95, 17 (2014). https://pdfs.semanticscholar.org/6e88/d020e8739089e42155c05c320098b743620e.pdf.

[38]

Vinit Ravishankar and Francis M. Tyers. 2017. Finite-state morphological analysis for Marathi. In Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP’17). 50--55.

[39]

Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In INTERSPEECH.

[40]

KVN Sunitha and N. Kalyani. 2009. A novel approach to improve rule based Telugu morphological analyzer. In World Congress on Nature 8 Biologically Inspired Computing (NaBIC’09). IEEE, 1649--1652.

[41]

John Sylak-Glassman, Christo Kirov, David Yarowsky, and Roger Que. 2015. A language-independent feature schema for inflectional morphology. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, 674--680. http://www.aclweb.org/anthology/P15-2111.

[42]

Xiang Yu, Agnieszka Falenska, and Ngoc Thang Vu. 2017. A general-purpose tagger with convolutional neural networks. CoRR abs/1706.01723 (2017). arxiv:1706.01723 http://arxiv.org/abs/1706.01723.

[43]

Matthew D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012). https://arxiv.org/abs/1212.5701.

[44]

Daniel Zeman, Martin Popel, Milan Straka, Jan Hajic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic, Jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria dePaiva, Kira Droganova, Héctor Martínez Alonso, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, 1--19.

Cited By

Vilares Ferro MDarriba Bilbao VRibadas Pena FGraña Gil J(2022)Surfing the Modeling of pos Taggers in Low-Resource ScenariosMathematics10.3390/math1019352610:19(3526)Online publication date: 27-Sep-2022
https://doi.org/10.3390/math10193526

Index Terms

NeuMorph: Neural Morphological Tagging for Low-Resource Languages—An Experimental Study for Indic Languages
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Phonology / morphology

Recommendations

BenLem (A Bengali Lemmatizer) and Its Role in WSD

A lemmatization algorithm for Bengali has been developed and evaluated. Its effectiveness for word sense disambiguation (WSD) is also investigated. One of the key challenges for computer processing of highly inflected languages is to deal with the ...
Simple measures of bridging lexical divergence help unsupervised neural machine translation for low-resource languages
Abstract
Unsupervised Neural Machine Translation (UNMT) approaches have gained widespread popularity in recent times. Though these approaches show impressive translation performance using only monolingual corpora of the languages involved, these approaches ...
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 19, Issue 1

January 2020

345 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3338846

Editor:
Imed Zitouni
Microsoft, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2019

Accepted: 01 May 2019

Revised: 01 March 2019

Received: 01 September 2018

Published in TALLIP Volume 19, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
206
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Vilares Ferro MDarriba Bilbao VRibadas Pena FGraña Gil J(2022)Surfing the Modeling of pos Taggers in Low-Resource ScenariosMathematics10.3390/math1019352610:19(3526)Online publication date: 27-Sep-2022
https://doi.org/10.3390/math10193526

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents