Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

NeuMorph: Neural Morphological Tagging for Low-Resource Languages—An Experimental Study for Indic Languages

Published: 10 August 2019 Publication History
  • Get Citation Alerts
  • Abstract

    This article deals with morphological tagging for low-resource languages. For this purpose, five Indic languages are taken as reference. In addition, two severely resource-poor languages, Coptic and Kurmanji, are also considered. The task entails prediction of the morphological tag (case, degree, gender, etc.) of an in-context word. We hypothesize that to predict the tag of a word, considering its longer context such as the entire sentence is not always necessary. In this light, the usefulness of convolution operation is studied resulting in a convolutional neural network (CNN) based morphological tagger. Our proposed model (BLSTM-CNN) achieves insightful results in comparison to the present state-of-the-art. Following the recent trend, the task is carried out under three different settings: single language, across languages, and across keys. Whereas the previous models used only character-level features, we show that the addition of word vectors along with character-level embedding significantly improves the performance of all the models. Since obtaining high-quality word vectors for resource-poor languages remains a challenge, in that scenario, the proposed character-level BLSTM-CNN proves to be most effective.1

    References

    [1]
    Miguel Ballesteros, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by modeling characters instead of words with LSTMs. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 349--359. http://aclweb.org/anthology/D15-1041.
    [2]
    Mugdha Bapat, Harshada Gune, and Pushpak Bhattacharyya. 2010. A paradigm-based finite state morphological analyzer for Marathi. In Proceedings of the 1st Workshop on South and Southeast Asian Natural Language Processing. 26--34.
    [3]
    Akshar Bharati, Amba P. Kulkarni, and V. Sheeba. 2006. Building a wide coverage Sanskrit morphological analyser: A practical approach. In The First National Symposium on Modelling and Shallow Parsing of Indian Languages.
    [4]
    Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5 (2017), 135--146. https://transacl.org/ojs/index.php/tacl/article/view/999.
    [5]
    Jan Buys and Jan A. Botha. 2016. Cross-lingual morphological tagging for low-resource languages. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 1954--1964. http://www.aclweb.org/anthology/P16-1184.
    [6]
    François Chollet et al. 2015. Keras. https://keras.io.
    [7]
    Ryan Cotterell and Georg Heigold. 2017. Cross-lingual character-level neural morphological tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 748--759. https://www.aclweb.org/anthology/D17-1078.
    [8]
    Raj Dabre, Archana Amberkar, and Pushpak Bhattacharyya. 2012. Morphological analyzer for affix stacking languages: A case study of Marathi. In Proceedings of COLING 2012: Posters. The COLING 2012 Organizing Committee, 225--234. http://www.aclweb.org/anthology/C12-2023.
    [9]
    Raj Dabre, Archana Amberkar, and Pushpak Bhattacharyya. 2013. A way to break them all: A compound word analyzer for Marathi. ICON (2013). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.433.61078rep=rep18type=pdf.
    [10]
    Yann N. Dauphin, Harm de Vries, and Yoshua Bengio. 2015. Equilibrated adaptive learning rates for non-convex optimization. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). MIT Press, Cambridge, MA, 1504--1512. http://dl.acm.org/citation.cfm?id=2969239.2969407.
    [11]
    Cícero Nogueira Dos Santos and Bianca Zadrozny. 2014. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on on Machine Learning - Volume 32 (ICML’14). JMLR.org, II--1818--II--1826. http://dl.acm.org/citation.cfm?id=3044805.3045095.
    [12]
    Timothy Dozat. 2016. Incorporating nesterov momentum into adam. https://web.stanford.edu/tdozat/files/TDozat-CS229-Paper.pdf.
    [13]
    Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 334--343. http://www.aclweb.org/anthology/P15-1033.
    [14]
    V. Goyal and G. S. Lehal. 2008. Hindi morphological analyzer and generator. In 2008 1st International Conference on Emerging Trends in Engineering and Technology. 1156--1159.
    [15]
    Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resource Association. http://aclweb.org/anthology/L18-1550.
    [16]
    Georg Heigold, Guenter Neumann, and Josef van Genabith. 2016. Neural morphological tagging from characters for morphologically rich languages. CoRR abs/1606.06640 (2016). arxiv:1606.06640 http://arxiv.org/abs/1606.06640.
    [17]
    Georg Heigold, Guenter Neumann, and Josef van Genabith. 2017. An extensive empirical evaluation of character-based morphological tagging for 14 languages. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. Association for Computational Linguistics, 505--513. http://www.aclweb.org/anthology/E17-1048.
    [18]
    Girish Nath Jha, Muktanand Agrawal, Sudhir K. Mishra, Diwakar Mani, Diwakar Mishra, Manji Bhadra, Surjit K. Singh, et al. 2009. Inflectional morphology analyzer for Sanskrit. In Sanskrit Computational Linguistics. Springer, 219--238. https://link.springer.com/chapter/10.1007/978-3-642-00155-0_8.
    [19]
    Nikhil Kanuparthi, Abhilash Inumella, and Dipti Misra Sharma. 2012. Hindi derivational morphological analyzer. In Proceedings of the 12th Meeting of the Special Interest Group on Computational Morphology and Phonology (SIGMORPHON’12). Association for Computational Linguistics, 10--16. http://dl.acm.org/citation.cfm?id=2390930.2390932.
    [20]
    Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 https://arxiv.org/abs/1412.6980.
    [21]
    S. Kullback and R. A. Leibler. 1951. On information and sufficiency. Ann. Math. Statistics 22 (1951), 79--86.
    [22]
    Arun Kumar, Ryan Cotterell, Lluís Padró, and Antoni Oliver. 2017. Morphological analysis of the Dravidian language family. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 217--222. http://www.aclweb.org/anthology/E17-2035.
    [23]
    Arun Kumar, Ryan Cotterell, Lluís Padró, and Antoni Oliver. 2017. Morphological analysis of the Dravidian language family. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Association for Computational Linguistics, 217--222. http://www.aclweb.org/anthology/E17-2035.
    [24]
    Arun Kumar, Lluís Padró, and Antoni Oliver. 2015. Learning agglutinative morphology of Indian languages with linguistically motivated adaptor grammars. In Proceedings of the International Conference Recent Advances in Natural Language Processing. 307--312. http://www.aclweb.org/anthology/R15-1041.
    [25]
    Deepak Kumar, Manjeet Singh, and Seema Shukla. 2012. FST based morphological analyzer for Hindi language. CoRR abs/1207.5409 (2012). arxiv:1207.5409 http://arxiv.org/abs/1207.5409.
    [26]
    Vishal Kumar and Rupinderdeep Guide Kaur. 2013. Paradigm Based Hindi Morphological Analyzer. Ph.D. Dissertation.
    [27]
    Matthieu Labeau, Kevin Löser, and Alexandre Allauzen. 2015. Non-lexical neural architecture for fine-grained POS tagging. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 232--237. http://aclweb.org/anthology/D15-1025.
    [28]
    Wang Ling, Chris Dyer, Alan W. Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 1520--1530. http://aclweb.org/anthology/D15-1176.
    [29]
    S. Lushanthan, A. R. Weerasinghe, and D. L. Herath. 2014. Morphological analyzer and generator for Tamil language. In 2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer’14). 190--196.
    [30]
    Chaitanya Malaviya, Matthew R. Gormley, and Graham Neubig. 2018. Neural factor graph models for cross-lingual morphological tagging. In The 56th Annual Meeting of the Association for Computational Linguistics (ACL). Melbourne, Australia.
    [31]
    Gábor Melis, Chris Dyer, and Phil Blunsom. 2017. On the state of the art of evaluation in neural language models. CoRR abs/1707.05589 (2017). arxiv:1707.05589 http://arxiv.org/abs/1707.05589.
    [32]
    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS). 3111--3119.
    [33]
    Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746--751.
    [34]
    Thomas Mueller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 322--332. http://www.aclweb.org/anthology/D13-1032.
    [35]
    Joakim Nivre, Željko Agić, Lars Ahrenberg, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Eckhard Bick, Cristina Bosco, Gosse Bouma, Sam Bowman, Marie Candito, Gülşen Cebiroǧlu Eryiǧit, Giuseppe G. A. Celano, Fabricio Chalub, Jinho Choi, Çaǧrı Çöltekin, Miriam Connor, Elizabeth Davidson, Marie-Catherine de Marneffe, Valeria de Paiva, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Marhaba Eli, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Cláudia Freitas, Katarína Gajdošová, Daniel Galbraith, Marcos Garcia, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Memduh Gökırmak, Yoav Goldberg, Xavier Gómez Guinovart, Berta Gonzáles Saavedra, Matias Grioni, Normunds Grūzītis, Bruno Guillaume, Nizar Habash, Jan Hajič, Linh Hà Mỹ, Dag Haug, Barbora Hladká, Petter Hohle, Radu Ion, Elena Irimia, Anders Johannsen, Fredrik Jørgensen, Hüner Kaıkara, Hiroshi Kanayama, Jenna Kanerva, Natalia Kotsyba, Simon Krek, Veronika Laippala, Phương Lê Hồng, Alessandro Lenci, Nikola Ljubešić, Olga Lyashevskaya, Teresa Lynn, Aibek Makazhanov, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martínez Alonso, André Martins, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simonetta Montemagni, Amir More, Shunsuke Mori, Bohdan Moskalevskyi, Kadri Muischnek, Nina Mustafina, Kaili Müürisep, Lương Nguyễn Thị, Huyền Nguyễn Thị Minh, Vitaly Nikolaev, Hanna Nurmi, Stina Ojala, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-Augusto Perez, Guy Perrier, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Lauma Pretkalniņa, Prokopis Prokopidis, Tiina Puolakainen, Sampo Pyysalo, Alexandre Rademaker, Loganathan Ramasamy, Livy Real, Laura Rituma, Rudolf Rosa, Shadi Saleh, Manuela Sanguinetti, Baiba Saulīte, Sebastian Schuster, Djamé Seddah, Wolfgang Seeker, Mojgan Seraji, Lena Shakurova, Mo Shen, Dmitry Sichinava, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Alane Suhr, Umut Sulubacak, Zsolt Szántó, Dima Taji, Takaaki Tanaka, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Larraitz Uria, Gertjan van Noord, Viktor Varga, Veronika Vincze, Jonathan North Washington, Zdeněk Žabokrtský, Amir Zeldes, Daniel Zeman, and Hanzhi Zhu. 2017. Universal Dependencies 2.0. http://hdl.handle.net/11234/1-1983 LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University.
    [36]
    Barbara Plank, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 412--418. http://anthology.aclweb.org/P16-2067.
    [37]
    Mayuri Rastogi and Pooja Khanna. 2014. Development of morphological analyzer for Hindi. International Journal of Computer Applications 95, 17 (2014). https://pdfs.semanticscholar.org/6e88/d020e8739089e42155c05c320098b743620e.pdf.
    [38]
    Vinit Ravishankar and Francis M. Tyers. 2017. Finite-state morphological analysis for Marathi. In Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP’17). 50--55.
    [39]
    Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In INTERSPEECH.
    [40]
    KVN Sunitha and N. Kalyani. 2009. A novel approach to improve rule based Telugu morphological analyzer. In World Congress on Nature 8 Biologically Inspired Computing (NaBIC’09). IEEE, 1649--1652.
    [41]
    John Sylak-Glassman, Christo Kirov, David Yarowsky, and Roger Que. 2015. A language-independent feature schema for inflectional morphology. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, 674--680. http://www.aclweb.org/anthology/P15-2111.
    [42]
    Xiang Yu, Agnieszka Falenska, and Ngoc Thang Vu. 2017. A general-purpose tagger with convolutional neural networks. CoRR abs/1706.01723 (2017). arxiv:1706.01723 http://arxiv.org/abs/1706.01723.
    [43]
    Matthew D. Zeiler. 2012. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012). https://arxiv.org/abs/1212.5701.
    [44]
    Daniel Zeman, Martin Popel, Milan Straka, Jan Hajic, Joakim Nivre, Filip Ginter, Juhani Luotolahti, Sampo Pyysalo, Slav Petrov, Martin Potthast, Francis Tyers, Elena Badmaeva, Memduh Gokirmak, Anna Nedoluzhko, Silvie Cinkova, Jan Hajic, Jr., Jaroslava Hlavacova, Václava Kettnerová, Zdenka Uresova, Jenna Kanerva, Stina Ojala, Anna Missilä, Christopher D. Manning, Sebastian Schuster, Siva Reddy, Dima Taji, Nizar Habash, Herman Leung, Marie-Catherine de Marneffe, Manuela Sanguinetti, Maria Simi, Hiroshi Kanayama, Valeria dePaiva, Kira Droganova, Héctor Martínez Alonso, Çağrı Çöltekin, Umut Sulubacak, Hans Uszkoreit, Vivien Macketanz, Aljoscha Burchardt, Kim Harris, Katrin Marheinecke, Georg Rehm, Tolga Kayadelen, Mohammed Attia, Ali Elkahky, Zhuoran Yu, Emily Pitler, Saran Lertpradit, Michael Mandl, Jesse Kirchner, Hector Fernandez Alcalde, Jana Strnadová, Esha Banerjee, Ruli Manurung, Antonio Stella, Atsuko Shimada, Sookyoung Kwak, Gustavo Mendonca, Tatiana Lando, Rattima Nitisaroj, and Josie Li. 2017. CoNLL 2017 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, 1--19.

    Cited By

    View all
    • (2022)Surfing the Modeling of pos Taggers in Low-Resource ScenariosMathematics10.3390/math1019352610:19(3526)Online publication date: 27-Sep-2022

    Index Terms

    1. NeuMorph: Neural Morphological Tagging for Low-Resource Languages—An Experimental Study for Indic Languages

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 19, Issue 1
      January 2020
      345 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3338846
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 10 August 2019
      Accepted: 01 May 2019
      Revised: 01 March 2019
      Received: 01 September 2018
      Published in TALLIP Volume 19, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Indic languages
      2. Tagging
      3. convolutional neural network
      4. multitask learning
      5. recurrent neural network

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Surfing the Modeling of pos Taggers in Low-Resource ScenariosMathematics10.3390/math1019352610:19(3526)Online publication date: 27-Sep-2022

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media