Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language

Published: 20 January 2017 Publication History

Abstract

In this article, we propose a word embedding--based named entity recognition (NER) approach. NER is commonly approached as a sequence labeling task with the application of methods such as conditional random field (CRF). However, for low-resource languages without the presence of sufficiently large training data, methods such as CRF do not perform well. In our work, we make use of the proximity of the vector embeddings of words to approach the NER problem. The hypothesis is that word vectors belonging to the same name category, such as a person’s name, occur in close vicinity in the abstract vector space of the embedded words. Assuming that this clustering hypothesis is true, we apply a standard classification approach on the vectors of words to learn a decision boundary between the NER classes. Our NER experiments are conducted on a morphologically rich and low-resource language, namely Bengali. Our approach significantly outperforms standard baseline CRF approaches that use cluster labels of word embeddings and gazetteers constructed from Wikipedia. Further, we propose an unsupervised approach (that uses an automatically created named entity (NE) gazetteer from Wikipedia in the absence of training data). For a low-resource language, the word vectors obtained from Wikipedia are not sufficient to train a classifier. As a result, we propose to make use of the distance measure between the vector embeddings of words to expand the set of Wikipedia training examples with additional NEs extracted from a monolingual corpus that yield significant improvement in the unsupervised NER performance. In fact, our expansion method performs better than the traditional CRF-based (supervised) approach (i.e., F-score of 65.4% vs. 64.2%). Finally, we compare our proposed approach to the official submission for the IJCNLP-2008 Bengali NER shared task and achieve an overall improvement of F-score 11.26% with respect to the best official system.

References

[1]
Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. 1999. An algorithm that learns what’s in a name. Machine Learning 34, 1--3, 211--231.
[2]
Andrew Eliot Borthwick. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. Dissertation. New York University, New York, NY.
[3]
Leo Breiman. 2001. Random forests. Machine Learning 45, 1, 5--32.
[4]
Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning (ICML’08). ACM, New York, NY, 160--167.
[5]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2493--2537. http://dl.acm.org/citation.cfm?id=1953048.2078186.
[6]
Alessandro Cucchiarelli, Danilo Luzi, and Paola Velardi. 1998. Automatic semantic tagging of unknown proper names. In Proceedings of the 17th International Conference on Computational Linguistics— Volume 1 (COLING’98). 286--292.
[7]
Hakan Demir and Arzucan Ozgur. 2014. Improving named entity recognition for morphologically rich languages using word embeddings. In Proceedings of the 13th International Conference on Machine Learning and Applications (ICMLA’14). 117--122.
[8]
Asif Ekbal, Rejwanul Haque, Amitava Das, Venkateswarlu Poka, and Sivaji Bandyopadhyay. 2008. Language independent named entity recognition in Indian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5006.
[9]
Asif Ekbal, Mohammed Hasanuzzaman, and Sivaji Bandyopadhyay. 2009. Voted approach for part of speech tagging in Bengali. In Proceedings of the 23rd Pacific Asia Conference on Language, Information, and Computation (PACLIC 23). 120--129. http://www.aclweb.org/anthology/Y09-1014.
[10]
Asif Ekbal and Sriparna Saha. 2011. Weighted vote-based classifier ensemble for named entity recognition: A genetic algorithm-based approach. ACM Transactions on Asian Language Information Processing 10, 2, Article No. 9.
[11]
Asif Ekbal and Sriparna Saha. 2012. Multiobjective optimization for classifier ensemble and feature selection: An application to named entity recognition. International Journal on Document Analysis and Recognition 15, 2, 143--166.
[12]
Richard J. Evans. 2003. A framework for named entity recognition in the open domain. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’03).
[13]
Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15, 1, 3133--3181. http://dl.acm.org/citation.cfm?id=2627435.2697065.
[14]
Karthik Gali, Harshit Surana, Ashwini Vaidya, Praneeth Shishtla, and Dipti Misra Sharma. 2008. Aggregating machine learning and rule based heuristics for named entity recognition. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5005.
[15]
Debasis Ganguly, Johannes Leveling, and Gareth J. F. Jones. 2013. DCU@Morpheme extraction task of FIRE-2012: Rule-based stemmers for Bengali and Hindi. In Proceedings of the 5th Forum on Information Retrieval Evaluation (FIRE’13). 12.
[16]
Yoav Goldberg and Omer Levy. 2014. Word2vec explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv:1402.3722. http://arxiv.org/abs/1402.3722.
[17]
Jiang Guo, Wanxiang Che, Haifeng Wang, and Ting Liu. 2014. Revisiting embedding features for simple semi-supervised learning. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 110--120. http://www.aclweb.org/anthology/D14-1012.
[18]
Jun’ichi Kazama and Kentaro Torisawa. 2007. Exploiting Wikipedia as external knowledge for named entity recognition. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). http://aclweb.org/anthology/D07-1073.
[19]
P. Praveen Kumar and V. Ravi Kiran. 2008. Hybrid named entity recognition system for South and South East Asian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5012.
[20]
Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and Web-enhanced lexicons. In Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003, Volume 4 (CoNLL’03). 188--191.
[21]
Andrei Mikheev, Marc Moens, and Claire Grover. 1999. Named entity recognition without gazetteers. In Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics (EACL’99). 1--8.
[22]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv:1301.3781. http://arxiv.org/abs/1301.3781.
[23]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of the Neural Information Processing Systems Conference (NIPS’13). 3111--3119.
[24]
Andriy Mnih and Geoffrey E. Hinton. 2008. A scalable hierarchical distributed language model. In Proceedings of the Neural Information Processing Systems Conference (NIPS’08). 1081--1088. http://papers.nips.cc/paper/3583-a-scalable-hierarchical-distributed-language-model.
[25]
Alexandre Passos, Vineet Kumar, and Andrew McCallum. 2014. Lexicon infused phrase embeddings for named entity resolution. In Proceedings of the 18th Conference on Computational Natural Language Learning. 78--86. http://www.aclweb.org/anthology/W/W14/W14-1609.
[26]
Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL’09). 147--155. http://dl.acm.org/citation.cfm?id=1596374.1596399.
[27]
E. Alexander Richman and Patrick Schone. 2008. Mining Wiki resources for multilingual named entity recognition. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and the Human Language Technology Conference (ACL-08: HLT). 1--9. http://aclweb.org/anthology/P08-1001.
[28]
Sujan K. Saha, Sanjay Chatterji, Sandipan Dandapat, Sudeshna Sarkar, and Pabitra Mitra. 2008. A hybrid named entity recognition system for South and South East Asian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5004.
[29]
Fei Sha and Fernando Pereira. 2003. Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology—Volume 1 (NAACL’03). 134--141.
[30]
Utpal Kumar Sikdar, Asif Ekbal, and Sriparna Saha. 2012. Differential evolution based feature selection and classifier ensemble for named entity recognition. In Proceedings of the 24th International Conference on Computational Linguistics (COLING’12). 2475--2490. http://dblp.uni-trier.de/db/conf/coling/coling2012.html#SikdarES12.
[31]
Anil K. Singh. 2008. Named entity recognition for South and South East Asian languages: Taking stock. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. http://aclweb.org/anthology/I08-5003.
[32]
Antonio Toral and Rafael Muñoz. 2006. A Proposal to Automatically Build and Maintain Gazetteers for Named Entity Recognition by Using Wikipedia. Technical Report. Available at http://www.aclweb.org/anthology/W06-2809.pdf.
[33]
Joseph Turian, Yoshua Bengi, Lev Ratinov, and Dan Roth. 2009. A preliminary evaluation of word representations for named-entity recognition. In Proceedings of the NIPS Workshop on Grammar Induction, Representation of Language, and Language Learning. http://citeseerx.ist.psu.edu/citeseerx/viewdoc/summary?doi=10.1.1.174.1362.
[34]
L. J. P. van der Maaten and G. E. Hinton. 2008. Visualizing high-dimensional data using t-SNE. Journal of Machine Learning Research 9, 2579--2605.
[35]
Ziqi Zhang and José Iria. 2009. A novel approach to automatic gazetteer generation using Wikipedia. In Proceedings of the 2009 Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web’09). 1--9. http://dl.acm.org/citation.cfm?id=1699765.1699766.
[36]
GuoDong Zhou and Jian Su. 2002. Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL’02). 473--480.

Cited By

View all
  • (2024)AlpaPICO: Extraction of PICO frames from clinical trial documents using LLMsMethods10.1016/j.ymeth.2024.04.005226(78-88)Online publication date: Jun-2024
  • (2024)CASRank: A ranking algorithm for legal statute retrievalMultimedia Tools and Applications10.1007/s11042-023-15464-083:2(5369-5386)Online publication date: 1-Jan-2024
  • (2023)Data Augmentation and Random Multi-Model Deep Learning for Data ClassificationComputers, Materials & Continua10.32604/cmc.2022.02942074:3(5191-5207)Online publication date: 2023
  • Show More Cited By
  1. Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 3
    September 2017
    167 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3041821
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 January 2017
    Accepted: 01 November 2016
    Revised: 01 October 2016
    Received: 01 September 2015
    Published in TALLIP Volume 16, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CRF-based NER
    2. Wikipedia-based NER
    3. Word embedding
    4. classifier
    5. language-independent NER
    6. unsupervised NER

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • ADAPT Centre at DCU
    • Science Foundation Ireland (SFI)
    • Indian Statistical Institute, Kolkata, India

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)AlpaPICO: Extraction of PICO frames from clinical trial documents using LLMsMethods10.1016/j.ymeth.2024.04.005226(78-88)Online publication date: Jun-2024
    • (2024)CASRank: A ranking algorithm for legal statute retrievalMultimedia Tools and Applications10.1007/s11042-023-15464-083:2(5369-5386)Online publication date: 1-Jan-2024
    • (2023)Data Augmentation and Random Multi-Model Deep Learning for Data ClassificationComputers, Materials & Continua10.32604/cmc.2022.02942074:3(5191-5207)Online publication date: 2023
    • (2023)Named Entity Recognition Based on BERT-BiLSTM-SPAN in Low Resource Scenarios2023 15th International Conference on Computer Research and Development (ICCRD)10.1109/ICCRD56364.2023.10080054(32-37)Online publication date: 10-Jan-2023
    • (2023)A deep neural framework for named entity recognition with boosted word embeddingsMultimedia Tools and Applications10.1007/s11042-023-16176-183:6(15533-15546)Online publication date: 13-Jul-2023
    • (2022)Using Machine Learning Language Models to Generate Innovation Knowledge Graphs for Patent MiningApplied Sciences10.3390/app1219981812:19(9818)Online publication date: 29-Sep-2022
    • (2022)An Enhanced Neural Word Embedding Model for Transfer LearningApplied Sciences10.3390/app1206284812:6(2848)Online publication date: 10-Mar-2022
    • (2022)Review on the Entity Extraction Methods for Low-resource Languages2022 14th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA)10.1109/ICMTMA54903.2022.00141(681-684)Online publication date: Jan-2022
    • (2021)IP Analytics and Machine Learning Applied to Create Process Visualization Graphs for Chemical Utility PatentsProcesses10.3390/pr90813429:8(1342)Online publication date: 30-Jul-2021
    • (2021)Context-Aware Bidirectional Neural Model for Sindhi Named Entity RecognitionApplied Sciences10.3390/app1119903811:19(9038)Online publication date: 28-Sep-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media