Abstract
In this paper, we propose a classifier ensemble technique based on genetic algorithm (GA) for named entity recognition (NER). We assume that the classifiers based on different feature representations can be effectively combined together using GA to achieve better performance. The proposed approach is also able to find the appropriate ensemble approach, i.e. either majority voting or weighted voting. Maximum entropy (ME) model is used as a base to generate a number of different classifiers depending upon the various representations of the available features. The proposed approach is evaluated for three leading Indian languages, namely Bengali, Hindi and Telugu. Evaluation results yield the recall, precision and F-measure values of 88.12, 93.99 and 90.96%, respectively for Bengali, 80.26, 92.70 and 86.03%, respectively for Hindi and 74.79, 85.38 and 79.73%, respectively for Telugu. We also evaluate the proposed approach with the CoNLL-2003 benchmark English datasets and it shows the recall, precision and F-measure values of 83.05, 85.52 and 84.27%, respectively. It is observed that the GA based ensemble attains the performance which is superior to all the individual classifiers as well as two conventional baseline ensembles for all the languages.
Similar content being viewed by others
References
Alba E., Luque G., Araujo L. (2006) Natural language tagging with genetic algorithms. Information Processing Letters 100(5): 173–182
Alfonseca, E., & Manandhar, S. (1999). An unsupervised method for general named entity recognition and automated concept discovery. In Proceedings AAAI ’99/IAAI ’99: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh conference on innovative applications of artificial intelligence (pp. 474–479).
Anderson T. W., Scolve S. (1978) Introduction to the statistical analysis of data. Houghton Mifflin, Boston
Aone, C., Halverson, L., Hampton, T., & Ramos-Santacruz, M. (1998). SRA: Description of the IE2 system used for MUC-7. In MUC-7, Fairfax, Virginia.
Araujo L. (2007) How evolutionary algorithms are applied to statistical natural language processing. Artificial Intelligence Review 28(4): 275–303
Babych, B., & Hartley, A. (2003). Improving machine translation quality with automatic named entity recognition. In Proceedings of EAMT/EACL 2003 workshop on MT and other language technology tools (pp. 1–8).
Bennet, S. W., Aone, C., & Lovell, C. (1997). Learning to tag multilingual texts through observation. In Proceedings of empirical methods of natural language processing (pp. 109–116). Providence, Rhode Island.
Bikel D. M., Schwartz R. L., Weischedel R. M. (1999) An algorithm that learns what’s in a name. Machine Learning 34(1–3): 211–231
Blasband, M. (1998). GAG: Genetic algorithms for grammars. Technical report, Compuleer.
Borthwick, A. (1999). Maximum entropy approach to named entity recognition. PhD thesis, New York University.
Borthwick, A., Sterling, J., Agichtein, E., & Grishman, R. (1998). NYU:Description of the MENE named entity system as used in MUC-7. In MUC-7, Fairfax.
Collins, M., & Singer, Y. (1999). Unsupervised models for named entity classification. In Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora.
Cunningham H. (2002) GATE, a general architecture for text engineering. Computers and the Humanities 36: 223–254
Darroch J., Ratcliff D. (1972) Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics 43: 1470–1480
De Jong K. A., Spears W. M., Gordon D. F. (1993) Using genetic algorithms for concept learning. Machine Learning 13(2–3): 161–188
Ekbal, A., & Bandyopadhyay, S. (2007). Lexical pattern learning from corpus data for named entity recognition. In Proceedings of the 5th international conference on natural language processing (ICON) (pp. 123–128). India.
Ekbal, A., & Bandyopadhyay, S. (2008a). Bengali named entity recognition using support vector machine. In Proceedings of workshop on NER for south and south east Asian languages, 3rd international joint conference on natural languge processing (IJCNLP) (pp. 51–58). India.
Ekbal A., Bandyopadhyay S. (2008b) A web-based Bengali news corpus for named entity recognition. Language Resources and Evaluation Journal 42(2): 173–182
Ekbal A., Bandyopadhyay S. (2008c) Web-based Bengali news corpus for lexicon development and POS tagging. POLIBITS, ISSN 1870–9044 37: 20–29
Ekbal A., Bandyopadhyay S. (2009a) A conditional random field approach for named entity recognition in Bengali and Hindi. Linguistic Issues in Language Technology (LiLT) 2(1): 1–44
Ekbal, A., & Bandyopadhyay, S. (2009b). Voted NER system using appropriate unlabeled data. In Proceedings of the 2009 named entities workshop: Shared task on transliteration (NEWS 2009), ACL-IJCNLP 2009 (pp. 202–210).
Ekbal A., Naskar S., Bandyopadhyay S. (2007) Named entity recognition and transliteration in Bengali. Named Entities: Recognition, Classification and Use, Special Issue of Lingvisticae Investigationes Journal 30(1): 95–114
Florian, R., Ittycheriah, A., Jing, H., & Zhang, T. (2003). Named entity recognition through classifier combination. In Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003.
Goldberg D. E. (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, New York
Holland J. H. (1975) Adaptation in natural and artificial systems. The University of Michigan Press, AnnArbor
Humphreys, K., Gaizauskas, R., Azzam, S., Huyck, C., Mitchell, B., Cunnigham, H., et al. (1998). University of Sheffield: Description of the LaSIE-II system as used for MUC-7. In MUC-7, Fairfax, Virginia.
Jain A., Zongker D. (1997) Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19: 153–158
Kazakov, D. (1997). Unsupervised learning of naive morphology with genetic algorithms. In ECML/Mlnet workshop on empirical learning of natural language processing tasks (pp. 105–112). Prague.
Kool, A., Daelemans, W., & Zavrel, J. (2000). Genetic algorithms for feature relevance assignment in memory-based language processing. In Proceedings of the 2nd workshop on learning language in logic and the 4th conference on computational natural language learning (pp. 103–106). Association for Computational Linguistics.
Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML (pp. 282–289).
Lankhorst, M. M. (1994). Breeding grammars: Grammatical inference with a genetic algorithm. In Proceedings of the 1994 Eurosim conference on massively parallel processing applications and development (pp. 423–430). Elsevier.
Li W., McCallum A. (2004) Rapid development of Hindi named entity recognition using conditional random fields and feature induction. ACM Transactions on Asian Languages Information Processing 2(3): 290–294
Losee, R. M. (2000). Learning syntactic rules and tags with genetic algorithms for information retrieval and filtering: An empirical basis for grammatical rules. Information Processing & Management 185–197.
Martin-Bautista, M. J., & Vila, M. A. (1999). A survey of genetic feature selection in mining issues. In Proceeding of congress on evolutionary computation (CEC-99) (pp. 1314–1321).
McCallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of CoNLL, Canada (pp. 188–191).
Mikheev, A., Grover, C., & Moens, M. (1998). Description of the LTG system used for MUC-7. In MUC-7, Fairfax, Virginia.
Mikheev, A., Grover, C., & Moens, M. (1999). Named entity recognition without gazeteers. In Proceedings of EACL (pp. 1–8). Bergen, Norway.
Miller, S., Crystal, M., Fox, H., Ramshaw, L., Schawartz, R., Stone, R., et al. (1998). BBN: Description of the SIFT system as used for MUC-7. In MUC-7, Fairfax, Virginia.
Moldovan, D., Harabagiu, S., Girju, R., Morarescu, P., Lacatusu, F., Novischi, A., et al. (2002). LCC tools for question answering. In Text REtrieval Conference (TREC).
Pasca, M., Lin, D., Bigham, J., Lifchits, A., & Jain, A. (2006). Organizing and searching the World Wide Web of facts-step one: The one-million fact extraction challenge. In Proceedings of national conference on artificial intelligence (AAAI-06).
Pietra D., Stephen V., Lafferty J. (1997) Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19: 380–393
Raymer M., Punch W., Goodman E., Kuhn L., Jain A. (2000) Dimensionality reduction using genetic algorithm. IEEE Transactions on Evolutionary Computation 4: 164–171
Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings AAAI ’99/IAAI ’99: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh conference on innovative applications of artificial intelligence (pp. 474–479).
Sekine, S. (1998). Description of the Japanese NE system used for MET-2. In MUC-7, Fairfax, Virginia.
Shinyama, Y., & Sekine, S. (2004). Named entity discovery using comparable news articles. In Proceedings of the international conference on computational linguistics (COLING), Switzerland (pp. 848–855).
Smith, T. C., & Witten, I. H. (1995). A genetic algorithm for the induction of natural language grammars. In Proc IJCAI-95 workshop on new approaches to learning for natural language processing (pp. 17–24).
Srikanth, P., & Murthy, K. N. (2008). Named entity recognition for Telugu. In Proceedings of the IJCNLP-08 workshop on NER for south and south east Asian languages (pp. 41–50).
Srihari, R., Niu, C., & Li, W. (2002). A hybrid approach for named entity and sub-type tagging. In: Proceedings of sixth conference on applied natural language processing (ANLP) (pp. 247–254).
Srinivas M., Patnaik L. M. (1994) Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on Systems, Man and Cybernatics 24(4): 656–667
Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language independent named entity recognition. In Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003 (pp. 142–147).
Vijayakrishna, R., & Sobha, L. (2008). Domain focused named entity recognizer for Tamil using conditional random fields. In Proceedings of the IJCNLP-08 workshop on NER for south and south east Asian languages (pp. 93–100).
Wang H., Dai D. (1996) An inductive method with genetic algorithm for learning phrase-structure-rule of natural language. Wuhan University Journal of Natural Sciences 1: 640–644
Yangarber, R., Lin, W., & Grishman, R. (2002). Unsupervised learning of generalized names. In Proceedings of the 19th international conference on computational linguistics (COLING-2002) (pp. 1–7).
Yu, X. (2007). Chinese named entity recognition with Cascaded hybrid model. In Proceedings of NAACL HLT 2007 (pp. 197–200). Prague.
Author information
Authors and Affiliations
Corresponding author
Additional information
Asif Ekbal and Sriparna Saha have equally contributed to this article.
About this article
Cite this article
Ekbal, A., Saha, S. Classifier Ensemble Selection Using Genetic Algorithm for Named Entity Recognition. Res on Lang and Comput 8, 73–99 (2010). https://doi.org/10.1007/s11168-010-9071-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11168-010-9071-0