Abstract
In this paper we present several optimizations introduced to Conditional Random Fields-based model for proper names recognition in Polish running texts. The proposed optimizations refer to word-level segmentation problems, gazetteers incompleteness, problem of unambiguous generalization features, feature construction and selection, and finally recognition of common proper names on the basis of external sources of knowledge. The problem of proper name recognition is limited to recognition of person first names and surnames, names of countries, cities and roads. The evaluation is performed in two ways: a single domain evaluation using 10-fold cross validation on a Corpus of Stock Exchange Reports and a cross-domain evaluation on a Corpus of Economic News. An additional corpus of Wikipedia articles, namely InfiKorp is used in the feature selection. Finally, we evaluate three configurations of proposed modifications. The top configuration improved the final result from 94.53% to 95.65% of F-measure for single domain and from 70.86% to 79.63% for cross-domain evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Mykowiecka, A., Kupść, A., Marciniak, M., Piskorski, J.: Resources for Information Extraction from Polish texts. In: Proceedings of the 3rd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, (LTC 2007), Poznań, Poland, October 5-7 (2007)
Graliński, F., Jassem, K., Marcińczuk, M.: An Environment for Named Entity Recognition and Translation. In: Màrquez, L., Somers, H. (eds.) Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain, pp. 88–95 (2009)
Graliński, F., Jassem, K., Marcińczuk, M., Wawrzyniak, P.: Named Entity Recognition in Machine Anonymization. In: Kłopotek, M.A., Przepiorkowski, A., Wierzchoń, A.T., Trojanowski, K. (eds.) Recent Advances in Intelligent Information Systems, pp. 247–260. Academic Publishing House Exit (2009)
Marcińczuk, M., Zaśko-Zielińska, M., Piasecki, M.: Structure Annotation in the Polish Corpus of Suicide Notes. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 419–426. Springer, Heidelberg (2011)
ACE (Automatic Content Extraction) English Annotation Guidelines for Entities. Linguistic Data Consortium, LDC (2008)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Seventh Conference on Natural Language Learning, CoNLL (2003)
Mykowiecka, A., Waszczuk, J.: Semantic Annotation of City Transportation Information Dialogues Using CRF Method. In: Matoušek, V., Mautner, P. (eds.) TSD 2009. LNCS, vol. 5729, pp. 411–418. Springer, Heidelberg (2009), doi:10.1007/978-3-642-04208-9_56
Marcińczuk, M., Stanek, M., Piasecki, M., Musiał, A.: Rich Set of Features for Proper Name Recognition in Polish Texts. In: Proc. of the S&IIS 2011, Poland (2011)
Georgiev, G., Nakov, P., Ganchev, K., Osenova, P., Simov, K.: Feature-Rich Named Entity Recognition for Bulgarian Using Conditional Random Fields. In: Proceedings of the International Conference RANLP 2009, pp. 113–117. Association for Computational Linguistics, Borovets (2009)
Benajiba, Y., Rosso, P.: Arabic Named Entity Recognition using Conditional Random Fields. In: Proc. Workshop on HLT & NLP with in the Arabic World (2008)
Marcińczuk, M., Piasecki, M.: Statistical Proper Name Recognition in Polish Economic Texts. In: Control and Cybernetics (2011)
Radziszewski, A., Śniatowski, T.: Maca: a configurable tool to integrate Polish morphological data. In: Proceedings of Free RBMT 2011, Barcelona, Spain (2011)
Piskorski, J.: Extraction of Polish named entities. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004 (ELR 2004), pp. 313–316. ACL, Prague (2004)
Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 134–141. Association for Computational Linguistics, Stroudsburg (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Marcińczuk, M., Janicki, M. (2012). Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-28604-9_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)