Abstract
We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger [1] for tagging and The Icelandic Frequency Dictionary [2] corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections [3]. Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich languages.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Loftsson, H.: Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics 31(1), 47–72 (2008)
Pind, J., Magnússon, F., Briem, S.: Íslensk orðtíðnibók [The Icelandic Frequency Dictionary]. The Institute of Lexicography, University of Iceland, Reykjavik (1991)
Bjarnadóttir, K.: Modern Icelandic Inflections. In: Holmboe, H. (ed.) Nordisk Sprogteknologi 2005. Museum Tusculanums Forlag, Copenhagen (2005)
Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: CIKM 2004: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 625–633. ACM, New York (2004)
Braschler, B., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7(3-4), 291–316 (2004)
Airio, E.: Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9(3), 249–271 (2006)
Krauwer, S.: The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap. SPECOM-2003, Moscow, Russia, Accessed 01.04.2008 (2003), http://www.elsnet.org/dox/krauwer-specom2003.pdf
Cassata, F.: Automatic thesaurus extraction for Icelandic. BSc Final Project, Department of Computer Science, Reykjavik University (2007)
Loftsson, H., Rögnvaldsson, E.: IceNLP: A Natural Language Processing Toolkit for Icelandic. In: Proceedings of Interspeech 2007, Special Session: Speech and language technology for less-resourced languages, Antwerp, Belgium (2007)
Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Jongejan, B., Haltrup, D.: The CST Lemmatiser. Center for Sprogteknologi, University of Copenhagen version 2.9 (2005)
Carlberger, J., Dalianis, H., Hassel, M., Knutsson, O.: Improving precision in information retrieval for Swedish using stemming. In: Proceedings of NODALIDA 2001 – 13th Nordic conference on computational linguistics (2001)
Dalianis, H., Jongejan, B.: Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST’s Lemmatiser. In: LREC 2006: Proceeding of the International Conference on Language Resources and Evaluation (2006)
Helgadóttir, S.: Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In: Holmboe, H. (ed.) Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag, Copenhagen (2005)
Manning, C.: Focusing on Linguistic Representations [abstract]. In: The Natural Language and Speech Processing Colloquium, Stanford, January 19 (2005)
Kenstowicz, M.: Phonology in Generative Grammar (Blackwell Textbooks in Linguistics). Blackwell Publishers, Malden (1993)
Prince, A., Smolensky, P.: Optimality Theory: Constraint Interaction in Generative Grammar. Manuscript, Rutgers University and University of Colorado at Boulder. ROA [ROA #537] (1993/2002), http://roa.rutgers.edu/
Lezius, W., Rapp, R., Wettler, M.: A freely available Morphological Analyzer, Disambiguator, and Context Sensitive Lemmatizer for German. In: Proceedings of the COLING-ACL, pp. 743–747 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ingason, A.K., Helgadóttir, S., Loftsson, H., Rögnvaldsson, E. (2008). A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI). In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-85287-2_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85286-5
Online ISBN: 978-3-540-85287-2
eBook Packages: Computer ScienceComputer Science (R0)