Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI)

  • Conference paper
Advances in Natural Language Processing (GoTAL 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5221))

Included in the following conference series:

  • 1931 Accesses


We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger [1] for tagging and The Icelandic Frequency Dictionary [2] corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections [3]. Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Loftsson, H.: Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics 31(1), 47–72 (2008)

    Article  Google Scholar 

  2. Pind, J., Magnússon, F., Briem, S.: Íslensk orðtíðnibók [The Icelandic Frequency Dictionary]. The Institute of Lexicography, University of Iceland, Reykjavik (1991)

    Google Scholar 

  3. Bjarnadóttir, K.: Modern Icelandic Inflections. In: Holmboe, H. (ed.) Nordisk Sprogteknologi 2005. Museum Tusculanums Forlag, Copenhagen (2005)

    Google Scholar 

  4. Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: CIKM 2004: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 625–633. ACM, New York (2004)

    Chapter  Google Scholar 

  5. Braschler, B., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7(3-4), 291–316 (2004)

    Article  MATH  Google Scholar 

  6. Airio, E.: Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9(3), 249–271 (2006)

    Article  Google Scholar 

  7. Krauwer, S.: The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap. SPECOM-2003, Moscow, Russia, Accessed 01.04.2008 (2003), http://www.elsnet.org/dox/krauwer-specom2003.pdf

  8. Cassata, F.: Automatic thesaurus extraction for Icelandic. BSc Final Project, Department of Computer Science, Reykjavik University (2007)

    Google Scholar 

  9. Loftsson, H., Rögnvaldsson, E.: IceNLP: A Natural Language Processing Toolkit for Icelandic. In: Proceedings of Interspeech 2007, Special Session: Speech and language technology for less-resourced languages, Antwerp, Belgium (2007)

    Google Scholar 

  10. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  11. Jongejan, B., Haltrup, D.: The CST Lemmatiser. Center for Sprogteknologi, University of Copenhagen version 2.9 (2005)

    Google Scholar 

  12. Carlberger, J., Dalianis, H., Hassel, M., Knutsson, O.: Improving precision in information retrieval for Swedish using stemming. In: Proceedings of NODALIDA 2001 – 13th Nordic conference on computational linguistics (2001)

    Google Scholar 

  13. Dalianis, H., Jongejan, B.: Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST’s Lemmatiser. In: LREC 2006: Proceeding of the International Conference on Language Resources and Evaluation (2006)

    Google Scholar 

  14. Helgadóttir, S.: Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In: Holmboe, H. (ed.) Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag, Copenhagen (2005)

    Google Scholar 

  15. Manning, C.: Focusing on Linguistic Representations [abstract]. In: The Natural Language and Speech Processing Colloquium, Stanford, January 19 (2005)

    Google Scholar 

  16. Kenstowicz, M.: Phonology in Generative Grammar (Blackwell Textbooks in Linguistics). Blackwell Publishers, Malden (1993)

    Google Scholar 

  17. Prince, A., Smolensky, P.: Optimality Theory: Constraint Interaction in Generative Grammar. Manuscript, Rutgers University and University of Colorado at Boulder. ROA [ROA #537] (1993/2002), http://roa.rutgers.edu/

  18. Lezius, W., Rapp, R., Wettler, M.: A freely available Morphological Analyzer, Disambiguator, and Context Sensitive Lemmatizer for German. In: Proceedings of the COLING-ACL, pp. 743–747 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ingason, A.K., Helgadóttir, S., Loftsson, H., Rögnvaldsson, E. (2008). A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI). In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85287-2_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85286-5

  • Online ISBN: 978-3-540-85287-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics