The topic of this presentation is a rule-based pipeline for converting constituency treebanks bas... more The topic of this presentation is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe the conversion process, the methods used to deliver a fully automated UD corpus and complications involved. An Icelandic constituency treebank is converted to a UD corpus, and the converter extended to convert a Faroese constituency treebank. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with two new UD corpora, an Icelandic one and a Faroese one. Both are included in version 2.7 of UD.
Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), 2020
In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, ... more In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer.
Since 2016, the tour de CLARIN initiative has been periodically highlighting prominent user invol... more Since 2016, the tour de CLARIN initiative has been periodically highlighting prominent user involvement activities in the CLARIN network in order to increase the visibility of its members, reveal the richness of the CLARIN landscape, and display the full range of activities that show what CLARIN has to offer to researchers, teachers, students, professionals and the general public interested in using and processing language data in various forms. In 2019, we expanded the initiative to also feature the work of CLARIN Knowledge Centres, which offer knowledge and expertise in specific areas provide to researchers, educators and developers alike. Initially conceived as a series of blog posts published on the CLARIN website, Tour de CLARIN soon proved to be one of our flagship outreach initiatives, which has been released in the form of two printed volumes. this third volume of tour de CLARIN is organized into two parts. In Part 1, we present the six countries which have been featured sin...
In Icelandic, as in many other languages, phrasal compounds are an interface phe-<br> nomen... more In Icelandic, as in many other languages, phrasal compounds are an interface phe-<br> nomenon of the different components of grammar. The rules of syntax seem to<br> be preserved in the phrasal component of Icelandic compounds, as they show full<br> internal case assignment and agreement. Phrasal compounds in Icelandic can be<br> divided into two distinct groups. The first group contains common words which<br> are part of the core vocabulary irrespective of genre, and these are not stylisti-<br> cally marked in any way. Examples of these structures can be found in texts from<br> the 13th century onwards. The second group contains more complex compounds,<br> mainly found in informal writing, as in blogs, and in speech. These seem to be<br> 20th century phenomena. Phrasal compounds of both types are relatively rare in<br> Icelandic, but other types of compounding are extremely productive. Tradition-<br> ally, Icelandic ...
Artikkelen handler om de problemer som hefter ved lemmatisering av sammensetninger i en tosprakli... more Artikkelen handler om de problemer som hefter ved lemmatisering av sammensetninger i en tospraklig ordbok der islandsk er kildespraket. Fordi enkelte ord kan vise varierende ordformer som forsteledd i sammensetninger, vil lemmaseleksjonen ikke utelukkende gjenspeile semantisk leksikalisering. Det ma ogsa tas hensyn til at leksikaliseringen i mange tilfeller er begrenset til en bestemt formvariant. Dette forholdet kompliseres ytterligere ved at sammensetninger som viser et produktivt ordlagingsmonster, kan inneholde polyseme ordledd, eller ved at ordleddene star i en flertydig relasjon til hverandre.
In Icelandic, as in many other languages, phrasal compounds are an interface phenomenon of the di... more In Icelandic, as in many other languages, phrasal compounds are an interface phenomenon of the different components of grammar. The rules of syntax seem to be preserved in the phrasal component of Icelandic compounds, as they show full internal case assignment and agreement. Phrasal compounds in Icelandic can be divided into two distinct groups. The first group contains common words which are part of the core vocabulary irrespective of genre, and these are not stylistically marked in any way. Examples of these structures can be found in texts from the 13th century onwards. The second group contains more complex compounds, mainly found in informal writing, as in blogs, and in speech. These seem to be 20th century phenomena. Phrasal compounds of both types are relatively rare in Icelandic, but other types of compounding are extremely productive. Traditionally, Icelandic compounds are divided into two groups, i.e., compounds containing stems and compounds containing inflected word form...
Denne artikkelen handler om verbbeskrivelsen i en elektronisk utgave av den enspraklige islandske... more Denne artikkelen handler om verbbeskrivelsen i en elektronisk utgave av den enspraklige islandske standardordboken, Islensk or›abok (2000), og hvordan den kan utnyttes i tospraklig sammenheng. Hovedforskjellen mellom en trykt og en elektronisk ordbok ligger i de ulike presentasjonsmatene, i et bokformat er det tekstens omfang som bestemmer utformingen av beskrivelsen men i en elektronisk framstilling ma det tas hensyn til hvor mye tekst det kan vises pa skjermen. Teksten ma deles opp i passe store enheter. For a fa det til i den elekroniske utgaven av Islensk or›abok fores verbale konstruksjoner opp som sublemmaer under de enkelte verbene. Dette betyr at den syntagmatiske beskrivelse blir mer utforlig og systematisk enn i de eldre utgaver av ordboken. I artikkelen blir verdien av denne framstillingen vurdert, spesielt med hensyn til en tospraklig ordboksbeskrivelse.
In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, ... more In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published ...
I forbindelse med Nordisk Ministerråds bevilling til at iværksætte et nordisk sprogteknologisk fo... more I forbindelse med Nordisk Ministerråds bevilling til at iværksætte et nordisk sprogteknologisk forskningsprogram blev det anført, at det var vigtigt, at det sprogteknologiske forskningsprogram præsenterede sine resultater og i øvrigt gjorde opmærksom på sig selv som et nyttigt bidrag til det nordiske samarbejde både i professionelle miljøer og over for en bredere kreds af interesserede. Nærværende årbog omhandler sprogteknologiprogrammets aktiviteter i den sidste del af året 2004 og den første del af 2005; den er et forsøg på at ...
The topic of this paper is The Database of Icelandic Morphology (DIM), a multipurpose linguistic ... more The topic of this paper is The Database of Icelandic Morphology (DIM), a multipurpose linguistic resource, created for use in language technology, as a reference for the general public in Iceland, and for use in research on the Icelandic language. DIM contains inflectional paradigms and analysis of word formation, with a vocabulary of approx. 285,000 lemmas. DIM is based on The Database of Modern Icelandic Inflection, which has been in use since 2004.
This collection of papers on phrasal compounding is part of a bigger project whose aims are twofo... more This collection of papers on phrasal compounding is part of a bigger project whose aims are twofold: First, it seeks to broaden the typological perspective by providing data for as many different languages as possible to gain a better understanding of the phenomenon itself. Second, based on these data which clearly show interaction between syntax and morphology it aims to discuss theoretical models which deal with this kind of interaction in different ways. Models like Generative Grammar, assume components of grammar and a clear-cut distinction between the lexicon (often including morphology) and grammar. Other models like construction grammar do not assume such components and are rather based on a lexicon including constructs. A comparison of these models on the basis of this phenomenon on the morphology-syntax interface makes it possible to assess their descriptive and explanatory power.
Lemmatization, finding the basic morphological form of a word in a corpus, is an important step i... more Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source lemmatizer for Icelandic. Nefnir uses suffix substitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation shows that for correctly tagged text, Nefnir obtains an accuracy of 99.55%, and for text tagged with a PoS tagger, the accuracy obtained is 96.88%.
Page 1. Using a Computer Corpus to Supplement a Citation Collection aHisi Jorgen Pind, KristinBja... more Page 1. Using a Computer Corpus to Supplement a Citation Collection aHisi Jorgen Pind, KristinBjarnadottir, Jon Hilmar Jonsson, GuSrun Kvaran, Fri5rik Magnusson, Asta Svavarsdottir, Institute of Lexicography, University of Iceland, IS-107 Reykjavik, Iceland ...
Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelih... more Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelihood of finding previously unseen compounds in texts is thus very high, which makes out-of-vocabulary words a problem in the use of NLP tools. Kvistur, the decompounder described in this paper, splits Icelandic compounds and shows their binary constituent structure. The probability of a constituent in an unknown (or unanalysed) compound forming a combined constituent with either of its neighbours is estimated, with the use of data on the constituent structure of over 240 thousand compounds from the Database of Modern Icelandic Inflection (Kristín Bjarna-dótt ir 2012), and word frequencies from Íslenskur orðasjóður, a corpus of approx. 550 million words. Thus, the structure of an unknown compound is derived by comparison with compounds with partially the same constituents and similar structure in the training data. The granularity of the split returned by the decompounder is important in t...
The topic of this presentation is a rule-based pipeline for converting constituency treebanks bas... more The topic of this presentation is a rule-based pipeline for converting constituency treebanks based on the Penn Treebank format to Universal Dependencies (UD). We describe the conversion process, the methods used to deliver a fully automated UD corpus and complications involved. An Icelandic constituency treebank is converted to a UD corpus, and the converter extended to convert a Faroese constituency treebank. The result is an open-source conversion tool, published under an Apache 2.0 license, applicable to a Penn-style treebank for conversion to a UD corpus, along with two new UD corpora, an Icelandic one and a Faroese one. Both are included in version 2.7 of UD.
Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC), 2020
In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, ... more In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published methods when evaluated on a corpus of manually split word forms. This method has been integrated into Kvistur, an Icelandic compound word analyzer.
Since 2016, the tour de CLARIN initiative has been periodically highlighting prominent user invol... more Since 2016, the tour de CLARIN initiative has been periodically highlighting prominent user involvement activities in the CLARIN network in order to increase the visibility of its members, reveal the richness of the CLARIN landscape, and display the full range of activities that show what CLARIN has to offer to researchers, teachers, students, professionals and the general public interested in using and processing language data in various forms. In 2019, we expanded the initiative to also feature the work of CLARIN Knowledge Centres, which offer knowledge and expertise in specific areas provide to researchers, educators and developers alike. Initially conceived as a series of blog posts published on the CLARIN website, Tour de CLARIN soon proved to be one of our flagship outreach initiatives, which has been released in the form of two printed volumes. this third volume of tour de CLARIN is organized into two parts. In Part 1, we present the six countries which have been featured sin...
In Icelandic, as in many other languages, phrasal compounds are an interface phe-<br> nomen... more In Icelandic, as in many other languages, phrasal compounds are an interface phe-<br> nomenon of the different components of grammar. The rules of syntax seem to<br> be preserved in the phrasal component of Icelandic compounds, as they show full<br> internal case assignment and agreement. Phrasal compounds in Icelandic can be<br> divided into two distinct groups. The first group contains common words which<br> are part of the core vocabulary irrespective of genre, and these are not stylisti-<br> cally marked in any way. Examples of these structures can be found in texts from<br> the 13th century onwards. The second group contains more complex compounds,<br> mainly found in informal writing, as in blogs, and in speech. These seem to be<br> 20th century phenomena. Phrasal compounds of both types are relatively rare in<br> Icelandic, but other types of compounding are extremely productive. Tradition-<br> ally, Icelandic ...
Artikkelen handler om de problemer som hefter ved lemmatisering av sammensetninger i en tosprakli... more Artikkelen handler om de problemer som hefter ved lemmatisering av sammensetninger i en tospraklig ordbok der islandsk er kildespraket. Fordi enkelte ord kan vise varierende ordformer som forsteledd i sammensetninger, vil lemmaseleksjonen ikke utelukkende gjenspeile semantisk leksikalisering. Det ma ogsa tas hensyn til at leksikaliseringen i mange tilfeller er begrenset til en bestemt formvariant. Dette forholdet kompliseres ytterligere ved at sammensetninger som viser et produktivt ordlagingsmonster, kan inneholde polyseme ordledd, eller ved at ordleddene star i en flertydig relasjon til hverandre.
In Icelandic, as in many other languages, phrasal compounds are an interface phenomenon of the di... more In Icelandic, as in many other languages, phrasal compounds are an interface phenomenon of the different components of grammar. The rules of syntax seem to be preserved in the phrasal component of Icelandic compounds, as they show full internal case assignment and agreement. Phrasal compounds in Icelandic can be divided into two distinct groups. The first group contains common words which are part of the core vocabulary irrespective of genre, and these are not stylistically marked in any way. Examples of these structures can be found in texts from the 13th century onwards. The second group contains more complex compounds, mainly found in informal writing, as in blogs, and in speech. These seem to be 20th century phenomena. Phrasal compounds of both types are relatively rare in Icelandic, but other types of compounding are extremely productive. Traditionally, Icelandic compounds are divided into two groups, i.e., compounds containing stems and compounds containing inflected word form...
Denne artikkelen handler om verbbeskrivelsen i en elektronisk utgave av den enspraklige islandske... more Denne artikkelen handler om verbbeskrivelsen i en elektronisk utgave av den enspraklige islandske standardordboken, Islensk or›abok (2000), og hvordan den kan utnyttes i tospraklig sammenheng. Hovedforskjellen mellom en trykt og en elektronisk ordbok ligger i de ulike presentasjonsmatene, i et bokformat er det tekstens omfang som bestemmer utformingen av beskrivelsen men i en elektronisk framstilling ma det tas hensyn til hvor mye tekst det kan vises pa skjermen. Teksten ma deles opp i passe store enheter. For a fa det til i den elekroniske utgaven av Islensk or›abok fores verbale konstruksjoner opp som sublemmaer under de enkelte verbene. Dette betyr at den syntagmatiske beskrivelse blir mer utforlig og systematisk enn i de eldre utgaver av ordboken. I artikkelen blir verdien av denne framstillingen vurdert, spesielt med hensyn til en tospraklig ordboksbeskrivelse.
In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, ... more In this paper, we present a character-based BiLSTM model for splitting Icelandic compound words, and show how varying amounts of training data affects the performance of the model. Compounding is highly productive in Icelandic, and new compounds are constantly being created. This results in a large number of out-of-vocabulary (OOV) words, negatively impacting the performance of many NLP tools. Our model is trained on a dataset of 2.9 million unique word forms and their constituent structures from the Database of Icelandic Morphology. The model learns how to split compound words into two parts and can be used to derive the constituent structure of any word form. Knowing the constituent structure of a word form makes it possible to generate the optimal split for a given task, e.g., a full split for subword tokenization, or, in the case of part-of-speech tagging, splitting an OOV word until the largest known morphological head is found. The model outperforms other previously published ...
I forbindelse med Nordisk Ministerråds bevilling til at iværksætte et nordisk sprogteknologisk fo... more I forbindelse med Nordisk Ministerråds bevilling til at iværksætte et nordisk sprogteknologisk forskningsprogram blev det anført, at det var vigtigt, at det sprogteknologiske forskningsprogram præsenterede sine resultater og i øvrigt gjorde opmærksom på sig selv som et nyttigt bidrag til det nordiske samarbejde både i professionelle miljøer og over for en bredere kreds af interesserede. Nærværende årbog omhandler sprogteknologiprogrammets aktiviteter i den sidste del af året 2004 og den første del af 2005; den er et forsøg på at ...
The topic of this paper is The Database of Icelandic Morphology (DIM), a multipurpose linguistic ... more The topic of this paper is The Database of Icelandic Morphology (DIM), a multipurpose linguistic resource, created for use in language technology, as a reference for the general public in Iceland, and for use in research on the Icelandic language. DIM contains inflectional paradigms and analysis of word formation, with a vocabulary of approx. 285,000 lemmas. DIM is based on The Database of Modern Icelandic Inflection, which has been in use since 2004.
This collection of papers on phrasal compounding is part of a bigger project whose aims are twofo... more This collection of papers on phrasal compounding is part of a bigger project whose aims are twofold: First, it seeks to broaden the typological perspective by providing data for as many different languages as possible to gain a better understanding of the phenomenon itself. Second, based on these data which clearly show interaction between syntax and morphology it aims to discuss theoretical models which deal with this kind of interaction in different ways. Models like Generative Grammar, assume components of grammar and a clear-cut distinction between the lexicon (often including morphology) and grammar. Other models like construction grammar do not assume such components and are rather based on a lexicon including constructs. A comparison of these models on the basis of this phenomenon on the morphology-syntax interface makes it possible to assess their descriptive and explanatory power.
Lemmatization, finding the basic morphological form of a word in a corpus, is an important step i... more Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source lemmatizer for Icelandic. Nefnir uses suffix substitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation shows that for correctly tagged text, Nefnir obtains an accuracy of 99.55%, and for text tagged with a PoS tagger, the accuracy obtained is 96.88%.
Page 1. Using a Computer Corpus to Supplement a Citation Collection aHisi Jorgen Pind, KristinBja... more Page 1. Using a Computer Corpus to Supplement a Citation Collection aHisi Jorgen Pind, KristinBjarnadottir, Jon Hilmar Jonsson, GuSrun Kvaran, Fri5rik Magnusson, Asta Svavarsdottir, Institute of Lexicography, University of Iceland, IS-107 Reykjavik, Iceland ...
Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelih... more Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelihood of finding previously unseen compounds in texts is thus very high, which makes out-of-vocabulary words a problem in the use of NLP tools. Kvistur, the decompounder described in this paper, splits Icelandic compounds and shows their binary constituent structure. The probability of a constituent in an unknown (or unanalysed) compound forming a combined constituent with either of its neighbours is estimated, with the use of data on the constituent structure of over 240 thousand compounds from the Database of Modern Icelandic Inflection (Kristín Bjarna-dótt ir 2012), and word frequencies from Íslenskur orðasjóður, a corpus of approx. 550 million words. Thus, the structure of an unknown compound is derived by comparison with compounds with partially the same constituents and similar structure in the training data. The granularity of the split returned by the decompounder is important in t...
Uploads
Papers by Kristín Bjarnadóttir