Verkefni þetta er í tveimur hlutum og felst annars vegar í íslenskri þýðingu úr spænsku á skáldsö... more Verkefni þetta er í tveimur hlutum og felst annars vegar í íslenskri þýðingu úr spænsku á skáldsögunni Ardiente paciencia eftir chileska rithöfundinn Antonio Skármeta, sem hlotið hefur titilinn Brennandi þolinmæði, og er hins vegar rannsókn á þýðingum myndhvarfa (e. metaphor) með hliðsjón af þýðingunni. Kenningar um myndhvörf eru af ýmsum toga og fjallað hefur verið um þau sérstaklega innan þýðingafræði þar sem þau skapa óhjákvæmilega vandamál við þýðingar. Myndmál er iðulega fastbundið menningu sérhvers staðar og ef þýða á texta milli mismunandi menningarheima kemur einatt upp sú staða að ekki sé notað sama myndmál í frummáli og markmáli. Til að geta betur greint þennan menningarlega mismun kemur að gagni að þekkja til hugrænna fræða (e. cognitive science) um hugtakamyndhvörf (e. conceptual metaphor). Innan hugrænna fræða er myndhvörfum lýst sem mun mikilvægara fyrirbæri en almennt hefur verið talið þar sem þau móti hugsun mannsins og sýn hans á heiminn. Ef hugrænar kenningar um my...
We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art ... more We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain (TLD). Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we translate and adapt the WinoGrande dataset for co-reference resolution. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing...
Named entity recognition (NER) can be a challenging task, especially in highly inflected language... more Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million tokens. Furthermore, we have used the corpus to train three machine learning models: first, a CRF model that makes use of shallow word features and a gazetteer function; second, a perceptron model with shallow word features and externally trained word clusters; and third, a BiLSTM model with external word embeddings. Finally, we applied simple voting to combine the model outputs. The voting method obtains an \(F_{1}\) score of 85.79, gaining 1.89 points compared to the best performing individual model. The corpus and the models are publicly available.
We report on work in progress which consists of annotating an Icelandic corpus for named entities... more We report on work in progress which consists of annotating an Icelandic corpus for named entities (NEs) and using it for training a named entity recognizer based on a Bidirectional Long Short-Term Memory model. Currently, we have annotated 7,538 NEs appearing in the first 200,000 tokens of a 1 million token corpus, MIM-GOLD, originally developed for serving as a gold standard for part-of-speech tagging. Our best performing model, trained on this subset of MIM-GOLD, and enriched with external word embeddings, obtains an overall F1 score of 81.3% when categorizing NEs into the following four categories: persons, locations, organizations and miscellaneous. Our preliminary results are promising, especially given the fact that 80% of MIM-GOLD has not yet been used for training.
A named entity recogniser finds named entities (proper nouns) in a text, and labels them by categ... more A named entity recogniser finds named entities (proper nouns) in a text, and labels them by category. It is a fundamental tool in natural language processing, in particular in the development of information extraction systems. In this paper, we present a prototype of a named entity recogniser for Icelandic, based on artificial neural networks. The training of such networks requires a textual corpus where named entities have been labelled. As no such corpus exists for Icelandic, its creation is a subject of this project. The recogniser was built using NeuroNER, a software package designed for named entity recognition. The results indicate that this is a viable approach towards recognition of named entities in Icelandic (F1=81.3%), especially considering the moderate size of the training corpus. Word embeddings, created from a much larger unlabelled corpus, turned out to improve the results greatly, warranting further study.
Lemmatization, finding the basic morphological form of a word in a corpus, is an important step i... more Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source lemmatizer for Icelandic. Nefnir uses suffix substitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation shows that for correctly tagged text, Nefnir obtains an accuracy of 99.55%, and for text tagged with a PoS tagger, the accuracy obtained is 96.88%.
Proceedings of the 22nd Nordic Conference of Computational Linguistics (NoDaLiDa), 2019
We report on work in progress which consists of annotating an Icelandic corpus for named entities... more We report on work in progress which consists of annotating an Icelandic corpus for named entities (NEs) and using it for training a named entity recognizer based on a Bidirectional Long Short-Term Memory model. Currently, we have annotated 7,538 NEs appearing in the first 200,000 tokens of a 1 million token corpus, MIM-GOLD, originally developed for serving as a gold standard for part-of-speech tagging. Our best performing model, trained on this subset of MIM-GOLD, and enriched with external word embeddings, obtains an overall F1 score of 81.3% when categorizing NEs into the following four categories: persons, locations, organizations and miscellaneous. Our preliminary results are promising, especially given the fact that 80% of MIM-GOLD has not yet been used for training.
Verkefni þetta er í tveimur hlutum og felst annars vegar í íslenskri þýðingu úr spænsku á skáldsö... more Verkefni þetta er í tveimur hlutum og felst annars vegar í íslenskri þýðingu úr spænsku á skáldsögunni Ardiente paciencia eftir chileska rithöfundinn Antonio Skármeta, sem hlotið hefur titilinn Brennandi þolinmæði, og er hins vegar rannsókn á þýðingum myndhvarfa (e. metaphor) með hliðsjón af þýðingunni. Kenningar um myndhvörf eru af ýmsum toga og fjallað hefur verið um þau sérstaklega innan þýðingafræði þar sem þau skapa óhjákvæmilega vandamál við þýðingar. Myndmál er iðulega fastbundið menningu sérhvers staðar og ef þýða á texta milli mismunandi menningarheima kemur einatt upp sú staða að ekki sé notað sama myndmál í frummáli og markmáli. Til að geta betur greint þennan menningarlega mismun kemur að gagni að þekkja til hugrænna fræða (e. cognitive science) um hugtakamyndhvörf (e. conceptual metaphor). Innan hugrænna fræða er myndhvörfum lýst sem mun mikilvægara fyrirbæri en almennt hefur verið talið þar sem þau móti hugsun mannsins og sýn hans á heiminn. Ef hugrænar kenningar um my...
We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art ... more We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain (TLD). Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we translate and adapt the WinoGrande dataset for co-reference resolution. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing...
Named entity recognition (NER) can be a challenging task, especially in highly inflected language... more Named entity recognition (NER) can be a challenging task, especially in highly inflected languages where each entity can have many different surface forms. We have created the first NER corpus for Icelandic by annotating 48,371 named entities (NEs) using eight NE types, in a text corpus of 1 million tokens. Furthermore, we have used the corpus to train three machine learning models: first, a CRF model that makes use of shallow word features and a gazetteer function; second, a perceptron model with shallow word features and externally trained word clusters; and third, a BiLSTM model with external word embeddings. Finally, we applied simple voting to combine the model outputs. The voting method obtains an \(F_{1}\) score of 85.79, gaining 1.89 points compared to the best performing individual model. The corpus and the models are publicly available.
We report on work in progress which consists of annotating an Icelandic corpus for named entities... more We report on work in progress which consists of annotating an Icelandic corpus for named entities (NEs) and using it for training a named entity recognizer based on a Bidirectional Long Short-Term Memory model. Currently, we have annotated 7,538 NEs appearing in the first 200,000 tokens of a 1 million token corpus, MIM-GOLD, originally developed for serving as a gold standard for part-of-speech tagging. Our best performing model, trained on this subset of MIM-GOLD, and enriched with external word embeddings, obtains an overall F1 score of 81.3% when categorizing NEs into the following four categories: persons, locations, organizations and miscellaneous. Our preliminary results are promising, especially given the fact that 80% of MIM-GOLD has not yet been used for training.
A named entity recogniser finds named entities (proper nouns) in a text, and labels them by categ... more A named entity recogniser finds named entities (proper nouns) in a text, and labels them by category. It is a fundamental tool in natural language processing, in particular in the development of information extraction systems. In this paper, we present a prototype of a named entity recogniser for Icelandic, based on artificial neural networks. The training of such networks requires a textual corpus where named entities have been labelled. As no such corpus exists for Icelandic, its creation is a subject of this project. The recogniser was built using NeuroNER, a software package designed for named entity recognition. The results indicate that this is a viable approach towards recognition of named entities in Icelandic (F1=81.3%), especially considering the moderate size of the training corpus. Word embeddings, created from a much larger unlabelled corpus, turned out to improve the results greatly, warranting further study.
Lemmatization, finding the basic morphological form of a word in a corpus, is an important step i... more Lemmatization, finding the basic morphological form of a word in a corpus, is an important step in many natural language processing tasks when working with morphologically rich languages. We describe and evaluate Nefnir, a new open source lemmatizer for Icelandic. Nefnir uses suffix substitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation shows that for correctly tagged text, Nefnir obtains an accuracy of 99.55%, and for text tagged with a PoS tagger, the accuracy obtained is 96.88%.
Proceedings of the 22nd Nordic Conference of Computational Linguistics (NoDaLiDa), 2019
We report on work in progress which consists of annotating an Icelandic corpus for named entities... more We report on work in progress which consists of annotating an Icelandic corpus for named entities (NEs) and using it for training a named entity recognizer based on a Bidirectional Long Short-Term Memory model. Currently, we have annotated 7,538 NEs appearing in the first 200,000 tokens of a 1 million token corpus, MIM-GOLD, originally developed for serving as a gold standard for part-of-speech tagging. Our best performing model, trained on this subset of MIM-GOLD, and enriched with external word embeddings, obtains an overall F1 score of 81.3% when categorizing NEs into the following four categories: persons, locations, organizations and miscellaneous. Our preliminary results are promising, especially given the fact that 80% of MIM-GOLD has not yet been used for training.
Uploads
Papers by Svanhvít Lilja Ingólfsdóttir