Skip to main content

Hrafn Loftsson

Reykjavik University, School of Computer Science, Faculty Member

Followers

28

Following

27

Co-authors

19

Mentions

1

Public Views

I completed a B.Sc. degree in Computer Science from the University of Iceland in 1989 and an M.Sc. in Computer Science and Operations Research from Pennsylvania State University, USA, in 1992. I joined Reykjavik University in 2000 after having spent about 8 years working for trading firms and investments banks in the USA, England and Iceland. In 2007, I completed my Ph.D. in Computer Science (NLP) from the University of Sheffield, England.
Supervisors: Yorick Wilks

less

Uploads

Papers by Hrafn Loftsson

Evaluating Icelandic Sentiment Analysis Models Trained on Translated Data

by Ólafur Aron Jóhannsson and Hrafn Loftsson

Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, 2024

We experiment with sentiment classification models for Icelandic that leverage machine-translated... more We experiment with sentiment classification models for Icelandic that leverage machine-translated data for training. Since no large sentiment dataset exists for Icelandic, we translate 50,000 English IMDb reviews, classified either as positive or negative, into Icelandic using two services: Google Translate and GreynirTranslate. After machine translation, we assess whether the sentiment of the source language text is retained in the target language. Moreover, we evaluate the accuracy of the sentiment classifiers on non-translated Icelandic text. The performance of three types of baseline classifiers is compared, i.e., Support Vector Machines, Logistic Regression and Naive Bayes, when trained on translated data generated by either translation service. Furthermore, we fine-tune and evaluate three pre-trained transformer-based models, RoBERTa, IceBERT and ELECTRA, on both the original English texts and the translated texts. Our results indicate that the transformer models perform better than the baseline classifiers on all datasets. Furthermore, our evaluation shows that the transformer models trained on data translated from English reviews can be used to effectively classify sentiment on non-translated Icelandic movie reviews.

Unsupervised Outlier Detection for Language-Independent Text Quality Filtering

Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, 2024

Web-crawled corpora offer an abundant source of training data for language models. However, they ... more Web-crawled corpora offer an abundant source of training data for language models. However, they are generally noisy and are typically filtered using heuristic rules or classifiers. These methods require careful tuning or labeling by fluent speakers. In this paper, we assess the effectiveness of commonly applied rules on TQ-IS, a manually labeled text quality dataset for Icelandic. Additionally, we advocate for the utilization of unsupervised clustering and outlier detection algorithms for filtering. These algorithms are language-independent, computationally efficient and do not require language expertise. Using grid search, we find the optimal configuration for every combination of rules, optimizing for F1 score on TQ-IS. For a rule-based approach, we discover that optimal results can be achieved with only a small subset of the full ruleset. Using five rules, we obtain an F1 score of 98.2%. We then evaluate three unsupervised algorithms, i.e., Gaussian Mixture Models (GMMs), Isolation Forests and One-Class SVMs. Our findings reveal that unsupervised algorithms perform well on the TQ-IS dataset, with GMMs obtaining the best results, comparable to those obtained with the rule-based approach. Finally, we show that unsupervised methods appear to be equally suitable for languages other than Icelandic, including Estonian and Basque.

Text Filtering Classifiers for Medium-Resource Languages

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024

Web-crawled corpora are essential resources for linguistic and NLP research, offering far more da... more Web-crawled corpora are essential resources for linguistic and NLP research, offering far more data than is available from curated corpora. However, they often contain a great deal of low-quality texts which can complicate research and degrade the quality of pre-trained language models. Therefore, they are typically filtered, e.g. by applying rules or classifiers. In this paper, we compare the effectiveness of various text filtering classifiers and measure their impact on language model performance for three medium-resource languages. We present TQ-IS, an Icelandic text quality dataset consisting of 2,000 web-crawled documents, in which spans of low-quality text have been manually identified and labeled. We then evaluate a perplexity-based classifier, a supervised classifier trained on TQ-IS, and a self-supervised classifier trained to discern between documents from curated and web-crawled corpora on Icelandic, Estonian and Basque. We find that these classifiers obtain F1 scores of 94.48%, 99.01% and 93.40%, respectively, when evaluated on the TQ-IS dataset. Furthermore, our results show that while adding filtered web-crawled text to a pre-training corpus can improve downstream performance for pre-trained language models, any improvement is likely to remain modest unless the web-crawled corpus is significantly larger in size.

SentAlign: Accurate and Scalable Sentence Alignment

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023

We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel ... more We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE bilingual sentence representations. SentAlign outperforms five other sentence alignment tools when evaluated on two different evaluation sets, German-French and English-Icelandic, and on a downstream machine translation task.

Effective Entity Disambiguation in Low-Resource Languages: A Study of Icelandic.

Proceedings of the 22nd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, 2023

Entity disambiguation (ED) is integral to the task of entity linking (EL), the task of identifyin... more Entity disambiguation (ED) is integral to the task of entity linking (EL), the task of identifying named entities in text and linking them to their corresponding entries in a knowledge base (KB). In this paper, we present an effective and efficient ED system for Icelandic, using the Icelandic Wikipedia as a KB. We focus on disambiguation, the linking aspect of EL, assuming that an entity mention has already been located. We perform candidate generation using an alias table and Wikipedia search, and achieve candidate ranking through fine-grained entity typing and the use of an entity-aware Icelandic language model, IceLUKE. We study and compare the effects of different variations of candidate generation and candidate ranking, with the best approach reaching an accuracy of 95.2%. Our results highlight the importance of using an entity-aware language model in the candidate ranking step and show a minor improvement in using fine-grained entity typing to decrease the size of the candidate set before ranking.

Do Not Discard -Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation

Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) at MT Summit, 2023

When parallel corpora are preprocessed for machine translation (MT) training, a part of the paral... more When parallel corpora are preprocessed for machine translation (MT) training, a part of the parallel data is commonly discarded and deemed non-parallel due to odd-length ratio, overlapping text in source and target sentences or failing some other form of a semantic equivalency test. For language pairs with limited parallel resources, this can be costly as in such cases modest amounts of acceptable data may be useful to help build MT systems that generate higher quality translations. In this paper, we refine parallel corpora for two language pairs, English-Bengali and English-Icelandic, by extracting sub-sentence fragments from sentence pairs that would otherwise have been discarded, in order to increase recall when compiling training data. We find that by including the fragments, translation quality of NMT systems trained on the data improves significantly when translating from English to Bengali and from English to Icelandic.

Do Not Discard - Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation

Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) at MT Summit, 2023

When parallel corpora are preprocessed for machine translation (MT) training, a part of the paral... more When parallel corpora are preprocessed for machine translation (MT) training, a part of the parallel data is commonly discarded and deemed non-parallel due to odd-length ratio, overlapping text in source and target sentences or failing some other form of a semantic equivalency test. For language pairs with limited parallel resources, this can be costly as in such cases modest amounts of acceptable data may be useful to help build MT systems that generate higher quality translations. In this paper, we refine parallel corpora for two language pairs, English-Bengali and English-Icelandic, by extracting sub-sentence fragments from sentence pairs that would otherwise have been discarded, in order to increase recall when compiling training data. We find that by including the fragments, translation quality of NMT systems trained on the data improves significantly when translating from English to Bengali and from English to Icelandic.

Can triaging patients with respiratory symptoms in primary care using artificial intelligence improve patient outcomes? A retrospective diagnostic accuracy study.

Annals of Family Medicine, 2023

PURPOSE Respiratory symptoms are the most common presenting complaint in primary care. Often the... more PURPOSE
Respiratory symptoms are the most common presenting complaint in primary care. Often these symptoms are self resolving, but they can indicate a severe illness. With increasing physician workload and health care costs, triaging patients before in-person consultations would be helpful, possibly offering low-risk patients other means of communication. The objective of this study was to train a machine learning model to triage patients with respiratory symptoms before visiting a primary care clinic and examine patient outcomes in the context of the triage.

METHODS
We trained a machine learning model, using clinical features only available
before a medical visit. Clinical text notes were extracted from 1,500 records for patients that received 1 of 7 International Classification of Diseases 10th Revision codes (J00, J10, JII, J15, J20, J44, J45). All primary care clinics in the Reykjavík area of Iceland were included. The model scored patients in 2 extrinsic data sets and divided them into 10 risk groups (higher values having greater risk). We analyzed selected outcomes in each group.

RESULTS
Risk groups 1 through 5 consisted of younger patients with lower C-reactive
protein values, re-evaluation rates in primary and emergency care, antibiotic prescription rates, chest x-ray (CXR) referrals, and CXRs with signs of pneumonia, compared with groups 6 through 10. Groups 1 through 5 had no CXRs with signs of pneumonia or diagnosis of pneumonia by a physician.

CONCLUSIONS
The model triaged patients in line with expected outcomes. The model can
reduce the number of CXR referrals by eliminating them in risk groups 1 through 5, thus decreasing clinically insignificant incidentaloma findings without input from clinicians.

Microservices at Your Service: Bridging the Gap between NLP Research and Industry

by Hrafn Loftsson, Sebastian Andersson, and Jarmo Hemminki

24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023

This paper describes a collaborative European project whose aim was to gather open source Natural... more This paper describes a collaborative European project whose aim was to gather open source Natural Language Processing (NLP) tools and make them accessible as running services and easy to try out in the European Language Grid (ELG). The motivation of the project was to increase accessibility for more European languages and make it easier for developers to use the underlying tools in their own applications. The project resulted in the containerization of 60 existing NLP tools for 16 languages, all of which are now currently running as easily testable services in the ELG platform.

Filtering Matters: Experiments in Filtering Training Sets for Machine Translation

24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023

We explore different approaches for filtering parallel data for MT training, whether the same fil... more We explore different approaches for filtering parallel data for MT training, whether the same filtering approaches suit different datasets, and if separate filters should be applied to a dataset depending on the translation direction. We evaluate the results of different approaches, both manually and on a downstream NMT task. We find that, first, it is beneficial to inspect how well different filtering approaches suit different datasets and, second, that while MT systems trained on data prepared using different filters do not differ substantially in quality, there is indeed a statistically significant difference. Finally, we find that the same training sets do not seem to suit different translation directions.

Is Part-of-Speech Tagging a Solved Problem for Icelandic

24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023

We train and evaluate four Part-of-Speech tagging models for Icelandic. Three are older models th... more We train and evaluate four Part-of-Speech tagging models for Icelandic. Three are older models that obtained the highest accuracy for Icelandic when they were introduced. The fourth model is of a type that currently reaches state-of-the-art accuracy. We use the most recent version of the MIM-GOLD training/testing corpus, its newest tagset, and augmentation data to obtain results that are comparable between the various models. We examine the accuracy improvements with each model and analyse the errors produced by our transformer model, which is based on a previously published ConvBERT model. For the set of errors that all the models make, and for which they predict the same tag, we extract a random subset for manual inspection. Extrapolating from this subset, we obtain a lower bound estimate on annotation errors in the corpus as well as on some unsolvable tagging errors. We argue that further tagging accuracy gains for Icelandic can still be obtained by fixing the errors in MIM-GOLD and, furthermore, that it should still be possible to squeeze out some small gains from our transformer model.

Semi-supervised Automated Clinical Coding Using International Classification of Diseases

by Hrafn Loftsson and Emil L. Sigurðsson

5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)., 2022

Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free... more Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned data imputation in a semi-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation with a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of information that are available to a physician. Our data augmentation method shows a significant positive effect, which is diminished when an increasing number of clinical features, from the examination of the patient and diagnostics, are made available. Our method may be used for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.

GameQA: Gamified Mobile App Platform for Building Multiple-Domain Question-Answering Datasets

by Hrafn Loftsson, Steinar Smári, and Hafsteinn Einarsson

17th Conference of the European Chapter of the Association for Computational Linguistics (EACL): System Demonstrations, 2023

The methods used to create many of the wellknown Question-Answering (QA) datasets are hard to rep... more The methods used to create many of the wellknown Question-Answering (QA) datasets are hard to replicate for low-resource languages. A commonality amongst these methods is hiring annotators to source answers from the internet by querying a single answer source, such as Wikipedia. Applying these methods for lowresource languages can be problematic since there is no single large answer source for these languages. Consequently, this can result in a high ratio of unanswered questions, since the amount of information in any single source is limited. To address this problem, we developed a novel crowd-sourcing platform to gather multiple-domain QA data for low-resource languages. Our platform, which consists of a mobile app and a web API, gamifies the data collection process. We successfully released the app for Icelandic (a low-resource language with about 350,000 native speakers) to build a dataset which rivals large QA datasets for highresource languages both in terms of size and ratio of answered questions. We have made the platform open source with instructions on how to localize and deploy it to gather data for other low-resource languages.

PivotAlign: Leveraging High-Precision Word Alignments for Bilingual Dictionary Inference

Proceedings of the 4th Translation Inference Across Dictionaries (TIAD) shared task, 2021

This paper describes our contribution to the TIAD 2021 shared task for Translation Inference Acro... more This paper describes our contribution to the TIAD 2021 shared task for Translation Inference Across Dictionaries. Our system, PivotAlign, approaches the problem from two directions. First, we collect translation candidates by pivoting through intermediary dictionaries, made available by the task organizers. Second, we decide which candidates to keep by applying scores to the candidate list, obtained by running an ensemble of word alignment tools on parallel corpora and comparing frequency of alignments to frequency of word co-occurrence in the parallel texts. Our approach outperforms all other participating systems with respect to F1 measure and recall, as well as having a very competitive precision score, showing the usefulness of a scoring mechanism based on highly accurate word alignments for this kind of task.

Pre-training and Evaluating Transformer-based Language Models for Icelandic

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 2022

In this paper, we evaluate several Transformer-based language models for Icelandic on four downst... more In this paper, we evaluate several Transformer-based language models for Icelandic on four downstream tasks: Part-of-Speech tagging, Named Entity Recognition. Dependency Parsing, and Automatic Text Summarization. We pre-train four types of monolingual ELECTRA and ConvBERT models and compare our results to a previously trained monolingual RoBERTa model and the multilingual mBERT model. We find that the Transformer models obtain better results, often by a large margin, compared to previous state-of-the-art models. Furthermore, our results indicate that pre-training larger language models results in a significant reduction in error rates in comparison to smaller models. Finally, our results show that the monolingual models for Icelandic outperform a comparably sized multilingual model.

National Language Technology Platform for Public Administration

by Matea Filko, Hrafn Loftsson, Judie Attard, and Donatienne Spiteri

This article presents the work in progress on the collaborative project of several European count... more This article presents the work in progress on the collaborative project of several European countries to develop National Language Technology Platform (NLTP). The project aims at combining the most advanced Language Technology tools and solutions in a new, state-of-the-art, Artificial Intelligence driven, National Language Technology Platform for five EU/EEA official and lower-resourced languages.

Compiling a Highly Accurate Bilingual Lexicon by Combining Different Approaches

by Luke O'brien, Finnur Ingimundarson, and Hrafn Loftsson

Proceedings of the Globalex Workshop on Linked Lexicography @LREC2022,, 2022

Bilingual lexicons can be generated automatically using a wide variety of approaches. We perform ... more Bilingual lexicons can be generated automatically using a wide variety of approaches. We perform a rigorous manual evaluation of four different methods: word alignments on different types of bilingual data, pivoting, machine translation and cross-lingual word embeddings. We investigate how the different setups perform using publicly available data for the English-Icelandic language pair, doing separate evaluations for each method, dataset and confidence class where it can be calculated. The results are validated by human experts, working with a random sample from all our experiments. By combining the most promising approaches and data sets, using confidence scores calculated from the data and the results of manually evaluating samples from our manual evaluation as indicators, we are able to induce lists of translations with a very high acceptance rate. We show how multiple different combinations generate lists with well over 90% acceptance rate, substantially exceeding the results for each individual approach, while still generating reasonably large candidate lists. All manually evaluated equivalence pairs are published in a new lexicon of over 232,000 pairs under an open license.

Building an Icelandic Entity Linking Corpus

by Steinunn Rut Friðriksdóttir, Valdimar Eggertsson, Benedikt Geir Jóhannesson, and Hrafn Loftsson

Proceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) @LREC2022, 2022

In this paper, we present the first Entity Linking corpus for Icelandic. We describe our approach... more In this paper, we present the first Entity Linking corpus for Icelandic. We describe our approach of using a multilingual entity linking model (mGENRE) in combination with Wikipedia API Search (WAPIS) to label our data and compare it to an approach using WAPIS only. We find that our combined method reaches 53.9% coverage on our corpus, compared to 30.9% using only WAPIS. We analyze our results and explain the value of using a multilingual system when working with Icelandic. Additionally, we analyze the data that remain unlabeled, identify patterns and discuss why they may be more difficult to annotate.

Help Yourself from the Buffet: National Language Technology Infrastructure Initiative on CLARIN-IS

by Steinþór Steingrímsson and Hrafn Loftsson

Selected Papers from the CLARIN 2021 Annual Conference, 2022

In this paper we describe how a fairly new CLARIN member is building a broad collection of nation... more In this paper we describe how a fairly new CLARIN member is building a broad collection of national language resources for use in language technology (LT). As a CLARIN C-centre, CLARIN-IS is hosting metadata for various text and speech corpora, lexical resources, software packages and models. The providers of the resources are universities, institutions and private companies working on a national LT infrastructure initiative, Language Technology Programme for Icelandic. All deliverables of the programme are published under open licences and are freely accessible for research as well as commercial use. We provide a broad overview of the available repositories and the core publishing guidelines.

DMS: A System for Delivering Dynamic Multitask NLP Tools

Proceedings of the 14th International Conference on Agents and Artificial Intelligence, 2022

Most NLP frameworks focus on state-of-the-art models which solve a single task. As an alternative... more Most NLP frameworks focus on state-of-the-art models which solve a single task. As an alternative to these frameworks, we present the Dynamic Multitask System (DMS), based on native PyTorch. The DMS has a simple interface, can be combined with other frameworks, is easily extendable, and bundles model downloading with an API and a terminal client for end-users. The DMS is flexible towards different tasks and enables
quick experimentation with different architectures and hyperparameters. Components of the system are split into two categories with their respective interfaces: encoders and decoders. The DMS targets researchers and practitioners who want to develop state-of-the-art multitask NLP tools and easily supply them to end-users. In this paper, we, first, describe the core components of the DMS and how it can be used to deliver a trained
system. Second, we demonstrate how we used the DMS for developing a state-of-the-art PoS tagger and a lemmatizer for Icelandic.

Evaluating Icelandic Sentiment Analysis Models Trained on Translated Data

by Ólafur Aron Jóhannsson and Hrafn Loftsson

Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, 2024

We experiment with sentiment classification models for Icelandic that leverage machine-translated... more We experiment with sentiment classification models for Icelandic that leverage machine-translated data for training. Since no large sentiment dataset exists for Icelandic, we translate 50,000 English IMDb reviews, classified either as positive or negative, into Icelandic using two services: Google Translate and GreynirTranslate. After machine translation, we assess whether the sentiment of the source language text is retained in the target language. Moreover, we evaluate the accuracy of the sentiment classifiers on non-translated Icelandic text. The performance of three types of baseline classifiers is compared, i.e., Support Vector Machines, Logistic Regression and Naive Bayes, when trained on translated data generated by either translation service. Furthermore, we fine-tune and evaluate three pre-trained transformer-based models, RoBERTa, IceBERT and ELECTRA, on both the original English texts and the translated texts. Our results indicate that the transformer models perform better than the baseline classifiers on all datasets. Furthermore, our evaluation shows that the transformer models trained on data translated from English reviews can be used to effectively classify sentiment on non-translated Icelandic movie reviews.

Unsupervised Outlier Detection for Language-Independent Text Quality Filtering

Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, 2024

Web-crawled corpora offer an abundant source of training data for language models. However, they ... more Web-crawled corpora offer an abundant source of training data for language models. However, they are generally noisy and are typically filtered using heuristic rules or classifiers. These methods require careful tuning or labeling by fluent speakers. In this paper, we assess the effectiveness of commonly applied rules on TQ-IS, a manually labeled text quality dataset for Icelandic. Additionally, we advocate for the utilization of unsupervised clustering and outlier detection algorithms for filtering. These algorithms are language-independent, computationally efficient and do not require language expertise. Using grid search, we find the optimal configuration for every combination of rules, optimizing for F1 score on TQ-IS. For a rule-based approach, we discover that optimal results can be achieved with only a small subset of the full ruleset. Using five rules, we obtain an F1 score of 98.2%. We then evaluate three unsupervised algorithms, i.e., Gaussian Mixture Models (GMMs), Isolation Forests and One-Class SVMs. Our findings reveal that unsupervised algorithms perform well on the TQ-IS dataset, with GMMs obtaining the best results, comparable to those obtained with the rule-based approach. Finally, we show that unsupervised methods appear to be equally suitable for languages other than Icelandic, including Estonian and Basque.

Text Filtering Classifiers for Medium-Resource Languages

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024

Web-crawled corpora are essential resources for linguistic and NLP research, offering far more da... more Web-crawled corpora are essential resources for linguistic and NLP research, offering far more data than is available from curated corpora. However, they often contain a great deal of low-quality texts which can complicate research and degrade the quality of pre-trained language models. Therefore, they are typically filtered, e.g. by applying rules or classifiers. In this paper, we compare the effectiveness of various text filtering classifiers and measure their impact on language model performance for three medium-resource languages. We present TQ-IS, an Icelandic text quality dataset consisting of 2,000 web-crawled documents, in which spans of low-quality text have been manually identified and labeled. We then evaluate a perplexity-based classifier, a supervised classifier trained on TQ-IS, and a self-supervised classifier trained to discern between documents from curated and web-crawled corpora on Icelandic, Estonian and Basque. We find that these classifiers obtain F1 scores of 94.48%, 99.01% and 93.40%, respectively, when evaluated on the TQ-IS dataset. Furthermore, our results show that while adding filtered web-crawled text to a pre-training corpus can improve downstream performance for pre-trained language models, any improvement is likely to remain modest unless the web-crawled corpus is significantly larger in size.

SentAlign: Accurate and Scalable Sentence Alignment

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023

We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel ... more We present SentAlign, an accurate sentence alignment tool designed to handle very large parallel document pairs. Given user-defined parameters, the alignment algorithm evaluates all possible alignment paths in fairly large documents of thousands of sentences and uses a divide-and-conquer approach to align documents containing tens of thousands of sentences. The scoring function is based on LaBSE bilingual sentence representations. SentAlign outperforms five other sentence alignment tools when evaluated on two different evaluation sets, German-French and English-Icelandic, and on a downstream machine translation task.

Effective Entity Disambiguation in Low-Resource Languages: A Study of Icelandic.

Proceedings of the 22nd IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, 2023

Entity disambiguation (ED) is integral to the task of entity linking (EL), the task of identifyin... more Entity disambiguation (ED) is integral to the task of entity linking (EL), the task of identifying named entities in text and linking them to their corresponding entries in a knowledge base (KB). In this paper, we present an effective and efficient ED system for Icelandic, using the Icelandic Wikipedia as a KB. We focus on disambiguation, the linking aspect of EL, assuming that an entity mention has already been located. We perform candidate generation using an alias table and Wikipedia search, and achieve candidate ranking through fine-grained entity typing and the use of an entity-aware Icelandic language model, IceLUKE. We study and compare the effects of different variations of candidate generation and candidate ranking, with the best approach reaching an accuracy of 95.2%. Our results highlight the importance of using an entity-aware language model in the candidate ranking step and show a minor improvement in using fine-grained entity typing to decrease the size of the candidate set before ranking.

Do Not Discard -Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation

Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) at MT Summit, 2023

When parallel corpora are preprocessed for machine translation (MT) training, a part of the paral... more When parallel corpora are preprocessed for machine translation (MT) training, a part of the parallel data is commonly discarded and deemed non-parallel due to odd-length ratio, overlapping text in source and target sentences or failing some other form of a semantic equivalency test. For language pairs with limited parallel resources, this can be costly as in such cases modest amounts of acceptable data may be useful to help build MT systems that generate higher quality translations. In this paper, we refine parallel corpora for two language pairs, English-Bengali and English-Icelandic, by extracting sub-sentence fragments from sentence pairs that would otherwise have been discarded, in order to increase recall when compiling training data. We find that by including the fragments, translation quality of NMT systems trained on the data improves significantly when translating from English to Bengali and from English to Icelandic.

Do Not Discard - Extracting Useful Fragments from Low-Quality Parallel Data to Improve Machine Translation

Second Workshop on Corpus Generation and Corpus Augmentation for Machine Translation (CoCo4MT) at MT Summit, 2023

When parallel corpora are preprocessed for machine translation (MT) training, a part of the paral... more When parallel corpora are preprocessed for machine translation (MT) training, a part of the parallel data is commonly discarded and deemed non-parallel due to odd-length ratio, overlapping text in source and target sentences or failing some other form of a semantic equivalency test. For language pairs with limited parallel resources, this can be costly as in such cases modest amounts of acceptable data may be useful to help build MT systems that generate higher quality translations. In this paper, we refine parallel corpora for two language pairs, English-Bengali and English-Icelandic, by extracting sub-sentence fragments from sentence pairs that would otherwise have been discarded, in order to increase recall when compiling training data. We find that by including the fragments, translation quality of NMT systems trained on the data improves significantly when translating from English to Bengali and from English to Icelandic.

Can triaging patients with respiratory symptoms in primary care using artificial intelligence improve patient outcomes? A retrospective diagnostic accuracy study.

Annals of Family Medicine, 2023

PURPOSE Respiratory symptoms are the most common presenting complaint in primary care. Often the... more PURPOSE
Respiratory symptoms are the most common presenting complaint in primary care. Often these symptoms are self resolving, but they can indicate a severe illness. With increasing physician workload and health care costs, triaging patients before in-person consultations would be helpful, possibly offering low-risk patients other means of communication. The objective of this study was to train a machine learning model to triage patients with respiratory symptoms before visiting a primary care clinic and examine patient outcomes in the context of the triage.

METHODS
We trained a machine learning model, using clinical features only available
before a medical visit. Clinical text notes were extracted from 1,500 records for patients that received 1 of 7 International Classification of Diseases 10th Revision codes (J00, J10, JII, J15, J20, J44, J45). All primary care clinics in the Reykjavík area of Iceland were included. The model scored patients in 2 extrinsic data sets and divided them into 10 risk groups (higher values having greater risk). We analyzed selected outcomes in each group.

RESULTS
Risk groups 1 through 5 consisted of younger patients with lower C-reactive
protein values, re-evaluation rates in primary and emergency care, antibiotic prescription rates, chest x-ray (CXR) referrals, and CXRs with signs of pneumonia, compared with groups 6 through 10. Groups 1 through 5 had no CXRs with signs of pneumonia or diagnosis of pneumonia by a physician.

CONCLUSIONS
The model triaged patients in line with expected outcomes. The model can
reduce the number of CXR referrals by eliminating them in risk groups 1 through 5, thus decreasing clinically insignificant incidentaloma findings without input from clinicians.

Microservices at Your Service: Bridging the Gap between NLP Research and Industry

by Hrafn Loftsson, Sebastian Andersson, and Jarmo Hemminki

24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023

This paper describes a collaborative European project whose aim was to gather open source Natural... more This paper describes a collaborative European project whose aim was to gather open source Natural Language Processing (NLP) tools and make them accessible as running services and easy to try out in the European Language Grid (ELG). The motivation of the project was to increase accessibility for more European languages and make it easier for developers to use the underlying tools in their own applications. The project resulted in the containerization of 60 existing NLP tools for 16 languages, all of which are now currently running as easily testable services in the ELG platform.

Filtering Matters: Experiments in Filtering Training Sets for Machine Translation

24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023

We explore different approaches for filtering parallel data for MT training, whether the same fil... more We explore different approaches for filtering parallel data for MT training, whether the same filtering approaches suit different datasets, and if separate filters should be applied to a dataset depending on the translation direction. We evaluate the results of different approaches, both manually and on a downstream NMT task. We find that, first, it is beneficial to inspect how well different filtering approaches suit different datasets and, second, that while MT systems trained on data prepared using different filters do not differ substantially in quality, there is indeed a statistically significant difference. Finally, we find that the same training sets do not seem to suit different translation directions.

Is Part-of-Speech Tagging a Solved Problem for Icelandic

24th Nordic Conference on Computational Linguistics (NoDaLiDa), 2023

We train and evaluate four Part-of-Speech tagging models for Icelandic. Three are older models th... more We train and evaluate four Part-of-Speech tagging models for Icelandic. Three are older models that obtained the highest accuracy for Icelandic when they were introduced. The fourth model is of a type that currently reaches state-of-the-art accuracy. We use the most recent version of the MIM-GOLD training/testing corpus, its newest tagset, and augmentation data to obtain results that are comparable between the various models. We examine the accuracy improvements with each model and analyse the errors produced by our transformer model, which is based on a previously published ConvBERT model. For the set of errors that all the models make, and for which they predict the same tag, we extract a random subset for manual inspection. Extrapolating from this subset, we obtain a lower bound estimate on annotation errors in the corpus as well as on some unsolvable tagging errors. We argue that further tagging accuracy gains for Icelandic can still be obtained by fixing the errors in MIM-GOLD and, furthermore, that it should still be possible to squeeze out some small gains from our transformer model.

Semi-supervised Automated Clinical Coding Using International Classification of Diseases

by Hrafn Loftsson and Emil L. Sigurðsson

5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)., 2022

Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free... more Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding. Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data. This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned data imputation in a semi-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation with a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases. We report the results of an evaluation of this data augmentation method over three tiers of information that are available to a physician. Our data augmentation method shows a significant positive effect, which is diminished when an increasing number of clinical features, from the examination of the patient and diagnostics, are made available. Our method may be used for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.

GameQA: Gamified Mobile App Platform for Building Multiple-Domain Question-Answering Datasets

by Hrafn Loftsson, Steinar Smári, and Hafsteinn Einarsson

17th Conference of the European Chapter of the Association for Computational Linguistics (EACL): System Demonstrations, 2023

The methods used to create many of the wellknown Question-Answering (QA) datasets are hard to rep... more The methods used to create many of the wellknown Question-Answering (QA) datasets are hard to replicate for low-resource languages. A commonality amongst these methods is hiring annotators to source answers from the internet by querying a single answer source, such as Wikipedia. Applying these methods for lowresource languages can be problematic since there is no single large answer source for these languages. Consequently, this can result in a high ratio of unanswered questions, since the amount of information in any single source is limited. To address this problem, we developed a novel crowd-sourcing platform to gather multiple-domain QA data for low-resource languages. Our platform, which consists of a mobile app and a web API, gamifies the data collection process. We successfully released the app for Icelandic (a low-resource language with about 350,000 native speakers) to build a dataset which rivals large QA datasets for highresource languages both in terms of size and ratio of answered questions. We have made the platform open source with instructions on how to localize and deploy it to gather data for other low-resource languages.

PivotAlign: Leveraging High-Precision Word Alignments for Bilingual Dictionary Inference

Proceedings of the 4th Translation Inference Across Dictionaries (TIAD) shared task, 2021

This paper describes our contribution to the TIAD 2021 shared task for Translation Inference Acro... more This paper describes our contribution to the TIAD 2021 shared task for Translation Inference Across Dictionaries. Our system, PivotAlign, approaches the problem from two directions. First, we collect translation candidates by pivoting through intermediary dictionaries, made available by the task organizers. Second, we decide which candidates to keep by applying scores to the candidate list, obtained by running an ensemble of word alignment tools on parallel corpora and comparing frequency of alignments to frequency of word co-occurrence in the parallel texts. Our approach outperforms all other participating systems with respect to F1 measure and recall, as well as having a very competitive precision score, showing the usefulness of a scoring mechanism based on highly accurate word alignments for this kind of task.

Pre-training and Evaluating Transformer-based Language Models for Icelandic

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 2022

In this paper, we evaluate several Transformer-based language models for Icelandic on four downst... more In this paper, we evaluate several Transformer-based language models for Icelandic on four downstream tasks: Part-of-Speech tagging, Named Entity Recognition. Dependency Parsing, and Automatic Text Summarization. We pre-train four types of monolingual ELECTRA and ConvBERT models and compare our results to a previously trained monolingual RoBERTa model and the multilingual mBERT model. We find that the Transformer models obtain better results, often by a large margin, compared to previous state-of-the-art models. Furthermore, our results indicate that pre-training larger language models results in a significant reduction in error rates in comparison to smaller models. Finally, our results show that the monolingual models for Icelandic outperform a comparably sized multilingual model.

National Language Technology Platform for Public Administration

by Matea Filko, Hrafn Loftsson, Judie Attard, and Donatienne Spiteri

This article presents the work in progress on the collaborative project of several European count... more This article presents the work in progress on the collaborative project of several European countries to develop National Language Technology Platform (NLTP). The project aims at combining the most advanced Language Technology tools and solutions in a new, state-of-the-art, Artificial Intelligence driven, National Language Technology Platform for five EU/EEA official and lower-resourced languages.

Compiling a Highly Accurate Bilingual Lexicon by Combining Different Approaches

by Luke O'brien, Finnur Ingimundarson, and Hrafn Loftsson

Proceedings of the Globalex Workshop on Linked Lexicography @LREC2022,, 2022

Bilingual lexicons can be generated automatically using a wide variety of approaches. We perform ... more Bilingual lexicons can be generated automatically using a wide variety of approaches. We perform a rigorous manual evaluation of four different methods: word alignments on different types of bilingual data, pivoting, machine translation and cross-lingual word embeddings. We investigate how the different setups perform using publicly available data for the English-Icelandic language pair, doing separate evaluations for each method, dataset and confidence class where it can be calculated. The results are validated by human experts, working with a random sample from all our experiments. By combining the most promising approaches and data sets, using confidence scores calculated from the data and the results of manually evaluating samples from our manual evaluation as indicators, we are able to induce lists of translations with a very high acceptance rate. We show how multiple different combinations generate lists with well over 90% acceptance rate, substantially exceeding the results for each individual approach, while still generating reasonably large candidate lists. All manually evaluated equivalence pairs are published in a new lexicon of over 232,000 pairs under an open license.

Building an Icelandic Entity Linking Corpus

by Steinunn Rut Friðriksdóttir, Valdimar Eggertsson, Benedikt Geir Jóhannesson, and Hrafn Loftsson

Proceedings of the 1st Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) @LREC2022, 2022

In this paper, we present the first Entity Linking corpus for Icelandic. We describe our approach... more In this paper, we present the first Entity Linking corpus for Icelandic. We describe our approach of using a multilingual entity linking model (mGENRE) in combination with Wikipedia API Search (WAPIS) to label our data and compare it to an approach using WAPIS only. We find that our combined method reaches 53.9% coverage on our corpus, compared to 30.9% using only WAPIS. We analyze our results and explain the value of using a multilingual system when working with Icelandic. Additionally, we analyze the data that remain unlabeled, identify patterns and discuss why they may be more difficult to annotate.

Help Yourself from the Buffet: National Language Technology Infrastructure Initiative on CLARIN-IS

by Steinþór Steingrímsson and Hrafn Loftsson

Selected Papers from the CLARIN 2021 Annual Conference, 2022

In this paper we describe how a fairly new CLARIN member is building a broad collection of nation... more In this paper we describe how a fairly new CLARIN member is building a broad collection of national language resources for use in language technology (LT). As a CLARIN C-centre, CLARIN-IS is hosting metadata for various text and speech corpora, lexical resources, software packages and models. The providers of the resources are universities, institutions and private companies working on a national LT infrastructure initiative, Language Technology Programme for Icelandic. All deliverables of the programme are published under open licences and are freely accessible for research as well as commercial use. We provide a broad overview of the available repositories and the core publishing guidelines.

DMS: A System for Delivering Dynamic Multitask NLP Tools

Proceedings of the 14th International Conference on Agents and Artificial Intelligence, 2022

Most NLP frameworks focus on state-of-the-art models which solve a single task. As an alternative... more Most NLP frameworks focus on state-of-the-art models which solve a single task. As an alternative to these frameworks, we present the Dynamic Multitask System (DMS), based on native PyTorch. The DMS has a simple interface, can be combined with other frameworks, is easily extendable, and bundles model downloading with an API and a terminal client for end-users. The DMS is flexible towards different tasks and enables
quick experimentation with different architectures and hyperparameters. Components of the system are split into two categories with their respective interfaces: encoders and decoders. The DMS targets researchers and practitioners who want to develop state-of-the-art multitask NLP tools and easily supply them to end-users. In this paper, we, first, describe the core components of the DMS and how it can be used to deliver a trained
system. Second, we demonstrate how we used the DMS for developing a state-of-the-art PoS tagger and a lemmatizer for Icelandic.