Skip to main content

Björn Gambäck

Norwegian University of Science and Technology, Computer and Information Science, Faculty Member

Followers

210

Following

65

Co-authors

28

Public Views

Asdg

less

Romanian Academy Institute for Artificial Intelligence

Oxford Brookes University

Utrecht University

Swansea University

Arimardan Kumar Tripathi

Visva-Bharati

University of Technology Sydney

Dominique Brunato

Consiglio Nazionale delle Ricerche (CNR)

Trond Trosterud

University of Tromsø

PALIMOTE JUSTICE

RIVERS STATE POLYTECHNIC

Punjab Technical University

Interests

Uploads

Papers by Björn Gambäck

Proceedings of the 2010 Workshop on Companionable Dialogue Systems}

Proceedings of the 2010 Workshop on Companionable Dialogue Systems}, 2010

@Book{CDS:2010, editor = {Yorick Wilks and Bj\&amp;amp;amp;amp;amp;amp;quot;{o}rn Gamb\&a... more @Book{CDS:2010, editor = {Yorick Wilks and Bj\&amp;amp;amp;amp;amp;amp;quot;{o}rn Gamb\&amp;amp;amp;amp;amp;amp;quot;{a}ck and Morena Danieli}, title = {Proceedings of the 2010 Workshop on Companionable Dialogue Systems}, month = {July}, year = {2010}, address = {Uppsala, Sweden}, publisher = {Association for Computational ...

Cross-Lingual Speaker Identification for Indian Languages

The paper introduces a cross-lingual speaker identification system for Indian languages, utilisin... more The paper introduces a cross-lingual speaker identification system for Indian languages, utilising a Long Short-Term Memory dense neural network (LSTM-DNN). The system was trained on audio recordings in English and evaluated on data from Hindi, Kannada, Malayalam, Tamil, and Telugu, with a view to how factors such as phonetic similarity and native accent affect performance. The model was fed with MFCC (mel-frequency cepstral coefficient) features extracted from the audio file. For comparison, the corresponding melspectrogram images were also used as input to a ResNet-50 model, while the raw audio was used to train a Siamese network. The LSTM-DNN model outperformed the other two models as well as two more traditional baseline speaker identification models, showing that deep learning models are superior to probabilistic models for capturing low-level speech features and learning speaker characteristics.

A Multilingual Adaptive Spoken Dialogue System for the E-mail Domain

AthosMail is a multilingual spoken dialogue system for reading of e-mail messages. The key featur... more AthosMail is a multilingual spoken dialogue system for reading of e-mail messages. The key features of the application are adaptivity and the integration of different approaches for spoken interaction. The application has flexible system structure supporting multiple components for both different and same purposes. The AthosMail system includes components for input interpretation, dialogue management, output generation, user modelling and text processing. Suitable components are selected dynamically to make the interaction adaptive, and various adaptive techniques are experimented. For example, the system responses are tailored according to the user's actions so as to adapt to the user's observed skill levels. In this paper AthosMail functionality and its system components are presented.

Named Entity Recognition on Code-Switched Data Using Conditional Random Fields

Named Entity Recognition is an important information extraction task that identifies proper names... more Named Entity Recognition is an important information extraction task that identifies proper names in unstructured texts and classifies them into some pre-defined categories. Identification of named entities in code-mixed social media texts is a more difficult and challenging task as the contexts are short, ambiguous and often noisy. This work proposes a Conditional Random Fields based named entity recognition system to identify proper names in code-switched data and classify them into nine categories. The system ranked fifth among nine participant systems and achieved a 59.25% F1-score.

NTNU-TRH system at the MultiGED-2023 Shared on Multilingual Grammatical Error Detection

Linköping electronic conference proceedings, May 16, 2023

The paper presents a monolithic approach to grammatical error detection, which uses one model for... more The paper presents a monolithic approach to grammatical error detection, which uses one model for all languages, in contrast to the individual approach, which creates separate models for each language. For both approaches, pre-trained embeddings are the only external knowledge sources. Two sets of embeddings (Flair and BERT) are compared as well as two approaches to the problem of multilingual rammar detection, building individual and monolithic systems for multilingual grammar error detection. The system submitted to the test phase of the MultiGED-2023 shared task ranked 5th of 6 systems. In the subsequent open phase, more experiments were conducted, improving results. These results show the individual models to perform better than the monolithic ones and BERT embeddings working better than Flair embeddings for the individual models, while the picture is more mixed for the monolithic models.

On Implementing Swedish Tense and Aspect

B jö rn G am b äck S tock h olm A b stract The paper addresses the problems encountered when impl... more B jö rn G am b äck S tock h olm A b stract The paper addresses the problems encountered when implementing a system for the treatment of Swedish tense, mood and aspect. The underlying theory suffered from the same shortcomings as do most implementable linguistic theories: it was designed for English. To extend it to Swedish some aspects of the theory, but also the implementation had to be generalized to allow for a system which treats Swedish verb-phrase syntax and semantics in a uniform way. This paper is concentrated on how this treatment actually has been implemented in a large-sc^e natural-language processing system.

Twitter Named Entity Extraction and Linking Using Differential Evolution

International conference natural language processing, Dec 1, 2016

Systems that simultaneously identify and classify named entities in Twitter typically show poor r... more Systems that simultaneously identify and classify named entities in Twitter typically show poor recall. To remedy this, the task is here divided into two parts: i) named entity identification using Conditional Random Fields in a multi-objective framework built on Differential Evolution, and ii) named entity classification using Vector Space Modelling and edit distance techniques. Differential Evolution is an evolutionary algorithm, which not only optimises the features, but also identifies the proper context window for each selected feature. The approach obtains F-scores of 70.7% for Twitter named entity extraction and 66.0% for entity linking to the DBpedia database.

Generative Solid Modelling Employing Natural Language Understanding and 3D Data

Lecture Notes in Computer Science, 2018

The paper describes an experimental system for generating 3D-printable models inspired by arbitra... more The paper describes an experimental system for generating 3D-printable models inspired by arbitrary textual input. Utilizing a transliteration pipeline, the system pivots on Natural Language Understanding technologies and 3D data available via online repositories to result in a bag of retrieved 3D models that are then concatenated in order to produce original designs. Such artefacts celebrate a post-digital kind of objecthood, as they are concretely physical while, at the same time, incorporate the cybernetic encodings of their own making. Twelve individuals were asked to reflect on some of the 3D-printed, physical artefacts. Their responses suggest that the created artefacts succeed in triggering imagination, and in accelerating moods and narratives of various sorts.

Semantic-head based resolution of scopal ambiguities

Proceedings of the 17th international conference on Computational linguistics -, 1998

We introduce an algorithm for scope resolution in underspecified semantic representations. Scope ... more We introduce an algorithm for scope resolution in underspecified semantic representations. Scope preferences are suggested on the basis of semantic argument structure. The major novelty of this approach is that, while maintaining an (scopally) underspecified semantic representation, we at the same time suggest a resolution possibility. The algorithm has been implemented and tested in a large-scale system and fared quite well: 28% of the utterances were ambiguous, 80% of these were correctly interpreted, leaving errors in only 5.7% of the utterance set.

Sentimental Poetry Generation

International Conference on Networks, 2020

The paper investigates how well poetry can be generated to contain a specific sentiment, and whet... more The paper investigates how well poetry can be generated to contain a specific sentiment, and whether readers of the poetry experience the intended sentiment. The poetry generator consists of a bi-directional Long ShortTerm Memory (LSTM) model, combined with rhyme pair generation, rule-based word prediction methods, and tree search for extending generation possibilities. The LSTM network was trained on a set of English poetry written and published by users on a public website. Human judges evaluated poems generated by the system, both with a positive and negative sentiment. The results indicate that while there are some weaknesses in the system compared to other state-of-the-art solutions, it is fully capable of generating poetry with an inherent sentiment that is perceived by readers.

Flytxt_NTNU at SemEval-2018 Task 8: Identifying and Classifying Malware Text Using Conditional Random Fields and Naïve Bayes Classifiers

Cybersecurity risks such as malware threaten the personal safety of users, but to identify malwar... more Cybersecurity risks such as malware threaten the personal safety of users, but to identify malware text is a major challenge. The paper proposes a supervised learning approach to identifying malware sentences given a document (subTask1 of SemEval 2018, Task 8), as well as to classifying malware tokens in the sentences (subTask2). The approach achieved good results, ranking second of twelve participants for both subtasks, with F-scores of 57% for subTask1 and 28% for subTask2.

Sentiment analysis

In this paper we address the Sentiment Analysis problem from the end user's perspective. An end u... more In this paper we address the Sentiment Analysis problem from the end user's perspective. An end user might desire an automated at-a-glance presentation of the main points made in a single review or how opinion changes time to time over multiple documents. To meet the requirement we propose a relatively generic opinion 5Ws structurization, further used for textual and visual summary and tracking. The 5W task seeks to extract the semantic constituents in a natural language sentence by distilling it into the answers to the 5W questions: Who, What, When, Where and Why. The visualization system facilitates users to generate sentiment tracking with textual summary and sentiment polarity wise graph based on any dimension or combination of dimensions as they want i.e. "Who" are the actors and "What" are their sentiment regarding any topic, changes in sentiment during "When" and "Where" and the reasons for change in sentiment as "Why".

A speaker independent continuous speech recognizer for Amharic

The paper discusses an Amharic speaker independent continuous speech recognizer based on an HMM/A... more The paper discusses an Amharic speaker independent continuous speech recognizer based on an HMM/ANN hybrid approach. The model was constructed at a context dependent phone part sub-word level with the help of the CSLU Toolkit. A promising result of 74.28% word and 39.70% sentence recognition rate was achieved. These are the best figures reported so far for speech recognition for the Amharic language.

Language Identification in Code-Switched Text Using Conditional Random Fields and Babelnet

The paper outlines a supervised approach to language identification in code-switched data, framin... more The paper outlines a supervised approach to language identification in code-switched data, framing this as a sequence labeling task where the label of each token is identified using a classifier based on Conditional Random Fields and trained on a range of different features, extracted both from the training data and by using information from Babelnet and Babelfy. The method was tested on the development dataset provided by organizers of the shared task on language identification in codeswitched data, obtaining tweet level monolingual, code-switched and weighted F1-scores of 94%, 85% and 91%, respectively, with a token level accuracy of 95.8%. When evaluated on the unseen test data, the system achieved 90%, 85% and 87.4% monolingual, code-switched and weighted tweet level F1scores, and a token level accuracy of 95.7%.

Feature-Rich Twitter Named Entity Recognition and Classification

International Conference on Computational Linguistics, Dec 1, 2016

Twitter named entity recognition is the process of identifying proper names and classifying them ... more Twitter named entity recognition is the process of identifying proper names and classifying them into some predefined labels/categories. The paper introduces a Twitter named entity system using a supervised machine learning approach, namely Conditional Random Fields. A large set of different features was developed and the system was trained using these. The Twitter named entity task can be divided into two parts: i) Named entity extraction from tweets and ii) Twitter name classification into ten different types. For Twitter named entity recognition on unseen test data, our system obtained the second highest F 1 score in the shared task: 63.22%. The system performance on the classification task was worse, with an F 1 measure of 40.06% on unseen test data, which was the fourth best of the ten systems participating in the shared task.

Swedish Language Processing in the Spoken Language Translator

T h e p a p e r d e s crib e s th e S w ed ish la n g u a g e c o m p o n e n t s u sed in th e S... more T h e p a p e r d e s crib e s th e S w ed ish la n g u a g e c o m p o n e n t s u sed in th e S p o ken L a n g u a g e T r a n s la to r (S L T) s y s te m. S L T is a m u lt i-c o m p o n e n t sy s te m fo r tr a n s la tio n o f s p o k e n E n g lish in t o s p o k e n S w ed ish. T h e la n g u a g e p ro ce s s in g p a rts o f th e s y s te m are th e E n g lish C o r e L a n g u a g e E n g in e (C L E) a n d its S w ed ish c o u n te r p a r t, th e S-C L E. T h e S-C L E is a g en era l p u rp o s e n a tu ra l la n g u a g e p r o ce s s in g s y s te m s fo r S w ed ish w h ich in th e S L T p r o je c t w as tu n ed to w a rd s th e register o f th e air tra v el in fo r m a tio n (A T I S) d o m a in. T h e p e c u lia r itie s a n d th e co v e r a g e o f th e re su ltin g S w ed ish g ra m m a r are th e m a in to p ic s o f th e p a p e r, ev e n t h o u g h th e ov e ra ll S L T s y s te m a lso is b rie fly d e s crib e d .

Tagging Experiments Using Neural Networks

The paper outlines a method for automatic part-of-speech tagging using artificial neural networks... more The paper outlines a method for automatic part-of-speech tagging using artificial neural networks. Several experiments have been carried out where the performance of different network architectures have been compared to each other on two tasks: classification by overall part-of-speech (noun, adjective or verb) and by a set of 13 possible output categories. The best classification rates were 93.6% for the simple and 96.4% for the complex task. These results are rather promising and the paper compares them to the performance reported by other methods; a comparison that shows the neural network completely compatible with pure statistical approaches.

Evolvable Media Repositories: An Evolutionary System to Retrieve and Ever-Renovate Related Media Web Content

Advances in Intelligent Systems and Computing, 2019

The paper tackles the question of evolvable media repositories, i.e., local pools of media files ... more The paper tackles the question of evolvable media repositories, i.e., local pools of media files that are retrieved over the Internet and that are ever-renovated with new, related files in an evolutionary fashion. The herein proposed method encodes genotypic space by virtue of simple undirected graphs of natural language tokens that represent web queries without employing fitness functions or other evaluation/selection schemata. Once a first population is seeded, a series of modular crawlers query the particular World Wide Web repositories of interest for both media content and assorted meta-data. Then, a series of attached intelligent comprehenders analyse the retrieved content in order to eventually generate new genetic representations, and the cycle is repeated. Such a method is generic, scalable and modular, and can be made fit the purposes of a wide array of applications in all sorts of disparate contextual and functional scenarios. The paper features a formal description of the method, gives implementation guidelines, and presents example usages.

NTNU at SemEval-2018 Task 7: Classifier Ensembling for Semantic Relation Identification and Classification in Scientific Papers

Proceedings of The 12th International Workshop on Semantic Evaluation, 2018

The paper presents NTNU's contribution to SemEval-2018 Task 7 on relation identification and clas... more The paper presents NTNU's contribution to SemEval-2018 Task 7 on relation identification and classification. The class weights and parameters of five alternative supervised classifiers were optimized through grid search and cross-validation. The outputs of the classifiers were combined through voting for the final prediction. A wide variety of features were explored, with the most informative identified by feature selection. The best setting achieved F 1 scores of 47.4% and 66.0% in the relation classification subtasks 1.1 and 1.2. For relation identification and classification in subtask 2, it achieved F 1 scores of 33.9% and 17.0%,

Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus

International Journal on Artificial Intelligence Tools, 2020

Sentiment analysis is a circumstantial analysis of text, identifying the social sentiment to bett... more Sentiment analysis is a circumstantial analysis of text, identifying the social sentiment to better understand the source material. The article addresses sentiment analysis of an English-Hindi and English-Bengali code-mixed textual corpus collected from social media. Code-mixing is an amalgamation of multiple languages, which previously mainly was associated with spoken language. However, social media users also deploy it to communicate in ways that tend to be somewhat casual. The coarse nature of social media text poses challenges for many language processing applications. Here, the focus is on the low predictive nature of traditional machine learners when compared to Deep Learning counterparts, including the contextual language representation model BERT (Bidirectional Encoder Representations from Transformers), on the task of extracting user sentiment from code-mixed texts. Three deep learners (a BiLSTM CNN, a Double BiLSTM and an Attention-based model) attained accuracy 20–60% gr...

Proceedings of the 2010 Workshop on Companionable Dialogue Systems}

Proceedings of the 2010 Workshop on Companionable Dialogue Systems}, 2010

@Book{CDS:2010, editor = {Yorick Wilks and Bj\&amp;amp;amp;amp;amp;amp;quot;{o}rn Gamb\&a... more @Book{CDS:2010, editor = {Yorick Wilks and Bj\&amp;amp;amp;amp;amp;amp;quot;{o}rn Gamb\&amp;amp;amp;amp;amp;amp;quot;{a}ck and Morena Danieli}, title = {Proceedings of the 2010 Workshop on Companionable Dialogue Systems}, month = {July}, year = {2010}, address = {Uppsala, Sweden}, publisher = {Association for Computational ...

Cross-Lingual Speaker Identification for Indian Languages

The paper introduces a cross-lingual speaker identification system for Indian languages, utilisin... more The paper introduces a cross-lingual speaker identification system for Indian languages, utilising a Long Short-Term Memory dense neural network (LSTM-DNN). The system was trained on audio recordings in English and evaluated on data from Hindi, Kannada, Malayalam, Tamil, and Telugu, with a view to how factors such as phonetic similarity and native accent affect performance. The model was fed with MFCC (mel-frequency cepstral coefficient) features extracted from the audio file. For comparison, the corresponding melspectrogram images were also used as input to a ResNet-50 model, while the raw audio was used to train a Siamese network. The LSTM-DNN model outperformed the other two models as well as two more traditional baseline speaker identification models, showing that deep learning models are superior to probabilistic models for capturing low-level speech features and learning speaker characteristics.

A Multilingual Adaptive Spoken Dialogue System for the E-mail Domain

AthosMail is a multilingual spoken dialogue system for reading of e-mail messages. The key featur... more AthosMail is a multilingual spoken dialogue system for reading of e-mail messages. The key features of the application are adaptivity and the integration of different approaches for spoken interaction. The application has flexible system structure supporting multiple components for both different and same purposes. The AthosMail system includes components for input interpretation, dialogue management, output generation, user modelling and text processing. Suitable components are selected dynamically to make the interaction adaptive, and various adaptive techniques are experimented. For example, the system responses are tailored according to the user's actions so as to adapt to the user's observed skill levels. In this paper AthosMail functionality and its system components are presented.

Named Entity Recognition on Code-Switched Data Using Conditional Random Fields

Named Entity Recognition is an important information extraction task that identifies proper names... more Named Entity Recognition is an important information extraction task that identifies proper names in unstructured texts and classifies them into some pre-defined categories. Identification of named entities in code-mixed social media texts is a more difficult and challenging task as the contexts are short, ambiguous and often noisy. This work proposes a Conditional Random Fields based named entity recognition system to identify proper names in code-switched data and classify them into nine categories. The system ranked fifth among nine participant systems and achieved a 59.25% F1-score.

NTNU-TRH system at the MultiGED-2023 Shared on Multilingual Grammatical Error Detection

Linköping electronic conference proceedings, May 16, 2023

The paper presents a monolithic approach to grammatical error detection, which uses one model for... more The paper presents a monolithic approach to grammatical error detection, which uses one model for all languages, in contrast to the individual approach, which creates separate models for each language. For both approaches, pre-trained embeddings are the only external knowledge sources. Two sets of embeddings (Flair and BERT) are compared as well as two approaches to the problem of multilingual rammar detection, building individual and monolithic systems for multilingual grammar error detection. The system submitted to the test phase of the MultiGED-2023 shared task ranked 5th of 6 systems. In the subsequent open phase, more experiments were conducted, improving results. These results show the individual models to perform better than the monolithic ones and BERT embeddings working better than Flair embeddings for the individual models, while the picture is more mixed for the monolithic models.

On Implementing Swedish Tense and Aspect

B jö rn G am b äck S tock h olm A b stract The paper addresses the problems encountered when impl... more B jö rn G am b äck S tock h olm A b stract The paper addresses the problems encountered when implementing a system for the treatment of Swedish tense, mood and aspect. The underlying theory suffered from the same shortcomings as do most implementable linguistic theories: it was designed for English. To extend it to Swedish some aspects of the theory, but also the implementation had to be generalized to allow for a system which treats Swedish verb-phrase syntax and semantics in a uniform way. This paper is concentrated on how this treatment actually has been implemented in a large-sc^e natural-language processing system.

Twitter Named Entity Extraction and Linking Using Differential Evolution

International conference natural language processing, Dec 1, 2016

Systems that simultaneously identify and classify named entities in Twitter typically show poor r... more Systems that simultaneously identify and classify named entities in Twitter typically show poor recall. To remedy this, the task is here divided into two parts: i) named entity identification using Conditional Random Fields in a multi-objective framework built on Differential Evolution, and ii) named entity classification using Vector Space Modelling and edit distance techniques. Differential Evolution is an evolutionary algorithm, which not only optimises the features, but also identifies the proper context window for each selected feature. The approach obtains F-scores of 70.7% for Twitter named entity extraction and 66.0% for entity linking to the DBpedia database.

Generative Solid Modelling Employing Natural Language Understanding and 3D Data

Lecture Notes in Computer Science, 2018

The paper describes an experimental system for generating 3D-printable models inspired by arbitra... more The paper describes an experimental system for generating 3D-printable models inspired by arbitrary textual input. Utilizing a transliteration pipeline, the system pivots on Natural Language Understanding technologies and 3D data available via online repositories to result in a bag of retrieved 3D models that are then concatenated in order to produce original designs. Such artefacts celebrate a post-digital kind of objecthood, as they are concretely physical while, at the same time, incorporate the cybernetic encodings of their own making. Twelve individuals were asked to reflect on some of the 3D-printed, physical artefacts. Their responses suggest that the created artefacts succeed in triggering imagination, and in accelerating moods and narratives of various sorts.

Semantic-head based resolution of scopal ambiguities

Proceedings of the 17th international conference on Computational linguistics -, 1998

We introduce an algorithm for scope resolution in underspecified semantic representations. Scope ... more We introduce an algorithm for scope resolution in underspecified semantic representations. Scope preferences are suggested on the basis of semantic argument structure. The major novelty of this approach is that, while maintaining an (scopally) underspecified semantic representation, we at the same time suggest a resolution possibility. The algorithm has been implemented and tested in a large-scale system and fared quite well: 28% of the utterances were ambiguous, 80% of these were correctly interpreted, leaving errors in only 5.7% of the utterance set.

Sentimental Poetry Generation

International Conference on Networks, 2020

The paper investigates how well poetry can be generated to contain a specific sentiment, and whet... more The paper investigates how well poetry can be generated to contain a specific sentiment, and whether readers of the poetry experience the intended sentiment. The poetry generator consists of a bi-directional Long ShortTerm Memory (LSTM) model, combined with rhyme pair generation, rule-based word prediction methods, and tree search for extending generation possibilities. The LSTM network was trained on a set of English poetry written and published by users on a public website. Human judges evaluated poems generated by the system, both with a positive and negative sentiment. The results indicate that while there are some weaknesses in the system compared to other state-of-the-art solutions, it is fully capable of generating poetry with an inherent sentiment that is perceived by readers.

Flytxt_NTNU at SemEval-2018 Task 8: Identifying and Classifying Malware Text Using Conditional Random Fields and Naïve Bayes Classifiers

Cybersecurity risks such as malware threaten the personal safety of users, but to identify malwar... more Cybersecurity risks such as malware threaten the personal safety of users, but to identify malware text is a major challenge. The paper proposes a supervised learning approach to identifying malware sentences given a document (subTask1 of SemEval 2018, Task 8), as well as to classifying malware tokens in the sentences (subTask2). The approach achieved good results, ranking second of twelve participants for both subtasks, with F-scores of 57% for subTask1 and 28% for subTask2.

Sentiment analysis

In this paper we address the Sentiment Analysis problem from the end user's perspective. An end u... more In this paper we address the Sentiment Analysis problem from the end user's perspective. An end user might desire an automated at-a-glance presentation of the main points made in a single review or how opinion changes time to time over multiple documents. To meet the requirement we propose a relatively generic opinion 5Ws structurization, further used for textual and visual summary and tracking. The 5W task seeks to extract the semantic constituents in a natural language sentence by distilling it into the answers to the 5W questions: Who, What, When, Where and Why. The visualization system facilitates users to generate sentiment tracking with textual summary and sentiment polarity wise graph based on any dimension or combination of dimensions as they want i.e. "Who" are the actors and "What" are their sentiment regarding any topic, changes in sentiment during "When" and "Where" and the reasons for change in sentiment as "Why".

A speaker independent continuous speech recognizer for Amharic

The paper discusses an Amharic speaker independent continuous speech recognizer based on an HMM/A... more The paper discusses an Amharic speaker independent continuous speech recognizer based on an HMM/ANN hybrid approach. The model was constructed at a context dependent phone part sub-word level with the help of the CSLU Toolkit. A promising result of 74.28% word and 39.70% sentence recognition rate was achieved. These are the best figures reported so far for speech recognition for the Amharic language.

Language Identification in Code-Switched Text Using Conditional Random Fields and Babelnet

The paper outlines a supervised approach to language identification in code-switched data, framin... more The paper outlines a supervised approach to language identification in code-switched data, framing this as a sequence labeling task where the label of each token is identified using a classifier based on Conditional Random Fields and trained on a range of different features, extracted both from the training data and by using information from Babelnet and Babelfy. The method was tested on the development dataset provided by organizers of the shared task on language identification in codeswitched data, obtaining tweet level monolingual, code-switched and weighted F1-scores of 94%, 85% and 91%, respectively, with a token level accuracy of 95.8%. When evaluated on the unseen test data, the system achieved 90%, 85% and 87.4% monolingual, code-switched and weighted tweet level F1scores, and a token level accuracy of 95.7%.

Feature-Rich Twitter Named Entity Recognition and Classification

International Conference on Computational Linguistics, Dec 1, 2016

Twitter named entity recognition is the process of identifying proper names and classifying them ... more Twitter named entity recognition is the process of identifying proper names and classifying them into some predefined labels/categories. The paper introduces a Twitter named entity system using a supervised machine learning approach, namely Conditional Random Fields. A large set of different features was developed and the system was trained using these. The Twitter named entity task can be divided into two parts: i) Named entity extraction from tweets and ii) Twitter name classification into ten different types. For Twitter named entity recognition on unseen test data, our system obtained the second highest F 1 score in the shared task: 63.22%. The system performance on the classification task was worse, with an F 1 measure of 40.06% on unseen test data, which was the fourth best of the ten systems participating in the shared task.

Swedish Language Processing in the Spoken Language Translator

T h e p a p e r d e s crib e s th e S w ed ish la n g u a g e c o m p o n e n t s u sed in th e S... more T h e p a p e r d e s crib e s th e S w ed ish la n g u a g e c o m p o n e n t s u sed in th e S p o ken L a n g u a g e T r a n s la to r (S L T) s y s te m. S L T is a m u lt i-c o m p o n e n t sy s te m fo r tr a n s la tio n o f s p o k e n E n g lish in t o s p o k e n S w ed ish. T h e la n g u a g e p ro ce s s in g p a rts o f th e s y s te m are th e E n g lish C o r e L a n g u a g e E n g in e (C L E) a n d its S w ed ish c o u n te r p a r t, th e S-C L E. T h e S-C L E is a g en era l p u rp o s e n a tu ra l la n g u a g e p r o ce s s in g s y s te m s fo r S w ed ish w h ich in th e S L T p r o je c t w as tu n ed to w a rd s th e register o f th e air tra v el in fo r m a tio n (A T I S) d o m a in. T h e p e c u lia r itie s a n d th e co v e r a g e o f th e re su ltin g S w ed ish g ra m m a r are th e m a in to p ic s o f th e p a p e r, ev e n t h o u g h th e ov e ra ll S L T s y s te m a lso is b rie fly d e s crib e d .

Tagging Experiments Using Neural Networks

The paper outlines a method for automatic part-of-speech tagging using artificial neural networks... more The paper outlines a method for automatic part-of-speech tagging using artificial neural networks. Several experiments have been carried out where the performance of different network architectures have been compared to each other on two tasks: classification by overall part-of-speech (noun, adjective or verb) and by a set of 13 possible output categories. The best classification rates were 93.6% for the simple and 96.4% for the complex task. These results are rather promising and the paper compares them to the performance reported by other methods; a comparison that shows the neural network completely compatible with pure statistical approaches.

Evolvable Media Repositories: An Evolutionary System to Retrieve and Ever-Renovate Related Media Web Content

Advances in Intelligent Systems and Computing, 2019

The paper tackles the question of evolvable media repositories, i.e., local pools of media files ... more The paper tackles the question of evolvable media repositories, i.e., local pools of media files that are retrieved over the Internet and that are ever-renovated with new, related files in an evolutionary fashion. The herein proposed method encodes genotypic space by virtue of simple undirected graphs of natural language tokens that represent web queries without employing fitness functions or other evaluation/selection schemata. Once a first population is seeded, a series of modular crawlers query the particular World Wide Web repositories of interest for both media content and assorted meta-data. Then, a series of attached intelligent comprehenders analyse the retrieved content in order to eventually generate new genetic representations, and the cycle is repeated. Such a method is generic, scalable and modular, and can be made fit the purposes of a wide array of applications in all sorts of disparate contextual and functional scenarios. The paper features a formal description of the method, gives implementation guidelines, and presents example usages.

NTNU at SemEval-2018 Task 7: Classifier Ensembling for Semantic Relation Identification and Classification in Scientific Papers

Proceedings of The 12th International Workshop on Semantic Evaluation, 2018

The paper presents NTNU's contribution to SemEval-2018 Task 7 on relation identification and clas... more The paper presents NTNU's contribution to SemEval-2018 Task 7 on relation identification and classification. The class weights and parameters of five alternative supervised classifiers were optimized through grid search and cross-validation. The outputs of the classifiers were combined through voting for the final prediction. A wide variety of features were explored, with the most informative identified by feature selection. The best setting achieved F 1 scores of 47.4% and 66.0% in the relation classification subtasks 1.1 and 1.2. For relation identification and classification in subtask 2, it achieved F 1 scores of 33.9% and 17.0%,

Deep Learning Based Sentiment Analysis in a Code-Mixed English-Hindi and English-Bengali Social Media Corpus

International Journal on Artificial Intelligence Tools, 2020

Sentiment analysis is a circumstantial analysis of text, identifying the social sentiment to bett... more Sentiment analysis is a circumstantial analysis of text, identifying the social sentiment to better understand the source material. The article addresses sentiment analysis of an English-Hindi and English-Bengali code-mixed textual corpus collected from social media. Code-mixing is an amalgamation of multiple languages, which previously mainly was associated with spoken language. However, social media users also deploy it to communicate in ways that tend to be somewhat casual. The coarse nature of social media text poses challenges for many language processing applications. Here, the focus is on the low predictive nature of traditional machine learners when compared to Deep Learning counterparts, including the contextual language representation model BERT (Bidirectional Encoder Representations from Transformers), on the task of extracting user sentiment from code-mixed texts. Three deep learners (a BiLSTM CNN, a Double BiLSTM and an Attention-based model) attained accuracy 20–60% gr...