The paper presents a language dependent model for classification of statements into ironic and no... more The paper presents a language dependent model for classification of statements into ironic and non-ironic. The model uses various language resources: morphological dictionaries, sentiment lexicon, lexicon of markers and a WordNet based ontology. This approach uses various features: antonymous pairs obtained using the reasoning rules over the Serbian WordNet ontology (R), antonymous pairs in which one member has positive sentiment polarity (PPR), polarity of positive sentiment words (PSP), ordered sequence of sentiment tags (OSA), Part-of-Speech tags of words (POS) and irony markers (M). The evaluation was performed on two collections of tweets that had been manually annotated according to irony. These collections of tweets as well as the used language resources are in the Serbian language (or one of closely related languages --Bosnian/Croatian/Montenegrin). The best accuracy of the developed classifier was achieved for irony with a set of 5 features -- (PPR, PSP, POS, OSA, M) -- acc = 86.1%, while for sarcasm the best results were achieved with the set (R, PSP, POS, OSA, M) -- acc = 72.8.
There is evidence that specific segments of the population were hit particularly hard by the Covi... more There is evidence that specific segments of the population were hit particularly hard by the Covid-19 pandemic (e.g., people with a migration background). In this context, the impact and role played by online platforms in facilitating the integration or fragmentation of public debates and social groups is a recurring topic of discussion. This is where our study ties in, we ask: How is the topic of vaccination discussed and evaluated in different language communities in Germany on Twitter during the Covid-19 pandemic? We collected all tweets in German, Russian, Turkish, and Polish (i.e., the largest migrant groups in Germany) in March 2021 that included the most important keywords related to Covid-19 vaccination. All users were automatically geocoded. The data was limited to tweets from Germany. Our results show that the multilingual debate on Covid-19 vaccination in Germany does not have many structural connections. However, in terms of actors, arguments, and positions towards Covid...
This paper presents our work on the refinement and improvement of the Serbian language part of Hu... more This paper presents our work on the refinement and improvement of the Serbian language part of Hurtlex, a multilingual lexicon of words to hurt. We pay special attention to adding Multi-word expressions that can be seen as abusive, as such lexical entries are very important in obtaining good results in a plethora of abusive language detection tasks. We use Serbian morphological dictionaries as a basis for data cleaning and MWE dictionary creation. A connection to other lexical and semantic resources in Serbian is outlined and building of abusive language detection systems based on that connection is foreseen.
In this paper, we introduce the architecture used for our PAN@CLEF2019 author profiling participa... more In this paper, we introduce the architecture used for our PAN@CLEF2019 author profiling participation. In this task, we had to predict if the author of 100 tweets was a bot, a female human, or a male human user. This task is proposed from a multilingual perspective, for English and Spanish. We handled this task in two steps, using different feature extraction techniques and machine learning algorithms. In the first step, we used random forest classifier with different features in order to predict if the users were bots or humans. In the second step, we recovered all the users predicted as humans. We then used a 2-layers architecture to predict the gender of the users detected as humans.
In the current world, individuals are faced with decision making problems and opinion formation p... more In the current world, individuals are faced with decision making problems and opinion formation processes on a daily basis. For example, debating or choosing between two similar products. However, answering a comparative question by retrieving documents based only on traditional measures (such as TF-IDF and BM25) does not always satisfy the need. Thus, introducing the argumentation aspect in the information retrieval procedure recently gained significant attention. In this paper, we present our participation at the CLEF 2021 Touché Lab for the second shared task, which tackles answering comparative questions based on arguments. Therefore, we propose a novel multi-layer architecture where the argument extraction task is considered as the main engine. Our approach therefore is a pipeline of query expansion, argument identification based on DistilBert model, and sorting the documents by a combination of different ranking criteria.
The Common European Framework of Reference (CEFR) provides generic guidelines for the evaluation ... more The Common European Framework of Reference (CEFR) provides generic guidelines for the evaluation of language proficiency. Nevertheless, for automated proficiency classification systems, different approaches for different languages are proposed. Our paper evaluates and extends the results of an approach to Automatic Essay Scoring proposed as a part of the REPROLANG 2020 challenge. We provide a comparison between our results and the ones from the published paper and we include a new corpus for the English language for further experiments. Our results are lower than the expected ones when using the same approach and the system does not scale well with the added English corpus.
In this paper we present a set of new additions and func-tionalities to recently introduced softw... more In this paper we present a set of new additions and func-tionalities to recently introduced software tools and techniques that will help researchers in the area of semantics and especially developers of wordnets. The motivation lies in our wish to get an on-line, fully comprehensive , modular, multiuser and safe system for further development of the Serbian WordNet (SWN). The most important functionality is the establishment of semantic relations between Princeton WordNet 3.0 (PWN) and the Serbian WordNet 3.0. Other functionalities of this set of tools are based on other semantic resources: SentiWordNet, a publicly available lexical resource for sentiment analysis, Suggested Upper Merged Ontology (SUMO) and Morphological electronic dictionary of Serbian (SrpMD). They provide sophisticated search possibilities and procedures for easier and more comfortable growth of WordNet. All of the functionalities were developed using publicly available resources: PWN mapping techniques, SUMO map...
In this paper, we present an efficient classifier that is able to perform automatic filtering and... more In this paper, we present an efficient classifier that is able to perform automatic filtering and detection of tweets with clear negative sentiment towards COVID-19 vaccination process. We used a transformer-based architecture in order to build the classifier. A pre-trained transformer encoder that is trained in ELECTRA fashion, BERTic, was selected and fine-tuned on a dataset we collected and manually annotated. Such an automatic filtering and detection algorithm is of utmost importance in order to explore the reasons behind the negative sentiment of Twitter users towards a particular topic and develop a communication strategy to educate them and provide them with accurate information regarding their specific beliefs that have been identified.
The COVID19 pandemic has brought health problems that concern individuals, the state, and the who... more The COVID19 pandemic has brought health problems that concern individuals, the state, and the whole world. The information available on social networks, which were used more frequently and intensively during the pandemic than before, may contain hidden knowledge that can help to better address some problems and apply protective measures more adequately. Since the messages on Twitter are specific in their length, informal style, figurative speech, and frequent use of slang, this analysis requires the application of slightly different techniques than those classically applied to long, formal documents. To determine which topics appear in tweets related to vaccination, we apply state-of-the-art topic modeling techniques to determine which one is the most appropriate. This kind of research is meant to give us an insight into the opinions of the Twitter community on the phenomenon of vaccination and all related aspects. Comparing the results of the LDA with the topics obtained by manual annotation over the same set, we concluded that the LDA method provides a very good interpretation of the topics. Such data allow the analysis of sentiment, in this case pro-or anti-vaccination attitudes, and of specific groups of data and topics.
The paper presents a language dependent model for classification of statements into ironic and no... more The paper presents a language dependent model for classification of statements into ironic and non-ironic. The model uses various language resources: morphological dictionaries, sentiment lexicon, lexicon of markers and a WordNet based ontology. This approach uses various features: antonymous pairs obtained using the reasoning rules over the Serbian WordNet ontology (R), antonymous pairs in which one member has positive sentiment polarity (PPR), polarity of positive sentiment words (PSP), ordered sequence of sentiment tags (OSA), Part-of-Speech tags of words (POS) and irony markers (M). The evaluation was performed on two collections of tweets that had been manually annotated according to irony. These collections of tweets as well as the used language resources are in the Serbian language (or one of closely related languages --Bosnian/Croatian/Montenegrin). The best accuracy of the developed classifier was achieved for irony with a set of 5 features -- (PPR, PSP, POS, OSA, M) -- acc = 86.1%, while for sarcasm the best results were achieved with the set (R, PSP, POS, OSA, M) -- acc = 72.8.
There is evidence that specific segments of the population were hit particularly hard by the Covi... more There is evidence that specific segments of the population were hit particularly hard by the Covid-19 pandemic (e.g., people with a migration background). In this context, the impact and role played by online platforms in facilitating the integration or fragmentation of public debates and social groups is a recurring topic of discussion. This is where our study ties in, we ask: How is the topic of vaccination discussed and evaluated in different language communities in Germany on Twitter during the Covid-19 pandemic? We collected all tweets in German, Russian, Turkish, and Polish (i.e., the largest migrant groups in Germany) in March 2021 that included the most important keywords related to Covid-19 vaccination. All users were automatically geocoded. The data was limited to tweets from Germany. Our results show that the multilingual debate on Covid-19 vaccination in Germany does not have many structural connections. However, in terms of actors, arguments, and positions towards Covid...
This paper presents our work on the refinement and improvement of the Serbian language part of Hu... more This paper presents our work on the refinement and improvement of the Serbian language part of Hurtlex, a multilingual lexicon of words to hurt. We pay special attention to adding Multi-word expressions that can be seen as abusive, as such lexical entries are very important in obtaining good results in a plethora of abusive language detection tasks. We use Serbian morphological dictionaries as a basis for data cleaning and MWE dictionary creation. A connection to other lexical and semantic resources in Serbian is outlined and building of abusive language detection systems based on that connection is foreseen.
In this paper, we introduce the architecture used for our PAN@CLEF2019 author profiling participa... more In this paper, we introduce the architecture used for our PAN@CLEF2019 author profiling participation. In this task, we had to predict if the author of 100 tweets was a bot, a female human, or a male human user. This task is proposed from a multilingual perspective, for English and Spanish. We handled this task in two steps, using different feature extraction techniques and machine learning algorithms. In the first step, we used random forest classifier with different features in order to predict if the users were bots or humans. In the second step, we recovered all the users predicted as humans. We then used a 2-layers architecture to predict the gender of the users detected as humans.
In the current world, individuals are faced with decision making problems and opinion formation p... more In the current world, individuals are faced with decision making problems and opinion formation processes on a daily basis. For example, debating or choosing between two similar products. However, answering a comparative question by retrieving documents based only on traditional measures (such as TF-IDF and BM25) does not always satisfy the need. Thus, introducing the argumentation aspect in the information retrieval procedure recently gained significant attention. In this paper, we present our participation at the CLEF 2021 Touché Lab for the second shared task, which tackles answering comparative questions based on arguments. Therefore, we propose a novel multi-layer architecture where the argument extraction task is considered as the main engine. Our approach therefore is a pipeline of query expansion, argument identification based on DistilBert model, and sorting the documents by a combination of different ranking criteria.
The Common European Framework of Reference (CEFR) provides generic guidelines for the evaluation ... more The Common European Framework of Reference (CEFR) provides generic guidelines for the evaluation of language proficiency. Nevertheless, for automated proficiency classification systems, different approaches for different languages are proposed. Our paper evaluates and extends the results of an approach to Automatic Essay Scoring proposed as a part of the REPROLANG 2020 challenge. We provide a comparison between our results and the ones from the published paper and we include a new corpus for the English language for further experiments. Our results are lower than the expected ones when using the same approach and the system does not scale well with the added English corpus.
In this paper we present a set of new additions and func-tionalities to recently introduced softw... more In this paper we present a set of new additions and func-tionalities to recently introduced software tools and techniques that will help researchers in the area of semantics and especially developers of wordnets. The motivation lies in our wish to get an on-line, fully comprehensive , modular, multiuser and safe system for further development of the Serbian WordNet (SWN). The most important functionality is the establishment of semantic relations between Princeton WordNet 3.0 (PWN) and the Serbian WordNet 3.0. Other functionalities of this set of tools are based on other semantic resources: SentiWordNet, a publicly available lexical resource for sentiment analysis, Suggested Upper Merged Ontology (SUMO) and Morphological electronic dictionary of Serbian (SrpMD). They provide sophisticated search possibilities and procedures for easier and more comfortable growth of WordNet. All of the functionalities were developed using publicly available resources: PWN mapping techniques, SUMO map...
In this paper, we present an efficient classifier that is able to perform automatic filtering and... more In this paper, we present an efficient classifier that is able to perform automatic filtering and detection of tweets with clear negative sentiment towards COVID-19 vaccination process. We used a transformer-based architecture in order to build the classifier. A pre-trained transformer encoder that is trained in ELECTRA fashion, BERTic, was selected and fine-tuned on a dataset we collected and manually annotated. Such an automatic filtering and detection algorithm is of utmost importance in order to explore the reasons behind the negative sentiment of Twitter users towards a particular topic and develop a communication strategy to educate them and provide them with accurate information regarding their specific beliefs that have been identified.
The COVID19 pandemic has brought health problems that concern individuals, the state, and the who... more The COVID19 pandemic has brought health problems that concern individuals, the state, and the whole world. The information available on social networks, which were used more frequently and intensively during the pandemic than before, may contain hidden knowledge that can help to better address some problems and apply protective measures more adequately. Since the messages on Twitter are specific in their length, informal style, figurative speech, and frequent use of slang, this analysis requires the application of slightly different techniques than those classically applied to long, formal documents. To determine which topics appear in tweets related to vaccination, we apply state-of-the-art topic modeling techniques to determine which one is the most appropriate. This kind of research is meant to give us an insight into the opinions of the Twitter community on the phenomenon of vaccination and all related aspects. Comparing the results of the LDA with the topics obtained by manual annotation over the same set, we concluded that the LDA method provides a very good interpretation of the topics. Such data allow the analysis of sentiment, in this case pro-or anti-vaccination attitudes, and of specific groups of data and topics.
Computational rhetoric (CR) is an area of Natural Language Processing (NLP) dealing with computat... more Computational rhetoric (CR) is an area of Natural Language Processing (NLP) dealing with computational approaches to modelling and detection of rhetorical figures, as well as rhetorical relations which, in turn, might aid tasks such as Sentiment and Opinion Mining and Analysis, Argument mining, Argumentation modelling, Analysis of political argumentation etc. (Mitrović, et al., 2017). In this poster, we present some recent advances in CR relating to using WordNet as a starting point and a valuable resource, as well as future directions in this regard, relating to the paradigm of Distributional semantics.
WordNet is largely used as a linguistic resource in a number of semantic tasks, such as Question ... more WordNet is largely used as a linguistic resource in a number of semantic tasks, such as Question Answering, Information Retrieval, Text Entailment, etc., but systems usually query only the links between terms, such as synonym, hypernym or derivational form relationships. The synsets' definitions are usually left aside, although they contain a large amount of relevant information. These natural language definitions can serve as a rich source of knowledge, but structuring them into a comprehensible semantic model is essential for making them useful in semantic interpretation tasks. In order to allow the use of WordNet's natural language definitions as a structured knowledge source in NLP tasks, we developed the WordNetGraph, a graph knowledge base built according to the methodology described in [1]. WordNetGraph builds upon a conceptual model based on entity-centered semantic roles for definitions [2], that is, roles that express the part played by an expression in a definition, showing how it relates to the definiendum, i.e., the entity being defined. This model extends the classic Aristotle's genus-differentia definition pattern [3, 4, 5]: the genus concepts is replaced by the supertype role (the definiendum's superclass, immediate or not); the essential properties represented by the differentia concept is split into the differentia quality and differentia event roles; and other roles, such as associated fact, purpose or accessory quality, among others, represent the definiendum's non-essential attributes. For building the graph, a small sample of WordNet definitions was first automatically pre-annotated, using the syntactic patterns described in [2] to assign the suitable semantic roles to each segment in a definition, and then manually curated to create a training dataset. This dataset was used to train a machine learning classifier [6], which was later used to label all WordNet noun and verb definitions. After a post-processing phase to fix minor errors in the sequence of labels, the classified data was then serialized in RDF format. Figure 1 shows an example of labeled definition (for the WordNet synset " lake poets "). The same labeled definition is depicted in the final graph format in Figure 2. WordNetGraph was primarily designed for and successfully used in an interpretable text entailment recognition approach for providing human-readable justifications for the entailment decision. Using an algorithm based on distributional semantics [7] to navigate the graph, we look for a path linking the entailing text T to the entailed hypothesis H. If we succeed, then the entailment is confirmed, and the contents of the nodes in the retrieved path are used to build a natural language justification that explains why the entailment is true and what exactly the semantic relationship between T and H is. The complete description of the text entailment recognition approach, including evaluation results and justification examples can be found in [8]. Figure 1. Example of role labeling for the definition of " lake poets "
This paper presents our work on the refinement and improvement of the Serbian language part of Hu... more This paper presents our work on the refinement and improvement of the Serbian language part of Hurtlex, a multilingual lexicon of words to hurt. We pay special attention to adding Multi-word expressions that can be seen as abusive, as such lexical entries are very important in obtaining good results in a plethora of abusive language detection tasks. We use Serbian morphological dictionaries as a basis for data cleaning and MWE dictionary creation. A connection to other lexical and semantic resources in Serbian is outlined and building of abusive language detection systems based on that connection is foreseen.
10th International Conference on Advanced Computer Information Technologies (ACIT). IEEE., 2020
The main tool of a lawyer is their language. Legal prose is bound by writing styles, especially i... more The main tool of a lawyer is their language. Legal prose is bound by writing styles, especially in Germany. These styles ensure that, i.a. judgments are written in a structured and comprehensive way. The writing style used for German judgements is called Urteilsstil and consist of several subcomponents. These subcomponents should be classifiable with the help of argumentation mining techniques. However, this classification is currently not possible because an annotated corpus, that considers such special structure of German legal text, is not available. This paper explores possibilities for classifying two subcomponents of the Urteilsstil by utilising argumentation mining. Furthermore, the creation of a new corpus for legal classification is proposed.
In this paper, we introduce our submission for the SemEval Task 12, sub-tasks A and B for English... more In this paper, we introduce our submission for the SemEval Task 12, sub-tasks A and B for English language offensive language identification and categorization in tweets. This year the dataset for Task A is significantly larger than in the previous year. Therefore, we have adapted the BlazingText algorithm to extract word embedding and classify texts after filtering and sanitizing the dataset according to the conventional text patterns on social media. We have gained both advantages of a speedy training process and obtaining an F1 score of 90.88% on the test set. For sub-task B, we opted to fine-tune a Bidirectional Encoder Representation from a Transformer (BERT) to accommodate the limited data for categorizing offensive tweets. We have achieved an F1 score of only 56.86%, but after experimenting with various label assignment thresholds in the pre-processing steps, the F1 score improved to 64%.
This paper describes a neural network (NN) model that was used for participating in the OffensEva... more This paper describes a neural network (NN) model that was used for participating in the OffensEval, Task 12 of the SemEval 2020 workshop. The aim of this task is to identify offensive speech in social media, particularly in tweets. The model we used, C-BiGRU, is composed of a Convolutional Neural Network (CNN) along with a bidirectional Recurrent Neural Network (RNN). A multi-dimensional numerical representation (embedding) for each of the words in the tweets that were used by the model were determined using fastText. This allowed for using a dataset of labeled tweets to train the model on detecting combinations of words that may convey an offensive meaning. This model was used in the sub-task A of the English, Turkish and Danish competitions of the workshop, achieving F1 scores of 90.88%, 76.76% and 76.70% respectively.
We introduce an approach to multilingual Offensive Language Detection based on the mBERT transfor... more We introduce an approach to multilingual Offensive Language Detection based on the mBERT transformer model. We download extra training data from Twitter in English, Danish, and Turkish, and use it to retrain the model. We then fine-tuned the model on the provided training data and, in some configurations, implement transfer learning approach exploiting the typological relatedness of the English and Danish languages. Our systems obtained good results across the three languages (.9036 for EN, .7619 for DA, and .7789 for TR).
Uploads
Papers by Jelena Mitrović