Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Automatic Term Extraction (ATE) systems have been studied for many decades as, among other things, one of the most important tools for tasks such as information retrieval, sentiment analysis, named entity recognition, and others. The... more
Automatic Term Extraction (ATE) systems have been studied for many decades as, among other things, one of the most important tools for tasks such as information retrieval, sentiment analysis, named entity recognition, and others. The interest in this topic has even increased in recent years given the support and improvement of the new neural approaches. In this article, we present a follow-up on the discussions about the pipeline that allows extracting key terms from medical reports, presented at MDTT 2022, and analyze the very last papers about ATE in a systematic review fashion. We analyzed the journal and conference papers published in 2022 (and partially in 2023) about ATE and cluster them into subtopics according to the focus of the papers for a better presentation.
A survey published by Nature in 2016 revealed that more than 70% of researchers failed in their attempt to reproduce another researcher’s experiments, and over 50% failed to reproduce one of their own experiments; a state of affairs that... more
A survey published by Nature in 2016 revealed that more than 70% of researchers failed in their attempt to reproduce another researcher’s experiments, and over 50% failed to reproduce one of their own experiments; a state of affairs that has been termed the ‘reproducibility crisis’ in science. The purpose of this work is to contribute to the field by presenting a reproducibility study of a Natural Language Processing paper about “Language Representation Models for Fine-Grained Sentiment Classification”. A thorough analysis of the methodology, experimental setting, and experimental results are presented, leading to a discussion of the issues and the necessary steps involved in this kind of study.
L'uso di software di riconoscimento OCR per convertire i caratteri stampati in testo digitale è uno strumento fondamentale per quanto riguarda l'ambito di studio degli approcci diacronici all'analisi del discorso politico... more
L'uso di software di riconoscimento OCR per convertire i caratteri stampati in testo digitale è uno strumento fondamentale per quanto riguarda l'ambito di studio degli approcci diacronici all'analisi del discorso politico attraverso i corpora (CADS studies). Tuttavia, i software OCR non sono totalmente affidabili, e il loro tasso di fallibilità può compromettere l'analisi. Questo articolo propone un approccio qualitativo-quantitativo al rilevamento e alla correzione degli errori post scansione OCR al fine di sviluppare una metodologia per migliorare la qualità dei corpora all'interno degli studi storici. Abbiamo applicato la metodologia sviluppata a due casi di studio su giornali dell'inizio del XX secolo per l'analisi linguistica delle rappresentazioni metaforico delle migrazioni e delle pandemie. Il risultato di questo progetto consiste in un insieme di regole che sono valide per diversi contesti e applicabili a diversi corpora e che possono essere riut...
Language interference is common in today's multilingual societies where more languages are in contact, and as a global result leads to the creation of hybrid languages. These, together with doubts on their right to be officially... more
Language interference is common in today's multilingual societies where more languages are in contact, and as a global result leads to the creation of hybrid languages. These, together with doubts on their right to be officially recognised, emerged the problem of their automatic identification and further elaboration in the area of computational linguistics. In this paper, we propose a first attempt to identify the elements of a Ukrainian-Russian hybrid language, Surzhyk, through the adoption of the example-based rules created with the instruments of programming language R. Our example-based study consists of: 1) analysis of spoken samples of Surzhyk registered by Del Gaudio (2010) in Kyiv area and creation of the written corpus; 2) production of specific rules on the identification of Surzhyk patterns and their implementation; 3) testing the code and analysing the effectiveness of the hybrid language classifier.
In this paper, we present an overview of some issues related to the use of Big Data in the area of Linguistics that have been debated in workshops and conferences in the last two years. We also consider some requirements that “big”... more
In this paper, we present an overview of some issues related to the use of Big Data in the area of Linguistics that have been debated in workshops and conferences in the last two years. We also consider some requirements that “big” linguistic databases should have in order to tackle some of these issues; finally, we discuss a set of possible interactive visualization approaches of large datasets that may have an impact in this research field.
In this paper, we describe a set of experiments that turn the machine learning classification task into a game, through gamification techniques, and let non expert users to perform text classification without even knowing the problem. The... more
In this paper, we describe a set of experiments that turn the machine learning classification task into a game, through gamification techniques, and let non expert users to perform text classification without even knowing the problem. The application is implemented in R using the Shiny package for interactive graphics. We present the outcome of three different experiments: a pilot experiment with PhD and post-doc students, and two experiments carried out with primary and secondary school students. The results show that the human aided classifier performs similarly and sometimes even better than state of the art classifiers.
Nowadays, the possibility to access big volumes of data is opening up new trends in social science research. Mass media and social media networks provide such a quantity of data that “a new era" of social science research can be... more
Nowadays, the possibility to access big volumes of data is opening up new trends in social science research. Mass media and social media networks provide such a quantity of data that “a new era" of social science research can be imagined, as described by the American Sociological Association 1 . As an example, aggregated newspaper articles can provide unprecedented accounts of the media’s agenda, as well as online discussions can answer broad questions about how often the public talks about politics during daily life 2 . This new “big data” paradigm requires the development of scalable, sustainable data infrastructures that facilitates an effective access to mass media and social media data by social scientists. In particular, we need to address questions related to the support for researchers in Social Science in the comparative investigation of relevant sociological analytical dimensions, time trends and evolution of the social representations of public debates featured in th...
Systematic review and e-Discovery have a common task in which the objective is to findmost (if not all) of the relevant documents in a collection by means of a (semi-)manual screening of the potentially interesting documents [3]. However,... more
Systematic review and e-Discovery have a common task in which the objective is to findmost (if not all) of the relevant documents in a collection by means of a (semi-)manual screening of the potentially interesting documents [3]. However, the high cost of e-Discovery software and the management of the advanced e-Discovery mechanism are expected to affect the growth of this market (which is expected to reach 17.32 billion dollars by 2023). 1 Moreover, the large and growing number of published studies makes the task of identifying relevant studies in systematic reviews in an unbiased way both complex and time consuming [4]. In this paper, we present an active learning system which combines different sampling approaches in order to estimate a 95% confidence interval of the number of relevant documents while taking into account the monetary costs of running the system itself.
The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available... more
The use of OCR software to convert printed characters to digital text is a fundamental tool within diachronic approaches to Corpusassisted discourse Studies because allow researchers to expand their interest by making many texts available and analysable through a computer. However, OCR software are not totally accurate, and the resulting error rate compromises their effectiveness. This paper proposes a mixed qualitative-quantitative approach to OCR error detection and correction in order to develop a methodology for compiling historical corpora. The proposed approach consists of three main steps: corpus creation, OCR detection and correction, and application of the automatic rules. The rules are implemented in R using a “tidyverse” approach for a better reproducibility of the experiments.
The present research is aimed at conducting a study with regard to Russian-Italian medical translation on the current development of two Machine Translation tools that feature prominently in today’s Neural Machine Translation framework,... more
The present research is aimed at conducting a study with regard to Russian-Italian medical translation on the current development of two Machine Translation tools that feature prominently in today’s Neural Machine Translation framework, namely DeepL and Yandex. For the purpose of our research, we have selected a number of Russian medical articles: three highly specialized and three popular-science articles concerning coronavirus pandemic. Such a choice is justified by the willingness not only to analyse recently published scientific documents but also to investigate the particular linguistic implications of 2020’s coronavirus pandemic outbreak. In fact, during the period of pandemic a set of terms has been introduced and coined in  every-day communication and entered the boundaries of scientific terminology. We have considered this existing linguistic phenomenon as a proper condition to test the performances of Machine Translation tools. In particular, we discuss the most relevant f...
The creation of a labelled dataset for machine learning purposes is a costly process. In recent works, it has been shown that a mix of crowdsourcing and active learning approaches can be used to annotate objects at an a↵ordable cost. In... more
The creation of a labelled dataset for machine learning purposes is a costly process. In recent works, it has been shown that a mix of crowdsourcing and active learning approaches can be used to annotate objects at an a↵ordable cost. In this paper, we study the gamification of machine learning techniques; in particular, the problem of classification of objects. In this first pilot study, we designed a simple game, based on a visual interpretation of probabilistic classifiers, that consists in separating two sets of coloured points on a two-dimensional plane by means of a straight line. We present the current results of this first experiment that we used to collect the requirements for the next version of the game and to analyze i) what is the ‘price’ to build a reasonably accurate classifier with a small amount of labelled objects, ii) and compare the accuracy of the player to the state-of-the-art classification algorithms.
Research Interests:
... Prepared by: Giorgio Maria Di Nunzio and Nicola Ferro {dinunzio, ferro}@dei.unipd.it ... Fields Constr. Pool celi Italy 10.2415/AH-TEL-MONO-EN-CLEF2009.CELI.CACAO_ENBL_AL LEXPANDED AH-TEL-MONO-EN-CLEF2009 en TD AUTOMATIC no ...
This research will examine neural retrieval methods for patent prior art search. One research direction is the federated search approach, where we proposed two new methods that solve the results merging problem in federated patent search... more
This research will examine neural retrieval methods for patent prior art search. One research direction is the federated search approach, where we proposed two new methods that solve the results merging problem in federated patent search using machine learning models. The methods are based on a centralized index containing samples of documents from all potential resources, and they implement machine learning models to predict comparable scores for the documents retrieved by different resources. The other research direction is the adaptation of end-to-end neural retrieval approaches to the patent characteristics such that the retrieval effectiveness will be increased. Off-the-self neural methods like BERT have lower effectiveness for patent prior art search. So, we adapt the BERT model to patent characteristics in order to increase retrieval performance. We propose a new gate-based document retrieval method and examine it in patent prior art search. The method combines a first-stage ...
In this paper, we describe the second participation of the Information Management Systems (IMS) group at CLEF eHealth 2018 Task 1. In this task, participants are required to extract causes of death from multilingual death reports (French,... more
In this paper, we describe the second participation of the Information Management Systems (IMS) group at CLEF eHealth 2018 Task 1. In this task, participants are required to extract causes of death from multilingual death reports (French, Hungarian and Italian) and label them with the correct International Classification Diseases (ICD10) code. We tackled this task by focusing on the reproducible code, that we published last year, which produces a clean dataset that can be used to implement more sophisticated approaches.
Systematic review and e-Discovery have a common task in which the objective is to findmost (if not all) of the relevant documents in a collection by means of a (semi-)manual screening of the potentially interesting documents [3]. However,... more
Systematic review and e-Discovery have a common task in which the objective is to findmost (if not all) of the relevant documents in a collection by means of a (semi-)manual screening of the potentially interesting documents [3]. However, the high cost of e-Discovery software and the management of the advanced e-Discovery mechanism are expected to affect the growth of this market (which is expected to reach 17.32 billion dollars by 2023). 1 Moreover, the large and growing number of published studies makes the task of identifying relevant studies in systematic reviews in an unbiased way both complex and time consuming [4]. In this paper, we present an active learning system which combines different sampling approaches in order to estimate a 95% confidence interval of the number of relevant documents while taking into account the monetary costs of running the system itself.
In this paper, we describe the third participation of the Information Management Systems (IMS) group at CLEF eHealth 2019 Task 1. In this task, participants are required to label with ICD-10 codes health-related documents with the focus... more
In this paper, we describe the third participation of the Information Management Systems (IMS) group at CLEF eHealth 2019 Task 1. In this task, participants are required to label with ICD-10 codes health-related documents with the focus on the German language and on non-technical summaries (NTPs) of animal experiments. We tackled this task by focusing on reproducibility aspects, as we did the previous years. This time, we tried three different probabilistic Näıve Bayes classifiers that use different hypothesis on the distribution of terms in the documents and the collection. The experimental evaluation showed a significantly different behavior of the classifiers during the training phase and the test phase. We are currently investigating possible sources of biases introduced in the training phase as well as out-of-vocabulary issues and change in the terminology from the training set to the test set.
In this paper, we describe the results of the participation of the Information Management Systems (IMS) group at AILA 2020 Task 1, precedents and statutes retrieval. In particular, we participated in both subtasks: precedents retrieval... more
In this paper, we describe the results of the participation of the Information Management Systems (IMS) group at AILA 2020 Task 1, precedents and statutes retrieval. In particular, we participated in both subtasks: precedents retrieval (task a) and statutes retrieval (task b). The goal of our work was to compare and evaluate the efficacy of a simple reproducible approach based on the use of either lemmas or stems with a tf-idf vector space model and a plain BM25 model. The results vary significantly from one subtask/evaluation measure to another. For the subtask of statutes retrieval, our approach performed well, being second only to a participant that used BERT to represent documents.
In this paper, we describe the first participation of the Information Management Systems (IMS) group at CLEF eHealth 2018 Task 3, Consumer Health Search Task. In particular, we participated in the subtask IRTask 1: Ad-hoc Search which is... more
In this paper, we describe the first participation of the Information Management Systems (IMS) group at CLEF eHealth 2018 Task 3, Consumer Health Search Task. In particular, we participated in the subtask IRTask 1: Ad-hoc Search which is a standard ad-hoc search task, aiming at retrieving information relevant to people seeking health advice on the web. The goal of our work is to evaluate 1) different query expansion strategies based on the recognition of Medical Subject Headings (MeSH) terms present in the original query; 2) different approaches to combine multiple ranking lists given the query expansions. We used Elasticsearch as search engine and the indexes provided by the organizers of this task.
This paper describes the Atlante Sintattico d'Italia, Syntactic Atlas of Italy (ASIt) linguistic linked dataset. ASIt is a scientific project aiming to account for minimally different variants within a sample of closely related... more
This paper describes the Atlante Sintattico d'Italia, Syntactic Atlas of Italy (ASIt) linguistic linked dataset. ASIt is a scientific project aiming to account for minimally different variants within a sample of closely related languages; it is part of the Edisyn network, the goal of which is to establish a European network of researchers in the area of language syntax that use similar standards with respect to methodology of data collection, data storage and annotation, data retrieval and cartography. In this context, ASIt is defined as a curated database which builds on dialectal data gathered during a twenty-year-long survey investigating the distribution of several grammatical phenomena across the dialects of Italy. Both the ASIt linguistic linked dataset and the Resource Description Framework Schema (RDF/S) on which it is based are publicly available and released with a Creative Commons license (CC BY-NC-SA 3.0). We report the characteristics of the data exposed by ASIt, th...
This is the third participation of the Information Management Systems (IMS) group at CLEF eHealth Task of Technologically Assisted Reviews in Empirical Medicine. This task focuses on the problem of medical systematic reviews, a problem... more
This is the third participation of the Information Management Systems (IMS) group at CLEF eHealth Task of Technologically Assisted Reviews in Empirical Medicine. This task focuses on the problem of medical systematic reviews, a problem which requires a recall close (if not equal) to 100%. Semi-Automated approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. We present a variation of the system we presented last year; in particular, not only we set the maximum amount of documents that the physician is willing to read, but we distribute the effort across the topics proportionally to the number of documents in the pool. We compare the results of this approach with the “frozen” system we used in 2018 and a BM25 baseline.
How do we calculate how many relevant documents are in a collection? In this abstract, we discuss our line of research about total recall systems such as interactive system for systematic reviews based on an active learning framework... more
How do we calculate how many relevant documents are in a collection? In this abstract, we discuss our line of research about total recall systems such as interactive system for systematic reviews based on an active learning framework [4–6]. In particular, we will present 1) the problem in mathematical terms, and 2) the experiments of an interactive system that continuously monitors the costs of reviewing additional documents and suggests the user whether to continue or not in the search based on the available remaining resources. We will discuss the results of this system on the ongoing CLEF 2019 eHealth task.
In this paper, we report the results of our participation to the CLEF eHealth 2020 Task on “Multilingual Information Extraction”. This task focuses on coding of medical textual data using the International Statistical Classification of... more
In this paper, we report the results of our participation to the CLEF eHealth 2020 Task on “Multilingual Information Extraction”. This task focuses on coding of medical textual data using the International Statistical Classification of Diseases and Related Health Problems (ICD) in Spanish. The main objective of our participation to this task is the study of reproducible experiments that use minimal effort to be set up and run and that can be used as a baseline. The contribution of our experiments to this task can be summarized as follows: the implementation of a reproducible pipeline for text analysis that uses universal dependency parsing; an evaluation of simple classifiers based on perfect matches on different morphological levels together with a tf-idf approach.
In this paper, we present the initial findings about a possible geometric interpretation of the BM25 model and a comparison of the BM25 with the Binary Independence Model (BIM) on a two-dimensional space. A Web application was developed... more
In this paper, we present the initial findings about a possible geometric interpretation of the BM25 model and a comparison of the BM25 with the Binary Independence Model (BIM) on a two-dimensional space. A Web application was developed in R to show an example of this geometric view on a standard TREC collection. The application is accessible at the following link: http://gmdn.shinyapps.io/shinyRF04
Evidence-based healthcare integrates the best research evidence with clinical expertise in order to make decisions based on the best practices available. In this context, the task of collecting all the relevant information, a recall... more
Evidence-based healthcare integrates the best research evidence with clinical expertise in order to make decisions based on the best practices available. In this context, the task of collecting all the relevant information, a recall oriented task, in order to take the right decision within a reasonable time frame has become an important issue. In this paper, we investigate the problem of building effective Consumer Health Search (CHS) systems that use query variations to achieve high recall and fulfill the information needs of health consumers. In particular, we study an intent-aware gain metric used to estimate the amount of missing information and make a prediction about the achievable recall for each query reformulation during a search session. We evaluate and propose alternative formulations of this metric using standard test collections of the CLEF 2018 eHealth Evaluation Lab CHS.
Terminology standardization reflects two different aspects involving the meaning of terms and the structure of terminological resources. In this paper, we focus on the structural aspect of standardization and we present the work of... more
Terminology standardization reflects two different aspects involving the meaning of terms and the structure of terminological resources. In this paper, we focus on the structural aspect of standardization and we present the work of re-modeling TriMED, a multilingual terminological database conceived to support multi-register medical communication. In particular, we provide a general methodology to make the termbase compliant to three of the most recent ISO/TC 37 standards. We focus on the definition of (i) the structural meta-model of the resource, (ii) the provided data categories and its Data Category Repository, and (iii) the TBX format for its implementation. In particular, we provide a general methodology to make the termbase compliant to three of the most recent ISO/TC 37 standards. We focus on the definition of (i) the structural meta-model of the resource, (ii) the provided data categories and its Data Category Repository, and (iii) the TBX format for its implementation.
In this work, we compare and analyze a variety of approaches in the task of medical publication retrieval and, in particular, for the Technology Assisted Review (TAR) task. This problem consists in the process of collecting articles that... more
In this work, we compare and analyze a variety of approaches in the task of medical publication retrieval and, in particular, for the Technology Assisted Review (TAR) task. This problem consists in the process of collecting articles that summarize all evidence that has been published regarding a certain medical topic. This task requires long search sessions by experts in the field of medicine. For this reason, semi-automatic approaches are essential for supporting these types of searches when the amount of data exceeds the limits of users. In this paper, we use state-of-the-art models and weighting schemes with different types of preprocessing as well as query expansion (QE) and relevance feedback (RF) approaches in order to study the best combination for this particular task. We also tested word embeddings representation of documents and queries in addition to three different ranking fusion approaches to see if the merged runs perform better than the single models. In order to make...
Research Interests:
Supervised machine learning algorithms require a set of labelled examples to be trained; however, the labelling process is a costly and time consuming task which is carried out by experts of the domain who label the dataset by means of an... more
Supervised machine learning algorithms require a set of labelled examples to be trained; however, the labelling process is a costly and time consuming task which is carried out by experts of the domain who label the dataset by means of an iterative process to filter out non-relevant objects of the dataset. In this paper, we describe a set of experiments that use gamification techniques to transform this labelling task into an interactive learning process where users can cooperate in order to achieve a common goal. To this end, first we use a geometrical interpretation of Naïve Bayes (NB) classifiers in order to create an intuitive visualization of the current state of the system and let the user change some of the parameters directly as part of a game. We apply this visualization technique to the classification of newswire and we report the results of the experiments conducted with different groups of people: PhD students, Master Degree students and general public. Then, we present a preliminary experiment of query rewriting for systematic reviews in a medical scenario, which makes use of gamification techniques to collect different formulation of the same query. Both the experiments show how the exploitation of gamification approaches help to engage the users in abstract tasks that might be hard to understand and/or boring to perform.
Three precise categories of people are confronted with the complexity of medical language: physicians, patients and scientific translators. The purpose of this work is to develop a methodology for the implementation of a terminological... more
Three precise categories of people are confronted with the complexity of medical language: physicians, patients and scientific translators. The purpose of this work is to develop a methodology for the implementation of a terminological tool that contributes to solve problems related to the opacity that characterizes communication in the medical field among its various actors. The main goals are: i) satisfy the peer-to-peer communication, ii) facilitate the comprehension of medical information by patients, and iii) provide a regularly updated resource for scientific translators. We illustrate our methodology and its application through the description of a multilingual terminological-phraseological resource named TriMED. This terminological database will consist of records designed to create a terminological bridge between the various registers (specialist, semi-specialist, non-specialist) as well as across the languages considered. In this initial analysis, we restricted to the fiel...
In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2017 Task 1. In this task, participants are required to extract causes of death from death reports (in French and in English)... more
In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2017 Task 1. In this task, participants are required to extract causes of death from death reports (in French and in English) and label them with the correct International Classification Diseases (ICD10) code. We tackled this task by focusing on the replicability and reproducibility of the experiments and, in particular, on building a basic compact system that produces a clean dataset that can be used to implement more sophisticated approaches.
In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2017 Task 2. This task focuses on the problem of systematic reviews, that is articles that summarise all evidence that is... more
In this paper, we describe the participation of the Information Management Systems (IMS) group at CLEF eHealth 2017 Task 2. This task focuses on the problem of systematic reviews, that is articles that summarise all evidence that is published regarding a certain medical topic. This task, known in Information Retrieval as the total recall problem, requires long and tedious search sessions by experts in the field of medicine. Automatic (or semi-automatic) approaches are essential to support these type of searches when the amount of data exceed the limits of users, i.e. in terms of attention or patience. We present the two-dimensional probabilistic version of BM25 with explicit relevance feedback together with a query aspect rewriting approach for both the simple evaluation and the cost-effective evaluation.
The process of standardization plays an important role in the management of terminological resources. In this context, we present the work of re-modeling an existing multilingual terminological database for the medical domain, named... more
The process of standardization plays an important role in the management of terminological resources. In this context, we present the work of re-modeling an existing multilingual terminological database for the medical domain, named TriMED. This resource was conceived in order to tackle some problems related to the complexity of medical terminology and to respond to different users’ needs. We provide a methodology that should be followed in order to make a termbase compliant to the three most recent ISO/TC 37 standards. In particular, we focus on the definition of i) the structural meta-model of the resource, ii) the data categories provided, and iii) the TBX format for its implementation. In addition to the formal standardization of the resource, we describe the realization of a new data category repository for the management of the TriMED terminological data and a Web application that can be used to access the multilingual terminological records.

And 128 more