ACM Transactions on Speech and Language Processing, 2007
... For example, given the sentence “Kennedy's assassin, Sirhan Bishara Sirhan, was immediat... more ... For example, given the sentence “Kennedy's assassin, Sirhan Bishara Sirhan, was immediately arrested.”, it is required that Sirhan Bishara Sirhan and Kennedy be identified as named entities of type person and the kill relation in which the former is the first argument (agent) of ...
We survey the evaluation methodology adopted in information extraction (IE), as defined in a few ... more We survey the evaluation methodology adopted in information extraction (IE), as defined in a few different efforts applying machine learning (ML) to IE. We identify a number of critical issues that hamper comparison of the results obtained by different researchers. Some of these issues are common to other NLP-related tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Some issues are specific to IE: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an IE task, these issues should be explicitly addressed, and a number of methodological characteristics should be clearly defined. To empirically verify the practical impact of the issues mentioned above, we perform a survey of the results of different algorithms when applied to a few standard datasets. The survey shows a serious lack of consensus on these issues, which makes it difficult to draw firm conclusions on a comparative evaluation of the algorithms. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. Widespread agreement on this proposal should lead to future IE comparative evaluations that are fair and reliable. To demonstrate the way the methodology is to be applied we have organized and run a comparative evaluation of ML-based IE systems (the Pascal Challenge on ML-based IE) where the principles described in this article are put into practice. In this article we describe the proposed methodology and its motivations. The Pascal evaluation is then described and its results presented.
Abstract. The paper describes an approach to cross-media knowledge acquisition which combines tex... more Abstract. The paper describes an approach to cross-media knowledge acquisition which combines text and raw data. The approach has been applied in a real-world use case concerning wind tunnel reports within the EU-funded project X-Media. The goal is to ...
This report describes SIE (Simple Information Extraction), an information extraction system desig... more This report describes SIE (Simple Information Extraction), an information extraction system designed and developed in the context of the
This document reports on the annotation of Named Entities for the Italian Content Annotation Bank... more This document reports on the annotation of Named Entities for the Italian Content Annotation Bank (ICAB) being developed at ITC-irst in conjunction with CELCT. I-CAB is a corpus of Italian news annotated with semantic information at different levels. The first level is represented by Temporal Expressions, the second level is represented by different types of Entities (both Named and not-Named), and the third level is represented by Relations between Entities (eg the affiliation relation connecting a person to an organization).
We present an approach for semantic relation extraction between nominals that combines shallow an... more We present an approach for semantic relation extraction between nominals that combines shallow and deep syntactic processing and semantic information using kernel methods. Two information sources are considered: (i) the whole sentence where the relation appears, and (ii) WordNet synsets and hypernymy relations of the candidate nominals. Each source of information is represented by kernel functions. In particular, five basic kernel functions are linearly combined and weighted under different conditions. The experiments were carried out using support vector machines as classifier. The system achieves an overall F1 of 71.8% on the Classification of Semantic Relations between Nominals task at SemEval-2007. 1
We propose an approach for extracting relations between entities from biomedical literature based... more We propose an approach for extracting relations between entities from biomedical literature based solely on shallow linguistic information. We use a combination of kernel functions to integrate two different information sources: (i) the whole sentence where the relation appears, and (ii) the local contexts around the interacting entities. We performed experiments on extracting gene and protein interactions from two different data sets. The results show that our approach outperforms most of the previous methods based on syntactic and semantic information.
We present a brief overview of the main challenges in the extraction of semantic relations from E... more We present a brief overview of the main challenges in the extraction of semantic relations from English text, and discuss the shortcomings of previous data sets and shared tasks. This leads us to introduce a new task, which will be part of SemEval-2010: multi-way classification of mutually exclusive semantic relations between pairs of common nominals. The task is designed to compare different approaches to the problem and to provide a standard testbed for future research, which can benefit many applications in Natural Language Processing. 1
This paper describes SIE (Simple Information Extraction), a modular information extraction system... more This paper describes SIE (Simple Information Extraction), a modular information extraction system designed with the goal of being easily and quickly portable across tasks and domains. SIE is composed by a general purpose machine learning algorithm (SVM) combined with several customizable modules. A crucial role in the architecture is played by Instance Filtering, which allows to increase efficiency without reducing effectiveness. The results obtained by SIE on several standard data sets, representative of different tasks and domains, are reported. The experiments show that SIE achieves performance close to the best systems in all tasks, without using domain-specific knowledge. 1
Unsupervised paraphrase acquisition has been an active research field in recent years, but its ef... more Unsupervised paraphrase acquisition has been an active research field in recent years, but its effective coverage and performance have rarely been evaluated. We propose a generic paraphrase-based approach for Relation Extraction (RE), aiming at a dual goal: obtaining an applicative evaluation scheme for paraphrase acquisition and obtaining a generic and largely unsupervised configuration for RE. We analyze the potential of our approach and evaluate an implemented prototype of it using an RE dataset. Our findings reveal a high potential for unsupervised paraphrase acquisition. We also identify the need for novel robust models for matching paraphrases in texts, which should address syntactic complexity and variability.
ACM Transactions on Speech and Language Processing, 2007
... For example, given the sentence “Kennedy's assassin, Sirhan Bishara Sirhan, was immediat... more ... For example, given the sentence “Kennedy's assassin, Sirhan Bishara Sirhan, was immediately arrested.”, it is required that Sirhan Bishara Sirhan and Kennedy be identified as named entities of type person and the kill relation in which the former is the first argument (agent) of ...
We survey the evaluation methodology adopted in information extraction (IE), as defined in a few ... more We survey the evaluation methodology adopted in information extraction (IE), as defined in a few different efforts applying machine learning (ML) to IE. We identify a number of critical issues that hamper comparison of the results obtained by different researchers. Some of these issues are common to other NLP-related tasks: e.g., the difficulty of exactly identifying the effects on performance of the data (sample selection and sample size), of the domain theory (features selected), and of algorithm parameter settings. Some issues are specific to IE: how leniently to assess inexact identification of filler boundaries, the possibility of multiple fillers for a slot, and how the counting is performed. We argue that, when specifying an IE task, these issues should be explicitly addressed, and a number of methodological characteristics should be clearly defined. To empirically verify the practical impact of the issues mentioned above, we perform a survey of the results of different algorithms when applied to a few standard datasets. The survey shows a serious lack of consensus on these issues, which makes it difficult to draw firm conclusions on a comparative evaluation of the algorithms. Our aim is to elaborate a clear and detailed experimental methodology and propose it to the IE community. Widespread agreement on this proposal should lead to future IE comparative evaluations that are fair and reliable. To demonstrate the way the methodology is to be applied we have organized and run a comparative evaluation of ML-based IE systems (the Pascal Challenge on ML-based IE) where the principles described in this article are put into practice. In this article we describe the proposed methodology and its motivations. The Pascal evaluation is then described and its results presented.
Abstract. The paper describes an approach to cross-media knowledge acquisition which combines tex... more Abstract. The paper describes an approach to cross-media knowledge acquisition which combines text and raw data. The approach has been applied in a real-world use case concerning wind tunnel reports within the EU-funded project X-Media. The goal is to ...
This report describes SIE (Simple Information Extraction), an information extraction system desig... more This report describes SIE (Simple Information Extraction), an information extraction system designed and developed in the context of the
This document reports on the annotation of Named Entities for the Italian Content Annotation Bank... more This document reports on the annotation of Named Entities for the Italian Content Annotation Bank (ICAB) being developed at ITC-irst in conjunction with CELCT. I-CAB is a corpus of Italian news annotated with semantic information at different levels. The first level is represented by Temporal Expressions, the second level is represented by different types of Entities (both Named and not-Named), and the third level is represented by Relations between Entities (eg the affiliation relation connecting a person to an organization).
We present an approach for semantic relation extraction between nominals that combines shallow an... more We present an approach for semantic relation extraction between nominals that combines shallow and deep syntactic processing and semantic information using kernel methods. Two information sources are considered: (i) the whole sentence where the relation appears, and (ii) WordNet synsets and hypernymy relations of the candidate nominals. Each source of information is represented by kernel functions. In particular, five basic kernel functions are linearly combined and weighted under different conditions. The experiments were carried out using support vector machines as classifier. The system achieves an overall F1 of 71.8% on the Classification of Semantic Relations between Nominals task at SemEval-2007. 1
We propose an approach for extracting relations between entities from biomedical literature based... more We propose an approach for extracting relations between entities from biomedical literature based solely on shallow linguistic information. We use a combination of kernel functions to integrate two different information sources: (i) the whole sentence where the relation appears, and (ii) the local contexts around the interacting entities. We performed experiments on extracting gene and protein interactions from two different data sets. The results show that our approach outperforms most of the previous methods based on syntactic and semantic information.
We present a brief overview of the main challenges in the extraction of semantic relations from E... more We present a brief overview of the main challenges in the extraction of semantic relations from English text, and discuss the shortcomings of previous data sets and shared tasks. This leads us to introduce a new task, which will be part of SemEval-2010: multi-way classification of mutually exclusive semantic relations between pairs of common nominals. The task is designed to compare different approaches to the problem and to provide a standard testbed for future research, which can benefit many applications in Natural Language Processing. 1
This paper describes SIE (Simple Information Extraction), a modular information extraction system... more This paper describes SIE (Simple Information Extraction), a modular information extraction system designed with the goal of being easily and quickly portable across tasks and domains. SIE is composed by a general purpose machine learning algorithm (SVM) combined with several customizable modules. A crucial role in the architecture is played by Instance Filtering, which allows to increase efficiency without reducing effectiveness. The results obtained by SIE on several standard data sets, representative of different tasks and domains, are reported. The experiments show that SIE achieves performance close to the best systems in all tasks, without using domain-specific knowledge. 1
Unsupervised paraphrase acquisition has been an active research field in recent years, but its ef... more Unsupervised paraphrase acquisition has been an active research field in recent years, but its effective coverage and performance have rarely been evaluated. We propose a generic paraphrase-based approach for Relation Extraction (RE), aiming at a dual goal: obtaining an applicative evaluation scheme for paraphrase acquisition and obtaining a generic and largely unsupervised configuration for RE. We analyze the potential of our approach and evaluate an implemented prototype of it using an RE dataset. Our findings reveal a high potential for unsupervised paraphrase acquisition. We also identify the need for novel robust models for matching paraphrases in texts, which should address syntactic complexity and variability.
Uploads
Papers by Lorenza Romano