I am Associate Professor in the TALN Group at the Department of Information and Communication Technologies, Universitat Pompeu Fabra. My main research interest is Computational Linguistics. I work in the areas of Text Summarization, Information Extraction and Semantic Analysis. I am the creator of the SUMMA system, and co-creator of the Simplext simplification system and Dr Inventor scientific text mining library.
ABSTRACT This study addresses the automatic simplification of texts in Spanish in order to make t... more ABSTRACT This study addresses the automatic simplification of texts in Spanish in order to make them more accessible to people with cognitive disabilities. A corpus analysis of original and manually simplified news articles was undertaken in order to identify and quantify relevant operations to be implemented in a text simplification system. The articles were further compared at sentence and text level by means of automatic feature extraction and various machine learning classification algorithms, using three different groups of features (POS frequencies, syntactic information, and text complexity measures) with the aim of identifying features that help separate original documents from their simple equivalents. Finally, it was investigated whether these features can be used to decide upon simplification operations to be carried out at the sentence level (split, delete, and reduce). Automatic classification of original sentences into those to be kept and those to be eliminated outperformed the classification that was previously conducted on the same corpus. Kept sentences were further classified into those to be split or significantly reduced in length and those to be left largely unchanged, with the overall F-measure up to 0.92. Both experiments were conducted and compared on two different sets of features: all features and the best subset returned by an attribute selection algorithm.
ABSTRACT In this poster submission, we describe the actual state of development of textual analys... more ABSTRACT In this poster submission, we describe the actual state of development of textual analysis and ontology-based information extraction in real world applications, as they are defined in the context of the European R&D project “MUSING” dealing with Business Intelligence. We present in some details the actual state of ontology development, including a time and domain ontologies, which are guiding information extraction onto an ontology population task.
Itl International Journal of Applied Linguistics, 2013
Are rounded numbers easier to understand than exact numbers? Information in newspapers often take... more Are rounded numbers easier to understand than exact numbers? Information in newspapers often takes the form of numerical expressions which pose comprehension problems for many people, including people with disabilities, low literacy levels or lack of access to advanced technology. The purpose of this paper is to motivate and describe a rule-based lexical component that simplifies numerical expressions in Spanish texts. We propose a simplification approach that makes news articles more accessible to readers with specials needs by rewriting difficult numerical expressions in a simpler way. We carried out a study that identifies powerful simplification strategies to simplify numerical information in a text by analysing a parallel corpus of original texts and their manual simplifications. The study is complemented with an analysis of simplifications obtained in response to a questionnaire where subjects were asked to produce simplifications of numerical expressions in context. Finally, we implemented and evaluated a simplification system that mimics the simplification strategies that were found to be effective.
ABSTRACT This study addresses the automatic simplification of texts in Spanish in order to make t... more ABSTRACT This study addresses the automatic simplification of texts in Spanish in order to make them more accessible to people with cognitive disabilities. A corpus analysis of original and manually simplified news articles was undertaken in order to identify and quantify relevant operations to be implemented in a text simplification system. The articles were further compared at sentence and text level by means of automatic feature extraction and various machine learning classification algorithms, using three different groups of features (POS frequencies, syntactic information, and text complexity measures) with the aim of identifying features that help separate original documents from their simple equivalents. Finally, it was investigated whether these features can be used to decide upon simplification operations to be carried out at the sentence level (split, delete, and reduce). Automatic classification of original sentences into those to be kept and those to be eliminated outperformed the classification that was previously conducted on the same corpus. Kept sentences were further classified into those to be split or significantly reduced in length and those to be left largely unchanged, with the overall F-measure up to 0.92. Both experiments were conducted and compared on two different sets of features: all features and the best subset returned by an attribute selection algorithm.
ABSTRACT In this poster submission, we describe the actual state of development of textual analys... more ABSTRACT In this poster submission, we describe the actual state of development of textual analysis and ontology-based information extraction in real world applications, as they are defined in the context of the European R&D project “MUSING” dealing with Business Intelligence. We present in some details the actual state of ontology development, including a time and domain ontologies, which are guiding information extraction onto an ontology population task.
Itl International Journal of Applied Linguistics, 2013
Are rounded numbers easier to understand than exact numbers? Information in newspapers often take... more Are rounded numbers easier to understand than exact numbers? Information in newspapers often takes the form of numerical expressions which pose comprehension problems for many people, including people with disabilities, low literacy levels or lack of access to advanced technology. The purpose of this paper is to motivate and describe a rule-based lexical component that simplifies numerical expressions in Spanish texts. We propose a simplification approach that makes news articles more accessible to readers with specials needs by rewriting difficult numerical expressions in a simpler way. We carried out a study that identifies powerful simplification strategies to simplify numerical information in a text by analysing a parallel corpus of original texts and their manual simplifications. The study is complemented with an analysis of simplifications obtained in response to a questionnaire where subjects were asked to produce simplifications of numerical expressions in context. Finally, we implemented and evaluated a simplification system that mimics the simplification strategies that were found to be effective.
Uploads
Papers by Horacio Saggion