Tweets in Spanish language, annotated for lexical normalization purposes. Created for the tweet n... more Tweets in Spanish language, annotated for lexical normalization purposes. Created for the tweet normalization challenge at Tweet-Norm 2013.
In this article, we present a factoid question-answering system, Sibyl, specifically tailored for... more In this article, we present a factoid question-answering system, Sibyl, specifically tailored for question answering (QA) on spoken-word documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken document scenario. More specifically, we study new information retrieval (IR) techniques designed or speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Sibyl is largely based on supervised machine-learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. Sibyl and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus, compa...
Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL), 2005
In this paper we present a semantic role labeling system submitted to the CoNLL-2005 shared task.... more In this paper we present a semantic role labeling system submitted to the CoNLL-2005 shared task. The system makes use of partial and full syntactic information and converts the task into a sequential BIO-tagging. As a result, the labeling architecture is very simple. Building on a state-of-the-art set of features, a binary classifier for each label is trained using AdaBoost with fixed depth decision trees. The final system, which combines the outputs of two base systems performed F 1 =76.59 on the official test set. Additionally, we provide results comparing the system when using partial vs. full parsing input information.
In the Question Answering (QA) task, search engines have to extract concise and precise fragments... more In the Question Answering (QA) task, search engines have to extract concise and precise fragments of texts that contain an answer to a question posed by the user in natural language. This task is very close to what is usually considered as automatic text understanding. The ...
Abstract. This is a preliminary report of the work carried out in order to introduce “spontaneous... more Abstract. This is a preliminary report of the work carried out in order to introduce “spontaneous” questions into QAST at CLEF 2009. QAST (Question Answering in Speech Transcripts) is a track of the CLEF campaign. The aim of this report is to show how difficult can be to generate “spontaneous” questions and the importance to take into account the real information needs of users for the evaluation of question answering systems.
In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which... more In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.
Tweets in Spanish language, annotated for lexical normalization purposes. Created for the tweet n... more Tweets in Spanish language, annotated for lexical normalization purposes. Created for the tweet normalization challenge at Tweet-Norm 2013.
In this article, we present a factoid question-answering system, Sibyl, specifically tailored for... more In this article, we present a factoid question-answering system, Sibyl, specifically tailored for question answering (QA) on spoken-word documents. This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken document scenario. More specifically, we study new information retrieval (IR) techniques designed or speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution. Sibyl is largely based on supervised machine-learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages. Sibyl and all its modules are extensively evaluated on the European Parliament Plenary Sessions English corpus, compa...
Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL), 2005
In this paper we present a semantic role labeling system submitted to the CoNLL-2005 shared task.... more In this paper we present a semantic role labeling system submitted to the CoNLL-2005 shared task. The system makes use of partial and full syntactic information and converts the task into a sequential BIO-tagging. As a result, the labeling architecture is very simple. Building on a state-of-the-art set of features, a binary classifier for each label is trained using AdaBoost with fixed depth decision trees. The final system, which combines the outputs of two base systems performed F 1 =76.59 on the official test set. Additionally, we provide results comparing the system when using partial vs. full parsing input information.
In the Question Answering (QA) task, search engines have to extract concise and precise fragments... more In the Question Answering (QA) task, search engines have to extract concise and precise fragments of texts that contain an answer to a question posed by the user in natural language. This task is very close to what is usually considered as automatic text understanding. The ...
Abstract. This is a preliminary report of the work carried out in order to introduce “spontaneous... more Abstract. This is a preliminary report of the work carried out in order to introduce “spontaneous” questions into QAST at CLEF 2009. QAST (Question Answering in Speech Transcripts) is a track of the CLEF campaign. The aim of this report is to show how difficult can be to generate “spontaneous” questions and the importance to take into account the real information needs of users for the evaluation of question answering systems.
In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which... more In this paper we introduce TweetNorm_es, an annotated corpus of tweets in Spanish language, which we make publicly available under the terms of the CC-BY license. This corpus is intended for development and testing of microtext normalization systems. It was created for Tweet-Norm, a tweet normalization workshop and shared task, and is the result of a joint annotation effort from different research groups. In this paper we describe the methodology defined to build the corpus as well as the guidelines followed in the annotation process. We also present a brief overview of the Tweet-Norm shared task, as the first evaluation environment where the corpus was used.
Uploads
Papers by Pere R . Comas