This paper describes the on-going work carried out within the CoBiLiRo (Bimodal Corpus for Romani... more This paper describes the on-going work carried out within the CoBiLiRo (Bimodal Corpus for Romanian Language) research project, part of ReTeRom (Resources and Technologies for Developing Human-Machine Interfaces in Romanian). Data annotation finds increasing use in speech recognition and synthesis with the goal to support learning processes. In this context, a variety of different annotation systems for application to Speech and Text Processing environments have been presented. Even if many designs for the data annotations workflow have emerged, the process of handling metadata, to manage complex user-defined annotations, is not covered enough. We propose a design of the format aimed to serve as an annotation standard for bimodal resources, which facilitates searching, editing and statistical analysis operations over it. The design and implementation of an infrastructure that houses the resources are also presented. The goal is widening the dissemination of bimodal corpora for resea...
Themes: current work on annotation tools, interraction of human and automatic annotation tools. W... more Themes: current work on annotation tools, interraction of human and automatic annotation tools. We present an annotation tool, called GLOSS, that manifests the following features: accepts as inputs SGML source documents and/or their database images and produces as output source SGML documents as well as the associated database images; allows for simultaneous opening of more documents; can collapse independent annotation views of the same original document, which also allows for a layer-by-layer annotation process in different annotation sessions and by different annotators, includig automatic; offers an attractive interface to the user; permits discourse structure annotation by offering a pair of building operations (adjoining and substitution) and remaking operations (undo, delete parent-child link and tree dismember). Finally we display an example that shows how GLOSS is employed to validate, using a corpora, a theory of global discourse. A demo can be offered on a PC platform run...
The paper argues in favour of an electronic form of the thesaurus dictionary of the Romanian lang... more The paper argues in favour of an electronic form of the thesaurus dictionary of the Romanian language, the dictionary edited by the Romanian Academy in two editions since 1913. Preliminary steps like scanning, optical character recognition, and pre-processing operations have already been done. The paper presents a prototype for the correction of the digital form of the dictionary. The numerous advantages of the digital thesaurus dictionary are discussed, as a basis for future work in Romanian lexicography and, more generally, in language processing. Key words: resources.
Advances in Natural Language Processing and Applications
Editorial Board of the Volume Comit�� editorial del volumen Eneko Agirre Christian Boitet Nicolet... more Editorial Board of the Volume Comit�� editorial del volumen Eneko Agirre Christian Boitet Nicoletta Calzolari John Carroll Kenneth Church Dan Cristea Walter Daelemans Barbara Di Eugenio Claire Gardent Alexander Gelbukh Gregory Grefenstette Eva Hajicova Yasunari Harada Eduard Hovy Nancy Ide Diana Inkpen Aravind Joshi Dimitar Kazakov Alma Kharrat Adam Kilgarriff Alexander Koller Sandra Kuebler Hugo Liu Aurelio Lopez Lopez Diana McCarthy Igor Mel'cuk Rada Mihalcea Masaki Murata Nicolas Nicolov Kemal Oflazer Constantin Orasan Manuel Palomar Ted ...
The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Languag... more The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of users. The querying of CoRoLa displayed here is supported by the KorAP frontend, through the querying language Poliqarp. Interrogations address annotation layers, such as the lexical, morphological and, in the near future, the syntactical layer, as well as the metadata. Other issues discussed are how to build a virtual corpus, how to deal with errors, how to find expressions and how to identify expressions
This paper presents the almost final results of a priority project of the Romanian Academy-the Co... more This paper presents the almost final results of a priority project of the Romanian Academy-the Corpus of Contemporary Romanian Language (CoRoLa). The Corpus includes data in both written and spoken forms of the language. The textual collection is made up of publications covering the period from the 2nd World War to our days, while the spoken collection includes only recent recordings.
This paper presents the almost final results of a priority project of the Romanian Academy-the Co... more This paper presents the almost final results of a priority project of the Romanian Academy-the Corpus of Contemporary Romanian Language (CoRoLa). The Corpus includes data in both written and spoken forms of the language. The textual collection is made up of publications covering the period from the 2nd World War to our days, while the spoken collection includes only recent recordings.
In this article we present a method for automatic extraction of syntactic patterns that are used ... more In this article we present a method for automatic extraction of syntactic patterns that are used to develop a dependency parsing method. The patterns have been extracted from a corpus automatically annotated for tokens, sentences' borders, parts of speech and noun phrases, and manually annotated for dependency relations between words. The evaluation shows promising results in the case of an order-free language.
The way in which discourse features express connections back to the previous discourse has been d... more The way in which discourse features express connections back to the previous discourse has been described in the literature in terms of adjoining at the right frontier of discourse structure. But this does not allow for discourse features that express expectations about what is to come in the subsequent discourse. After characterizing these expectations and their distribution in text, we show how an approach that makes use of substitution as well as adjoining on a suitably defined right frontier, can be used to both process expectations and constrain discouse processing in general.
The quality of discourse structure annotations is negatively influenced by the numerous difficult... more The quality of discourse structure annotations is negatively influenced by the numerous difficulties that occur in the analysis process. In contrast, referential annotation resources are considerably more reliable, given the high precision of the existent anaphora resolution systems. We present an approach based on the Veins Theory (Cristea, Ide, Romary, 1998), in which successful reference annotations of texts are exploited in order to improve arbitrary structural analyses; in this way, the large amount of corpora annotated at reference level can be used for the acquisition of discourse structure annotation resources. 1.
Preface Big cultural heritage data present an unprecedented opportunity for the humanities that i... more Preface Big cultural heritage data present an unprecedented opportunity for the humanities that is reshaping conventional research methods. However, digital humanities have grown past the stage where the mere availability of digital data was enough as a demonstrator of possibilities. Knowledge resource modeling, development, enrichment and integration is crucial for associating relevant information in pools of digital material which are not only scattered across various archives, libraries and collections, but they also often lack relevant metadata. Within this research framework, NLP approaches originally stemming from lexico-semantic information extraction and knowledge resource representation, modeling, development and reuse have a pivotal role to play. From the NLP perspective, applications of knowledge resources for the SocioEconomic Sciences and Humanities present numerous interesting research challenges that relate among others to the development of historical lexico-semantic...
In this paper we present the methodology employed in the creation of an aligned speech-to-text Ro... more In this paper we present the methodology employed in the creation of an aligned speech-to-text Romanian Corpus. The corpus uses recordings from the AMPER-ROM and AMPRom projects as well as ad-hoc recordings of continuous speech. The protocol for speech recording and labelling, as well as the manual annotation procedure, are described. The corpus is intended to be used for training a speech segmentation module and an automatic speech-to-text aligner module.
As it is known, on the political scene the success of a speech can be measured by the degree in w... more As it is known, on the political scene the success of a speech can be measured by the degree in which the speaker is able to change attitudes, opinions, feelings and political beliefs in his auditorium. We suggest a range of analysis tools, all be-longing to semiotics, from lexical-semantic, to syntactical and rhetorical, that integrated in the exploratory panoply of discur-sive weapons of a political speaker could influence the impact of her/his speeches over a sensible auditory. Our approach is based on the assumption that semiotics, in its quality of methodology and meta-language, can capitalize a situational analysis over the political discourse. Such an analysis assumes establishing the communication situation, in our case, the Parliament’s vote in favour of suspending the Romanian President, through which we can describe an action of communication. We describe a platform, the Discourse Analysis Tool (DAT), which integrates a range of natural language processing tools with the ...
This paper describes the on-going work carried out within the CoBiLiRo (Bimodal Corpus for Romani... more This paper describes the on-going work carried out within the CoBiLiRo (Bimodal Corpus for Romanian Language) research project, part of ReTeRom (Resources and Technologies for Developing Human-Machine Interfaces in Romanian). Data annotation finds increasing use in speech recognition and synthesis with the goal to support learning processes. In this context, a variety of different annotation systems for application to Speech and Text Processing environments have been presented. Even if many designs for the data annotations workflow have emerged, the process of handling metadata, to manage complex user-defined annotations, is not covered enough. We propose a design of the format aimed to serve as an annotation standard for bimodal resources, which facilitates searching, editing and statistical analysis operations over it. The design and implementation of an infrastructure that houses the resources are also presented. The goal is widening the dissemination of bimodal corpora for resea...
Themes: current work on annotation tools, interraction of human and automatic annotation tools. W... more Themes: current work on annotation tools, interraction of human and automatic annotation tools. We present an annotation tool, called GLOSS, that manifests the following features: accepts as inputs SGML source documents and/or their database images and produces as output source SGML documents as well as the associated database images; allows for simultaneous opening of more documents; can collapse independent annotation views of the same original document, which also allows for a layer-by-layer annotation process in different annotation sessions and by different annotators, includig automatic; offers an attractive interface to the user; permits discourse structure annotation by offering a pair of building operations (adjoining and substitution) and remaking operations (undo, delete parent-child link and tree dismember). Finally we display an example that shows how GLOSS is employed to validate, using a corpora, a theory of global discourse. A demo can be offered on a PC platform run...
The paper argues in favour of an electronic form of the thesaurus dictionary of the Romanian lang... more The paper argues in favour of an electronic form of the thesaurus dictionary of the Romanian language, the dictionary edited by the Romanian Academy in two editions since 1913. Preliminary steps like scanning, optical character recognition, and pre-processing operations have already been done. The paper presents a prototype for the correction of the digital form of the dictionary. The numerous advantages of the digital thesaurus dictionary are discussed, as a basis for future work in Romanian lexicography and, more generally, in language processing. Key words: resources.
Advances in Natural Language Processing and Applications
Editorial Board of the Volume Comit�� editorial del volumen Eneko Agirre Christian Boitet Nicolet... more Editorial Board of the Volume Comit�� editorial del volumen Eneko Agirre Christian Boitet Nicoletta Calzolari John Carroll Kenneth Church Dan Cristea Walter Daelemans Barbara Di Eugenio Claire Gardent Alexander Gelbukh Gregory Grefenstette Eva Hajicova Yasunari Harada Eduard Hovy Nancy Ide Diana Inkpen Aravind Joshi Dimitar Kazakov Alma Kharrat Adam Kilgarriff Alexander Koller Sandra Kuebler Hugo Liu Aurelio Lopez Lopez Diana McCarthy Igor Mel'cuk Rada Mihalcea Masaki Murata Nicolas Nicolov Kemal Oflazer Constantin Orasan Manuel Palomar Ted ...
The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Languag... more The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of users. The querying of CoRoLa displayed here is supported by the KorAP frontend, through the querying language Poliqarp. Interrogations address annotation layers, such as the lexical, morphological and, in the near future, the syntactical layer, as well as the metadata. Other issues discussed are how to build a virtual corpus, how to deal with errors, how to find expressions and how to identify expressions
This paper presents the almost final results of a priority project of the Romanian Academy-the Co... more This paper presents the almost final results of a priority project of the Romanian Academy-the Corpus of Contemporary Romanian Language (CoRoLa). The Corpus includes data in both written and spoken forms of the language. The textual collection is made up of publications covering the period from the 2nd World War to our days, while the spoken collection includes only recent recordings.
This paper presents the almost final results of a priority project of the Romanian Academy-the Co... more This paper presents the almost final results of a priority project of the Romanian Academy-the Corpus of Contemporary Romanian Language (CoRoLa). The Corpus includes data in both written and spoken forms of the language. The textual collection is made up of publications covering the period from the 2nd World War to our days, while the spoken collection includes only recent recordings.
In this article we present a method for automatic extraction of syntactic patterns that are used ... more In this article we present a method for automatic extraction of syntactic patterns that are used to develop a dependency parsing method. The patterns have been extracted from a corpus automatically annotated for tokens, sentences' borders, parts of speech and noun phrases, and manually annotated for dependency relations between words. The evaluation shows promising results in the case of an order-free language.
The way in which discourse features express connections back to the previous discourse has been d... more The way in which discourse features express connections back to the previous discourse has been described in the literature in terms of adjoining at the right frontier of discourse structure. But this does not allow for discourse features that express expectations about what is to come in the subsequent discourse. After characterizing these expectations and their distribution in text, we show how an approach that makes use of substitution as well as adjoining on a suitably defined right frontier, can be used to both process expectations and constrain discouse processing in general.
The quality of discourse structure annotations is negatively influenced by the numerous difficult... more The quality of discourse structure annotations is negatively influenced by the numerous difficulties that occur in the analysis process. In contrast, referential annotation resources are considerably more reliable, given the high precision of the existent anaphora resolution systems. We present an approach based on the Veins Theory (Cristea, Ide, Romary, 1998), in which successful reference annotations of texts are exploited in order to improve arbitrary structural analyses; in this way, the large amount of corpora annotated at reference level can be used for the acquisition of discourse structure annotation resources. 1.
Preface Big cultural heritage data present an unprecedented opportunity for the humanities that i... more Preface Big cultural heritage data present an unprecedented opportunity for the humanities that is reshaping conventional research methods. However, digital humanities have grown past the stage where the mere availability of digital data was enough as a demonstrator of possibilities. Knowledge resource modeling, development, enrichment and integration is crucial for associating relevant information in pools of digital material which are not only scattered across various archives, libraries and collections, but they also often lack relevant metadata. Within this research framework, NLP approaches originally stemming from lexico-semantic information extraction and knowledge resource representation, modeling, development and reuse have a pivotal role to play. From the NLP perspective, applications of knowledge resources for the SocioEconomic Sciences and Humanities present numerous interesting research challenges that relate among others to the development of historical lexico-semantic...
In this paper we present the methodology employed in the creation of an aligned speech-to-text Ro... more In this paper we present the methodology employed in the creation of an aligned speech-to-text Romanian Corpus. The corpus uses recordings from the AMPER-ROM and AMPRom projects as well as ad-hoc recordings of continuous speech. The protocol for speech recording and labelling, as well as the manual annotation procedure, are described. The corpus is intended to be used for training a speech segmentation module and an automatic speech-to-text aligner module.
As it is known, on the political scene the success of a speech can be measured by the degree in w... more As it is known, on the political scene the success of a speech can be measured by the degree in which the speaker is able to change attitudes, opinions, feelings and political beliefs in his auditorium. We suggest a range of analysis tools, all be-longing to semiotics, from lexical-semantic, to syntactical and rhetorical, that integrated in the exploratory panoply of discur-sive weapons of a political speaker could influence the impact of her/his speeches over a sensible auditory. Our approach is based on the assumption that semiotics, in its quality of methodology and meta-language, can capitalize a situational analysis over the political discourse. Such an analysis assumes establishing the communication situation, in our case, the Parliament’s vote in favour of suspending the Romanian President, through which we can describe an action of communication. We describe a platform, the Discourse Analysis Tool (DAT), which integrates a range of natural language processing tools with the ...
Uploads
Papers by Dan Cristea