English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A p... more English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus.
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts a... more A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) tre... more In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) treebanks for Russian: UD_Russian-SynTagRus, UD_Russian-GSD, UD_Russian-Taiga, and UD_Russian-PUD. We describe the four treebanks, their distinctive features and development. In order to test and improve consistency within the treebanks, we reconsidered the experiments by Martínez Alonso and Zeman; our parsing experiments were conducted using a state-of-the-art parser that took part in the CoNLL 2017 Shared Task. We analyze error classes in functional and content relations and discuss a method to separate the errors induced by annotation inconsistency and those caused by syntactic complexity and other factors.
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluati... more The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10). One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.2; January 2017) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-un...
Universal Dependencies is an open community effort to create cross-linguistically consistent tree... more Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.
Artificially created treebank of elliptical constructions (gapping), in the annotation style of U... more Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9)... more Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). Changes in version 1.1: 1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset. 2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0. 3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
We announce a new language resource for research on semantic parsing, a large, carefully curated ... more We announce a new language resource for research on semantic parsing, a large, carefully curated collection of semantic dependency graphs representing multiple linguistic traditions. This resource is called SDP~2016 and provides an update and extension to previous versions used as Semantic Dependency Parsing target representations in the 2014 and 2015 Semantic Evaluation Exercises. For a common core of English text, this third edition comprises semantic dependency graphs from four distinct frameworks, packaged in a unified abstract format and aligned at the sentence and token levels. SDP 2016 is the first general release of this resource and available for licensing from the Linguistic Data Consortium in May 2016. The data is accompanied by an open-source SDP utility toolkit and system results from previous contrastive parsing evaluations against these target representations.
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.o... more Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.
We describe a pilot experiment aimed at harmonizing diverse data resources that contain coreferen... more We describe a pilot experiment aimed at harmonizing diverse data resources that contain coreferencerelated annotations. We converted 17 existing datasets for 11 languages into a common annotation scheme based on Universal Dependencies, and released a subset of the resulting collection publicly under the name CorefUD 0.1 via the LINDAT-CLARIAH-CZ repository (http://hdl.handle.net/ 11234/1-3510). All the datasets included in CorefUD 0.1 are enriched with automatic morphological and syntactic annotations which are fully compliant with the standards of the Universal Dependencies project. The datasets are stored in the CoNLL-U format, with coreferenceand bridging-specific information captured by attribute-value pairs located in the MISC column. This report describes our current knowledge and the results of the harmonization procedure valid in March 2021; see https://ufal.mff.cuni.cz/corefud for future updates.
English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A p... more English-Hindi parallel corpus collected from several sources. Tokenized and sentence-aligned. A part of the data is our patch for the Emille parallel corpus.
A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts a... more A Hindi corpus of texts downloaded mostly from news sites. Contains both the original raw texts and an extensively cleaned-up and tokenized version suitable for language modeling. 18M sentences, 308M tokens
In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) tre... more In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) treebanks for Russian: UD_Russian-SynTagRus, UD_Russian-GSD, UD_Russian-Taiga, and UD_Russian-PUD. We describe the four treebanks, their distinctive features and development. In order to test and improve consistency within the treebanks, we reconsidered the experiments by Martínez Alonso and Zeman; our parsing experiments were conducted using a state-of-the-art parser that took part in the CoNLL 2017 Shared Task. We analyze error classes in functional and content relations and discuss a method to separate the errors induced by annotation inconsistency and those caused by syntactic complexity and other factors.
The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluati... more The original SDP 2014 and 2015 data collections were made available under task-specific ‘evaluation’ licenses to registered SemEval participants. In mid-2016, all original data has been bundled with system submissions, supporting software, an additional SDP-style collection of semantic dependency graphs, and additional background material (from which some of the SDP target representations were derived) for release through the Linguistic Data Consortium (with LDC catalogue number LDC2016 T10). One of the four English target representations (viz. DM) and the entire Czech data (in the PSD target representation) are not derivative of LDC-licensed annotations and, thus, can be made available for direct download (Open SDP; version 1.2; January 2017) under a more permissive licensing scheme, viz. the Creative Common Attribution-NonCommercial-ShareAlike license. This package also includes some ‘richer’ meaning representations from which the English bi-lexical DM graphs derive, viz. scope-un...
Universal Dependencies is an open community effort to create cross-linguistically consistent tree... more Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.
Artificially created treebank of elliptical constructions (gapping), in the annotation style of U... more Artificially created treebank of elliptical constructions (gapping), in the annotation style of Universal Dependencies. Data taken from UD 2.1 release, and from large web corpora parsed by two parsers. Input data are filtered, sentences are identified where gapping could be applied, then those sentences are transformed, one or more words are omitted, resulting in a sentence with gapping. Details in Droganova et al.: Parse Me if You Can: Artificial Treebanks for Parsing Experiments on Elliptical Constructions, LREC 2018, Miyazaki, Japan.
Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9)... more Texts in 107 languages from the W2C corpus (http://hdl.handle.net/11858/00-097C-0000-0022-6133-9), first 1,000,000 tokens per language, tagged by the delexicalized tagger described in Yu et al. (2016, LREC, Portorož, Slovenia). Changes in version 1.1: 1. Universal Dependencies tagset instead of the older and smaller Google Universal POS tagset. 2. SVM classifier trained on Universal Dependencies 1.2 instead of HamleDT 2.0. 3. Balto-Slavic languages, Germanic languages and Romance languages were tagged by classifier trained only on the respective group of languages. Other languages were tagged by a classifier trained on all available languages. The "c7" combination from version 1.0 is no longer used.
We announce a new language resource for research on semantic parsing, a large, carefully curated ... more We announce a new language resource for research on semantic parsing, a large, carefully curated collection of semantic dependency graphs representing multiple linguistic traditions. This resource is called SDP~2016 and provides an update and extension to previous versions used as Semantic Dependency Parsing target representations in the 2014 and 2015 Semantic Evaluation Exercises. For a common core of English text, this third edition comprises semantic dependency graphs from four distinct frameworks, packaged in a unified abstract format and aligned at the sentence and token levels. SDP 2016 is the first general release of this resource and available for licensing from the Linguistic Data Consortium in May 2016. The data is accompanied by an open-source SDP utility toolkit and system results from previous contrastive parsing evaluations against these target representations.
Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.o... more Additional three Czech reference translations of the whole WMT 2011 data set (http://www.statmt.org/wmt11/test.tgz), translated from the German originals. Original segmentation of the WMT 2011 data is preserved.
We describe a pilot experiment aimed at harmonizing diverse data resources that contain coreferen... more We describe a pilot experiment aimed at harmonizing diverse data resources that contain coreferencerelated annotations. We converted 17 existing datasets for 11 languages into a common annotation scheme based on Universal Dependencies, and released a subset of the resulting collection publicly under the name CorefUD 0.1 via the LINDAT-CLARIAH-CZ repository (http://hdl.handle.net/ 11234/1-3510). All the datasets included in CorefUD 0.1 are enriched with automatic morphological and syntactic annotations which are fully compliant with the standards of the Universal Dependencies project. The datasets are stored in the CoNLL-U format, with coreferenceand bridging-specific information captured by attribute-value pairs located in the MISC column. This report describes our current knowledge and the results of the harmonization procedure valid in March 2021; see https://ufal.mff.cuni.cz/corefud for future updates.
Uploads
Papers by Daniel Zeman