I am a historical corpus linguist, working mainly with aspect and definiteness-related phenomena in the history of Russian and in Old Church Slavonic, often contrasting with Greek. I run and maintain the TOROT diachronic treebank.
This article explores the synchronic variation between the nominative-accusative (NA) and genitiv... more This article explores the synchronic variation between the nominative-accusative (NA) and genitive-accusative (GA) in the oldest layer of canonical Old Church Slavonic (OCS), using parallel Greek and OCS data with principled information status annotation. Firstly, the data are used to clarify the claims made about the pragmatic properties of the alternation in the previous literature. There is a good case for claiming that OCS GA marking functions as a limited type of definiteness marking, i.e. that GA objects will nearly always be previously mentioned or contextually accessible. Secondly, the data are used to examine whether the GA-NA variation correlates with any other discourse properties known to be important in differential object marking systems. The NA is found to be a marker of referential persistence: a new referent will typically be NA-marked if it is an important participant in the further narrative. Third, the focus is shifted to the relationship between subject and obje...
We test whether the functionality (non-redundancy) of morphological features can serve as a predi... more We test whether the functionality (non-redundancy) of morphological features can serve as a predictor of the survivability of those features in the course of language change. We apply a recently proposed method of measuring functionality of a feature by estimating its importance for the performance of an automatic parser to the Slavic language group. We find that the functionality of a Common Slavic grammeme, together with the functionality of its category, is a significant predictor of its survivability in modern Slavic languages. The least functional grammemes within the most functional categories are most likely to die out.
This paper gives an example of how enriched diachronic treebank data can shed new light on an old... more This paper gives an example of how enriched diachronic treebank data can shed new light on an old and conflicted topic, even when that topic is morphological and semantic in nature rather than syntactic. The topic is the rise of the Russian po delimitatives, a change seen as crucial in most accounts of the history of Russian aspect, since it represents a major step in generalising the derivational aspect system. Earlier accounts concur that the po delimitatives spread fairly recently, too recently for the development to be connected to the loss of the aorist tense, which also had delimitative readings with atelic verbs. Using treebank data from the Tromso Old Russian and OCS Treebank, enriched with tags for derivational morphology and semantics, I show that the po delimitatives were not marginal even in the earliest Slavic sources, neither in terms of frequency nor semantics, and that they first complemented and then competed with the delimitative aorists. It can thus be claimed tha...
We describe the Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) that spans from the e... more We describe the Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) that spans from the earliest Old Church Slavonic to modern Russian texts, covering more than a thousand years of continuous language history. We focus on the latest additions to the treebank, first of all, the modern subcorpus that was created by a high-quality conversion of the existing treebank of contemporary standard Russian (SynTagRus).
Historical treebanks tend to be manually annotated, which is not surprising, since state-of-the-a... more Historical treebanks tend to be manually annotated, which is not surprising, since state-of-the-art parsers are not accurate enough to ensure high-quality annotation for historical texts. We test whether automatic parsing can be an efficient pre-annotation tool for Old East Slavic texts. We use the TOROT treebank from the PROIEL treebank family. We convert the PROIEL format to the CONLL format and use MaltParser to create syntactic pre-annotation. Using the most conservative evaluation method, which takes into account PROIEL-specific features, MaltParser by itself yields 0.845 unlabelled attachment score, 0.779 labelled attachment score and 0.741 secondary dependency accuracy (note, though, that the test set comes from a relatively simple genre and contains rather short sentences). Experiments with human annotators show that preparsing, if limited to sentences where no changes to word or sentence boundaries are required, increases their annotation rate. For experienced annotators, t...
We describe automatic conversion of the SynTagRus dependency treebank of Russian to the PROIEL fo... more We describe automatic conversion of the SynTagRus dependency treebank of Russian to the PROIEL format (with the ultimate purpose of obtaining a single-format diachronic treebank spanning more than a thousand years), focusing on analysis of shared arguments in verbal coordinations. Whether arguments are shared or private is not marked in the SynTagRus native format, but the PROIEL format indicates sharing by means of secondary dependencies. In order to recover missing information and insert secondary dependencies into the converted SynTagRus, we create a simple guessing algorithm based on four probabilistic features: how likely a given argument type is to be shared; how likely an argument in a given position is to be shared; how likely a given verb is to have a given argument; how likely a given verb is to have a given argument frame. Boosted with a few deterministic rules and trained on a small manually annotated sample (346 sentences), the guesser very successfully inserts shared s...
In this paper we first briefly describe the design of a corpus containing the Koine Greek origina... more In this paper we first briefly describe the design of a corpus containing the Koine Greek original text of the New Testament and its translations in to Gothic, Latin, Old Church Slavic and Armenian. We then discuss extensively the annotation that we have applied in each layer of annotation: morphology and syntax, information structure, animacy, and token alignment. For each type of annotation we provide some preliminary results and applications that draw on it, often in combination with other layers of annotation.
... 104 Hanne Martine Eckhoff ... Harris and Campbell 1995,25-9). Only Sprincak tries to use this... more ... 104 Hanne Martine Eckhoff ... Harris and Campbell 1995,25-9). Only Sprincak tries to use this "syntactic law", as he calls it, to explain the develop-ment of noun phrase syntax, claiming that agreeing forms are typical of a par-atactic language structure, while governed4 forms are ...
This article explores the synchronic variation between the nominative-accusative (NA) and genitiv... more This article explores the synchronic variation between the nominative-accusative (NA) and genitive-accusative (GA) in the oldest layer of canonical Old Church Slavonic (OCS), using parallel Greek and OCS data with principled information status annotation. Firstly, the data are used to clarify the claims made about the pragmatic properties of the alternation in the previous literature. There is a good case for claiming that OCS GA marking functions as a limited type of definiteness marking, i.e. that GA objects will nearly always be previously mentioned or contextually accessible. Secondly, the data are used to examine whether the GA-NA variation correlates with any other discourse properties known to be important in differential object marking systems. The NA is found to be a marker of referential persistence: a new referent will typically be NA-marked if it is an important participant in the further narrative. Third, the focus is shifted to the relationship between subject and obje...
We test whether the functionality (non-redundancy) of morphological features can serve as a predi... more We test whether the functionality (non-redundancy) of morphological features can serve as a predictor of the survivability of those features in the course of language change. We apply a recently proposed method of measuring functionality of a feature by estimating its importance for the performance of an automatic parser to the Slavic language group. We find that the functionality of a Common Slavic grammeme, together with the functionality of its category, is a significant predictor of its survivability in modern Slavic languages. The least functional grammemes within the most functional categories are most likely to die out.
This paper gives an example of how enriched diachronic treebank data can shed new light on an old... more This paper gives an example of how enriched diachronic treebank data can shed new light on an old and conflicted topic, even when that topic is morphological and semantic in nature rather than syntactic. The topic is the rise of the Russian po delimitatives, a change seen as crucial in most accounts of the history of Russian aspect, since it represents a major step in generalising the derivational aspect system. Earlier accounts concur that the po delimitatives spread fairly recently, too recently for the development to be connected to the loss of the aorist tense, which also had delimitative readings with atelic verbs. Using treebank data from the Tromso Old Russian and OCS Treebank, enriched with tags for derivational morphology and semantics, I show that the po delimitatives were not marginal even in the earliest Slavic sources, neither in terms of frequency nor semantics, and that they first complemented and then competed with the delimitative aorists. It can thus be claimed tha...
We describe the Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) that spans from the e... more We describe the Tromsø Old Russian and Old Church Slavonic Treebank (TOROT) that spans from the earliest Old Church Slavonic to modern Russian texts, covering more than a thousand years of continuous language history. We focus on the latest additions to the treebank, first of all, the modern subcorpus that was created by a high-quality conversion of the existing treebank of contemporary standard Russian (SynTagRus).
Historical treebanks tend to be manually annotated, which is not surprising, since state-of-the-a... more Historical treebanks tend to be manually annotated, which is not surprising, since state-of-the-art parsers are not accurate enough to ensure high-quality annotation for historical texts. We test whether automatic parsing can be an efficient pre-annotation tool for Old East Slavic texts. We use the TOROT treebank from the PROIEL treebank family. We convert the PROIEL format to the CONLL format and use MaltParser to create syntactic pre-annotation. Using the most conservative evaluation method, which takes into account PROIEL-specific features, MaltParser by itself yields 0.845 unlabelled attachment score, 0.779 labelled attachment score and 0.741 secondary dependency accuracy (note, though, that the test set comes from a relatively simple genre and contains rather short sentences). Experiments with human annotators show that preparsing, if limited to sentences where no changes to word or sentence boundaries are required, increases their annotation rate. For experienced annotators, t...
We describe automatic conversion of the SynTagRus dependency treebank of Russian to the PROIEL fo... more We describe automatic conversion of the SynTagRus dependency treebank of Russian to the PROIEL format (with the ultimate purpose of obtaining a single-format diachronic treebank spanning more than a thousand years), focusing on analysis of shared arguments in verbal coordinations. Whether arguments are shared or private is not marked in the SynTagRus native format, but the PROIEL format indicates sharing by means of secondary dependencies. In order to recover missing information and insert secondary dependencies into the converted SynTagRus, we create a simple guessing algorithm based on four probabilistic features: how likely a given argument type is to be shared; how likely an argument in a given position is to be shared; how likely a given verb is to have a given argument; how likely a given verb is to have a given argument frame. Boosted with a few deterministic rules and trained on a small manually annotated sample (346 sentences), the guesser very successfully inserts shared s...
In this paper we first briefly describe the design of a corpus containing the Koine Greek origina... more In this paper we first briefly describe the design of a corpus containing the Koine Greek original text of the New Testament and its translations in to Gothic, Latin, Old Church Slavic and Armenian. We then discuss extensively the annotation that we have applied in each layer of annotation: morphology and syntax, information structure, animacy, and token alignment. For each type of annotation we provide some preliminary results and applications that draw on it, often in combination with other layers of annotation.
... 104 Hanne Martine Eckhoff ... Harris and Campbell 1995,25-9). Only Sprincak tries to use this... more ... 104 Hanne Martine Eckhoff ... Harris and Campbell 1995,25-9). Only Sprincak tries to use this "syntactic law", as he calls it, to explain the develop-ment of noun phrase syntax, claiming that agreeing forms are typical of a par-atactic language structure, while governed4 forms are ...
We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmati... more We describe and compare two tools for processing Middle Russian texts. Both tools provide lemmatization, part-of-speech and morphological annotation. One (“RNC”) was developed for annotating texts in the Russian National Corpus and is rule-based. The other one (“TOROT”) is being used for annotating the eponymous corpus and is statistical. We apply the two analyzers to the same Middle Russian text and then compare their outputs with high-quality manual annotation. Since the analyzers use different annotation schemes and spelling principles, we have to harmonize their outputs before we can compare them. The comparison shows that TOROT performs considerably better than RNC (lemmatization 69.8% vs. 47.3%, part of speech 89.5% vs. 54.2%, morphology 81.5% vs. 16.7%). If, however, we limit the evaluation set only to those tokens for which the analyzers provide a guess and in addition consider the RNC response correct if one of the multiple guesses it provides is correct, the numbers become comparable (88.5% vs. 91.9%, 93.9% vs. 95.2%, 81.5% vs. 86.8%). We develop a simple procedure which boosts TOROT lemmatization accuracy by 8.7% by using RNC lemma guesses when TOROT fails to provide one and matching them against the existing TOROT lemma database. We conclude that a statistical analyzer (trained on a large material) can deal with non-standardised historical texts better than a rule-based one. Still, it is possible to make the analyzers collaborate, boosting the performance of the superior one.
Uploads
Papers by Hanne Eckhoff