This chapter explores the potential of natural language corpora for grammatical research. The cha... more This chapter explores the potential of natural language corpora for grammatical research. The chapter distinguishes three main types of data (evidence) that a corpus can provide: factual evidence, frequency evidence, and interaction evidence. The chapter makes the research case for parsing a corpus completely, correcting the annotation by human linguists. It uses the cyclic ‘3A’ perspective (Nelson et al. 2002) to relate a series of exploratory algorithms and tools relevant to the grammatical researcher, including concordancing tools, grammatical exploration tools, and bottom-up generalization algorithms. The aim is not merely to describe what is found in a corpus but to perform systematic ‘natural experiments’. The rich grammatical analysis of a parsed corpus gains a new role: in reliably obtaining examples of grammatical units within which research may be conducted. The chapter concludes with a discussion of some simple experiments, and the methodological issues that arise in carr...
This paper describes a series of novel statistical meta-tests for comparing experimental runs for... more This paper describes a series of novel statistical meta-tests for comparing experimental runs for significant difference, for conditions where experiments are carried out using Binomial or multinomial contingency statistics (χ, z, log-likelihood tests, etc.). The new tests permit us to evaluate whether experiments have failed to replicate on new data; whether a particular data source or subcorpus obtains a significantly different result than another; or whether changing experimental parameters obtains a stronger effect. Recognising when an experiment obtains a significantly different result and when it does not is an issue frequently overlooked in research publication. Papers are frequently published citing ‘p values’ or test scores suggesting a ‘stronger effect’, substituting for sound statistical reasoning. This paper sets out a series of tests which together illustrate the correct approach to this question, namely, to compute confidence intervals for differences between effect si...
An 800,000 word corpus of spontaneous spoken British English containing equal amounts of directly... more An 800,000 word corpus of spontaneous spoken British English containing equal amounts of directly comparable material from 1960-1976 and from the early 1990s. The corpus is textually annotated (marking sentence boundaries, speakers, overlaps etc.), as well as grammatically annotated (tagged and parsed), indexed, and fully searchable with ICECUP, using Fuzzy Tree Fragments and other query systems. The resource features a lexicon (a database of word-tag combinations in the corpus) and a grammaticon (a database of node combinations). These will enable users to contrast lexical and grammatical distributions in the LLC and ICE. The resource is an invaluable research tool for linguists interested in present-day English grammar, as well as for those interested in current changes in this domain.
Wallis (2013) provides an account of an empirical evaluation of Binomial confidence intervals and... more Wallis (2013) provides an account of an empirical evaluation of Binomial confidence intervals and contingency test formulae. The main take-home message of that article was that it is possible to evaluate statistical methods objectively and provide advice to researchers that is based on an objective computational assessment. In this article we develop the evaluation of that article further by re-weighting estimates of error using Binomial and Fisher weighting, which is equivalent to an ‘exhaustive Monte-Carlo simulation’. We also develop an argument concerning key attributes of difference intervals: that we are not merely concerned with when differences are zero (conventionally equivalent to a significance test) but also accurate estimation when difference may be non-zero (necessary for plotting data and comparing differences).
This chapter explores the potential of natural language corpora for grammatical research. The cha... more This chapter explores the potential of natural language corpora for grammatical research. The chapter distinguishes three main types of data (evidence) that a corpus can provide: factual evidence, frequency evidence, and interaction evidence. The chapter makes the research case for parsing a corpus completely, correcting the annotation by human linguists. It uses the cyclic ‘3A’ perspective (Nelson et al. 2002) to relate a series of exploratory algorithms and tools relevant to the grammatical researcher, including concordancing tools, grammatical exploration tools, and bottom-up generalization algorithms. The aim is not merely to describe what is found in a corpus but to perform systematic ‘natural experiments’. The rich grammatical analysis of a parsed corpus gains a new role: in reliably obtaining examples of grammatical units within which research may be conducted. The chapter concludes with a discussion of some simple experiments, and the methodological issues that arise in carr...
This paper describes a series of novel statistical meta-tests for comparing experimental runs for... more This paper describes a series of novel statistical meta-tests for comparing experimental runs for significant difference, for conditions where experiments are carried out using Binomial or multinomial contingency statistics (χ, z, log-likelihood tests, etc.). The new tests permit us to evaluate whether experiments have failed to replicate on new data; whether a particular data source or subcorpus obtains a significantly different result than another; or whether changing experimental parameters obtains a stronger effect. Recognising when an experiment obtains a significantly different result and when it does not is an issue frequently overlooked in research publication. Papers are frequently published citing ‘p values’ or test scores suggesting a ‘stronger effect’, substituting for sound statistical reasoning. This paper sets out a series of tests which together illustrate the correct approach to this question, namely, to compute confidence intervals for differences between effect si...
An 800,000 word corpus of spontaneous spoken British English containing equal amounts of directly... more An 800,000 word corpus of spontaneous spoken British English containing equal amounts of directly comparable material from 1960-1976 and from the early 1990s. The corpus is textually annotated (marking sentence boundaries, speakers, overlaps etc.), as well as grammatically annotated (tagged and parsed), indexed, and fully searchable with ICECUP, using Fuzzy Tree Fragments and other query systems. The resource features a lexicon (a database of word-tag combinations in the corpus) and a grammaticon (a database of node combinations). These will enable users to contrast lexical and grammatical distributions in the LLC and ICE. The resource is an invaluable research tool for linguists interested in present-day English grammar, as well as for those interested in current changes in this domain.
Wallis (2013) provides an account of an empirical evaluation of Binomial confidence intervals and... more Wallis (2013) provides an account of an empirical evaluation of Binomial confidence intervals and contingency test formulae. The main take-home message of that article was that it is possible to evaluate statistical methods objectively and provide advice to researchers that is based on an objective computational assessment. In this article we develop the evaluation of that article further by re-weighting estimates of error using Binomial and Fisher weighting, which is equivalent to an ‘exhaustive Monte-Carlo simulation’. We also develop an argument concerning key attributes of difference intervals: that we are not merely concerned with when differences are zero (conventionally equivalent to a significance test) but also accurate estimation when difference may be non-zero (necessary for plotting data and comparing differences).
A research blog covering the discussion of statistics for corpus linguistics research, aimed at r... more A research blog covering the discussion of statistics for corpus linguistics research, aimed at researchers in corpus linguistics of all levels of experience.
The blog approaches statistical questions in a fresh, mathematical manner, prioritising the design of experiments and the visualisation of results over traditional statistical testing.
By approaching statistics in this manner, the student is better able to understand what the statistical test is actually testing and therefore what a significant result means for their research question. We debunk a few myths along the way.
Uploads
Papers by Sean Wallis
The blog approaches statistical questions in a fresh, mathematical manner, prioritising the design of experiments and the visualisation of results over traditional statistical testing.
By approaching statistics in this manner, the student is better able to understand what the statistical test is actually testing and therefore what a significant result means for their research question. We debunk a few myths along the way.