LU 2021 - What Can Corpus Software Reveal

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/360029255
What can corpus software reveal about language development?
Chapter · April 2022

DOI: 10.4324/9780367076399-12
CITATIONS READS
0 153
1 author:
Xiaofei Lu
Pennsylvania State University
107 PUBLICATIONS 2,314 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Integrating corpus-based and genre-based approaches to EAP writing research and pedagogy View project
Content Difficulty of EAP Reading Materials (National Social Science Fund of China; 18BYY110) View project
All content following this page was uploaded by Xiaofei Lu on 19 April 2022.
The user has requested enhancement of the downloaded file.

Lu, X. (2022). What can corpus software reveal about language development?. In A.
O’Keeffe & M. McCarthy (Eds.), The Routledge handbook of corpus linguistics (2nd ed.)
(pp. 155-167). London: Routledge. https://doi.org/10.4324/9780367076399-12
What can corpus software reveal about language development?
Xiaofei Lu
The Pennsylvania State University
1. What is language development?
Language development refers to the process in which the language faculty develops in a
human being. First language (L1) development is concerned with how children acquire the
capability of their native language, while second language (L2) development is concerned
with how children and adults acquire the capability of an L2.
Theories of L1 development generally need to address at least the following three questions:
what children bring to the language learning task, what mechanisms drive language
acquisition, and what types of input support the language-learning system (Pence and
Justice 2017). Psychologists have taken drastically different approaches to answering these
questions, among which the rationalist, empiricist, and pragmatist paradigms have been the
most influential (Russell 2004). The rationalist approach, inspired by Chomskyan
linguistics, takes the view that the language faculty does not depend on external sources for
its content but is internal to each individual. For rationalists, children are born with innate
1
formal knowledge of a universal grammar, and they bring this domain-specific knowledge
to the task of acquiring the I-language (i.e. the internal and individual language) of their
native tongue. Language input is used to discover the parameters that their native language
uses to satisfy the universal grammar. The empiricist approach, upheld by connectionists,
believes that the content of the language faculty is not innate, but is derived from perceptual
experience. For empiricists, children employ domain-general mechanisms of associative
learning to acquire the rules and representations of their native language through experience
with sufficient speech input. The pragmatist or socio-cognitivist approach advocates that
children recruit their socio-cognitive capacity to actively construct their language faculty.
Within this paradigm, language is viewed as a socio-cultural action, and the language
development process is viewed as involving children constructing a series of models or
working theories of their mother tongue from the evidence that is available to them.
Theories of L2 development generally seek to explain a different set of questions, including
the nature of L2 knowledge, the nature of interlanguage, the contributions of knowledge of
the L1, the contributions of the linguistic environment, and the role of instruction (Ortega
2014). A total of 14 contemporary theories of or approaches to L2 development or
acquisition are presented in VanPatten and Williams (2014) and Atkinson (2011). These
theories take different stances with respect to the various aspects of L2 development. For
example, concerning the nature of L2 knowledge, the Chomskyan universal grammar
theory, which is committed to nativism (a theoretical perspective positing that children
have the innate ability to acquire language), argues that L2 learners cannot obtain
knowledge of ungrammaticality and ambiguity from linguistic input, but possess pre-
2
existing knowledge of the grammar that constrains their learning task (White 2014).
Contrastively, the skill acquisition theory, which is committed to conscious processing,
claims that development happens from initial representation of knowledge through
proceduralization of knowledge to eventual automatization of knowledge (DeKeyser 2014).
Still different is the Vygotskian sociocultural theory, which views L2 knowledge as socially
distributed, ‘places mediation, either by other or self, at the core of development and use’
and understands development ‘not only in terms of target-like performance but also in
terms of the quality and quantity of external mediation required’ (Lantolf and Thorne 2011:
24).
2. How do we measure language development?
In addition to the theoretical question of how language development takes place, another
important and more practical question that is of interest to teachers, researchers, parents,
and/or clinicians is what stage of language development a particular child or L2 learner is
in, or in other words, how much a child or L2 learner knows about the language system and
its use at a particular point. Measurement of language development is especially important
for children suffering any delay or disorder in their language development. There are
multiple ways to answer this question, including naturalistic observation; production,
comprehension, and judgment tasks; formal testing; and language sample analysis, among
others. In this section, we focus on how language development can be measured through
analyzing spoken or written language samples produced by a child or L2 learner.
3
A number of measures of language development have been proposed and explored in the
child language development literature. Some measures are based on verbal output, e.g.
mean length of utterance (MLU) (Brown 1973) and number of different words (NDWs),
while others are based on structural analysis, e.g. Developmental Sentence Scoring (DSS)
(Lee 1974), Index of Production Syntax (IPSyn) (Scarborough 1990), and Developmental
Level (D-Level) (Rosenberg and Abbeduto 1987; Covington et al. 2006). Both DSS and
IPSyn were developed to evaluate children’s grammatical development, although they work
in different ways. The DSS metric assigns a score to each sentence. It considers eight
different types of grammatical forms, including indefinite pronouns, personal pronouns,
main verbs, secondary or embedded verbs, conjunctions, negatives and two types of
questions. Variants of the same type of grammatical form are scored differently based on
the order in which children develop the ability to use them. The score of a sentence is the
sum of the points for each type plus one point if the sentence is fully grammatical. The
average DSS of a speaker can be computed using a representative language sample. The
IPSyn metric does not apply to individual sentences but examines the number of times 56
target grammatical structures are used in a sample produced by a speaker. These include
various types of noun phrases, verb phrases, questions, and some specific sentence
structures. Each occurrence of any of the target grammatical structures in the language
sample receives one point. However, a maximum of two occurrences of each structure are
counted, and the maximum score a language sample can receive is 112. The D-Level scale
classifies each sentence into one of eight increasingly more complex categories (Levels
zero through seven), depending on the syntactic structures it contains. For example, a
sentence containing a finite clause as the object of the main verb is classified as Level three,
4
and a sentence containing an embedded clause serving as the subject of the main verb is
classified as Level six.
In the L2 development literature, a large number of developmental index studies have
attempted to identify objective measures of complexity, accuracy, and fluency (CAF) of
production that can be used to index the learner’s level of development or overall
proficiency in the target language. This is generally achieved by assessing the development
of L2 learners at known proficiency levels in the target language using various measures.
Developmental measures identified in such a way allow teachers and researchers to
evaluate and describe the learner’s developmental level in a more precise way. In addition,
they can also be used to examine the effect of a particular pedagogical treatment on
language use. Wolfe-Quintero et al. (1998) provided a comprehensive review of the
measures explored in 39 second and foreign language writing studies and recommended
several measures that were consistently linear and significantly related to program or school
levels as the best measures of development or error. These include three measures of
fluency, i.e. mean length of T-unit, where a T-unit is a main clause plus any subordinate
clauses (Hunt 1965), mean length of clause, and mean length of error-free T-unit; two
measures of accuracy, i.e. error-free T-units per T-unit, and errors per T-unit; two measures
of grammatical complexity, i.e. clauses per T-unit, and dependent clauses per clause; and
two measures of lexical complexity, i.e. total number of word types divided by the square
root of twice the total number of word tokens, and total number of sophisticated word types
divided by total number of word types. In terms of syntactic complexity, recent research
has argued for the need to focus on more fine-grained measures (Kyle 2016), measures of
5
phrasal complexity, as well as co-occurrence patterns of lexico-grammatical features (e.g.
Biber et al. 2016). In a similar spirit, Hawkins and Buttery (2010) proposed and illustrated
the identification of a systematic set of criterial features for each proficiency level in the
Common European Framework of Reference for Languages (CEFR). These features
captured the emergence, frequency, accuracy, and usage distribution of relevant linguistic
properties (e.g., verb co-occurrence frames and relative clauses) characterizing each
proficiency level. They further argued for the need to consider L1-specific features, a view
supported by later research demonstrating the influence of L1 on the acquisition order
and/or usage of different linguistic properties (e.g. Lu and Ai 2015; Murakami and
Alexopoulou 2016).
3. How can we use a corpus to find out more about first language development?
In this section, we discuss several ways in which a corpus of child language development
data may be used to find out more about L1 development. Some of these will be illustrated
using the following corpora and corpus analysis software: the Child Language Data
Exchange System (CHILDES) database, the Computerized Language Analysis (CLAN)
program (MacWhinney 2000), Computerized Profiling (Long 2019), and D-Level Analyzer
(Lu 2009). We briefly introduce each of these first.
The CHILDES database contains transcripts and media data collected from conversations
between young children of different ages and their parents, playmates, and caretakers.
These data are contributed by researchers from many different countries, following the
6
same data collection and transcription standards. Each file in the database contains a
transcript of a conversation and includes a header that encodes information about the target
child or children (e.g. age, native language, whether the child is normal in terms of
language development, etc.), other participants, the location and situation of the
conversation, the activities that are going on during the conversation, and the researchers
and coders collecting and transcribing the data. The conversation is transcribed in a one-
utterance-per-line format, with the producer of each utterance clearly marked in a prefix.
Each utterance is followed by another line that consists of a morphological analysis of the
utterance. Any physical actions accompanying the utterance are also provided in a separate
line. The CLAN program is a collection of computational tools designed to automatically
analyze data transcribed in the CHILDES format. Some of the automatic analyses that the
program can run on one or more files in the CHILDES database include word frequency,
type/token ratio, a measure of vocabulary diversity called D (Durán et al. 2004), mean
length of turn, mean length of utterance, and DSS score, among others.
Computerized Profiling is a set of programs designed to analyze both written language
samples and phonetically transcribed spoken language samples. Linguistic analysis at a
range of different levels can be performed, including simple corpus statistics, semantics,
grammar, phonology, pragmatics, and narratives. For example, at the grammar level, the
following four procedures can be run: IPSyn, DSS, Black English Sentence Scoring (BESS)
(Nelson 1998), which is an adaptation of DSS for use with speakers of African American
Vernacular English, and the Language Assessment, Remediation, and Screening Procedure
7
(LARSP) (Crystal et al. 1989), a system for profiling the syntactic and discourse
development of children that is related to both age and stage.
The D-Level Analyzer is a computer program designed to automate the measurement of
syntactic complexity using the revised D-Level scale (Covington et al. 2006). Given a raw
sentence as input, the analyzer assigns it to an appropriate level on the scale. The program
achieves an accuracy of 93.2 per cent on spoken child language acquisition data from the
CHILDES database.
First of all, a corpus can be used to describe the characteristics of language produced by
children in different age groups or different stages of development. Children may exhibit a
considerable amount of variability in terms of language development. However, it is useful
to understand the average capability as well as the range of capabilities exhibited by
children within the same age group. Researchers generally agree that there are certain
milestones in child language development, or approximate ages at which specific language
capabilities usually emerge or mature. For example, at approximately 12 months of age,
words start to emerge; at approximately 24 months of age, children possess more than 50
vocabulary items and begin to spontaneously join these items into self-created two-word
phrases; and at approximately 30 months of age, children produce utterances with at least
two words, and many with three or even five words (Lenneberg 1967). A large corpus
consisting of language samples produced by children of different age groups can be used to
complement or confirm naturalistic observation for establishing or revisiting such
milestones. The CHILDES database constitutes a good example of such a corpus. Given a
8
set of data that consists of transcripts of conversations involving targeted groups of children
in a particular age group, e.g. eighteen months, it is possible to use CLAN, Computerized
Profiling, and D-Level Analyzer to find out the average as well as the range of different
types of developmental metrics of interest exhibited by all the children in the group.
Second, a corpus can be used to investigate the sequence or order in which children acquire
different aspects of the system of their native language as well as to track the development
of individual children over time. This type of investigation necessitates a corpus of
longitudinal data, i.e. data collected from the same child or group of children over an
extended period of time, e.g. one to five years. An early example of this type of research is
Ramer (1977), who conducted a longitudinal study to investigate the developmental
sequence of syntactic acquisition in seven children. Specifically, she aimed to find out
whether there is “a universal sequence of emergence of grammatical relations leading up to
the production of S + V + O constructions” (p. 144). She analyzed her corpus data using a
hypothesized simplicity-complexity dimension based on the number of grammatical
relations produced and their expansions. She reported that the sequence of acquisition
specified in the hypothesized dimension was observed in the data from all seven children.
In a recent study, Khaghaninejad et al. (2018) analyzed a corpus in the CHILDES database
containing the utterances produced by five L1 Farsi Iranian children over a year to examine
the order of acquisition of Farsi consonants. Their analysis generated a ‘timeline’ for L1
Farsi children to acquire idealized articulations of different consonants.
9
Third, a corpus can be used to assess the validity and adequacy of the various metrics
proposed for measuring child language development. This is an important enterprise as
such measures are often used for evaluating the level of language development of children
with developmental delays or disorders. One of the ways to approach this problem is
closely related to the descriptive and longitudinal research discussed above. Since these
metrics were proposed to measure language development, many of them were based on
observation of child language acquisition. Given a particular measure, it is sensible to
evaluate whether it reflects the developmental sequence or significantly differentiates the
developmental levels of children in different age groups. An example of this type of
research is Lu (2009), who analyzed data from the CHILDES database using the D-Level
Analyzer and reported a correlation of .648 (p < .001) between average D-Level scores and
speaker age as well as significant between-age differences in average D-level scores. A
second way to approach this problem is to examine whether a proposed measure
significantly differentiates between the developmental levels of children with and without
developmental disorders within the same age group. A good example of this line of
research is Hewitt et al. (2005). They compared scores of kindergarten children with a
mean age of six years with and without specific language impairment (SLI) on three
commonly used measures, i.e. MLU in morphemes, IPSyn, and NDWs. They found that
children with SLI showed significantly lower mean scores for all of the three measures,
except for some subtests of the IPSyn. In relation to this line of research, a corpus can also
be used to provide normative information for valid and adequate measures. To improve the
feasibility of applying these measures in practical situations and to enable researchers and
clinicians to make sense of the analytical results using these measures, it is necessary to
10
have normative information for different age groups for benchmarking purposes. The
CHILDES database could again be used for providing such normative information.
Finally, a corpus can also be used to gain in-depth understanding of language development
disorders. Through comprehensive contrastive analyses, it is possible to qualitatively and
quantitatively describe the developmental differences between children with and without
language disorders, e.g. in terms of vocabulary size and range of syntactic structures. In
addition, longitudinal data can also be used to investigate the effect of a particular
therapeutic intervention. Early interventions play a critical role in optimizing the
developmental trajectory of children with language disorders during the best window of
opportunity (Pence and Justice 2017). By analyzing language samples produced before and
after a particular intervention, it is possible to evaluate whether targeted changes have
systematically occurred in a statistically significant way.
4. How can we use a corpus to find out more about second language development?
In this section, we discuss a number of ways that a corpus of learner language can be used
to find out more about L2 development. The Longitudinal Database of Learner English
(LONGDALE) (Meunier 2016) constitutes an excellent example of such a resource.
LONGDALE contains data from English learners from diverse L1 backgrounds, with all
learners contributing data at least once a year for three or more years. Various types of
spoken, written, and experimental data are included. The database also includes
comprehensive information about the learners and tasks, such as age, gender, language
11
background, proficiency level, and task type, among others. The International Corpus of
Learner English (ICLE; Version 2) (Granger et al. 2009), while initially designed for
comparing learner English among learners from different L1 backgrounds as well as against
L1 English, has good potential for L2 development research as well. This corpus contains
3.7 million words of academic writing, mostly argumentative, by intermediate to advanced
learners of English as a foreign language, mostly university students, representing 16
different mother tongue backgrounds. The following learner variables are recorded for each
written text: age, learning context, proficiency level, gender, mother tongue, region,
knowledge of other foreign languages, and L2 exposure. These variables allow for cross-
sectional or quasi-longitudinal analysis that can offer useful insight into learner’s L2
development (Meunier 2015; see also chapter 23).
Various corpus processing tools can be used to analyze learner corpora in the different
ways to be discussed below (e.g. Lu 2014; 2017). For example, Coh-Metrix (McNamara et
al. 2014) can be used to assess the coherence and cohesion of language samples using a
large set of linguistic features. The Biber Tagger (Biber 1988) can be used to analyze a
large number of lexico-grammatical features of language samples. The Lexical Complexity
Analyzer (LCA) (Lu 2012) and the Tool for the Automatic Analysis of Lexical
Sophistication (TAALES) (Kyle et al. 2018) can be used to assess the lexical density,
lexical diversity, and lexical sophistication of learner texts using a large number of metrics.
The L2 Syntactic Complexity Analyzer (L2SCA) (Lu 2010) and the Tool for the Automatic
Analysis of Syntactic Sophistication and Complexity (TAASSC) (Kyle 2016) are both
designed for L2 writing syntactic complexity analysis. In addition, computational systems
12
for automated grammatical error detection in learner writing are emerging (Leacock et al.
2014), such as Criterion Online Writing Evaluation Service developed by Educational
Testing Service (ETS) (available at www.ets.org/criterion), Cambridge English Write and
Improve (available at https://writeandimprove.com), and the Grammar and Mechanics
Error Tool (GAMET) (Crossley et al. 2019).
The first way a corpus can be used to reveal L2 development is as a database for describing
the characteristics of the interlanguage of learners at known proficiency levels. To this end,
it is necessary to have a learner corpus that encodes information about the learners’
proficiency levels. Proficiency level can be conceptualized in a number of different ways,
e.g. classroom grades, holistic ratings, program levels, school levels, and standardized test
scores (Wolfe-Quintero et al. 1998). The CEFR has also been increasingly used as a
calibration for proficiency level within learner corpora, such as the Cambridge Learner
Corpus (CLC), which is comprised of data from the Cambridge English Language
Assessment (Barker et al. 2015), and the EF-Cambridge Open Language Database
(EFCAMDAT), which consists of written samples from over 174,000 adult learners of
English as a second language (ESL) across the world (Huang et al. 2018). Linking to the
same framework of proficiency makes the results from different data sources more
comparable. Nevertheless, it should be noted that the use of any type of calibration for
proficiency, be it age, year of schooling, or the CEFR, is a workaround for not being able to
obtain large amounts of genuine longitudinal data (e.g. Meunier 2015), which, if available,
would be the preferred data source for investigating L2 development. In terms of analysis,
one may choose to focus on a particular aspect of the interlanguage, for example, the
13
degree to which informal, colloquial patterns or styles are used in formal, written language.
One may also attempt to provide a comprehensive description of the lexico-grammatical
system of the interlanguage. For example, in the English Grammar Profile project,
O’Keeffe and Mark (2017) examined the patterns of grammatical development across the
six levels of the CEFR using the CLC. Their project resulted in a database of over 1,200
empirically-derived statements that can be used to characterize the grammatical
competence of English learners at different CEFR levels.
This type of descriptive study can benefit both from error analysis and from contrastive
analysis of learner data and L1 speaker data. To conduct an error analysis, it is necessary to
first design an error annotation scheme, which should be consistently followed in
identifying and annotating errors in learner text. An early example of an error annotation
scheme can be found in Granger (2003), which assigns each error first to one of the
following nine major domains: form, morphology, grammar, lexis, syntax, register, style,
punctuation, and typo, and then to a specific category within the domain. Lüdeling and
Hirschmann (2015) offer a systematic review of issues surrounding error annotation and
existing error annotation systems in learner corpus research. An error-annotated learner
corpus enables one to easily identify the common errors that learners at a given proficiency
level tend to make.
A contrastive study of learner data and L1 speaker data helps us to look at the
characteristics of the interlanguage from a different perspective, in particular, how it
converges to or differs from L1 speaker usage. For example, one may assess whether
14
learners tend to overuse or underuse certain words, phrases, collocations, grammatical
constructions, speech acts, etc. relative to L1 speakers (Granger 1998; De Cock 2000). It is
important, however, to ensure that the learner data and the L1 speaker data are of
comparable nature in terms of mode, genre, and field, etc. The Trinity Lancaster Corpus,
which contains 4.2 million words of interaction between English learners and L1 speakers
(Gablasova et al. 2019), constitutes an excellent source of data for this purpose. Importantly,
however, it should be noted that recent notions in learner corpus research are starting to
move away from using L1 speaker data as a norm for comparing learner data and focus on
L2 competence as an entity for analysis in its own right (e.g. Granger 2015).
Second, a corpus may be used in developmental index studies to identify objective metrics
that can be used to index levels of L2 development or the learner’s overall language
proficiency. Earlier studies of CAF differences between different proficiency levels
contained substantial variability in terms of choice and definition of measures, writing task
used, sample size, corpus length, timing condition, etc., making it challenging to compare
the results reported (Wolfe-Quintero et al. 1998), as these factors have been found to affect
the CAF of learner language (e.g. Alexopoulou et al. 2017; Hsu 2019). To eliminate such
inconsistency and variability, recent research has evaluated or compared large sets of
measures on the same learner corpus or corpora. For example, Lu (2011) used L2SCA to
analyze large-scale L2 writing data from the Written English Corpus of Chinese Learners
(WECCL) (Wen et al. 2005). The corpus is a collection of over 3,000 essays written by
English majors in nine different colleges in China. Each essay in the corpus is annotated
with a header that includes the following information: mode (written or spoken), genre
15
(argumentation, narration, or exposition), school level (first, second, third, or fourth year in
college), year of admission (2000, 2001, 2002, or 2003), timing condition (timed with a 40-
minute limit or untimed), institution (a two- to four-letter code), and length (number of
words in the essay). Students in the same school level within the same institution wrote on
the same topics, but topics varied from institution to institution. Given the information that
is available in the corpus, proficiency level is conceptualized using school level. Through
the analysis, this study provided useful insights into how different syntactic complexity
measures perform as indices of college-level L2 writers’ language development, how they
relate to each other, and how their performances are affected by external factors.
Third, a corpus can be used to examine the contributions of knowledge of the L1 as well as
the effect of L1 transfer. One the one hand, knowledge of the L1 may prove helpful in
learning certain aspects of the L2, and learners with different L1 background may show
strengths in learning different aspects of the L2. On the other hand, the intrusion of L1 may
result in difficulty in acquiring certain lexico-grammatical aspects of the L2 and prevalence
of certain forms or grammatical patterns that deviate from the target language in the
interlanguage. Consequently, the interlanguages of learners at the same proficiency level
but with different L1 background may demonstrate some significantly different
characteristics. A contrastive study of such interlanguages may provide evidence of L1
influence, either positive or negative, on learner development and output (e.g. Granger et al.
2015; Murakami and Alexopoulou 2016). The ICLE corpus constitutes an excellent source
of data for this type of research, as students with diverse L1 background are represented. A
contrastive study of a learner’s L1 and interlanguage will provide further evidence on the
16
L1 influence. One example of this type of research is Lu and Ai (2015), who analyzed data
from the ICLE and the Louvain Corpus of Native English Essays (LOCNESS) (Granger
1996) with L2SCA to examine differences in the syntactic complexity in English writing
among college-level writers with eight different L1 backgrounds, including seven L2
groups and one L1 group. They reported that the seven L2 groups demonstrated drastically
different patterns of difference from the L1 group.
Fourth, longitudinal learner corpora can be used to examine the trajectories and patterns of
learner development and to provide evidence to validate or challenge the claims and
assumptions of different theories of or approaches to L2 development. For example, the
complex dynamic systems approach views language development as a dynamic process
characterized by changing patterns of variability and interactions among different
subsystems of the language (Verspoor et al. 2011). Multiple longitudinal studies from this
approach have reported evidence that L2 developmental trajectories as well as the patterns
of interaction among different CAF features are highly variable and that such variability
follows the principles of dynamic systems (e.g. Larsen-Freeman 2006; Caspi 2010). Usage-
based approaches to L2 development take the position that L2 learning is achieved by
learning constructions, understood as conventionalized form-meaning mappings at varied
levels of complexity that are entrenched as language knowledge in the speakers’ mind
(Goldberg 1995). Research within this framework posits that language acquisition is shaped
by exposure to and usage of language and has reported that a learner’s repertoire of
constructions starts with fixed sequences and becomes increasingly more complex and
productive (Ellis et al. 2016). For example, Römer (2019) analyzed verb-argument
17
constructions (VACs) in a large-scale corpus of written texts produced by L2 learners at
varied levels of English proficiency and found that the learners’ inventory of VACs
developed from fixed sequences to more diverse, productive, and complex ones.
Finally, a corpus may be used to examine the role of instruction or the effect of a particular
pedagogical intervention on language development. For example, by examining corpus data
of different groups of learners at the same school level or program level that are exposed to
different types of instruction method, material, or linguistic environment, we may better
understand whether differences in instruction result in differences in L2 development. In
addition, by comparing the learner’s production prior to and after a period of targeted
pedagogical intervention, we may assess whether the intervention is effective in helping the
learner acquire particular aspects of the L2.
5. Looking to the future
As a field, corpus-based language development research will benefit tremendously from the
following future developments. First, language samples produced by children and L2
learners often contain many errors and as such present a challenge to natural language
processing (NLP) technology, especially when it comes to measures that involve syntactic,
semantic, and discourse analysis. Therefore, continued enhancement of existing NLP
technology and development of robust new NLP technology will facilitate more accurate
and reliable automatic analysis of language samples using more diversified measures. A
second avenue for future development in the field lies in the systematic collection and
18
sharing of large-scale child and L2 development data that encodes richer information about
the children or learners producing the data. For child language development research, large-
scale longitudinal data and data of children with language disorders are particularly
valuable. The Growth in Grammar Corpus (Durrant et al. forthcoming), a large collection
of texts written by school children in England as part of their school work, constitutes an
excellent example of this avenue of development. For L2 development research,
systematical annotation of the learner’s proficiency level using as many conceptualizations
as possible will prove especially useful to L2 development researchers. These include
school levels, program levels, standardized test scores, holistic ratings, classroom grades,
etc. Large-scale data with richer information will make it easier to draw more reliable
conclusions for many of the types of research discussed above. Finally, analysis of L2
development data will benefit from the development of consistent and standardized error
annotation standards as well as improved automatic error detection techniques. L2
development researchers have often devised their own annotation schemes for error
analysis, which makes comparison and sharing of research results problematic. The field in
general will benefit from a more consistent annotation scheme. There has also been an
increasing stream of research in automatic error detection and correction (Crossley et al.
2019; Leacock et al. 2014). The maturity of such techniques will facilitate automatic error
analysis of large-scale L2 development data and enable researchers to gain more reliable
insights into L2 use.
6. Further reading
19
Atkinson, D. (ed.) (2011) Alternative Approaches to Second Language Acquisition.
Abingdon: Routledge. (This edited volume presents a comprehensive introduction to and
comparison of six non-cognitivist approaches to second language acquisition.)
Lu, X. (2014) Computational Methods for Corpus Annotation and Analysis. Singapore:
Springer. (This book provides a systematic and accessible introduction to diverse types of
computational tools that can be used for automatic or computer-assisted annotation and
analysis of text corpora at various linguistic levels).
MacWhinney, B. (2000) The CHILDES Project: Tools for Analyzing Talk, 3rd edn.
Mahwah: Lawrence Erlbaum Associates. (This book provides hands-on instruction on how
to transcribe naturalistic child language development data following the CHILDES format
and automatically analyze such data using CLAN. Readers are introduced to a set of
computational tools designed to improve the readability of transcripts, to automate the data
analysis process, and to facilitate the sharing of transcribed data).
Pence, L.K. and Justice, L.M. (2017) Language Development from Theory to Practice, 3rd
edn. New York: Pearson. (This book provides an extremely accessible introduction to the
theory and practice of child language development. The material presented in the book is
also highly relevant to clinical, educational, and research settings).
VanPatten, B. and Williams, J. (eds) (2014) Theories in Second Language Acquisition: An
Introduction, 2nd edn. New York: Routledge. (This edited volume presents a
20
comprehensive introduction to early and contemporary theories in second language
acquisition. It provides an excellent overview of each of these compelling theories).
References
Alexopoulou, T., Michel, M., Murakami, A. and Meurers, D. (2017) ‘Task Effects on
Linguistic Complexity and Accuracy: A Large-Scale Learner Corpus Analysis
Employing Natural Language Processing Techniques’, Language Learning 67(S1):
180–208.
Atkinson, D. (ed.) (2011) Alternative Approaches to Second Language Acquisition.
Abingdon: Routledge.
Barker, F., Salamoura, A. and Saville, N. (2015). ‘Learner Corpora and Language Testing’,
in G. Granger, G. Gilquin and F. Meunier (eds) The Cambridge Handbook of Learner
Corpus Research, Cambridge: Cambridge University Press, pp. 511–33.
Biber, D. (1988). Variation Across Speech and Writing. Cambridge: Cambridge University
Press.
Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999) Longman Grammar
of Spoken and Written English. New York: Longman.
Biber, D., Gray, B. and Staples, S. (2016) ‘Predicting Patterns of Grammatical Complexity
Across Language Exam Task Types and Proficiency Levels’, Applied Linguistics 37(5):
639–68.
Brown, R. (1973) A First Language. Cambridge: Harvard University Press.
Caspi, T. (2010). A Dynamic Perspective on Second Language Development. PhD
dissertation, University of Groningen, Groningen, Netherlands.
21
Covington, M.A., He, C., Brown, C., Naçi, L. and Brown, J. (2006) How Complex is that
Sentence? A Proposed Revision of the Rosenberg and Abbeduto D-Level Scale. Atlanta:
The University of Georgia, Artificial Intelligence Center.
Crossley, S. A., Bradfield, F. and Bustamante, A. (2019) ‘Using Human Judgments to
Examine the Validity of Automated Grammar, Syntax, and Mechanical Errors in
Writing’, Journal of Writing Research 11(2): 251–70.
Crystal, D., Fletcher, P. and Garman, M. (1989). Grammatical Analysis of Language
Disability, 2nd edn. London: Cole & Whurr.
De Cock, S. (2000) ‘Repetitive Phrasal Chunkiness and Advanced EFL Speech and
Writing’, in C. Mair and M. Hundt (eds) Corpus Linguistics and Linguistic Theory,
Amsterdam: Rodopi, pp. 51–68.
DeKeyser, R. (2007). ‘Skill Acquisition Theory’, in B. VanPatten and J. Williams (eds)
Theories in Second Language Acquisition: An Introduction, Mahwah: Lawrence
Erlbaum Associates, pp. 97–114.
Durán, P., Malvern, D., Richards, B. and Chipere, N. (2004) ‘Developmental Trends in
Lexical Diversity’, Applied Linguistics 25(2): 220–42.
Durrant, P., Brenchley, M. and McCallum, L. (forthcoming) Understanding Development
and Proficiency in Writing: Quantitative Corpus Linguistic Approaches. Cambridge:
Cambridge University Press.
Ellis, N.C., Römer, U. and O’Donnell, M.B. (2016) Usage-based Approaches to Language
Acquisition and Processing: Cognitive and Corpus Investigations of Construction
Grammar. Hoboken: Wiley-Blackwell.
22
Gablasova, D., Brezina, V. and McEnery, T. (2019) ‘The Trinity Lancaster Corpus:
Development, description and application’, International Journal of Learner Corpus
Research 5(2): 126–58.
Goldberg, A.E. (1995) Constructions. A Construction Grammar Approach to Argument
Structure. Chicago: University of Chicago Press.
Granger, S. (1996) ‘From CA to CIA and Back: An integrated approach to computerized
bilingual and learner corpora’, in K. Aijmer, B. Altenberg and M. Johansson (eds)
Languages in Contrast: Paper from a Symposium on Text-based Cross-linguistic
Studies. Lund Studies in English, Vol. 88, Lund: Lund University Press, pp. 37–51.
Granger, S. (ed.) (1998) Learner English on Computer. Boston: Addison Wesley Longman.
Granger, S. (2003) ‘Error-Tagged Learner Corpora and CALL: A promising synergy’,
CALICO Journal 20(3): 465–80.
Granger, S. (2015) ‘Contrastive Interlanguage Analysis: A reappraisal’, International
Journal of Learner Corpus Research 1(1): 7–24.
Granger, S., Dagneaux, E., Meunier, F. and Paquot, M. (2009) International Corpus of
Learner English Version 2. Louvain: Presses Universitaires de Louvain.
Granger, S., Gilquin, G. and Meunier, F. (eds) (2015) The Cambridge Handbook of
Learner Corpus Research. Cambridge: Cambridge University Press.
Hawkins, J. and Buttery, P. (2010) ‘Criterial Features in Learner Corpora: Theory and
illustrations’, English Profile Journal 1: 1–23.
Hewitt, L.E., Scheffner, H.C., Yont, K.M. and Tomblin, J.B. (2005) ‘Language Sampling
for Kindergarten Children With and Without SLI: Mean length of utterance, IPSYN,
and NDW’, Journal of Communication Disorders 38(3): 197–213.
23
Hsu, H.-C. (2019) ‘The Combined Effect of Task Repetition and Post-Task Transcribing on
L2 Speaking Complexity, Accuracy, and Fluency’, Language Learning Journal 47(2):
172–87.
Huang, Y., Murakami, A., Alexopoulou, T. and Korhonen, A. (2018) ‘Dependency Parsing
of Learner English’, International Journal of Corpus Linguistics 23(1): 28–54.
Hunt, K.W. (1965) Grammatical Structures Written at Three Grade Levels. Urbana:
National Council of Teachers of English.
Khaghaninejad, M.S., Moloodi, A. and Saadi, R.F. (2018) ‘A Timeline for Acquisition of
Farsi Consonants: A first language acquisition corpus-based analysis’, Theory and
Practice in Language Studies 8(12): 1711–24.
Kyle, K. (2016) Measuring Syntactic Development in L2 Writing: Fine Grained Indices of
Syntactic Complexity and Usage-based Indices of Syntactic Sophistication. PhD
Dissertation, Georgia State University, Atlanta.
Kyle, K., Crossley, S. and Berger, C. (2018) ‘The Tool for the Automatic Analysis of
Lexical Sophistication (TAALES): Version 2.0’, Behavior Research Methods 50(3):
1030–46.
Lantolf, J.P. and Thorne, S.L. (2011) ‘The Sociocultural Approach to Second Language
Acquisition’, in D. Atkinson (ed.) Alternative Approaches to Second Language
Acquisition, Abingdon: Routledge, pp. 24–47.
Larsen-Freeman, D. (2006) ‘The Emergence of Complexity, Fluency, and Accuracy in the
Oral and Written Production of Five Chinese Learners of English’, Applied Linguistics
27(4): 590–619.
24
Leacock, C., Chodorow, M., Gamon, M. and Tetreault J. (2014). Automated Grammatical
Error Detection for Language Learners, 2nd ed. San Rafael: Morgan & Claypool
Publishers.
Lee, L. (1974) Developmental Sentence Analysis. Chicago: Northwestern University Press.
Lenneberg, E.H. (1967) Biological Foundations of Language. Hoboken: John Wiley &
Sons.
Long, S.H. (2019) Computerized Profiling (Version 10.0.0). Milwaukee: Marquette
University.
Lu, X. (2009) ‘Automatic Measurement of Syntactic Complexity in Child Language
Acquisition’, International Journal of Corpus Linguistics 14(1): 3–28.
Lu, X. (2010) ‘Automatic Analysis of Syntactic Complexity in Second Language Writing’,
International Journal of Corpus Linguistics 15(4): 474–96.
Lu, X. (2011) ‘A Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of
College-Level ESL Writers’ Language Proficiency’, TESOL Quarterly 45(1): 36–62.
Lu, X. (2012) ‘The Relationship of Lexical Richness to the Quality of ESL Learners’ Oral
Narratives’, The Modern Language Journal 96(2): 190–208.
Lu, X. (2014) Computational Methods for Corpus Annotation and Analysis. Dordrecht:
Springer.
Lu, X. and Ai, H. (2015) ‘Syntactic Complexity in College-Level English Writing:
Differences among writers with diverse L1 backgrounds’, Journal of Second Language
Writing 29: 16–27.
25
Lu, X. (2017) ‘Automated Measurement of Syntactic Complexity in Corpus-Based L2
Writing Research and Implications for Writing Assessment,’ Language Testing 34(4):
493–511.
Lüdeling, A. and Hirschmann, H. (2015) ‘Error Annotation Systems’, in A. Granger and F.
Munier (eds) The Cambridge Handbook of Learner Corpus Research, Cambridge:
Cambridge University Press, pp. 135–57.
Meunier, F. (2015) ‘Developmental Patterns in Learner Corpora’, in S. Granger, G. Gilquin
and F. Meunier (eds) The Cambridge Handbook of Learner Corpus Research,
Cambridge: Cambridge University Press, pp. 379–400.
Meunier, F. (2016) ‘Introduction to the LONGDALE Project’, in E. Castello, K. Ackerley
and F. Coccetta (eds) Studies in Learner Corpus Linguistics. Research and Applications
for Foreign Language Teaching and Assessment, Berlin: Peter Lang, pp. 123–26.
Murakami, A. and Alexopoulou, T. (2016) ‘L1 Influence on the Acquisition Order of
English Grammatical Morphemes: A learner corpus study’, Studies in Second Language
Acquisition 38(3): 365–401.
MacWhinney, B. (2000) The CHILDES Project: Tools for Analyzing Talk, 3rd edn.
Mahwah: Lawrence Erlbaum Associates.
Nelson, N.W. (1998) Childhood Language Disorders in Context: Infancy Through
Adolescence, 2nd edn. Boston: Allyn & Bacon.
O’Keeffe, A. and Mark, G. (2017) ‘The English Grammar Profile of Learner Competence:
Methodology and key findings’, International Journal of Corpus Linguistics 22(4):
457–89.
26
Ortega, L. (2014) ‘Second Language Learning Explained? SLA Across 10 Contemporary
Theories’, in B. VanPatten and J. Williams (eds) Theories in Second Language
Acquisition: An Introduction, 2nd edn, New York: Routledge, pp. 245–72.
Pence, L.K. and Justice, L.M. (2017) Language Development from Theory to Practice, 3rd
edn. New York: Pearson.
Ramer, A.L.H. (1977) ‘The Development of Syntactic Complexity’, Journal of
Psycholinguistic Research 6: 145–61.
Römer, U. (2019) ‘A Corpus Perspective on the Development of Verb Constructions in
Second Language Learners’, International Journal of Corpus Linguistics 24(3): 268–90.
Rosenberg, S. and Abbeduto, L. (1987) ‘Indicators of Linguistic Competence in the Peer
Group Conversational Behavior of Mildly Retarded Adults’, Applied Psycholinguistics
8(1): 19–32.
Russell, J. (2004) What is Language Development? Rationalist, Empiricist, and Pragmatist
Approaches to the Acquisition of Syntax. Oxford: Oxford University Press.
Scarborough, H.S. (1990) ‘Index of productive syntax,’ Applied Psycholinguistics 11: 1–22.
VanPatten, B. and Williams, J. (eds) (2014) Theories in Second Language Acquisition: An
Introduction, 2nd edn. New York: Routledge.
Verspoor, M.H., De Bot, K. and Lowie, W. (eds) (2011) A Dynamic Approach to Second
Language Development: Methods and Techniques. Amsterdam: John Benjamins.
Wen, Q., Wang, L. and Liang, M. (2005) Spoken and Written English Corpus of Chinese
Learners. Beijing: Foreign Language Teaching and Research Press.
27
White, L. (2014) ‘Linguistic Theory, Universal Grammar, and Second Language
Acquisition’, in B. VanPatten and J. Williams (eds) Theories in Second Language
Acquisition: An Introduction, 2nd edn. New York: Routledge, pp. 34–53.
Wolfe-Quintero, K., Inagaki, S. and Kim, H.-Y. (1998) Second Language Development in
Writing: Measures of Fluency, Accuracy, and Complexity. Honolulu: University of
Hawai’i, Second Language Teaching and Curriculum Center.
28
View publication stats

LU 2021 - What Can Corpus Software Reveal

Uploaded by

Copyright:

Available Formats

LU 2021 - What Can Corpus Software Reveal

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LU 2021 - What Can Corpus Software Reveal

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

What can corpus software reveal about language development?

Chapter · April 2022

The user has requested enhancement of the downloaded file.

What can corpus software reveal about language development?

The Pennsylvania State University

1. What is language development?

with how children and adults acquire the capability of an L2.

most influential (Russell 2004). The rationalist approach, inspired by Chomskyan

experience. For empiricists, children employ domain-general mechanisms of associative

development process is viewed as involving children constructing a series of models or

Theories of L2 development generally seek to explain a different set of questions, including

the nature of L2 knowledge, the nature of interlanguage, the contributions of knowledge of

2014). A total of 14 contemporary theories of or approaches to L2 development or

example, concerning the nature of L2 knowledge, the Chomskyan universal grammar

theory, which is committed to nativism (a theoretical perspective positing that children

Contrastively, the skill acquisition theory, which is committed to conscious processing,

claims that development happens from initial representation of knowledge through

proceduralization of knowledge to eventual automatization of knowledge (DeKeyser 2014).

2. How do we measure language development?

and/or clinicians is what stage of language development a particular child or L2 learner is

its use at a particular point. Measurement of language development is especially important

multiple ways to answer this question, including naturalistic observation; production,

analyzing spoken or written language samples produced by a child or L2 learner.

different types of grammatical forms, including indefinite pronouns, personal pronouns,

classified as Level six.

In the L2 development literature, a large number of developmental index studies have

attempted to identify objective measures of complexity, accuracy, and fluency (CAF) of

Developmental measures identified in such a way allow teachers and researchers to

language use. Wolfe-Quintero et al. (1998) provided a comprehensive review of the

Common European Framework of Reference for Languages (CEFR). These features

supported by later research demonstrating the influence of L1 on the acquisition order

Exchange System (CHILDES) database, the Computerized Language Analysis (CLAN)

(Lu 2009). We briefly introduce each of these first.

line. The CLAN program is a collection of computational tools designed to automatically

Computerized Profiling is a set of programs designed to analyze both written language

samples and phonetically transcribed spoken language samples. Linguistic analysis at a

development of children that is related to both age and stage.

The D-Level Analyzer is a computer program designed to automate the measurement of

considerable amount of variability in terms of language development. However, it is useful

to understand the average capability as well as the range of capabilities exhibited by

milestones in child language development, or approximate ages at which specific language

capabilities usually emerge or mature. For example, at approximately 12 months of age,

complement or confirm naturalistic observation for establishing or revisiting such

of individual children over time. This type of investigation necessitates a corpus of

Ramer (1977), who conducted a longitudinal study to investigate the developmental

whether there is “a universal sequence of emergence of grammatical relations leading up to

hypothesized simplicity-complexity dimension based on the number of grammatical

Farsi children to acquire idealized articulations of different consonants.

proposed for measuring child language development. This is an important enterprise as

observation of child language acquisition. Given a particular measure, it is sensible to

evaluate whether it reflects the developmental sequence or significantly differentiates the

developmental levels of children in different age groups. An example of this type of

speaker age as well as significant between-age differences in average D-level scores. A

second way to approach this problem is to examine whether a proposed measure

disorders. Through comprehensive contrastive analyses, it is possible to qualitatively and

therapeutic intervention. Early interventions play a critical role in optimizing the

after a particular intervention, it is possible to evaluate whether targeted changes have

systematically occurred in a statistically significant way.