LU 2021 - What Can Corpus Software Reveal
LU 2021 - What Can Corpus Software Reveal
LU 2021 - What Can Corpus Software Reveal
net/publication/360029255
CITATIONS READS
0 153
1 author:
Xiaofei Lu
Pennsylvania State University
107 PUBLICATIONS 2,314 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Integrating corpus-based and genre-based approaches to EAP writing research and pedagogy View project
Content Difficulty of EAP Reading Materials (National Social Science Fund of China; 18BYY110) View project
All content following this page was uploaded by Xiaofei Lu on 19 April 2022.
Xiaofei Lu
Language development refers to the process in which the language faculty develops in a
human being. First language (L1) development is concerned with how children acquire the
capability of their native language, while second language (L2) development is concerned
Theories of L1 development generally need to address at least the following three questions:
what children bring to the language learning task, what mechanisms drive language
acquisition, and what types of input support the language-learning system (Pence and
Justice 2017). Psychologists have taken drastically different approaches to answering these
questions, among which the rationalist, empiricist, and pragmatist paradigms have been the
linguistics, takes the view that the language faculty does not depend on external sources for
its content but is internal to each individual. For rationalists, children are born with innate
1
formal knowledge of a universal grammar, and they bring this domain-specific knowledge
to the task of acquiring the I-language (i.e. the internal and individual language) of their
native tongue. Language input is used to discover the parameters that their native language
uses to satisfy the universal grammar. The empiricist approach, upheld by connectionists,
believes that the content of the language faculty is not innate, but is derived from perceptual
learning to acquire the rules and representations of their native language through experience
with sufficient speech input. The pragmatist or socio-cognitivist approach advocates that
children recruit their socio-cognitive capacity to actively construct their language faculty.
Within this paradigm, language is viewed as a socio-cultural action, and the language
working theories of their mother tongue from the evidence that is available to them.
the L1, the contributions of the linguistic environment, and the role of instruction (Ortega
acquisition are presented in VanPatten and Williams (2014) and Atkinson (2011). These
theories take different stances with respect to the various aspects of L2 development. For
have the innate ability to acquire language), argues that L2 learners cannot obtain
knowledge of ungrammaticality and ambiguity from linguistic input, but possess pre-
2
existing knowledge of the grammar that constrains their learning task (White 2014).
Still different is the Vygotskian sociocultural theory, which views L2 knowledge as socially
distributed, ‘places mediation, either by other or self, at the core of development and use’
and understands development ‘not only in terms of target-like performance but also in
terms of the quality and quantity of external mediation required’ (Lantolf and Thorne 2011:
24).
In addition to the theoretical question of how language development takes place, another
important and more practical question that is of interest to teachers, researchers, parents,
in, or in other words, how much a child or L2 learner knows about the language system and
for children suffering any delay or disorder in their language development. There are
comprehension, and judgment tasks; formal testing; and language sample analysis, among
others. In this section, we focus on how language development can be measured through
3
A number of measures of language development have been proposed and explored in the
child language development literature. Some measures are based on verbal output, e.g.
mean length of utterance (MLU) (Brown 1973) and number of different words (NDWs),
while others are based on structural analysis, e.g. Developmental Sentence Scoring (DSS)
(Lee 1974), Index of Production Syntax (IPSyn) (Scarborough 1990), and Developmental
Level (D-Level) (Rosenberg and Abbeduto 1987; Covington et al. 2006). Both DSS and
IPSyn were developed to evaluate children’s grammatical development, although they work
in different ways. The DSS metric assigns a score to each sentence. It considers eight
main verbs, secondary or embedded verbs, conjunctions, negatives and two types of
questions. Variants of the same type of grammatical form are scored differently based on
the order in which children develop the ability to use them. The score of a sentence is the
sum of the points for each type plus one point if the sentence is fully grammatical. The
average DSS of a speaker can be computed using a representative language sample. The
IPSyn metric does not apply to individual sentences but examines the number of times 56
target grammatical structures are used in a sample produced by a speaker. These include
various types of noun phrases, verb phrases, questions, and some specific sentence
structures. Each occurrence of any of the target grammatical structures in the language
sample receives one point. However, a maximum of two occurrences of each structure are
counted, and the maximum score a language sample can receive is 112. The D-Level scale
classifies each sentence into one of eight increasingly more complex categories (Levels
zero through seven), depending on the syntactic structures it contains. For example, a
sentence containing a finite clause as the object of the main verb is classified as Level three,
4
and a sentence containing an embedded clause serving as the subject of the main verb is
production that can be used to index the learner’s level of development or overall
proficiency in the target language. This is generally achieved by assessing the development
of L2 learners at known proficiency levels in the target language using various measures.
evaluate and describe the learner’s developmental level in a more precise way. In addition,
they can also be used to examine the effect of a particular pedagogical treatment on
measures explored in 39 second and foreign language writing studies and recommended
several measures that were consistently linear and significantly related to program or school
levels as the best measures of development or error. These include three measures of
fluency, i.e. mean length of T-unit, where a T-unit is a main clause plus any subordinate
clauses (Hunt 1965), mean length of clause, and mean length of error-free T-unit; two
measures of accuracy, i.e. error-free T-units per T-unit, and errors per T-unit; two measures
of grammatical complexity, i.e. clauses per T-unit, and dependent clauses per clause; and
two measures of lexical complexity, i.e. total number of word types divided by the square
root of twice the total number of word tokens, and total number of sophisticated word types
divided by total number of word types. In terms of syntactic complexity, recent research
has argued for the need to focus on more fine-grained measures (Kyle 2016), measures of
5
phrasal complexity, as well as co-occurrence patterns of lexico-grammatical features (e.g.
Biber et al. 2016). In a similar spirit, Hawkins and Buttery (2010) proposed and illustrated
the identification of a systematic set of criterial features for each proficiency level in the
captured the emergence, frequency, accuracy, and usage distribution of relevant linguistic
properties (e.g., verb co-occurrence frames and relative clauses) characterizing each
proficiency level. They further argued for the need to consider L1-specific features, a view
and/or usage of different linguistic properties (e.g. Lu and Ai 2015; Murakami and
Alexopoulou 2016).
3. How can we use a corpus to find out more about first language development?
In this section, we discuss several ways in which a corpus of child language development
data may be used to find out more about L1 development. Some of these will be illustrated
using the following corpora and corpus analysis software: the Child Language Data
program (MacWhinney 2000), Computerized Profiling (Long 2019), and D-Level Analyzer
The CHILDES database contains transcripts and media data collected from conversations
between young children of different ages and their parents, playmates, and caretakers.
These data are contributed by researchers from many different countries, following the
6
same data collection and transcription standards. Each file in the database contains a
transcript of a conversation and includes a header that encodes information about the target
child or children (e.g. age, native language, whether the child is normal in terms of
language development, etc.), other participants, the location and situation of the
conversation, the activities that are going on during the conversation, and the researchers
and coders collecting and transcribing the data. The conversation is transcribed in a one-
utterance-per-line format, with the producer of each utterance clearly marked in a prefix.
Each utterance is followed by another line that consists of a morphological analysis of the
utterance. Any physical actions accompanying the utterance are also provided in a separate
analyze data transcribed in the CHILDES format. Some of the automatic analyses that the
program can run on one or more files in the CHILDES database include word frequency,
type/token ratio, a measure of vocabulary diversity called D (Durán et al. 2004), mean
length of turn, mean length of utterance, and DSS score, among others.
range of different levels can be performed, including simple corpus statistics, semantics,
grammar, phonology, pragmatics, and narratives. For example, at the grammar level, the
following four procedures can be run: IPSyn, DSS, Black English Sentence Scoring (BESS)
(Nelson 1998), which is an adaptation of DSS for use with speakers of African American
Vernacular English, and the Language Assessment, Remediation, and Screening Procedure
7
(LARSP) (Crystal et al. 1989), a system for profiling the syntactic and discourse
syntactic complexity using the revised D-Level scale (Covington et al. 2006). Given a raw
sentence as input, the analyzer assigns it to an appropriate level on the scale. The program
achieves an accuracy of 93.2 per cent on spoken child language acquisition data from the
CHILDES database.
First of all, a corpus can be used to describe the characteristics of language produced by
children in different age groups or different stages of development. Children may exhibit a
children within the same age group. Researchers generally agree that there are certain
words start to emerge; at approximately 24 months of age, children possess more than 50
vocabulary items and begin to spontaneously join these items into self-created two-word
phrases; and at approximately 30 months of age, children produce utterances with at least
two words, and many with three or even five words (Lenneberg 1967). A large corpus
consisting of language samples produced by children of different age groups can be used to
milestones. The CHILDES database constitutes a good example of such a corpus. Given a
8
set of data that consists of transcripts of conversations involving targeted groups of children
in a particular age group, e.g. eighteen months, it is possible to use CLAN, Computerized
Profiling, and D-Level Analyzer to find out the average as well as the range of different
types of developmental metrics of interest exhibited by all the children in the group.
Second, a corpus can be used to investigate the sequence or order in which children acquire
different aspects of the system of their native language as well as to track the development
longitudinal data, i.e. data collected from the same child or group of children over an
extended period of time, e.g. one to five years. An early example of this type of research is
sequence of syntactic acquisition in seven children. Specifically, she aimed to find out
the production of S + V + O constructions” (p. 144). She analyzed her corpus data using a
relations produced and their expansions. She reported that the sequence of acquisition
specified in the hypothesized dimension was observed in the data from all seven children.
In a recent study, Khaghaninejad et al. (2018) analyzed a corpus in the CHILDES database
containing the utterances produced by five L1 Farsi Iranian children over a year to examine
the order of acquisition of Farsi consonants. Their analysis generated a ‘timeline’ for L1
9
Third, a corpus can be used to assess the validity and adequacy of the various metrics
such measures are often used for evaluating the level of language development of children
with developmental delays or disorders. One of the ways to approach this problem is
closely related to the descriptive and longitudinal research discussed above. Since these
metrics were proposed to measure language development, many of them were based on
research is Lu (2009), who analyzed data from the CHILDES database using the D-Level
Analyzer and reported a correlation of .648 (p < .001) between average D-Level scores and
significantly differentiates between the developmental levels of children with and without
developmental disorders within the same age group. A good example of this line of
research is Hewitt et al. (2005). They compared scores of kindergarten children with a
mean age of six years with and without specific language impairment (SLI) on three
commonly used measures, i.e. MLU in morphemes, IPSyn, and NDWs. They found that
children with SLI showed significantly lower mean scores for all of the three measures,
except for some subtests of the IPSyn. In relation to this line of research, a corpus can also
be used to provide normative information for valid and adequate measures. To improve the
feasibility of applying these measures in practical situations and to enable researchers and
clinicians to make sense of the analytical results using these measures, it is necessary to
10
have normative information for different age groups for benchmarking purposes. The
CHILDES database could again be used for providing such normative information.
Finally, a corpus can also be used to gain in-depth understanding of language development
quantitatively describe the developmental differences between children with and without
language disorders, e.g. in terms of vocabulary size and range of syntactic structures. In
addition, longitudinal data can also be used to investigate the effect of a particular
developmental trajectory of children with language disorders during the best window of
opportunity (Pence and Justice 2017). By analyzing language samples produced before and
4. How can we use a corpus to find out more about second language development?
In this section, we discuss a number of ways that a corpus of learner language can be used
to find out more about L2 development. The Longitudinal Database of Learner English
LONGDALE contains data from English learners from diverse L1 backgrounds, with all
learners contributing data at least once a year for three or more years. Various types of
spoken, written, and experimental data are included. The database also includes
comprehensive information about the learners and tasks, such as age, gender, language
11
background, proficiency level, and task type, among others. The International Corpus of
Learner English (ICLE; Version 2) (Granger et al. 2009), while initially designed for
comparing learner English among learners from different L1 backgrounds as well as against
L1 English, has good potential for L2 development research as well. This corpus contains
different mother tongue backgrounds. The following learner variables are recorded for each
written text: age, learning context, proficiency level, gender, mother tongue, region,
knowledge of other foreign languages, and L2 exposure. These variables allow for cross-
sectional or quasi-longitudinal analysis that can offer useful insight into learner’s L2
Various corpus processing tools can be used to analyze learner corpora in the different
ways to be discussed below (e.g. Lu 2014; 2017). For example, Coh-Metrix (McNamara et
al. 2014) can be used to assess the coherence and cohesion of language samples using a
large set of linguistic features. The Biber Tagger (Biber 1988) can be used to analyze a
Analyzer (LCA) (Lu 2012) and the Tool for the Automatic Analysis of Lexical
Sophistication (TAALES) (Kyle et al. 2018) can be used to assess the lexical density,
lexical diversity, and lexical sophistication of learner texts using a large number of metrics.
The L2 Syntactic Complexity Analyzer (L2SCA) (Lu 2010) and the Tool for the Automatic
Analysis of Syntactic Sophistication and Complexity (TAASSC) (Kyle 2016) are both
12
for automated grammatical error detection in learner writing are emerging (Leacock et al.
The first way a corpus can be used to reveal L2 development is as a database for describing
the characteristics of the interlanguage of learners at known proficiency levels. To this end,
it is necessary to have a learner corpus that encodes information about the learners’
e.g. classroom grades, holistic ratings, program levels, school levels, and standardized test
scores (Wolfe-Quintero et al. 1998). The CEFR has also been increasingly used as a
calibration for proficiency level within learner corpora, such as the Cambridge Learner
Corpus (CLC), which is comprised of data from the Cambridge English Language
Assessment (Barker et al. 2015), and the EF-Cambridge Open Language Database
(EFCAMDAT), which consists of written samples from over 174,000 adult learners of
English as a second language (ESL) across the world (Huang et al. 2018). Linking to the
same framework of proficiency makes the results from different data sources more
comparable. Nevertheless, it should be noted that the use of any type of calibration for
proficiency, be it age, year of schooling, or the CEFR, is a workaround for not being able to
obtain large amounts of genuine longitudinal data (e.g. Meunier 2015), which, if available,
would be the preferred data source for investigating L2 development. In terms of analysis,
one may choose to focus on a particular aspect of the interlanguage, for example, the
13
degree to which informal, colloquial patterns or styles are used in formal, written language.
system of the interlanguage. For example, in the English Grammar Profile project,
O’Keeffe and Mark (2017) examined the patterns of grammatical development across the
six levels of the CEFR using the CLC. Their project resulted in a database of over 1,200
This type of descriptive study can benefit both from error analysis and from contrastive
analysis of learner data and L1 speaker data. To conduct an error analysis, it is necessary to
identifying and annotating errors in learner text. An early example of an error annotation
scheme can be found in Granger (2003), which assigns each error first to one of the
following nine major domains: form, morphology, grammar, lexis, syntax, register, style,
punctuation, and typo, and then to a specific category within the domain. Lüdeling and
Hirschmann (2015) offer a systematic review of issues surrounding error annotation and
corpus enables one to easily identify the common errors that learners at a given proficiency
A contrastive study of learner data and L1 speaker data helps us to look at the
converges to or differs from L1 speaker usage. For example, one may assess whether
14
learners tend to overuse or underuse certain words, phrases, collocations, grammatical
constructions, speech acts, etc. relative to L1 speakers (Granger 1998; De Cock 2000). It is
important, however, to ensure that the learner data and the L1 speaker data are of
comparable nature in terms of mode, genre, and field, etc. The Trinity Lancaster Corpus,
which contains 4.2 million words of interaction between English learners and L1 speakers
(Gablasova et al. 2019), constitutes an excellent source of data for this purpose. Importantly,
however, it should be noted that recent notions in learner corpus research are starting to
move away from using L1 speaker data as a norm for comparing learner data and focus on
L2 competence as an entity for analysis in its own right (e.g. Granger 2015).
Second, a corpus may be used in developmental index studies to identify objective metrics
that can be used to index levels of L2 development or the learner’s overall language
contained substantial variability in terms of choice and definition of measures, writing task
used, sample size, corpus length, timing condition, etc., making it challenging to compare
the results reported (Wolfe-Quintero et al. 1998), as these factors have been found to affect
the CAF of learner language (e.g. Alexopoulou et al. 2017; Hsu 2019). To eliminate such
inconsistency and variability, recent research has evaluated or compared large sets of
measures on the same learner corpus or corpora. For example, Lu (2011) used L2SCA to
analyze large-scale L2 writing data from the Written English Corpus of Chinese Learners
(WECCL) (Wen et al. 2005). The corpus is a collection of over 3,000 essays written by
English majors in nine different colleges in China. Each essay in the corpus is annotated
with a header that includes the following information: mode (written or spoken), genre
15
(argumentation, narration, or exposition), school level (first, second, third, or fourth year in
college), year of admission (2000, 2001, 2002, or 2003), timing condition (timed with a 40-
minute limit or untimed), institution (a two- to four-letter code), and length (number of
words in the essay). Students in the same school level within the same institution wrote on
the same topics, but topics varied from institution to institution. Given the information that
is available in the corpus, proficiency level is conceptualized using school level. Through
the analysis, this study provided useful insights into how different syntactic complexity
relate to each other, and how their performances are affected by external factors.
Third, a corpus can be used to examine the contributions of knowledge of the L1 as well as
the effect of L1 transfer. One the one hand, knowledge of the L1 may prove helpful in
learning certain aspects of the L2, and learners with different L1 background may show
strengths in learning different aspects of the L2. On the other hand, the intrusion of L1 may
of certain forms or grammatical patterns that deviate from the target language in the
influence, either positive or negative, on learner development and output (e.g. Granger et al.
2015; Murakami and Alexopoulou 2016). The ICLE corpus constitutes an excellent source
of data for this type of research, as students with diverse L1 background are represented. A
contrastive study of a learner’s L1 and interlanguage will provide further evidence on the
16
L1 influence. One example of this type of research is Lu and Ai (2015), who analyzed data
from the ICLE and the Louvain Corpus of Native English Essays (LOCNESS) (Granger
1996) with L2SCA to examine differences in the syntactic complexity in English writing
groups and one L1 group. They reported that the seven L2 groups demonstrated drastically
Fourth, longitudinal learner corpora can be used to examine the trajectories and patterns of
learner development and to provide evidence to validate or challenge the claims and
subsystems of the language (Verspoor et al. 2011). Multiple longitudinal studies from this
approach have reported evidence that L2 developmental trajectories as well as the patterns
of interaction among different CAF features are highly variable and that such variability
follows the principles of dynamic systems (e.g. Larsen-Freeman 2006; Caspi 2010). Usage-
levels of complexity that are entrenched as language knowledge in the speakers’ mind
(Goldberg 1995). Research within this framework posits that language acquisition is shaped
by exposure to and usage of language and has reported that a learner’s repertoire of
constructions starts with fixed sequences and becomes increasingly more complex and
productive (Ellis et al. 2016). For example, Römer (2019) analyzed verb-argument
17
constructions (VACs) in a large-scale corpus of written texts produced by L2 learners at
varied levels of English proficiency and found that the learners’ inventory of VACs
developed from fixed sequences to more diverse, productive, and complex ones.
Finally, a corpus may be used to examine the role of instruction or the effect of a particular
of different groups of learners at the same school level or program level that are exposed to
addition, by comparing the learner’s production prior to and after a period of targeted
pedagogical intervention, we may assess whether the intervention is effective in helping the
As a field, corpus-based language development research will benefit tremendously from the
learners often contain many errors and as such present a challenge to natural language
processing (NLP) technology, especially when it comes to measures that involve syntactic,
technology and development of robust new NLP technology will facilitate more accurate
and reliable automatic analysis of language samples using more diversified measures. A
second avenue for future development in the field lies in the systematic collection and
18
sharing of large-scale child and L2 development data that encodes richer information about
the children or learners producing the data. For child language development research, large-
scale longitudinal data and data of children with language disorders are particularly
valuable. The Growth in Grammar Corpus (Durrant et al. forthcoming), a large collection
of texts written by school children in England as part of their school work, constitutes an
school levels, program levels, standardized test scores, holistic ratings, classroom grades,
etc. Large-scale data with richer information will make it easier to draw more reliable
conclusions for many of the types of research discussed above. Finally, analysis of L2
development data will benefit from the development of consistent and standardized error
development researchers have often devised their own annotation schemes for error
analysis, which makes comparison and sharing of research results problematic. The field in
general will benefit from a more consistent annotation scheme. There has also been an
increasing stream of research in automatic error detection and correction (Crossley et al.
2019; Leacock et al. 2014). The maturity of such techniques will facilitate automatic error
analysis of large-scale L2 development data and enable researchers to gain more reliable
6. Further reading
19
Atkinson, D. (ed.) (2011) Alternative Approaches to Second Language Acquisition.
Lu, X. (2014) Computational Methods for Corpus Annotation and Analysis. Singapore:
Springer. (This book provides a systematic and accessible introduction to diverse types of
computational tools that can be used for automatic or computer-assisted annotation and
MacWhinney, B. (2000) The CHILDES Project: Tools for Analyzing Talk, 3rd edn.
Mahwah: Lawrence Erlbaum Associates. (This book provides hands-on instruction on how
to transcribe naturalistic child language development data following the CHILDES format
and automatically analyze such data using CLAN. Readers are introduced to a set of
computational tools designed to improve the readability of transcripts, to automate the data
Pence, L.K. and Justice, L.M. (2017) Language Development from Theory to Practice, 3rd
edn. New York: Pearson. (This book provides an extremely accessible introduction to the
theory and practice of child language development. The material presented in the book is
Introduction, 2nd edn. New York: Routledge. (This edited volume presents a
20
comprehensive introduction to early and contemporary theories in second language
References
Alexopoulou, T., Michel, M., Murakami, A. and Meurers, D. (2017) ‘Task Effects on
180–208.
Abingdon: Routledge.
Barker, F., Salamoura, A. and Saville, N. (2015). ‘Learner Corpora and Language Testing’,
Biber, D. (1988). Variation Across Speech and Writing. Cambridge: Cambridge University
Press.
Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999) Longman Grammar
Biber, D., Gray, B. and Staples, S. (2016) ‘Predicting Patterns of Grammatical Complexity
Across Language Exam Task Types and Proficiency Levels’, Applied Linguistics 37(5):
639–68.
21
Covington, M.A., He, C., Brown, C., Naçi, L. and Brown, J. (2006) How Complex is that
Sentence? A Proposed Revision of the Rosenberg and Abbeduto D-Level Scale. Atlanta:
De Cock, S. (2000) ‘Repetitive Phrasal Chunkiness and Advanced EFL Speech and
Writing’, in C. Mair and M. Hundt (eds) Corpus Linguistics and Linguistic Theory,
Durán, P., Malvern, D., Richards, B. and Chipere, N. (2004) ‘Developmental Trends in
Ellis, N.C., Römer, U. and O’Donnell, M.B. (2016) Usage-based Approaches to Language
22
Gablasova, D., Brezina, V. and McEnery, T. (2019) ‘The Trinity Lancaster Corpus:
Studies. Lund Studies in English, Vol. 88, Lund: Lund University Press, pp. 37–51.
Granger, S. (ed.) (1998) Learner English on Computer. Boston: Addison Wesley Longman.
Granger, S., Dagneaux, E., Meunier, F. and Paquot, M. (2009) International Corpus of
Granger, S., Gilquin, G. and Meunier, F. (eds) (2015) The Cambridge Handbook of
Hawkins, J. and Buttery, P. (2010) ‘Criterial Features in Learner Corpora: Theory and
Hewitt, L.E., Scheffner, H.C., Yont, K.M. and Tomblin, J.B. (2005) ‘Language Sampling
for Kindergarten Children With and Without SLI: Mean length of utterance, IPSYN,
23
Hsu, H.-C. (2019) ‘The Combined Effect of Task Repetition and Post-Task Transcribing on
172–87.
Huang, Y., Murakami, A., Alexopoulou, T. and Korhonen, A. (2018) ‘Dependency Parsing
Hunt, K.W. (1965) Grammatical Structures Written at Three Grade Levels. Urbana:
Khaghaninejad, M.S., Moloodi, A. and Saadi, R.F. (2018) ‘A Timeline for Acquisition of
Kyle, K., Crossley, S. and Berger, C. (2018) ‘The Tool for the Automatic Analysis of
1030–46.
Lantolf, J.P. and Thorne, S.L. (2011) ‘The Sociocultural Approach to Second Language
Oral and Written Production of Five Chinese Learners of English’, Applied Linguistics
27(4): 590–619.
24
Leacock, C., Chodorow, M., Gamon, M. and Tetreault J. (2014). Automated Grammatical
Error Detection for Language Learners, 2nd ed. San Rafael: Morgan & Claypool
Publishers.
Lenneberg, E.H. (1967) Biological Foundations of Language. Hoboken: John Wiley &
Sons.
University.
Lu, X. (2012) ‘The Relationship of Lexical Richness to the Quality of ESL Learners’ Oral
Lu, X. (2014) Computational Methods for Corpus Annotation and Analysis. Dordrecht:
Springer.
25
Lu, X. (2017) ‘Automated Measurement of Syntactic Complexity in Corpus-Based L2
Writing Research and Implications for Writing Assessment,’ Language Testing 34(4):
493–511.
and F. Coccetta (eds) Studies in Learner Corpus Linguistics. Research and Applications
for Foreign Language Teaching and Assessment, Berlin: Peter Lang, pp. 123–26.
MacWhinney, B. (2000) The CHILDES Project: Tools for Analyzing Talk, 3rd edn.
O’Keeffe, A. and Mark, G. (2017) ‘The English Grammar Profile of Learner Competence:
457–89.
26
Ortega, L. (2014) ‘Second Language Learning Explained? SLA Across 10 Contemporary
Pence, L.K. and Justice, L.M. (2017) Language Development from Theory to Practice, 3rd
8(1): 19–32.
Scarborough, H.S. (1990) ‘Index of productive syntax,’ Applied Psycholinguistics 11: 1–22.
Verspoor, M.H., De Bot, K. and Lowie, W. (eds) (2011) A Dynamic Approach to Second
Wen, Q., Wang, L. and Liang, M. (2005) Spoken and Written English Corpus of Chinese
27
White, L. (2014) ‘Linguistic Theory, Universal Grammar, and Second Language
Wolfe-Quintero, K., Inagaki, S. and Kim, H.-Y. (1998) Second Language Development in
28