Research Timeline: Assessing Second Language Speaking: TH TH
Research Timeline: Assessing Second Language Speaking: TH TH
Research Timeline: Assessing Second Language Speaking: TH TH
Glenn Fulcher
School of Education
Biodata: Glenn Fulcher is Professor of Education and Language Assessment at the University
of Leicester, and Head of the School of Education. He has published widely in the field of
language testing, from journals such as Language Testing, Language Assessment Quarterly,
Applied Linguistics and System, to monographs and edited volumes. His books include
Testing second language speaking (Longman 2003), Language testing and assessment: An
advanced resource book (Routledge 2007), Practical language testing (Hodder 2010), and
the Routledge handbook of language testing (Routledge 2012). He currently co-edits the Sage
Introduction
While the viva voce (oral) examination has always been used in content-based educational
assessment (Latham 1877, p. 132), the assessment of second language (L2) speaking in
performance tests is relatively recent. The impetus for the growth in testing speaking during
the 19th and 20th Centuries is twofold. Firstly, in educational settings the development of
rating scales was driven by the need to improve achievement in public schools, and to
communicate that improvement to the outside world. Chadwick (1864, see timeline) implies
that the rating scales first devised in the 1830s served two purposes: providing information to
the classroom teacher on learner progress for formative use, and generating data for school
2
accountability. From the earliest days, such data was used for parents to select schools for
their children in order to ‘maximize the benefit of their investment’ (Chadwick 1858).
Secondly, in military settings it was imperative to be able to predict which soldiers were able
to undertake tasks in the field without risk to themselves or other personnel (Kaulfers, 1944,
see timeline). Many of the key developments in speaking test design and rating scales are
The speaking assessment project is therefore primarily a practical one. The need for
speaking tests has expanded from the educational and military domain to decision making for
international mobility, entrance to higher education, and employment. But investigating how
we make sound decisions based on inferences from speaking test scores remains the central
helps focus attention on facets of the testing context under investigation. The first such model
developed by Kenyon (1992) was subsequently extended by McNamara (1995), Milanovic &
Saville (1996), Skehan (2001), Bachman (2001), and most recently by Fulcher (2003, p. 115),
providing a framework within which research might be structured. The latter is reproduced
here to indicate the extensive range of factors that have been and continue to be investigated
in speaking assessment research, and these are reflected in my selection of themes and
Characteristics Training
Charac Rater(s)
Local
Score and
performance
inferences about
conditions Performance
the test taker
Interlocutor(s)
Task
Additional task
Orientation
characteristics or
Interactional
conditions as
Relationship
required for
Goals
specific contexts
Interlocutors
Topics Test Taker
Situations
Difficulty Individual variables
(e.g., personality)
Decisions and
Consequences
Overviews of the issues illustrated in figure 1 are discussed in a number of texts devoted to
assessing speaking that I have not included in the timeline (Lazaraton 2002; Fulcher 2003;
4
Luoma 2004; Taylor (ed. 2011). Rather, I have selected publications based on 12 themes that
arise from these texts, from figure 1, and from my analysis of the literature.
Themes that pervade the research literature are rating scale development, construct
definition, operationalisation, and validation. Scale development and construct definition are
inextricably bound together because it is the rating scale descriptors that define the construct.
Yet, rating scales are developed in a number of different ways. The data-based approach
requires detailed analysis of performance. Others are informed by the views of? expert judges
using performance samples to describe levels. Some scales are a patchwork quilt created by
bundling descriptors from other scales together based on scaled teacher judgments. How we
define the speaking construct and how we design the rating scale descriptors are therefore
Underlying design decisions are research issues that are extremely contentious.
Perhaps these can be presented in a series of binary alternatives to show stark contrasts,
Specific purposes tests vs. Generalizability. Should the construct definition and task design
scores that are relevant to any and every type of real-world decision that we may wish to
make? This is critical not least because the more generalizable we wish scores to be, the more
Psycholinguistic criteria vs. Sociolinguistic criteria. Closely related to the specific purpose
issue is the selection of scoring criteria. Usually, the more abstract or psycholinguistic the
criteria used, the greater the claims made for generalizability. These criteria or ‘facilities’ are
5
said to be part of the construct of speaking that is not context dependent. These may be the
The latter are required for the automated assessment of speaking. Yet, as the generalizability
claim grows, the relationship between score and any specific language use context is eroded.
This particular antithesis is not only a research issue, but one that impacts upon the
commercial viability of tests; it is therefore not surprising that from time to time the
arguments flare up, and research is called into the service of confirmatory defence (Chun
Normal conversation vs. Domain specific interaction. It is widely claimed that the ‘gold
which there are no power differentials, so that all participants have equal speaking rights.
Other types of interaction are compared to this ‘norm’ and the validity of test formats such as
the interview is brought into question (e.g. Johnson 2001). But we must question whether
‘friends chatting’ is indeed the ‘norm’ in most spoken interaction. In higher education, for
example, this kind of talk is very rare, and scores from simulated ‘normal’ conversations are
assistants. Research that describes the language used in specific communicative contexts to
support test design is becoming more common, such as that in academic contexts to underpin
Rater cognition vs. Performance analysis. It has become increasingly common to look at
‘what raters pay attention to’. When we discover what is going on in their heads, should it be
treated as construct irrelevant if it is at odds with the rating scale descriptors and/or an
6
analysis of performance on test tasks? Or should it be used to define the construct and
populate the rating scale descriptors? Do all raters bring the same analysis of performance to
the task? Or are we merely incorporating variable degrees of perverseness that dilutes the
construct? The most challenging question is perhaps: Are rater perceptions at odds with
reality?
Freedom vs. Control. Left to their own devices, raters tend to vary in how they score the same
performance. The variability decreases if they are trained; and it decreases over time through
the process of social moderation. With repeated practice raters start to interpret performances
in the same way as their peers. But when severed from the collective for a period of time,
judges begin to reassert their own individuality, and disagreement rises. How do we identify
and control this variability? This question now extends to interlocutor behaviour, as we know
that interlocutors provide differing levels of scaffolding and support to test takers. This
variability may lead to different scores for the same test taker depending on which
interlocutor they work with. Much work has been done in the co-construction of speech in
test contexts. And here comes the crunch. For some, this variation is part of a richer speaking
construct and should therefore be built into the test. For others, the variation removes the
principle of equality of experience and opportunity at the moment of testing, and therefore
the interlocutors should be controlled in what they say. In face-to-face speaking tests we have
seen the growth of the interlocutor frame to control speakers, and proponents of indirect
speaking tests claim that the removal of an interlocutor eliminates subjective variation.
and the list cannot be exhaustive. My selection avoids clustering in particular years or
decades, and attempts to show how the contrasts and themes identified play out historically.
You will notice that themes H and I are different from the others in that they are about
7
assessment research, and may help others to identify key discourse or multi-faceted Rasch
measurement studies (MFRM). What I have not been able to cover is the assessment of
pronunciation and intonation, or the detailed issues surrounding semi-direct (or simulated)
tests of speaking, both of which require separate timelines. Finally, I am very much aware
that the assessment of speaking was common in the United Kingdom from the early 20th
Century. Yet, there is sparse reference to research outside the United States in the early part
of the timeline. The reason for this is that apart from Roach (see timeline, reprinted as an
appendix in Weir, Vidaković & Galaczi (2013) (eds.) there is very little published research
from Europe (Fulcher 2003, p. 1). The requirement that research is in the public domain for
independent inspection and critique was a criterion for selection in this timeline. For a
retrospective interpretation of the early period in the United Kingdom with reference to
unpublished material and confidential internal examination board reports to which we do not
have access, see Weir & Milanovic (2003) and Vidaković & Galaczi (2013).
Themes
G. Washback
H. Discourse analysis
K. Rater cognition
L. Test-taker characteristics
References
presented at the meeting of the American Association of Applied Linguistics. St. Louis,
Missouri, February.
Biber, D. (2006). University language. A corpus-based study of spoken and written registers.
in the public service. Journal of the Statistical Society of London 21.1, 18–51.
Downey, R., H. Farhady, R. Present-Thomas, M Suzuki. & A. Van Moere (2008). Evaluation
of the Usefulness of the Versant for English Test: A Response. Language Assessment
Oral Proficiency Interview. New Haven and London: Yale University Press.
scales in language testing. Paper delivered at the 14th Language Testing Research
Press.
Skehan, P. (2001). Tasks and language performance assessment. In M. Bygate, P. Skehan, &
M. Swain (eds.), Researching pedagogic tasks: Second language learning, teaching and
Taylor, L. (2011). Examining speaking. Research and practice in assessing second language
Weir, C. & M. Milanovic (2003). (eds.), Continuity and innovation: Revising the Cambridge
Weir, C. J., I.Vidaković & E. D. Galaczi (2013). (eds.), Measured constructs. A history of
Press.
Vidaković, I. & E. D. Galaczi (2013). The measurement of speaking ability 1913 – 2012. In
Press, 257‒346.
11
products. The School Review Galton and Cattell towards the end of
language testing. London and the military domain that were piloted
support Criterion-referenced
assessment.
1944 Kaulfers, W. V. (1944). War- The interwar years saw a rapid A, B, D
war: the United States A.S.T. geared towards the military context
domain specific.
1945 Roach, J. O. (1945). Some Roach was among the first to E
over time.
1952/ Foreign Service Institute. Little progress was made in testing A, B,
this timeline.
15
of the FSI oral interview test. construct at the top band (level six).
(1979). Testing kit: French and new rating scales, the first testing E, G
educational purposes.
1980 Adams, M. L. (1980). Five co- Adams conducted the first structural B
(1981). The construct validity were carried out in the early 1980s,
OPI.
1983 Lowe, P. (1983). The ILR oral In the 1960s the FSI approach to A, C, D
18
diplomatic services
STANAG 6001.
1984 Liskin-Gasparro, J. E. (1984). Following the publication of A, B
models of communicative
strategies.
1985 Lantolf, J. P. & W Frawley Lantolf & Frawley were among the A, B
2
Canale, M. & M. Swain (1980). Theoretical bases of communicative approaches to second
understand under-specification in
audience in mind.
1992 Young, R. & M. Milanovic An early and significant use of B, C,
argument.
1992 Ross, S. & R. Berwick (1992). Reacting to critiques of the OPI from B, C,
LAZARATON (1996).
1992 Mislevy, R. J. (1992). Linking LOWE (1983) and others had argued E
speaking proficiency.
1995 Chalhoub-Deville, M. (1995). Chalhoub-Deville investigated the A, B, E
performances.
1995 Lumley, T. & T. McNamara Rater variability is studied across E, I
performance as recommended by
(2010).
1996 McNamara, T. (1996). McNamara described the A, B,
Issues in task design and the others, this study compared a group
takers.
1996 Fulcher, G. (1996). Does thick Based on work conducted since A, B,
project.
1996 Lazaraton, A. (1996). In the ROSS & BERWICK (1992) B, H, J
attention to. In M. Milanovic & how raters use rating scales, and
from the 15th Language Testing own conceptual baggage to the rating
research.
1998 Young, R. & A. W. He (1998) An important collection of research B, C, H
(2012).
2002 Young, R. (2002). Discourse A careful investigation of the ‘layers’ B, C, H
construct validation.
2003 Brown, A. (2003). Interviewer A much quoted study into variation B, H, I,
variation and the co- in the speech of the same test taker J
opportunity.
2003 Fulcher, G. & R. Marquez- An investigation into the effects of B, C, H
Reiter (2003). Task difficulty in task features (social power and level
group oral discussion task. categories. Test taker ability was the
behaviour.
2005 Cumming, A., L. Grant, P. An important prototyping study. Pre- B, C, K
Princeton, NJ: Educational the tasks and asked whether these are
controlled for.
2008 Galaczi, E. D. (2008). Peer- Galaczi presents a discourse analytic B, C, H
test: The case of the First which two candidates are required to
this area.
2010 Poonpon, K. (2010). A study that brings together the EBB A, B,
pragmatic constructs.
2011 Frost, K., C. Elder & G. Integrated task types have become A, B, C
interpretation: understanding
discourse is co-constructed.
2011 Nakatsuhara, F. (2011). Effects Building on BONK & OCKEY (2003) B, H
testing era.
2012 Tan, J., B. Mak & P. Zhou This paper applies fuzzy logic to our E, J
does fuzziness become exact? into account both rater decisions, and
those decisions.
2014 Nitta, R & F. Nakatsuhara Nitta & Nakatsuhara investigate C, H
speakers.
Acknowledgements
I would like to thank Dr. Gary Ockey of Educational Testing Service for reviewing my first
draft, and providing valuable critical feedback. My thanks are also due to the very
constructive criticism of the three reviewers, which has considerably improved the coverage
and coherence of the timeline. Finally to the editor of Language Teaching for timely