Research Timeline: Assessing Second Language Speaking: TH TH

1
Research timeline: Assessing second language speaking
Glenn Fulcher
University of Leicester, United Kingdom
School of Education
Biodata: Glenn Fulcher is Professor of Education and Language Assessment at the University
of Leicester, and Head of the School of Education. He has published widely in the field of
language testing, from journals such as Language Testing, Language Assessment Quarterly,
Applied Linguistics and System, to monographs and edited volumes. His books include
Testing second language speaking (Longman 2003), Language testing and assessment: An
advanced resource book (Routledge 2007), Practical language testing (Hodder 2010), and
the Routledge handbook of language testing (Routledge 2012). He currently co-edits the Sage
journal Language Testing.
Introduction
While the viva voce (oral) examination has always been used in content-based educational
assessment (Latham 1877, p. 132), the assessment of second language (L2) speaking in
performance tests is relatively recent. The impetus for the growth in testing speaking during
the 19th and 20th Centuries is twofold. Firstly, in educational settings the development of
rating scales was driven by the need to improve achievement in public schools, and to
communicate that improvement to the outside world. Chadwick (1864, see timeline) implies
that the rating scales first devised in the 1830s served two purposes: providing information to
the classroom teacher on learner progress for formative use, and generating data for school
2
accountability. From the earliest days, such data was used for parents to select schools for
their children in order to ‘maximize the benefit of their investment’ (Chadwick 1858).
Secondly, in military settings it was imperative to be able to predict which soldiers were able
to undertake tasks in the field without risk to themselves or other personnel (Kaulfers, 1944,
see timeline). Many of the key developments in speaking test design and rating scales are
linked to military needs.
The speaking assessment project is therefore primarily a practical one. The need for
speaking tests has expanded from the educational and military domain to decision making for
international mobility, entrance to higher education, and employment. But investigating how
we make sound decisions based on inferences from speaking test scores remains the central
concern of research. A model of speaking test performance is essential in this context, as it
helps focus attention on facets of the testing context under investigation. The first such model
developed by Kenyon (1992) was subsequently extended by McNamara (1995), Milanovic &
Saville (1996), Skehan (2001), Bachman (2001), and most recently by Fulcher (2003, p. 115),
providing a framework within which research might be structured. The latter is reproduced
here to indicate the extensive range of factors that have been and continue to be investigated
in speaking assessment research, and these are reflected in my selection of themes and
associated papers for this timeline.

3
Figure 1. An expanded model of speaking test performance (Fulcher 2003, p. 115).
Characteristics Training
Charac Rater(s)
Orientation / Rating Scale / Band Construct

Scoring Descriptors definition
philosophy
Focus
Local
Score and
performance
inferences about
conditions Performance
the test taker
Interlocutor(s)
Task
Additional task
 Orientation
characteristics or
 Interactional
conditions as
Relationship
required for
 Goals
specific contexts
 Interlocutors
 Topics Test Taker
 Situations
 Difficulty Individual variables
(e.g., personality)
Task specific Real-time processing Abilities /

knowledge or capacity capacities on
skills constructs
Decisions and
Consequences
Overviews of the issues illustrated in figure 1 are discussed in a number of texts devoted to
assessing speaking that I have not included in the timeline (Lazaraton 2002; Fulcher 2003;
4
Luoma 2004; Taylor (ed. 2011). Rather, I have selected publications based on 12 themes that
arise from these texts, from figure 1, and from my analysis of the literature.
Themes that pervade the research literature are rating scale development, construct
definition, operationalisation, and validation. Scale development and construct definition are
inextricably bound together because it is the rating scale descriptors that define the construct.
Yet, rating scales are developed in a number of different ways. The data-based approach
requires detailed analysis of performance. Others are informed by the views of? expert judges
using performance samples to describe levels. Some scales are a patchwork quilt created by
bundling descriptors from other scales together based on scaled teacher judgments. How we
define the speaking construct and how we design the rating scale descriptors are therefore
interconnected. Design decisions therefore need to be informed by testing purpose and
relevant theoretical frameworks.
Underlying design decisions are research issues that are extremely contentious.
Perhaps these can be presented in a series of binary alternatives to show stark contrasts,
although in reality there are clines at work.
Specific purposes tests vs. Generalizability. Should the construct definition and task design
be related to specific communicative purposes and domains? Or is it possible to produce test
scores that are relevant to any and every type of real-world decision that we may wish to
make? This is critical not least because the more generalizable we wish scores to be, the more
difficult it becomes to select test content.
Psycholinguistic criteria vs. Sociolinguistic criteria. Closely related to the specific purpose
issue is the selection of scoring criteria. Usually, the more abstract or psycholinguistic the
criteria used, the greater the claims made for generalizability. These criteria or ‘facilities’ are
5
said to be part of the construct of speaking that is not context dependent. These may be the
more traditional constructs of ‘fluency’ or ‘accuracy’, or more basic observable variables
related to automaticity of language processing, such as response latency or speed of delivery.
The latter are required for the automated assessment of speaking. Yet, as the generalizability
claim grows, the relationship between score and any specific language use context is eroded.
This particular antithesis is not only a research issue, but one that impacts upon the
commercial viability of tests; it is therefore not surprising that from time to time the
arguments flare up, and research is called into the service of confirmatory defence (Chun
2006; Downey et al. 2008).
Normal conversation vs. Domain specific interaction. It is widely claimed that the ‘gold
standard’ of spoken language is ‘normal’ conversation, loosely defined as interactions in
which there are no power differentials, so that all participants have equal speaking rights.
Other types of interaction are compared to this ‘norm’ and the validity of test formats such as
the interview is brought into question (e.g. Johnson 2001). But we must question whether
‘friends chatting’ is indeed the ‘norm’ in most spoken interaction. In higher education, for
example, this kind of talk is very rare, and scores from simulated ‘normal’ conversations are
unlikely to be relevant to communication with a professor, accommodation staff, or library
assistants. Research that describes the language used in specific communicative contexts to
support test design is becoming more common, such as that in academic contexts to underpin
task design (Biber 2006).
Rater cognition vs. Performance analysis. It has become increasingly common to look at
‘what raters pay attention to’. When we discover what is going on in their heads, should it be
treated as construct irrelevant if it is at odds with the rating scale descriptors and/or an
6
analysis of performance on test tasks? Or should it be used to define the construct and
populate the rating scale descriptors? Do all raters bring the same analysis of performance to
the task? Or are we merely incorporating variable degrees of perverseness that dilutes the
construct? The most challenging question is perhaps: Are rater perceptions at odds with
reality?
Freedom vs. Control. Left to their own devices, raters tend to vary in how they score the same
performance. The variability decreases if they are trained; and it decreases over time through
the process of social moderation. With repeated practice raters start to interpret performances
in the same way as their peers. But when severed from the collective for a period of time,
judges begin to reassert their own individuality, and disagreement rises. How do we identify
and control this variability? This question now extends to interlocutor behaviour, as we know
that interlocutors provide differing levels of scaffolding and support to test takers. This
variability may lead to different scores for the same test taker depending on which
interlocutor they work with. Much work has been done in the co-construction of speech in
test contexts. And here comes the crunch. For some, this variation is part of a richer speaking
construct and should therefore be built into the test. For others, the variation removes the
principle of equality of experience and opportunity at the moment of testing, and therefore
the interlocutors should be controlled in what they say. In face-to-face speaking tests we have
seen the growth of the interlocutor frame to control speakers, and proponents of indirect
speaking tests claim that the removal of an interlocutor eliminates subjective variation.
Publications selected to illustrate a timeline are inevitably subjective to some degree,
and the list cannot be exhaustive. My selection avoids clustering in particular years or
decades, and attempts to show how the contrasts and themes identified play out historically.
You will notice that themes H and I are different from the others in that they are about
7
particular methodologies. I have included these because of their pervasiveness in speaking
assessment research, and may help others to identify key discourse or multi-faceted Rasch
measurement studies (MFRM). What I have not been able to cover is the assessment of
pronunciation and intonation, or the detailed issues surrounding semi-direct (or simulated)
tests of speaking, both of which require separate timelines. Finally, I am very much aware
that the assessment of speaking was common in the United Kingdom from the early 20th
Century. Yet, there is sparse reference to research outside the United States in the early part
of the timeline. The reason for this is that apart from Roach (see timeline, reprinted as an
appendix in Weir, Vidaković & Galaczi (2013) (eds.) there is very little published research
from Europe (Fulcher 2003, p. 1). The requirement that research is in the public domain for
independent inspection and critique was a criterion for selection in this timeline. For a
retrospective interpretation of the early period in the United Kingdom with reference to
unpublished material and confidential internal examination board reports to which we do not
have access, see Weir & Milanovic (2003) and Vidaković & Galaczi (2013).
Themes
A. Rating scale development
B. Construct definition and validation
C. Task design and format
D. Specific purposes testing and generalizability
E. Reliability and rater training
F. The native speaker criterion
G. Washback
H. Discourse analysis
I. Multi-faceted Rasch Measurement (MFRM)

8
J. Interlocutor behaviour and training
K. Rater cognition
L. Test-taker characteristics
References
Bachman, L. F. (2001). Speaking as a realization of communicative competence. Paper
presented at the meeting of the American Association of Applied Linguistics. St. Louis,
Missouri, February.
Biber, D. (2006). University language. A corpus-based study of spoken and written registers.
Amsterdam: John Benjamins.
Chadwick, E. (1858). On the economical, social, educational, and political influences of
competitive examinations, as tests of qualifications for admission to the junior appointments
in the public service. Journal of the Statistical Society of London 21.1, 18–51.
Chun, C. W. (2006). Commentary: An Analysis of a Language Testing for Employment: The
Authenticity of the PhonePass Test. Language Assessment Quarterly 3.3, 295–306.
Downey, R., H. Farhady, R. Present-Thomas, M Suzuki. & A. Van Moere (2008). Evaluation
of the Usefulness of the Versant for English Test: A Response. Language Assessment
Quarterly 5.2, 160–167.

9
Fulcher, G. (2003). Testing second language speaking. Harlow: Longman/Pearson Education.
Johnson, M. (2001). The Art of Non-conversation. A re-examination of the validity of the
Oral Proficiency Interview. New Haven and London: Yale University Press.
Kenyon, D. (1992). Introductory remarks at symposium on development and use of rating
scales in language testing. Paper delivered at the 14th Language Testing Research
Colloquium, Vancouver, March.
Latham, H. (1877). On the action of examinations considered as a means of selection.
Cambridge: Dighton, Bell and Company.
Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests.
Cambridge: Cambridge University Press.
Luoma, S. (2004). Assessing second language speaking. Cambridge: Cambridge University
Press.
McNamara, T. F. (1995). Modelling performance: Opening Pandora’s Box. Applied
Linguistics 16.2, 159‒179.
Milanovic, M. & N. Saville (1996). Introduction. In M. Milanovic (ed.), Performance testing,
cognition and assessment. Cambridge: Cambridge University Press, 1 – 17

10
Skehan, P. (2001). Tasks and language performance assessment. In M. Bygate, P. Skehan, &
M. Swain (eds.), Researching pedagogic tasks: Second language learning, teaching and
testing. . London: Longman, 167–185.
Taylor, L. (2011). Examining speaking. Research and practice in assessing second language
speaking. Cambridge: University of Cambridge Press.
Weir, C. & M. Milanovic (2003). (eds.), Continuity and innovation: Revising the Cambridge
Proficiency in English Examination 1913 – 2002. Cambridge: Cambridge University Press.
Weir, C. J., I.Vidaković & E. D. Galaczi (2013). (eds.), Measured constructs. A history of
Cambridge English language examinations 1913 – 2012. Cambridge: Cambridge University
Press.
Vidaković, I. & E. D. Galaczi (2013). The measurement of speaking ability 1913 – 2012. In
C. J. Weir, I. Vidaković, & E. D. Galaczi (eds.), Measured constructs. A history of
Cambridge English language examinations 1913 – 2012. Cambridge: Cambridge University
Press, 257‒346.
11
Year References Annotations Theme

1864 Chadwick, E. (1864). Statistics The earliest record of an attempt to A
of educational results. assess L2 speaking dates to the first
Museum: A Quarterly few years after Rev. George Fisher
Magazine of Education, became Headmaster of the
Literature and Science 3, 479‒ Greenwich Royal Hospital School in
484. 1834. In order to improve and record
academic achievement, he instituted
Also see discussion in: a ‘Scale Book’, which recorded
Cadenhead, K. & R. Robinson performance on a scale of 1 to 5 with
(1987). Fisher’s ‘Scale Book’: quarter intervals. A scale was created
An early attempt at educational for French as a second language,

12
measurement. Educational with typical speaking prompts to
Measurement: Issues and which boys would be expected to
Practice 6.4, 15–18. respond at each level. The Scale
Book has not survived.

1912 Thorndike, E. L. (1912). The Scales of various kinds were A, B
measurement of educational developed by social scientists like
products. The School Review Galton and Cattell towards the end of
20.5, 289–299. the 19th Century, but it was not until
the work of Thorndike in the early
20th Century that the definition of
each point on an equal interval scale
was revived. With reference to
speaking German, he suggested that
performance samples should be
attached to each level of a scale,
along with a descriptor that
summarizes the ability being tested.

1920 Yerkes, R. M. (1920). What Yerkes describes the development of A, B,
psychology contributed to the the first large-scale speaking test for C, D
war. In R. M. Yerkes (ed.), The military purposes in 1917. It was
new world of science: designed to place army recruits into
Its development during the language development battalions. It
war. New York, NY: The consisted of a verbal section and a
Century Co, 364–389. performance section (following
instructions), with tasks linked to
Also see discussion in: scale level by difficulty. Although

13
Fulcher, G. (2012). Scoring the development of the test is not
performance tests. In G described, the generic approach is
Fulcher. & F. Davidson (eds.), outlined, and involved the
The Routledge handbook of identification of typical tasks from
language testing. London and the military domain that were piloted
New York: Routledge, 378– in test conditions. It is arguably the
392. case that this was the first English for
Specific Purposes (ESP) test based
on domain specific criteria. In
addition, there was clearly an
element of domain analysis to
support Criterion-referenced
assessment.
1944 Kaulfers, W. V. (1944). War- The interwar years saw a rapid A, B, D
time developments in modern growth in large-scale assessment that
language achievement tests. relied on the multiple-choice item for
The Modern Language Journal efficiency. In the Second World War
28, 136–150. Kaulfers quickly realized that these
tests could not adequately predict
Also see discussion in: ability to speak in potentially life-
Velleman, B. L. (2008). The threatening contexts. Teaching and
‘scientific linguist’ goes to assessment of speaking was quickly
war: the United States A.S.T. geared towards the military context
program in foreign languages. once again. Kaulfers presents scoring
Historiographia Linguistica criteria according to the scope and
35, 385–416. quality of performance. However, all

14
descriptors are generic and not
domain specific.
1945 Roach, J. O. (1945). Some Roach was among the first to E
problems of oral examinations investigate rater reliability in
in modern languages. An speaking tests. He was concerned
experimental approach based primarily with maintaining
on the Cambridge ‘standards’, by which he meant that
examinations in English for examiners would agree on which test
Foreign Students. takers were awarded a pass, a good
University of Cambridge pass, and a very good pass, on the
Examinations Syndicate: Certificate of Proficiency in English.
Internal report circulated to He was the first to recommend what
oral examiners and local we now call ‘social moderation’ (see

1
representatives for these MISLEVY 1992) – familiarization
examinations. (Reprinted as with the system through team work,
facsimile in Weir et al. 2013) which results in agreement evolving
over time.
1952/ Foreign Service Institute. Little progress was made in testing A, B,
1958 (1952/1958). FSI Proficiency L2 speaking until the outbreak of the C, D, F
Ratings. Washington D.C.: Korean War in 1950. The Foreign
Foreign Service Institute. Service Institute (FSI) was
established, and the first widely used
Also see discussion in: semantic-differential rating scale put
Sollenberger, H. E. (1978) into use in 1952. This
Development and current use operationalized the ‘native speaker’

Authors’ names are shown in small capitals when the study referred to appears in
1
this timeline.
15
of the FSI oral interview test. construct at the top band (level six).
In Clark, J. L. D. (ed.), Direct With the Vietnam war on the
testing of speaking proficiency: horizon, a decision was taken to
Theory and application. register the language skills of US
Princeton, NJ: Educational diplomatic and military personnel.
Testing Service, Work began to expand the FSI scale
1–12. by adding verbal descriptors at each
of the six levels from zero
proficiency to native speaker, and to
include multiple holistic traits. This
went hand in hand with the creation
of the Oral Proficiency Interview
(OPI), which was a mix of interview,
prepared dialogue, and simulation.
The wording of the 1958 FSI scale
and the tasks associated with the OPI
have been copied into many other
testing systems still in use.

1967 Carroll, J. B. (1967). The Despite little validation evidence the E, G
foreign language attainments FSI and Interagency Language
of language majors in the Roundtable (ILR) approach became
senior year: A survey popular in education because of its
conducted in US colleges and face validity, inter-rater reliability
Universities. Foreign through social moderation, and
Language Annals 1.2, 131– perceived coherence with new
communicative teaching methods.

16
151. Carroll showed that the military
system was not sensitive to language
acquisition in an educational context,
and hence was demotivating. It
would be over a decade before this
research had an impact on policy.

1979 Strength Through Wisdom: A Further impetus to extend speaking
critique of U.S. capability. A assessment in educational settings
report to the President from came from a report submitted to
the President's Commission on President Carter on shortcomings in
Foreign Language and the US military because of lack of
International Studies. foreign language skills. It is not
Washington DC: US coincidental that in the same year
Government Printing Office. attention was drawn to the study
published by CARROLL (1967). The
American Council on the Teaching
of Foreign Languages (ACTFL) was
given the task of revising the
FSI/ILR scales for wider use.

1979 Adams, M. L. & J. R. Frith As part of the ACTFL research into A, C,
(1979). Testing kit: French and new rating scales, the first testing E, G
Spanish. Washington DC: kits were developed for training and
Department of State and the assessment purposes in US Colleges.
Foreign Service Institute. The articles and resources in Adams
& Frith provided a comprehensive
guide for raters of the OPI for

17
educational purposes.
1980 Adams, M. L. (1980). Five co- Adams conducted the first structural B
occurring factors in speaking validation study designed to
proficiency. In J. R. Frith (ed.), investigate which of the five FSI
Measuring spoken language subscales discriminated between
proficiency. Washington DC: learners at each proficiency level.
Georgetown University Press, The study was not theoretically
1–6. motivated, and no patterns could be
discerned in the data.

1980 Reves, T. (1980). The group- Reves questioned whether the OPI C
oral test: Aan experiment. could generate ‘real-life
English Teachers Journal 24, conversation’ and began
19–21. experimenting with group tasks to
generate richer speaking samples.

1981 Bachman, L. F. & A.S. Palmer The first construct validation studies B
(1981). The construct validity were carried out in the early 1980s,
of the FSI oral interview. using the multitrait-multmethod
Language Learning 31.1, 67– technique and confirmatory factor
86. analysis. These demonstrated that the
FSI OPI loaded most heavily on the
speaking trait, and lowest of all
methods on the method trait. These
studies concluded that there was
significant convergent and divergent
evidence for construct validity in the
OPI.
1983 Lowe, P. (1983). The ILR oral In the 1960s the FSI approach to A, C, D
18
interview: origins, applications, assessing speaking was adopted by
pitfalls, and implications. Die the Defense Language Institute, the
Unterrichtspraxis 16, 230–244. Central Intelligence Agency, and the
Peace Corp. In 1968 the various
adaptations were standardized as the
Interagency Language Roundtable
(ILR), which is still the accepted tool
for the certification of L2 speaking
proficiency throughout the United
States military, intelligence and
diplomatic services
(http://www.govtilr.org/). Via the
Peace Corp it spread to academia,
and the assessment of speaking
proficiency worldwide. It also
provides the basis for the current
NATO language standards, known as
STANAG 6001.
1984 Liskin-Gasparro, J. E. (1984). Following the publication of A, B
The ACTFL Proficiency Strength Through Wisdom (see 1979,
Guidelines: Gateway to testing above) and the concerns raised by
and curriculum. Foreign CARROLL (1967), the ACTFL
Language Annals 17.5, 475– Guidelines were developed
489. throughout the 80s, with preliminary
publications in 1982, and the final
Guidelines issued in 1986 (revised

19
1999). Levels from 0 to 5 were
broken down into subsections, with
finer gradations at lower proficiency
levels. Level descriptors provided
longer prose definitions of what
could be done at each level. New
constructs were introduced at each
level, drawing on new theoretical
models of communicative
competence of the time, particularly
those of Canale & Swain2. These
included discourse competence,
interaction, and communicative
strategies.
1985 Lantolf, J. P. & W Frawley Lantolf & Frawley were among the A, B
(1985). Oral proficiency first to question the ACTFL
testing: A critical analysis. The approach. They claimed the scales
Modern Language Journal were ‘analytical’ rather than
69.4, 337–345. ‘empirical’, depending on their own
internal logic of non-contradiction
between levels. The claim that the
descriptors bear no relationship to
how language is acquired or used set
2
Canale, M. & M. Swain (1980). Theoretical bases of communicative approaches to second
language teaching and testing. Applied Linguistics 1.1, 1–47.

20
off a whole chain of research into
scale analysis and development.

1986 Kramsch, C. J. (1986). From Kramsch’s research into B
language proficiency to interactional competence spurred
interactional competence. The further research into task types that
Modern Language Journal might elicit interaction, and the
70.4, 366–372. construction of ‘interaction’
descriptors for rating scales. This
research had a particular impact on
future discourse related studies by
HE & YOUNG (1998).

1986 Bachman, L. F. & S. Savignon This very influential paper B, D, F
(1986). The evaluation of questioned the use of the native
communicative language speaker to define the top level of a
proficiency: a critique of the rating scale, and the notion of zero
ACTFL Oral Interview. The proficiency at the bottom. Secondly,
Modern Language Journal 79, the researchers questioned reference
380–390. to context within scales as
confounding constructs with test
method facets, unless the test is for a
defined ESP setting. This paper
therefore set the agenda for debates
around score generalizability, which
we still wrestle with today.

1987 Fulcher, G. (1987). Tests of Using discourse analysis of native A, B, H
oral performance: The need for speaker interaction, this paper
data-based criteria. English provided the first evidence that rating

21
Language Teaching Journal scales did not describe what typically
41.4, 287‒291. happened in naturally occurring
speech, and advocated a data-based
approach to writing descriptors and
constructing scales. This was the first
use of discourse analysis to
understand under-specification in
rating scale descriptors, and was
expanded into a larger research
agenda (see FULCHER 1996).

1989 Van Lier, L. (1989). Reeling, In another discourse analysis study, B, H
writhing, drawling, stretching, Van Lier showed that interview
and fainting in coils: Oral language was not like ‘normal
proficiency interviews as conversation’. Although the work of
conversation. TESOL finding formats that encouraged
Quarterly 23.3, 489–508. ‘conversation’ had started with
REVES (1980) and colleagues in
Israel, this paper encouraged wider
research in the area.

1991 Linacre, J. M. (1991). FACETS Rater variation had been a concern E, I
computer programme for since the work of ROACH (1945)
Many-faceted Rasch during the war, but only with the
Measurement. Chicago, IL: publication of Linacre’s FACETS
Mesa Press. did it become possible to model rater
harshness/leniency in relation to task
difficulty and learner ability. MFRM

22
remains the standard tool for
studying rater behaviour today and
test facets today, as in the studies by
LUMLEY & MCNAMARA (1995), and
BONK & OCKEY (2003).

1991 Alderson, J. C. (1991). Bands Based on research driving the IELTS A
and scores. In J. C. Alderson & revision project, Alderson
B. North (eds.), Language categorized rating scales as use-
testing in the 1990s. London: oriented, rater-oriented, and
Modern English Publications constructor-oriented. These
and the British Council, 71–86. categories have been useful in
guiding descriptor content with
audience in mind.
1992 Young, R. & M. Milanovic An early and significant use of B, C,
(1992). Discourse variation in discourse analysis to characterize the H, L
oral proficiency interviews. interaction of test takers with
Studies in Second Language interviewers in the First Certificate
Acquisition 14.4, 403–424. Test of English. Discourse structure
was demonstrated to be related to
examiner, task and gender variables.

1992 Douglas, D. & L. Selinker Douglas & Selinker show that a A, B, D
(1992). Analyzing Oral discipline specific test (chemistry) is
Proficiency Test performance a better predictor of domain specific
in general and specific purpose performance than a general speaking
contexts. System 20.3, 317– test. In this, and a series of
328). publications on ESP testing, they
show that reducing generalizability

23
by introducing context increases
score usefulness. This is the other
side of the coin to BACHMAN &
SAVIGNON’S (1986) generalizability
argument.
1992 Ross, S. & R. Berwick (1992). Reacting to critiques of the OPI from B, C,
The discourse of VAN LIER (1989), LANTOLF & H, J
accommodation in oral FRAWLEY (1985), and others, Ross
proficiency interviews. Studies & Berwick undertook discourse
in Second Language analysis of OPIs to study how
Acquisition 14.1, 159–176. interviewers accommodated to the
discourse of candidates. They
concluded that the OPI had features
of both interview and conversation.
However, it also raised the question
of how interlocutor variation might
result in test takers being treated
differentially. This sparked a chain of
similar research by scholars such as
LAZARATON (1996).
1992 Mislevy, R. J. (1992). Linking LOWE (1983) and others had argued E
educational assessments. that the meaning of descriptors was
Concepts. Issues. Methods and socially acquired. In this publication
prospects. Princeton NJ: the term ‘social moderation’ was
Educational Testing Service. formalized. NORTH & SCHNEIDER
(1998) and the Council of Europe

24
have taken this concept and made it
central to the project of using the
Common European Framework of
Reference (CEFR) scales as a
European-wide lens for viewing
speaking proficiency.
1995 Chalhoub-Deville, M. (1995). Chalhoub-Deville investigated the A, B, E
Deriving oral assessment inter-relationship of diverse tasks
scales across different tests and and raters using multidimensional
rater groups. Language Testing scaling to identify the components of
12.1, 16–33. speaking proficiency that were being
assessed. She found that these varied
by task and rater group, and therefore
called for the construct to be defined
anew for each task / rater
combination. The issue at stake is
whether the construct has any
independent psychological reality
independently from context specific
performances.
1995 Lumley, T. & T. McNamara Rater variability is studied across E, I
(1995). Rater characteristics time using FACETS, showing that
and rater bias: implications or there is considerable variation in
training. Language Testing harshness irrespective of training.
12.10, 54–71. The researchers question the use of
single ratings in high-stakes speaking

25
tests, and recommend the use of rater
calibrations to provide training
feedback or adjust scores.

1995 Upshur, J. & C. Turner (1995). Upshur & Turner introduce A, B,
Constructing rating scales for Empirically-derived Binary-choice C, D, K
second language tests. English boundary-definition scales (EBB).
Language Teaching Journal These address the long-standing
49.1, 3–12. concern over a-priori scale
development outlined by LANTOLF &
FRAWLEY (1985), and start to tie
decisions to specific examples of
performance as recommended by
FULCHER (1987). The scales are task
specific rather than generic. The
methodology has specific impact on
later studies such as POONPON
(2010).
1996 McNamara, T. (1996). McNamara described the A, B,
Measuring second language development of the Occupational C, D
performance. Harlow: English Test (OET) for health
Longman. professionals. This is a specific
purpose test with a clearly specified
audience, and scores from this
instrument are shown to be more
reliable and valid for decision
making than generic English tests.

1996 Fulcher, G. Testing tasks: Building on REVES (1980) and C, G
26
Issues in task design and the others, this study compared a group
group oral. Language Testing oral (3 participants) and two
13.1, 23–51. interview-type tasks. Discourse was
more varied in the group task, and
participants reported a preference for
working in a group with other test-
takers.
1996 Fulcher, G. (1996). Does thick Based on work conducted since A, B,
description lead to smart tests? FULCHER (1987), this paper C, D, H
A data-based approach to describes the research underpinning
rating scale construction. the design of data-based rating
Language Testing 13.2, 208‒ scales. The methodology employs
238. discourse analysis of speech samples
to produce scale descriptors. The use
of the resulting scale is compared
with generic a-priori scales. Using
discriminant analysis, the data-based
scores are found to be more reliable,
and using MFRM, rater variation is
significantly decreased. The data-
based approach therefore solves the
problems identified by researchers
like LUMLEY & MCNAMARA (1995).
The study also generated the Fluency
Rating Scale descriptors, which were
used as anchor items in the CEFR

27
project.
1996 Lazaraton, A. (1996). In the ROSS & BERWICK (1992) B, H, J
Interlocutor support in oral tradition, and inspired by VAN LIER
proficiency interviews. The (1989), Lazaraton identifies 8 kinds
case of CASE. Language of support provided by a
Testing 13.2, 151–172. rater/interlocutor in an OPI. She
concludes that the variation is
problematic, and calls for additional
rater training and possibly the use of
an ‘interlocutor support scale’ as part
of the rating procedure.

1996 Pollitt, A. & Murray, N. L. Pollitt & Murray use two B, K
(1996). What raters really pay innovative techniques to investigate
attention to. In M. Milanovic & how raters use rating scales, and
N. Saville (eds.), Performance what they pay attention to when
testing, cognition and rating spoken performances. The
assessment. Selected papers research showed raters bring their
from the 15th Language Testing own conceptual baggage to the rating
Research Colloquium, process, but used constructs such as
Cambridge and Arnhem. discourse, sociolinguistic, and
Studies in Language Testing 3. grammatical competence, as well as
Cambridge: Cambridge fluency and ‘naturalness’.
University Press, 74‒91.

1997 McNamara, T. (1997). Speaking had generally been B
Modelling performance: characterized in cognitive terms as
Opening Pandora’s Box. traits resident in the speaker being
Applied Linguistics 18.4, 446– assessed. Building on the work of

28
465. KRAMSCH (1986) and others,
McNamara showed that interaction
implied the co-construction of
speech, and argued that in social
contexts there was shared
responsibility for performance. The
question of shared responsibility and
the role of the interlocutor, have
since become active areas of
research.
1998 Young, R. & A. W. He (1998) An important collection of research B, C, H
(eds.), Talking and testing. papers analysing the discourse of
Discourse approaches to the test-taker speech in speaking tests.
assessment of oral proficiency. The speaking test is characterized as
Amsterdam: John Benjamins. an ‘interactive practice’ co-
constructed by the participants.

1998 North, B. & G. Schneider North & Schneider describe the A, I
(1998). Scaling descriptors for measurement-driven approach to
language proficiency scales. scale development as embodied in
Language Testing 15.2, 217– the CEFR. Descriptors from existing
262. speaking scales are extracted from
context and scaled using MFRM and
teacher judgments as data.

1999 Jacoby, S. & T. McNamara In two studies, Jacoby & B, K
(1999). Locating competence. McNamara discovered that the
English for Specific Purposes linguistic criteria used by applied
18.3, 213–241. linguists to rate speaking

29
performance did not capture the kind
of communication valued by subject
specialists. They recommended
studying ‘indigenous criteria’ to
expand what is valued in
performances. This work has
impacted on domain specific studies,
such as FULCHER ET AL. (2011). It
also raises serious questions about
psycholinguistic approaches such as
those advocated by VAN MOERE
(2012).
2002 Young, R. (2002). Discourse A careful investigation of the ‘layers’ B, C, H
approaches to oral language of discourse in naturally occurring
assessment. Annual Review of speech and test tasks. This is
Applied Linguistics 22, 243– combined with a review of various
262. approaches to testing speaking, with
an indication of which test formats
are likely to elicit the most useful
speech samples for rating.

2002 O’Sullivan, B., C. J. Weir & N. A methodological study to compare B, H
Saville (2002). Using the ‘informational and interactional
observation checklists to functions’ produced on speaking test
validate speaking-test tasks. tasks with those the test designer
Language Testing 19.1, 33–56. intended to elicit. The instrument
proved to be unwieldy and

30
impractical, but the study established
the important principle for
examination boards that evidence of
congruence between intention and
reality is an important aspect of
construct validation.
2003 Brown, A. (2003). Interviewer A much quoted study into variation B, H, I,
variation and the coin the speech of the same test taker J
construction of speaking with two different interlocutors.
proficiency. Language Testing Building on ROSS & BERWICK
20.1, 1–25. (1992), LAZARATON (1996) and
MCNAMARA (1996), Brown
demonstrated that scores also varied,
although not by as much as one may
have expected. The paper raises the
critical issue of whether variation
should be allowed because it is part
of the construct, or controlled
because it leads to inequality of
opportunity.
2003 Fulcher, G. & R. Marquez- An investigation into the effects of B, C, H
Reiter (2003). Task difficulty in task features (social power and level
speaking tests. Language of imposition) and L1 cultural
Testing 20.3, 321–344. background, on task difficulty and
score variation. Like BROWN (2003)
it was discovered that although

31
significant variation occurred when
extreme conditions were used, effect
sizes were not substantial.

2003 Bonk, W. J. & G. J. Ockey Using FACETS, the researchers B, E, I
(2003). A many-facet Rasch investigated variability due to test
analysis of the second language taker, prompt, rater, and rating
group oral discussion task. categories. Test taker ability was the
Language Testing 20.1, 89–110. largest facet. Although there was
evidence of rater variability, this did
not threaten validity, and indicated
that raters became more stable in
their judgments over time. This adds
to the evidence that socialization
over time has an impact on rater
behaviour.
2005 Cumming, A., L. Grant, P. An important prototyping study. Pre- B, C, K
Mulcahy-Ernt & D. E. Powers operational tasks were shown to
(2005). A teacher-verification experts who judge whether they
study of speaking and writing represent the kinds of tasks that
prototype tasks for a new students would undertake at
TOEFL Test. TOEFL university. They are also presented
Monograph No. MS-26. with their own student’s responses to
Princeton, NJ: Educational the tasks and asked whether these are
Testing Service. ‘typical’ of their work. The study
shows that test development is a
research-led activity, and not merely

32
a technical task. Design decisions
and the evidence for those decisions
are part of a validation narrative.

2007 Berry, V. (2007). Personality Based on many years of research into B, C, L
differences and oral test personality and speaking test
performance. Frankfurt: Peter performance, Berry shows that
Lang. levels of introversion and
extroversion impact on contributions
to conversation in paired- and group-
formats, and results in differential
score levels when ability is
controlled for.
2008 Galaczi, E. D. (2008). Peer- Galaczi presents a discourse analytic B, C, H
peer interaction in a speaking study of the paired test format, in
test: The case of the First which two candidates are required to
Certificate in English converse with each other, as well as
examination. Language the examiner/interlocutor. The
Assessment Quarterly 5.2, 89– research identified three interactive
119. patterns in the data: ‘collaborative’,
‘parallel’ and ‘asymmetric’.
Tentative evidence is also presented
to suggest that there is a relationship
between scores on an ‘Interactive
Communication’ rating scale.

2009 Ockey, G. (2009). The effects Building on BERRY (2007), Ockey B, C, L
of group members’ investigates the effect of levels of
personalities on a test taker’s ‘assertiveness’ on speaking scores in

33
L2 group oral discussion test a group oral test, using MANCOVA
scores. Language Testing 26.2, analyses. Assertive students are
161–186. found to have lower scores when
placed in all assertive groups, and
higher scores when placed with less
assertive participants. The scores of
non-assertive students did not change
depending on group makeup. The
results differ from BERRY, indicating
that much more research is needed in
this area.
2010 Poonpon, K. (2010). A study that brings together the EBB A, B,
Expanding a Second Language approach of UPSHUR & TURNER H, K
Speaking Rating scale for (1995) with the data-based approach
Instructional Assessment of FULCHER (1996) to create a rich
Purposes. Spaan Fellow data-based EBB for use with TOEFL
Working Papers in Second or iBT tasks. In the process, the nature
Foreign Language Assessment of the academic speaking construct is
8, 69–94. further explored and defined.

2011 Fulcher, G., F. Davidson & J. Like POONPON (2010), this study A, B, H
Kemp (2011). Effective rating brings together UPSHUR & TURNER’s
scale development for speaking (1995) EBB and FULCHER’s (1996)
tests: Performance Decision data-based approach in the context of
Trees. Language Testing 28.1, service encounters. It also
5‒29. incorporates indigenous insights
following JACOBY & MCNAMARA

34
(1999). It describes interaction in
service encounters through a
performance decision tree that
focuses rater attention on observable
criteria related to discourse and
pragmatic constructs.
2011 Frost, K., C. Elder & G. Integrated task types have become A, B, C
Wigglesworth (2011). widely used since their incorporation
Investigating the validity of an into TOEFL iBT. However, little
integrated listening-speaking research has been carried out into the
task: A discourse-based use of source material in spoken
analysis of test takers’ oral responses, or how the integrated skill
performances. Language can be described in rating scale
Testing 29.3, 345–369. descriptors. The ‘integration’
remains elusive. In this study a
discourse approach is adopted
following ideas in DOUGLAS &
SELINKER (1992) and FULCHER
(1996) to define content related
aspects of validity in integrated task
types. The study provides evidence
for the usefulness of integrated tasks
in broadening construct definition.

2011 May, L. (2011). Interactional Following KRAMSCH (1986), B, C, K
competence in a paired MCNAMARA (1997) and YOUNG
speaking test: Features salient (2002), May problematizes the

35
to raters. Language Assessment notion of the speaking construct in a
Quarterly 8.2, 127–145. paired speaking test. However, she
attempts to deal with the problem of
how to award scores to individuals
by looking at how raters focus on
features of the speech of individual
participants. The three categories of
interpretation: understanding
interlocutor’s message, responding
appropriately, and using
communicative strategies, are not as
important as the attempt to
disentangle the individual from the
event, while recognizing that
discourse is co-constructed.
2011 Nakatsuhara, F. (2011). Effects Building on BONK & OCKEY (2003) B, H
of test-taker characteristics and and other research into the group
the number of participants in speaking test, Nakatsuhara used
group oral tests. Language conversation analysis to investigate
Testing 28.4, 483–508. group size in relation to proficiency
level and personality type. She
discovered that more proficient
extroverts talked more and initiated
topic more when in groups of 4 than
in groups of 3. However, proficiency
level resulted in more variation in

36
groups of 3. With reference to
GALACZI (2008), she concludes that
groups of 3 are more collaborative.

2012 Van Moere, A. (2012). A Very much against the trend, Van B, C
psycholinguistic approach to Moere makes a case for a return to
oral language assessment. assessing psycholinguistic speech
Language Testing 29.1, 325– ‘facilitators’, related to processing
344. automaticity. These include response
latency, speed of speech, length of
pauses, and the reproduction of
syntactically accurate sequences,
with appropriate pronunciation
intonation and stress. Task types are
sentence repetition and sentence
building. This approach is driven by
an a-priori decision to use an
automated scoring engine to rate
speech samples. The validation
argument stresses the objective
nature of the decisions, compared
with the unreliable and frequently
irrelevant judgments of human raters.
This is an exercise in reductionism
par excellence, and is likely to
reignite the debate on prediction to
domain performance from

37
‘atomistic’ features that last raged in
the early communicative language
testing era.
2012 Tan, J., B. Mak & P. Zhou This paper applies fuzzy logic to our E, J
(2012). Confidence scoring of understanding of how raters score
speaking performance: How performances. This approach takes
does fuzziness become exact? into account both rater decisions, and
Language Testing 29.1, 43–65. the levels of uncertainty in arriving at
those decisions.
2014 Nitta, R & F. Nakatsuhara Nitta & Nakatsuhara investigate C, H
(2014). A multifaceted providing test-takers with planning
approach to investigating pre- time prior to undertaking a paired
task planning effects on paired speaking test. The unexpected
oral performance. Language findings are that planning time
Testing 31.2, 147–175. results in stilted prepared output, and
reduced interaction between
speakers.
Acknowledgements
I would like to thank Dr. Gary Ockey of Educational Testing Service for reviewing my first
draft, and providing valuable critical feedback. My thanks are also due to the very
constructive criticism of the three reviewers, which has considerably improved the coverage
and coherence of the timeline. Finally to the editor of Language Teaching for timely
guidance and advice.

Research Timeline: Assessing Second Language Speaking: TH TH

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Research Timeline: Assessing Second Language Speaking: TH TH

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Timeline: Assessing Second Language Speaking: TH TH

Uploaded by

Copyright:

Available Formats

1

Research timeline: Assessing second language speaking

University of Leicester, United Kingdom

journal Language Testing.

linked to military needs.

concern of research. A model of speaking test performance is essential in this context, as it

associated papers for this timeline.

Figure 1. An expanded model of speaking test performance (Fulcher 2003, p. 115).

Orientation / Rating Scale / Band Construct

Task specific Real-time processing Abilities /

interconnected. Design decisions therefore need to be informed by testing purpose and

relevant theoretical frameworks.

although in reality there are clines at work.

be related to specific communicative purposes and domains? Or is it possible to produce test

difficult it becomes to select test content.

more traditional constructs of ‘fluency’ or ‘accuracy’, or more basic observable variables

related to automaticity of language processing, such as response latency or speed of delivery.

2006; Downey et al. 2008).

standard’ of spoken language is ‘normal’ conversation, loosely defined as interactions in

unlikely to be relevant to communication with a professor, accommodation staff, or library

task design (Biber 2006).

Publications selected to illustrate a timeline are inevitably subjective to some degree,

particular methodologies. I have included these because of their pervasiveness in speaking

A. Rating scale development

B. Construct definition and validation

C. Task design and format

D. Specific purposes testing and generalizability

E. Reliability and rater training

F. The native speaker criterion

I. Multi-faceted Rasch Measurement (MFRM)

J. Interlocutor behaviour and training

Bachman, L. F. (2001). Speaking as a realization of communicative competence. Paper

Amsterdam: John Benjamins.

Chadwick, E. (1858). On the economical, social, educational, and political influences of

competitive examinations, as tests of qualifications for admission to the junior appointments

Chun, C. W. (2006). Commentary: An Analysis of a Language Testing for Employment: The

Authenticity of the PhonePass Test. Language Assessment Quarterly 3.3, 295–306.

Quarterly 5.2, 160–167.

Fulcher, G. (2003). Testing second language speaking. Harlow: Longman/Pearson Education.

Johnson, M. (2001). The Art of Non-conversation. A re-examination of the validity of the

Kenyon, D. (1992). Introductory remarks at symposium on development and use of rating

Colloquium, Vancouver, March.

Latham, H. (1877). On the action of examinations considered as a means of selection.

Cambridge: Dighton, Bell and Company.

Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests.

Cambridge: Cambridge University Press.

Luoma, S. (2004). Assessing second language speaking. Cambridge: Cambridge University

McNamara, T. F. (1995). Modelling performance: Opening Pandora’s Box. Applied

Linguistics 16.2, 159‒179.

Milanovic, M. & N. Saville (1996). Introduction. In M. Milanovic (ed.), Performance testing,

cognition and assessment. Cambridge: Cambridge University Press, 1 – 17

testing. . London: Longman, 167–185.

speaking. Cambridge: University of Cambridge Press.

Proficiency in English Examination 1913 – 2002. Cambridge: Cambridge University Press.

Cambridge English language examinations 1913 – 2012. Cambridge: Cambridge University

C. J. Weir, I. Vidaković, & E. D. Galaczi (eds.), Measured constructs. A history of

Cambridge English language examinations 1913 – 2012. Cambridge: Cambridge University

Year References Annotations Theme