Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Research Timeline: Assessing Second Language Speaking: TH TH

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 37

1

Research timeline: Assessing second language speaking

Glenn Fulcher

University of Leicester, United Kingdom

School of Education

Biodata: Glenn Fulcher is Professor of Education and Language Assessment at the University

of Leicester, and Head of the School of Education. He has published widely in the field of

language testing, from journals such as Language Testing, Language Assessment Quarterly,

Applied Linguistics and System, to monographs and edited volumes. His books include

Testing second language speaking (Longman 2003), Language testing and assessment: An

advanced resource book (Routledge 2007), Practical language testing (Hodder 2010), and

the Routledge handbook of language testing (Routledge 2012). He currently co-edits the Sage

journal Language Testing.

Introduction

While the viva voce (oral) examination has always been used in content-based educational

assessment (Latham 1877, p. 132), the assessment of second language (L2) speaking in

performance tests is relatively recent. The impetus for the growth in testing speaking during

the 19th and 20th Centuries is twofold. Firstly, in educational settings the development of

rating scales was driven by the need to improve achievement in public schools, and to

communicate that improvement to the outside world. Chadwick (1864, see timeline) implies

that the rating scales first devised in the 1830s served two purposes: providing information to

the classroom teacher on learner progress for formative use, and generating data for school
2

accountability. From the earliest days, such data was used for parents to select schools for

their children in order to ‘maximize the benefit of their investment’ (Chadwick 1858).

Secondly, in military settings it was imperative to be able to predict which soldiers were able

to undertake tasks in the field without risk to themselves or other personnel (Kaulfers, 1944,

see timeline). Many of the key developments in speaking test design and rating scales are

linked to military needs.

The speaking assessment project is therefore primarily a practical one. The need for

speaking tests has expanded from the educational and military domain to decision making for

international mobility, entrance to higher education, and employment. But investigating how

we make sound decisions based on inferences from speaking test scores remains the central

concern of research. A model of speaking test performance is essential in this context, as it

helps focus attention on facets of the testing context under investigation. The first such model

developed by Kenyon (1992) was subsequently extended by McNamara (1995), Milanovic &

Saville (1996), Skehan (2001), Bachman (2001), and most recently by Fulcher (2003, p. 115),

providing a framework within which research might be structured. The latter is reproduced

here to indicate the extensive range of factors that have been and continue to be investigated

in speaking assessment research, and these are reflected in my selection of themes and

associated papers for this timeline.


3

Figure 1. An expanded model of speaking test performance (Fulcher 2003, p. 115).

Characteristics Training
Charac Rater(s)

Orientation / Rating Scale / Band Construct


Scoring Descriptors definition
philosophy
Focus

Local
Score and
performance
inferences about
conditions Performance
the test taker

Interlocutor(s)

Task
Additional task
 Orientation
characteristics or
 Interactional
conditions as
Relationship
required for
 Goals
specific contexts
 Interlocutors
 Topics Test Taker
 Situations
 Difficulty Individual variables
(e.g., personality)

Task specific Real-time processing Abilities /


knowledge or capacity capacities on
skills constructs

Decisions and
Consequences

Overviews of the issues illustrated in figure 1 are discussed in a number of texts devoted to

assessing speaking that I have not included in the timeline (Lazaraton 2002; Fulcher 2003;
4

Luoma 2004; Taylor (ed. 2011). Rather, I have selected publications based on 12 themes that

arise from these texts, from figure 1, and from my analysis of the literature.

Themes that pervade the research literature are rating scale development, construct

definition, operationalisation, and validation. Scale development and construct definition are

inextricably bound together because it is the rating scale descriptors that define the construct.

Yet, rating scales are developed in a number of different ways. The data-based approach

requires detailed analysis of performance. Others are informed by the views of? expert judges

using performance samples to describe levels. Some scales are a patchwork quilt created by

bundling descriptors from other scales together based on scaled teacher judgments. How we

define the speaking construct and how we design the rating scale descriptors are therefore

interconnected. Design decisions therefore need to be informed by testing purpose and

relevant theoretical frameworks.

Underlying design decisions are research issues that are extremely contentious.

Perhaps these can be presented in a series of binary alternatives to show stark contrasts,

although in reality there are clines at work.

Specific purposes tests vs. Generalizability. Should the construct definition and task design

be related to specific communicative purposes and domains? Or is it possible to produce test

scores that are relevant to any and every type of real-world decision that we may wish to

make? This is critical not least because the more generalizable we wish scores to be, the more

difficult it becomes to select test content.

Psycholinguistic criteria vs. Sociolinguistic criteria. Closely related to the specific purpose

issue is the selection of scoring criteria. Usually, the more abstract or psycholinguistic the

criteria used, the greater the claims made for generalizability. These criteria or ‘facilities’ are
5

said to be part of the construct of speaking that is not context dependent. These may be the

more traditional constructs of ‘fluency’ or ‘accuracy’, or more basic observable variables

related to automaticity of language processing, such as response latency or speed of delivery.

The latter are required for the automated assessment of speaking. Yet, as the generalizability

claim grows, the relationship between score and any specific language use context is eroded.

This particular antithesis is not only a research issue, but one that impacts upon the

commercial viability of tests; it is therefore not surprising that from time to time the

arguments flare up, and research is called into the service of confirmatory defence (Chun

2006; Downey et al. 2008).

Normal conversation vs. Domain specific interaction. It is widely claimed that the ‘gold

standard’ of spoken language is ‘normal’ conversation, loosely defined as interactions in

which there are no power differentials, so that all participants have equal speaking rights.

Other types of interaction are compared to this ‘norm’ and the validity of test formats such as

the interview is brought into question (e.g. Johnson 2001). But we must question whether

‘friends chatting’ is indeed the ‘norm’ in most spoken interaction. In higher education, for

example, this kind of talk is very rare, and scores from simulated ‘normal’ conversations are

unlikely to be relevant to communication with a professor, accommodation staff, or library

assistants. Research that describes the language used in specific communicative contexts to

support test design is becoming more common, such as that in academic contexts to underpin

task design (Biber 2006).

Rater cognition vs. Performance analysis. It has become increasingly common to look at

‘what raters pay attention to’. When we discover what is going on in their heads, should it be

treated as construct irrelevant if it is at odds with the rating scale descriptors and/or an
6

analysis of performance on test tasks? Or should it be used to define the construct and

populate the rating scale descriptors? Do all raters bring the same analysis of performance to

the task? Or are we merely incorporating variable degrees of perverseness that dilutes the

construct? The most challenging question is perhaps: Are rater perceptions at odds with

reality?

Freedom vs. Control. Left to their own devices, raters tend to vary in how they score the same

performance. The variability decreases if they are trained; and it decreases over time through

the process of social moderation. With repeated practice raters start to interpret performances

in the same way as their peers. But when severed from the collective for a period of time,

judges begin to reassert their own individuality, and disagreement rises. How do we identify

and control this variability? This question now extends to interlocutor behaviour, as we know

that interlocutors provide differing levels of scaffolding and support to test takers. This

variability may lead to different scores for the same test taker depending on which

interlocutor they work with. Much work has been done in the co-construction of speech in

test contexts. And here comes the crunch. For some, this variation is part of a richer speaking

construct and should therefore be built into the test. For others, the variation removes the

principle of equality of experience and opportunity at the moment of testing, and therefore

the interlocutors should be controlled in what they say. In face-to-face speaking tests we have

seen the growth of the interlocutor frame to control speakers, and proponents of indirect

speaking tests claim that the removal of an interlocutor eliminates subjective variation.

Publications selected to illustrate a timeline are inevitably subjective to some degree,

and the list cannot be exhaustive. My selection avoids clustering in particular years or

decades, and attempts to show how the contrasts and themes identified play out historically.

You will notice that themes H and I are different from the others in that they are about
7

particular methodologies. I have included these because of their pervasiveness in speaking

assessment research, and may help others to identify key discourse or multi-faceted Rasch

measurement studies (MFRM). What I have not been able to cover is the assessment of

pronunciation and intonation, or the detailed issues surrounding semi-direct (or simulated)

tests of speaking, both of which require separate timelines. Finally, I am very much aware

that the assessment of speaking was common in the United Kingdom from the early 20th

Century. Yet, there is sparse reference to research outside the United States in the early part

of the timeline. The reason for this is that apart from Roach (see timeline, reprinted as an

appendix in Weir, Vidaković & Galaczi (2013) (eds.) there is very little published research

from Europe (Fulcher 2003, p. 1). The requirement that research is in the public domain for

independent inspection and critique was a criterion for selection in this timeline. For a

retrospective interpretation of the early period in the United Kingdom with reference to

unpublished material and confidential internal examination board reports to which we do not

have access, see Weir & Milanovic (2003) and Vidaković & Galaczi (2013).

Themes

A. Rating scale development

B. Construct definition and validation

C. Task design and format

D. Specific purposes testing and generalizability

E. Reliability and rater training

F. The native speaker criterion

G. Washback

H. Discourse analysis

I. Multi-faceted Rasch Measurement (MFRM)


8

J. Interlocutor behaviour and training

K. Rater cognition

L. Test-taker characteristics

References

Bachman, L. F. (2001). Speaking as a realization of communicative competence. Paper

presented at the meeting of the American Association of Applied Linguistics. St. Louis,

Missouri, February.

Biber, D. (2006). University language. A corpus-based study of spoken and written registers.

Amsterdam: John Benjamins.

Chadwick, E. (1858). On the economical, social, educational, and political influences of

competitive examinations, as tests of qualifications for admission to the junior appointments

in the public service. Journal of the Statistical Society of London 21.1, 18–51.

Chun, C. W. (2006). Commentary: An Analysis of a Language Testing for Employment: The

Authenticity of the PhonePass Test. Language Assessment Quarterly 3.3, 295–306.

Downey, R., H. Farhady, R. Present-Thomas, M Suzuki. & A. Van Moere (2008). Evaluation

of the Usefulness of the Versant for English Test: A Response. Language Assessment

Quarterly 5.2, 160–167.


9

Fulcher, G. (2003). Testing second language speaking. Harlow: Longman/Pearson Education.

Johnson, M. (2001). The Art of Non-conversation. A re-examination of the validity of the

Oral Proficiency Interview. New Haven and London: Yale University Press.

Kenyon, D. (1992). Introductory remarks at symposium on development and use of rating

scales in language testing. Paper delivered at the 14th Language Testing Research

Colloquium, Vancouver, March.

Latham, H. (1877). On the action of examinations considered as a means of selection.

Cambridge: Dighton, Bell and Company.

Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests.

Cambridge: Cambridge University Press.

Luoma, S. (2004). Assessing second language speaking. Cambridge: Cambridge University

Press.

McNamara, T. F. (1995). Modelling performance: Opening Pandora’s Box. Applied

Linguistics 16.2, 159‒179.

Milanovic, M. & N. Saville (1996). Introduction. In M. Milanovic (ed.), Performance testing,

cognition and assessment. Cambridge: Cambridge University Press, 1 – 17


10

Skehan, P. (2001). Tasks and language performance assessment. In M. Bygate, P. Skehan, &

M. Swain (eds.), Researching pedagogic tasks: Second language learning, teaching and

testing. . London: Longman, 167–185.

Taylor, L. (2011). Examining speaking. Research and practice in assessing second language

speaking. Cambridge: University of Cambridge Press.

Weir, C. & M. Milanovic (2003). (eds.), Continuity and innovation: Revising the Cambridge

Proficiency in English Examination 1913 – 2002. Cambridge: Cambridge University Press.

Weir, C. J., I.Vidaković & E. D. Galaczi (2013). (eds.), Measured constructs. A history of

Cambridge English language examinations 1913 – 2012. Cambridge: Cambridge University

Press.

Vidaković, I. & E. D. Galaczi (2013). The measurement of speaking ability 1913 – 2012. In

C. J. Weir, I. Vidaković, & E. D. Galaczi (eds.), Measured constructs. A history of

Cambridge English language examinations 1913 – 2012. Cambridge: Cambridge University

Press, 257‒346.
11

Year References Annotations Theme


1864 Chadwick, E. (1864). Statistics The earliest record of an attempt to A

of educational results. assess L2 speaking dates to the first

Museum: A Quarterly few years after Rev. George Fisher

Magazine of Education, became Headmaster of the

Literature and Science 3, 479‒ Greenwich Royal Hospital School in

484. 1834. In order to improve and record

academic achievement, he instituted

Also see discussion in: a ‘Scale Book’, which recorded

Cadenhead, K. & R. Robinson performance on a scale of 1 to 5 with

(1987). Fisher’s ‘Scale Book’: quarter intervals. A scale was created

An early attempt at educational for French as a second language,


12

measurement. Educational with typical speaking prompts to

Measurement: Issues and which boys would be expected to

Practice 6.4, 15–18. respond at each level. The Scale

Book has not survived.


1912 Thorndike, E. L. (1912). The Scales of various kinds were A, B

measurement of educational developed by social scientists like

products. The School Review Galton and Cattell towards the end of

20.5, 289–299. the 19th Century, but it was not until

the work of Thorndike in the early

20th Century that the definition of

each point on an equal interval scale

was revived. With reference to

speaking German, he suggested that

performance samples should be

attached to each level of a scale,

along with a descriptor that

summarizes the ability being tested.


1920 Yerkes, R. M. (1920). What Yerkes describes the development of A, B,

psychology contributed to the the first large-scale speaking test for C, D

war. In R. M. Yerkes (ed.), The military purposes in 1917. It was

new world of science: designed to place army recruits into

Its development during the language development battalions. It

war. New York, NY: The consisted of a verbal section and a

Century Co, 364–389. performance section (following

instructions), with tasks linked to

Also see discussion in: scale level by difficulty. Although


13

Fulcher, G. (2012). Scoring the development of the test is not

performance tests. In G described, the generic approach is

Fulcher. & F. Davidson (eds.), outlined, and involved the

The Routledge handbook of identification of typical tasks from

language testing. London and the military domain that were piloted

New York: Routledge, 378– in test conditions. It is arguably the

392. case that this was the first English for

Specific Purposes (ESP) test based

on domain specific criteria. In

addition, there was clearly an

element of domain analysis to

support Criterion-referenced

assessment.
1944 Kaulfers, W. V. (1944). War- The interwar years saw a rapid A, B, D

time developments in modern growth in large-scale assessment that

language achievement tests. relied on the multiple-choice item for

The Modern Language Journal efficiency. In the Second World War

28, 136–150. Kaulfers quickly realized that these

tests could not adequately predict

Also see discussion in: ability to speak in potentially life-

Velleman, B. L. (2008). The threatening contexts. Teaching and

‘scientific linguist’ goes to assessment of speaking was quickly

war: the United States A.S.T. geared towards the military context

program in foreign languages. once again. Kaulfers presents scoring

Historiographia Linguistica criteria according to the scope and

35, 385–416. quality of performance. However, all


14

descriptors are generic and not

domain specific.
1945 Roach, J. O. (1945). Some Roach was among the first to E

problems of oral examinations investigate rater reliability in

in modern languages. An speaking tests. He was concerned

experimental approach based primarily with maintaining

on the Cambridge ‘standards’, by which he meant that

examinations in English for examiners would agree on which test

Foreign Students. takers were awarded a pass, a good

University of Cambridge pass, and a very good pass, on the

Examinations Syndicate: Certificate of Proficiency in English.

Internal report circulated to He was the first to recommend what

oral examiners and local we now call ‘social moderation’ (see


1
representatives for these MISLEVY 1992) – familiarization

examinations. (Reprinted as with the system through team work,

facsimile in Weir et al. 2013) which results in agreement evolving

over time.
1952/ Foreign Service Institute. Little progress was made in testing A, B,

1958 (1952/1958). FSI Proficiency L2 speaking until the outbreak of the C, D, F

Ratings. Washington D.C.: Korean War in 1950. The Foreign

Foreign Service Institute. Service Institute (FSI) was

established, and the first widely used

Also see discussion in: semantic-differential rating scale put

Sollenberger, H. E. (1978) into use in 1952. This

Development and current use operationalized the ‘native speaker’


Authors’ names are shown in small capitals when the study referred to appears in
1

this timeline.
15

of the FSI oral interview test. construct at the top band (level six).

In Clark, J. L. D. (ed.), Direct With the Vietnam war on the

testing of speaking proficiency: horizon, a decision was taken to

Theory and application. register the language skills of US

Princeton, NJ: Educational diplomatic and military personnel.

Testing Service, Work began to expand the FSI scale

1–12. by adding verbal descriptors at each

of the six levels from zero

proficiency to native speaker, and to

include multiple holistic traits. This

went hand in hand with the creation

of the Oral Proficiency Interview

(OPI), which was a mix of interview,

prepared dialogue, and simulation.

The wording of the 1958 FSI scale

and the tasks associated with the OPI

have been copied into many other

testing systems still in use.


1967 Carroll, J. B. (1967). The Despite little validation evidence the E, G

foreign language attainments FSI and Interagency Language

of language majors in the Roundtable (ILR) approach became

senior year: A survey popular in education because of its

conducted in US colleges and face validity, inter-rater reliability

Universities. Foreign through social moderation, and

Language Annals 1.2, 131– perceived coherence with new

communicative teaching methods.


16

151. Carroll showed that the military

system was not sensitive to language

acquisition in an educational context,

and hence was demotivating. It

would be over a decade before this

research had an impact on policy.


1979 Strength Through Wisdom: A Further impetus to extend speaking

critique of U.S. capability. A assessment in educational settings

report to the President from came from a report submitted to

the President's Commission on President Carter on shortcomings in

Foreign Language and the US military because of lack of

International Studies. foreign language skills. It is not

Washington DC: US coincidental that in the same year

Government Printing Office. attention was drawn to the study

published by CARROLL (1967). The

American Council on the Teaching

of Foreign Languages (ACTFL) was

given the task of revising the

FSI/ILR scales for wider use.


1979 Adams, M. L. & J. R. Frith As part of the ACTFL research into A, C,

(1979). Testing kit: French and new rating scales, the first testing E, G

Spanish. Washington DC: kits were developed for training and

Department of State and the assessment purposes in US Colleges.

Foreign Service Institute. The articles and resources in Adams

& Frith provided a comprehensive

guide for raters of the OPI for


17

educational purposes.
1980 Adams, M. L. (1980). Five co- Adams conducted the first structural B

occurring factors in speaking validation study designed to

proficiency. In J. R. Frith (ed.), investigate which of the five FSI

Measuring spoken language subscales discriminated between

proficiency. Washington DC: learners at each proficiency level.

Georgetown University Press, The study was not theoretically

1–6. motivated, and no patterns could be

discerned in the data.


1980 Reves, T. (1980). The group- Reves questioned whether the OPI C

oral test: Aan experiment. could generate ‘real-life

English Teachers Journal 24, conversation’ and began

19–21. experimenting with group tasks to

generate richer speaking samples.


1981 Bachman, L. F. & A.S. Palmer The first construct validation studies B

(1981). The construct validity were carried out in the early 1980s,

of the FSI oral interview. using the multitrait-multmethod

Language Learning 31.1, 67– technique and confirmatory factor

86. analysis. These demonstrated that the

FSI OPI loaded most heavily on the

speaking trait, and lowest of all

methods on the method trait. These

studies concluded that there was

significant convergent and divergent

evidence for construct validity in the

OPI.
1983 Lowe, P. (1983). The ILR oral In the 1960s the FSI approach to A, C, D
18

interview: origins, applications, assessing speaking was adopted by

pitfalls, and implications. Die the Defense Language Institute, the

Unterrichtspraxis 16, 230–244. Central Intelligence Agency, and the

Peace Corp. In 1968 the various

adaptations were standardized as the

Interagency Language Roundtable

(ILR), which is still the accepted tool

for the certification of L2 speaking

proficiency throughout the United

States military, intelligence and

diplomatic services

(http://www.govtilr.org/). Via the

Peace Corp it spread to academia,

and the assessment of speaking

proficiency worldwide. It also

provides the basis for the current

NATO language standards, known as

STANAG 6001.
1984 Liskin-Gasparro, J. E. (1984). Following the publication of A, B

The ACTFL Proficiency Strength Through Wisdom (see 1979,

Guidelines: Gateway to testing above) and the concerns raised by

and curriculum. Foreign CARROLL (1967), the ACTFL

Language Annals 17.5, 475– Guidelines were developed

489. throughout the 80s, with preliminary

publications in 1982, and the final

Guidelines issued in 1986 (revised


19

1999). Levels from 0 to 5 were

broken down into subsections, with

finer gradations at lower proficiency

levels. Level descriptors provided

longer prose definitions of what

could be done at each level. New

constructs were introduced at each

level, drawing on new theoretical

models of communicative

competence of the time, particularly

those of Canale & Swain2. These

included discourse competence,

interaction, and communicative

strategies.
1985 Lantolf, J. P. & W Frawley Lantolf & Frawley were among the A, B

(1985). Oral proficiency first to question the ACTFL

testing: A critical analysis. The approach. They claimed the scales

Modern Language Journal were ‘analytical’ rather than

69.4, 337–345. ‘empirical’, depending on their own

internal logic of non-contradiction

between levels. The claim that the

descriptors bear no relationship to

how language is acquired or used set

2
Canale, M. & M. Swain (1980). Theoretical bases of communicative approaches to second

language teaching and testing. Applied Linguistics 1.1, 1–47.


20

off a whole chain of research into

scale analysis and development.


1986 Kramsch, C. J. (1986). From Kramsch’s research into B

language proficiency to interactional competence spurred

interactional competence. The further research into task types that

Modern Language Journal might elicit interaction, and the

70.4, 366–372. construction of ‘interaction’

descriptors for rating scales. This

research had a particular impact on

future discourse related studies by

HE & YOUNG (1998).


1986 Bachman, L. F. & S. Savignon This very influential paper B, D, F

(1986). The evaluation of questioned the use of the native

communicative language speaker to define the top level of a

proficiency: a critique of the rating scale, and the notion of zero

ACTFL Oral Interview. The proficiency at the bottom. Secondly,

Modern Language Journal 79, the researchers questioned reference

380–390. to context within scales as

confounding constructs with test

method facets, unless the test is for a

defined ESP setting. This paper

therefore set the agenda for debates

around score generalizability, which

we still wrestle with today.


1987 Fulcher, G. (1987). Tests of Using discourse analysis of native A, B, H

oral performance: The need for speaker interaction, this paper

data-based criteria. English provided the first evidence that rating


21

Language Teaching Journal scales did not describe what typically

41.4, 287‒291. happened in naturally occurring

speech, and advocated a data-based

approach to writing descriptors and

constructing scales. This was the first

use of discourse analysis to

understand under-specification in

rating scale descriptors, and was

expanded into a larger research

agenda (see FULCHER 1996).


1989 Van Lier, L. (1989). Reeling, In another discourse analysis study, B, H

writhing, drawling, stretching, Van Lier showed that interview

and fainting in coils: Oral language was not like ‘normal

proficiency interviews as conversation’. Although the work of

conversation. TESOL finding formats that encouraged

Quarterly 23.3, 489–508. ‘conversation’ had started with

REVES (1980) and colleagues in

Israel, this paper encouraged wider

research in the area.


1991 Linacre, J. M. (1991). FACETS Rater variation had been a concern E, I

computer programme for since the work of ROACH (1945)

Many-faceted Rasch during the war, but only with the

Measurement. Chicago, IL: publication of Linacre’s FACETS

Mesa Press. did it become possible to model rater

harshness/leniency in relation to task

difficulty and learner ability. MFRM


22

remains the standard tool for

studying rater behaviour today and

test facets today, as in the studies by

LUMLEY & MCNAMARA (1995), and

BONK & OCKEY (2003).


1991 Alderson, J. C. (1991). Bands Based on research driving the IELTS A

and scores. In J. C. Alderson & revision project, Alderson

B. North (eds.), Language categorized rating scales as use-

testing in the 1990s. London: oriented, rater-oriented, and

Modern English Publications constructor-oriented. These

and the British Council, 71–86. categories have been useful in

guiding descriptor content with

audience in mind.
1992 Young, R. & M. Milanovic An early and significant use of B, C,

(1992). Discourse variation in discourse analysis to characterize the H, L

oral proficiency interviews. interaction of test takers with

Studies in Second Language interviewers in the First Certificate

Acquisition 14.4, 403–424. Test of English. Discourse structure

was demonstrated to be related to

examiner, task and gender variables.


1992 Douglas, D. & L. Selinker Douglas & Selinker show that a A, B, D

(1992). Analyzing Oral discipline specific test (chemistry) is

Proficiency Test performance a better predictor of domain specific

in general and specific purpose performance than a general speaking

contexts. System 20.3, 317– test. In this, and a series of

328). publications on ESP testing, they

show that reducing generalizability


23

by introducing context increases

score usefulness. This is the other

side of the coin to BACHMAN &

SAVIGNON’S (1986) generalizability

argument.
1992 Ross, S. & R. Berwick (1992). Reacting to critiques of the OPI from B, C,

The discourse of VAN LIER (1989), LANTOLF & H, J

accommodation in oral FRAWLEY (1985), and others, Ross

proficiency interviews. Studies & Berwick undertook discourse

in Second Language analysis of OPIs to study how

Acquisition 14.1, 159–176. interviewers accommodated to the

discourse of candidates. They

concluded that the OPI had features

of both interview and conversation.

However, it also raised the question

of how interlocutor variation might

result in test takers being treated

differentially. This sparked a chain of

similar research by scholars such as

LAZARATON (1996).
1992 Mislevy, R. J. (1992). Linking LOWE (1983) and others had argued E

educational assessments. that the meaning of descriptors was

Concepts. Issues. Methods and socially acquired. In this publication

prospects. Princeton NJ: the term ‘social moderation’ was

Educational Testing Service. formalized. NORTH & SCHNEIDER

(1998) and the Council of Europe


24

have taken this concept and made it

central to the project of using the

Common European Framework of

Reference (CEFR) scales as a

European-wide lens for viewing

speaking proficiency.
1995 Chalhoub-Deville, M. (1995). Chalhoub-Deville investigated the A, B, E

Deriving oral assessment inter-relationship of diverse tasks

scales across different tests and and raters using multidimensional

rater groups. Language Testing scaling to identify the components of

12.1, 16–33. speaking proficiency that were being

assessed. She found that these varied

by task and rater group, and therefore

called for the construct to be defined

anew for each task / rater

combination. The issue at stake is

whether the construct has any

independent psychological reality

independently from context specific

performances.
1995 Lumley, T. & T. McNamara Rater variability is studied across E, I

(1995). Rater characteristics time using FACETS, showing that

and rater bias: implications or there is considerable variation in

training. Language Testing harshness irrespective of training.

12.10, 54–71. The researchers question the use of

single ratings in high-stakes speaking


25

tests, and recommend the use of rater

calibrations to provide training

feedback or adjust scores.


1995 Upshur, J. & C. Turner (1995). Upshur & Turner introduce A, B,

Constructing rating scales for Empirically-derived Binary-choice C, D, K

second language tests. English boundary-definition scales (EBB).

Language Teaching Journal These address the long-standing

49.1, 3–12. concern over a-priori scale

development outlined by LANTOLF &

FRAWLEY (1985), and start to tie

decisions to specific examples of

performance as recommended by

FULCHER (1987). The scales are task

specific rather than generic. The

methodology has specific impact on

later studies such as POONPON

(2010).
1996 McNamara, T. (1996). McNamara described the A, B,

Measuring second language development of the Occupational C, D

performance. Harlow: English Test (OET) for health

Longman. professionals. This is a specific

purpose test with a clearly specified

audience, and scores from this

instrument are shown to be more

reliable and valid for decision

making than generic English tests.


1996 Fulcher, G. Testing tasks: Building on REVES (1980) and C, G
26

Issues in task design and the others, this study compared a group

group oral. Language Testing oral (3 participants) and two

13.1, 23–51. interview-type tasks. Discourse was

more varied in the group task, and

participants reported a preference for

working in a group with other test-

takers.
1996 Fulcher, G. (1996). Does thick Based on work conducted since A, B,

description lead to smart tests? FULCHER (1987), this paper C, D, H

A data-based approach to describes the research underpinning

rating scale construction. the design of data-based rating

Language Testing 13.2, 208‒ scales. The methodology employs

238. discourse analysis of speech samples

to produce scale descriptors. The use

of the resulting scale is compared

with generic a-priori scales. Using

discriminant analysis, the data-based

scores are found to be more reliable,

and using MFRM, rater variation is

significantly decreased. The data-

based approach therefore solves the

problems identified by researchers

like LUMLEY & MCNAMARA (1995).

The study also generated the Fluency

Rating Scale descriptors, which were

used as anchor items in the CEFR


27

project.
1996 Lazaraton, A. (1996). In the ROSS & BERWICK (1992) B, H, J

Interlocutor support in oral tradition, and inspired by VAN LIER

proficiency interviews. The (1989), Lazaraton identifies 8 kinds

case of CASE. Language of support provided by a

Testing 13.2, 151–172. rater/interlocutor in an OPI. She

concludes that the variation is

problematic, and calls for additional

rater training and possibly the use of

an ‘interlocutor support scale’ as part

of the rating procedure.


1996 Pollitt, A. & Murray, N. L. Pollitt & Murray use two B, K

(1996). What raters really pay innovative techniques to investigate

attention to. In M. Milanovic & how raters use rating scales, and

N. Saville (eds.), Performance what they pay attention to when

testing, cognition and rating spoken performances. The

assessment. Selected papers research showed raters bring their

from the 15th Language Testing own conceptual baggage to the rating

Research Colloquium, process, but used constructs such as

Cambridge and Arnhem. discourse, sociolinguistic, and

Studies in Language Testing 3. grammatical competence, as well as

Cambridge: Cambridge fluency and ‘naturalness’.

University Press, 74‒91.


1997 McNamara, T. (1997). Speaking had generally been B

Modelling performance: characterized in cognitive terms as

Opening Pandora’s Box. traits resident in the speaker being

Applied Linguistics 18.4, 446– assessed. Building on the work of


28

465. KRAMSCH (1986) and others,

McNamara showed that interaction

implied the co-construction of

speech, and argued that in social

contexts there was shared

responsibility for performance. The

question of shared responsibility and

the role of the interlocutor, have

since become active areas of

research.
1998 Young, R. & A. W. He (1998) An important collection of research B, C, H

(eds.), Talking and testing. papers analysing the discourse of

Discourse approaches to the test-taker speech in speaking tests.

assessment of oral proficiency. The speaking test is characterized as

Amsterdam: John Benjamins. an ‘interactive practice’ co-

constructed by the participants.


1998 North, B. & G. Schneider North & Schneider describe the A, I

(1998). Scaling descriptors for measurement-driven approach to

language proficiency scales. scale development as embodied in

Language Testing 15.2, 217– the CEFR. Descriptors from existing

262. speaking scales are extracted from

context and scaled using MFRM and

teacher judgments as data.


1999 Jacoby, S. & T. McNamara In two studies, Jacoby & B, K

(1999). Locating competence. McNamara discovered that the

English for Specific Purposes linguistic criteria used by applied

18.3, 213–241. linguists to rate speaking


29

performance did not capture the kind

of communication valued by subject

specialists. They recommended

studying ‘indigenous criteria’ to

expand what is valued in

performances. This work has

impacted on domain specific studies,

such as FULCHER ET AL. (2011). It

also raises serious questions about

psycholinguistic approaches such as

those advocated by VAN MOERE

(2012).
2002 Young, R. (2002). Discourse A careful investigation of the ‘layers’ B, C, H

approaches to oral language of discourse in naturally occurring

assessment. Annual Review of speech and test tasks. This is

Applied Linguistics 22, 243– combined with a review of various

262. approaches to testing speaking, with

an indication of which test formats

are likely to elicit the most useful

speech samples for rating.


2002 O’Sullivan, B., C. J. Weir & N. A methodological study to compare B, H

Saville (2002). Using the ‘informational and interactional

observation checklists to functions’ produced on speaking test

validate speaking-test tasks. tasks with those the test designer

Language Testing 19.1, 33–56. intended to elicit. The instrument

proved to be unwieldy and


30

impractical, but the study established

the important principle for

examination boards that evidence of

congruence between intention and

reality is an important aspect of

construct validation.
2003 Brown, A. (2003). Interviewer A much quoted study into variation B, H, I,

variation and the co- in the speech of the same test taker J

construction of speaking with two different interlocutors.

proficiency. Language Testing Building on ROSS & BERWICK

20.1, 1–25. (1992), LAZARATON (1996) and

MCNAMARA (1996), Brown

demonstrated that scores also varied,

although not by as much as one may

have expected. The paper raises the

critical issue of whether variation

should be allowed because it is part

of the construct, or controlled

because it leads to inequality of

opportunity.
2003 Fulcher, G. & R. Marquez- An investigation into the effects of B, C, H

Reiter (2003). Task difficulty in task features (social power and level

speaking tests. Language of imposition) and L1 cultural

Testing 20.3, 321–344. background, on task difficulty and

score variation. Like BROWN (2003)

it was discovered that although


31

significant variation occurred when

extreme conditions were used, effect

sizes were not substantial.


2003 Bonk, W. J. & G. J. Ockey Using FACETS, the researchers B, E, I

(2003). A many-facet Rasch investigated variability due to test

analysis of the second language taker, prompt, rater, and rating

group oral discussion task. categories. Test taker ability was the

Language Testing 20.1, 89–110. largest facet. Although there was

evidence of rater variability, this did

not threaten validity, and indicated

that raters became more stable in

their judgments over time. This adds

to the evidence that socialization

over time has an impact on rater

behaviour.
2005 Cumming, A., L. Grant, P. An important prototyping study. Pre- B, C, K

Mulcahy-Ernt & D. E. Powers operational tasks were shown to

(2005). A teacher-verification experts who judge whether they

study of speaking and writing represent the kinds of tasks that

prototype tasks for a new students would undertake at

TOEFL Test. TOEFL university. They are also presented

Monograph No. MS-26. with their own student’s responses to

Princeton, NJ: Educational the tasks and asked whether these are

Testing Service. ‘typical’ of their work. The study

shows that test development is a

research-led activity, and not merely


32

a technical task. Design decisions

and the evidence for those decisions

are part of a validation narrative.


2007 Berry, V. (2007). Personality Based on many years of research into B, C, L

differences and oral test personality and speaking test

performance. Frankfurt: Peter performance, Berry shows that

Lang. levels of introversion and

extroversion impact on contributions

to conversation in paired- and group-

formats, and results in differential

score levels when ability is

controlled for.
2008 Galaczi, E. D. (2008). Peer- Galaczi presents a discourse analytic B, C, H

peer interaction in a speaking study of the paired test format, in

test: The case of the First which two candidates are required to

Certificate in English converse with each other, as well as

examination. Language the examiner/interlocutor. The

Assessment Quarterly 5.2, 89– research identified three interactive

119. patterns in the data: ‘collaborative’,

‘parallel’ and ‘asymmetric’.

Tentative evidence is also presented

to suggest that there is a relationship

between scores on an ‘Interactive

Communication’ rating scale.


2009 Ockey, G. (2009). The effects Building on BERRY (2007), Ockey B, C, L

of group members’ investigates the effect of levels of

personalities on a test taker’s ‘assertiveness’ on speaking scores in


33

L2 group oral discussion test a group oral test, using MANCOVA

scores. Language Testing 26.2, analyses. Assertive students are

161–186. found to have lower scores when

placed in all assertive groups, and

higher scores when placed with less

assertive participants. The scores of

non-assertive students did not change

depending on group makeup. The

results differ from BERRY, indicating

that much more research is needed in

this area.
2010 Poonpon, K. (2010). A study that brings together the EBB A, B,

Expanding a Second Language approach of UPSHUR & TURNER H, K

Speaking Rating scale for (1995) with the data-based approach

Instructional Assessment of FULCHER (1996) to create a rich

Purposes. Spaan Fellow data-based EBB for use with TOEFL

Working Papers in Second or iBT tasks. In the process, the nature

Foreign Language Assessment of the academic speaking construct is

8, 69–94. further explored and defined.


2011 Fulcher, G., F. Davidson & J. Like POONPON (2010), this study A, B, H

Kemp (2011). Effective rating brings together UPSHUR & TURNER’s

scale development for speaking (1995) EBB and FULCHER’s (1996)

tests: Performance Decision data-based approach in the context of

Trees. Language Testing 28.1, service encounters. It also

5‒29. incorporates indigenous insights

following JACOBY & MCNAMARA


34

(1999). It describes interaction in

service encounters through a

performance decision tree that

focuses rater attention on observable

criteria related to discourse and

pragmatic constructs.
2011 Frost, K., C. Elder & G. Integrated task types have become A, B, C

Wigglesworth (2011). widely used since their incorporation

Investigating the validity of an into TOEFL iBT. However, little

integrated listening-speaking research has been carried out into the

task: A discourse-based use of source material in spoken

analysis of test takers’ oral responses, or how the integrated skill

performances. Language can be described in rating scale

Testing 29.3, 345–369. descriptors. The ‘integration’

remains elusive. In this study a

discourse approach is adopted

following ideas in DOUGLAS &

SELINKER (1992) and FULCHER

(1996) to define content related

aspects of validity in integrated task

types. The study provides evidence

for the usefulness of integrated tasks

in broadening construct definition.


2011 May, L. (2011). Interactional Following KRAMSCH (1986), B, C, K

competence in a paired MCNAMARA (1997) and YOUNG

speaking test: Features salient (2002), May problematizes the


35

to raters. Language Assessment notion of the speaking construct in a

Quarterly 8.2, 127–145. paired speaking test. However, she

attempts to deal with the problem of

how to award scores to individuals

by looking at how raters focus on

features of the speech of individual

participants. The three categories of

interpretation: understanding

interlocutor’s message, responding

appropriately, and using

communicative strategies, are not as

important as the attempt to

disentangle the individual from the

event, while recognizing that

discourse is co-constructed.
2011 Nakatsuhara, F. (2011). Effects Building on BONK & OCKEY (2003) B, H

of test-taker characteristics and and other research into the group

the number of participants in speaking test, Nakatsuhara used

group oral tests. Language conversation analysis to investigate

Testing 28.4, 483–508. group size in relation to proficiency

level and personality type. She

discovered that more proficient

extroverts talked more and initiated

topic more when in groups of 4 than

in groups of 3. However, proficiency

level resulted in more variation in


36

groups of 3. With reference to

GALACZI (2008), she concludes that

groups of 3 are more collaborative.


2012 Van Moere, A. (2012). A Very much against the trend, Van B, C

psycholinguistic approach to Moere makes a case for a return to

oral language assessment. assessing psycholinguistic speech

Language Testing 29.1, 325– ‘facilitators’, related to processing

344. automaticity. These include response

latency, speed of speech, length of

pauses, and the reproduction of

syntactically accurate sequences,

with appropriate pronunciation

intonation and stress. Task types are

sentence repetition and sentence

building. This approach is driven by

an a-priori decision to use an

automated scoring engine to rate

speech samples. The validation

argument stresses the objective

nature of the decisions, compared

with the unreliable and frequently

irrelevant judgments of human raters.

This is an exercise in reductionism

par excellence, and is likely to

reignite the debate on prediction to

domain performance from


37

‘atomistic’ features that last raged in

the early communicative language

testing era.
2012 Tan, J., B. Mak & P. Zhou This paper applies fuzzy logic to our E, J

(2012). Confidence scoring of understanding of how raters score

speaking performance: How performances. This approach takes

does fuzziness become exact? into account both rater decisions, and

Language Testing 29.1, 43–65. the levels of uncertainty in arriving at

those decisions.
2014 Nitta, R & F. Nakatsuhara Nitta & Nakatsuhara investigate C, H

(2014). A multifaceted providing test-takers with planning

approach to investigating pre- time prior to undertaking a paired

task planning effects on paired speaking test. The unexpected

oral performance. Language findings are that planning time

Testing 31.2, 147–175. results in stilted prepared output, and

reduced interaction between

speakers.

Acknowledgements

I would like to thank Dr. Gary Ockey of Educational Testing Service for reviewing my first

draft, and providing valuable critical feedback. My thanks are also due to the very

constructive criticism of the three reviewers, which has considerably improved the coverage

and coherence of the timeline. Finally to the editor of Language Teaching for timely

guidance and advice.

You might also like