Assessing Reading
Assessing Reading
reading
from theories to
classrooms
Edited by
Marian Sainsbury
Colin Harrison
Andrew Watts
assessing reading
from theories to classrooms
An international multi-disciplinary
investigation of the theory of reading
assessment and its practical implications
at the beginning of the 21st century
Edited by
Marian Sainsbury
Colin Harrison
Andrew Watts
First published in July 2006, this edition
published online only in July 2008
National Foundation for
Educational Research
The Mere, Upton Park
Slough, Berkshire SL1 2DQ
www.nfer.ac.uk
© NFER 2006
The views contained in this document are Registered Charity No. 313392
the authors’ own and do not necessarily ISBN 978 1 906792 04 6 (online)
reflect those of the NFER.
Design by Stuart Gordon at NFER
Every attempt has been made to contact Layout by Helen Crawley at NFER
copyright holders. Please contact the NFER Index by Indexing Specialists (UK) Ltd
if you have any concerns. www.indexing.co.uk
Contents
[Part 1]
Competing paradigms:
theories of reading and theories of assessment
[Part 2]
Historical insights as drivers of theory
9 Lessons of the GCSE English ‘100 per cent coursework’ option, 122
1986–1993
Paul Thompson
[Part 3]
Theory into practice: current issues
[Part 4]
Theory into practice: national initiatives
Index 260
Contributors
John R. Beech
John Beech is a senior lecturer in the School of Psychology, University of Leicester. He
was Editor of Journal of Research in Reading for vols 24–7 and is currently a coeditor
of this journal. He is author/editor of a dozen books including Learning to Read, Cogni-
tive Approaches to Reading and Psychological Assessment of Reading (coedited with
Chris Singleton). His research interests are in cognitive, biological and educational
approaches to reading and reading development.
Patricia Donahue
Patricia Donahue is a Senior Program Administrator in the Assessment Division at
Educational Testing Service, Princeton, New Jersey. She is the coordinator of the
National Assessment of Educational Progress (NAEP) reading assessment and serves
on the reading task force for the PIRLS study.
Ros Fisher
Ros Fisher has taught in primary schools in the north-west of England and the USA.
She is now Senior Lecturer in Education at the University of Exeter. She writes widely
about the teaching of literacy and has researched the role of the teacher and teacher
change in current large-scale initiatives to change the teaching of literacy in England.
She is currently researching the relationship between talk and writing. She has recently
written Inside the Literacy Hour (Routledge, 2002) and edited the collection of papers
from an ESRC-funded research seminar series Raising Standards in Literacy (Falmer,
2002).
Colin Harrison
Colin Harrison is Professor of Literacy Studies in Education at the University of Not-
tingham. He has been at various times a secondary school teacher of English, a GCSE
examiner, a full-time researcher into the place of reading in school, a teacher educator
and a director of national projects evaluating the place of ICT in learning. His three cur-
rent research projects are in the use of digital video for teacher development, using
vi assessing reading: from theories to classrooms
Louise Hayward
Louise Hayward is a Senior Lecturer in the Faculty of Education, University of Glas-
gow. Over the past 15 years she has worked to bring research, policy and practice in
assessment into closer alignment in Scotland. Currently, she chairs the Research and
Development group for the National Assessment for Learning programme. Her research
interests are in inclusion, assessment and transformational change.
Colin Higgins
Colin Higgins is Head of the Learning Technology Research Group in the Computer
Science department at the University of Nottingham. He is interested in automatic
assessment of essays, and particularly the development of the CourseMarker (formerly
CEILIDH) system. His other current projects are in handwriting recognition, construct-
ing metrics for object-oriented designs and programs, and writing programs to measure
the quality of logic programs written in the Prolog language.
Claudia Leacock
Claudia Leacock is a senior member of technical staff at Pearson Knowledge Tech-
nologies. Since receiving her PhD in Linguistics at the CUNY Graduate Center, she
has specialised in the automated understanding of human languages. Her primary
focus during the past 10 years has been the automated assessment of constructed
responses – for both content and grammar – and she has published many articles in this
area. Most recently, she was a guest co-editor of the Journal of Natural Language
Engineering’s special issue on building educational applications using natural lan-
guage processing.
Nasiroh Omar
Nasiroh Omar came from a post in a technological university in Malaysia to the Univer-
sity of Nottingham’s Learning Sciences Research Institute to work on a doctorate in the
field of artificial intelligence and human learning. She has been the lead programmer
and researcher on the Online Reading Internet Research Support System.
Roger Palmer
Roger Palmer was educated at Canton High School, Cardiff and University College,
Cardiff where he read English and also completed a PGCE course. He taught for 20 years
in schools in the Cardiff area before joining the Curriculum Council for Wales in 1989.
When the Council was superseded by the Qualifications, Curriculum and Assessment
Authority for Wales, Roger worked for the new body as an English Subject Officer. In
September 2004, he became the Authority’s Assistant Chief Executive (Curriculum and
Assessment: 3–14).
contributors vii
P. David Pearson
P. David Pearson, a frequent writer on issues of assessment and policy, serves as Profes-
sor and Dean in the Graduate School of Education at the University of California,
Berkeley. Additionally, Pearson has an active programme of research on issues of
instruction and reform in high-poverty schools.
Lorna Pepper
Lorna Pepper is a Project Officer at Oxford Cambridge and RSA (OCR), developing key
stage 3 English test papers for National Assessment Agency (NAA) and managing key
stage 3 English test development for the Council for the Curriculum, Examinations and
Assessment (CCEA). Previously, she worked in secondary education, first as an English
teacher and then in various middle management roles, before taking up positions in inter-
national schools abroad and in the UK, as International Baccalaureate Coordinator and
school Vice Principal.
Alastair Pollitt
Alastair Pollitt is currently a visiting fellow at the Research Centre for English and
Applied Linguistics in the University of Cambridge, where he was a Senior Research
Officer from 1990 to 1994. In the intervening years he was Director of the Research and
Evaluation Division at the University of Cambridge Local Examinations Syndicate
(now Cambridge Assessment). In 1989, while a lecturer in the University of Edinburgh,
he co-directed the national survey of standards of English in Scotland’s primary and
secondary schools.
Martine Rémond
Martine Rémond is Maître de conference in Cognitive Psychology at the IUFM de
Créteil, France. Her research interests† are devoted to reading comprehension and its
assessment, to the role of metacognition, to the improvement of reading comprehen-
sion and the effects of strategy instruction training. Nationally and internationally
recognised for her ability in assessment, she is reading expert for PISA (2000, 2003,
2006) and PIRLS (2001), for the French High Committee for Education Assessment
(2000–05), and for the French High Committee for Reading (since 1996). She has been
involved in a large number of educational and psychological researches on assessment
and in large-scale studies (in France and in Europe) for her expertise in reading
processes. †(Institut National de la Recherche Pédagogique and University of Paris 8)
Marian Sainsbury
Marian Sainsbury is Head of Literacy Assessment Research in the Department of
Assessment and Measurement at the NFER. She is director of the projects developing
the national tests in English at key stages 1 and 2 in England and key stage 2 in Wales
and international reading coordinator for the PIRLS study. Her research interests are in
a variety of aspects of literacy and its assessment.
viii assessing reading: from theories to classrooms
Rifat Siddiqui
Rifat Siddiqui is a freelance education consultant with a range of experience in literacy
assessment. Formerly a primary teacher, she has worked for a number of education
organisations including the NFER, the Qualifications and Curriculum Authority and the
University of Cambridge.
Ernie Spencer
Ernie Spencer is Honorary Research Fellow in the Faculty of Education, University
of Glasgow. In previous roles as Senior Research Officer at the Scottish Council for
Research in Education and HMIE National Specialist in Assessment and in English he
made significant contributions to the development of formative and summative
assessment in Scotland.
Gordon Stobart
Gordon Stobart is Reader in Education at the University of London Institute of Educa-
tion. After teaching English in secondary schools he worked as an educational
psychologist in London. He then studied in the USA as a Fulbright Scholar. After work-
ing as Head of Research at London Examinations he became Principal Research Officer
for the National Council for Vocational Qualifications and then for the Qualifications
and Curriculum Authority. He is a member of the Assessment Reform Group, which
campaigns for better use of formative assessment in teaching and learning, and has pro-
duced a series of influential pamphlets – Assessment for Learning (1998); Testing,
Motivation and Learning (2002); The Role of Teachers in the Assessment of Learning
(2006). He is also editor of the international journal Assessment in Education: Princi-
ples, Policy and Practice.
Lynda Taylor
Lynda Taylor is currently Assistant Director of the Research and Validation Group at the
University of Cambridge ESOL Examinations (part of Cambridge Assessment – a non-
teaching department of the university). She is responsible for coordinating the research
and validation programme to support Cambridge ESOL’s wide range of language tests
and teaching awards. She has extensive experience of the theoretical and practical
issues involved in second language testing and assessment, and a special interest in the
theory and practice of assessing reading comprehension ability.
Paul Thompson
Paul Thompson has been working as a lecturer in the School of Education at the Univer-
sity of Nottingham since 2001. For many years previously, he was a Head of English in
City of Nottingham comprehensive schools. His main research interests centre around
the relationship between oracy and literacy. He is particularly interested in theories of
collaborative learning and classroom talk.
contributors ix
Andrew Watts
Andrew Watts began his career as a teacher of English in secondary schools in Surrey,
Coventry and Northampton, UK. After 11 years he moved to Singapore where he taught
in a Junior College for over four years. He then worked for five years as a ‘Specialist
Inspector’ for English in the Ministry of Education in Singapore, focusing on curricu-
lum development in English teaching and in-service teacher development. In 1990 he
returned to England and has been working with Cambridge Assessment since the sum-
mer of 1992. For most of that time he looked after teams that were developing national
tests in English, Maths and Science for 14-year-olds in England, Northern Ireland and
Wales. He is now working on the setting up of the Cambridge Assessment Network,
whose purpose is to promote online and face-to-face professional development opportu-
nities for assessment professionals internationally.
Chris Whetton
Chris Whetton is an Assistant Director of NFER and also Head of its Department of
Assessment and Measurement. He is the author of over a dozen published tests span-
ning both educational and psychological uses. He has directed several large-scale
national projects including the development of National Curriculum tests for seven-
year-olds as these were introduced, and other National Curriculum projects including
key stage 2 English development.
1 Introduction and overview
Marian Sainsbury
The nature of ‘reading’ is something usually taken for granted. In contemporary soci-
eties, the use of literacy for a vast range of social and personal purposes is so
widespread that it is rarely questioned. Within the education system, reading becomes
an explicit focus of attention, with substantial resources devoted to the teaching and
learning of literacy. Even here, however, the definition of ‘reading’ is usually not dis-
cussed, although one can be inferred from the kinds of teaching and learning activities
adopted. It is in the research community that the nature of reading becomes a defined
area of study, and here, as will become apparent, there are major disagreements between
different academic traditions over what is included and implied by the term ‘reading’.
Essentially, this book sets out to explore some of the theories, practices and conflicts
that surround the idea of reading at the beginning of the 21st century. In order to do this,
it adopts a particular perspective: that of assessment. Researchers, educationalists and
the population at large have questions about how well people read. Often, though not
exclusively, these people are children who are still in the process of mastering reading.
This need to assess leads inevitably to the question ‘What exactly are the skills and
understandings that we want to know about, in order to gauge reading ability?’ Thus it
is that a particular definition of reading becomes made concrete in an assessment. By
scrutinising tests and other instruments, it is possible to study the definition of reading –
the construct – specified or assumed in each one. It is the existence of this concrete evi-
dence in the form of tests and other instruments that makes assessment a promising
springboard for investigating the nature of reading.
In 2003–4, a series of seminars was held in England, supported by the research fund-
ing body the Economic and Social Research Council, with the purpose of exploring the
construct of reading. The participants were selected with the deliberate intention of
allowing interaction between different disciplines, and consisted of a group of special-
ists in assessment and reading from the United Kingdom, France and the United States.
There were cognitive psychologists with research interests in reading; educationalists
with a range of research backgrounds in the teaching and learning of literacy and litera-
ture; and assessment specialists. Unusually in such gatherings, there was a strong
representation of test developers, whose day-to-day research activities included the
practical processes of devising, trialling and refining actual reading tests.
This group set out to bring together their varying experiences of and perspectives on
the construct of reading. The seminars were open-ended and built in generous time for
discussion, in recognition of the complexity of the subject matter. Each individual chap-
ter in this volume explicates its reasoning and rationale, with references that situate it
within its own research background. However, some ‘fault lines’ in the arguments can
2 assessing reading: from theories to classrooms
be set out in general terms, and these apply both to ideas about reading and to ideas
about assessment.
When we read, we consciously or unconsciously recognise written symbols as words
with meaning. The act of reading includes deciphering, or decoding, written words and
letters, transforming them into recognisable language, and understanding their mean-
ing. Meaning is intricately tied up with communication, and communication of many
kinds of meanings occupies a central role in human social intercourse. There is a funda-
mental divide between researchers who focus primarily on the decoding of words and
those who focus primarily upon reading as an act of meaning-communication. For the
former group, ‘reading’ proper is recognising the words; the uses of those words to
communicate meaning and support social interaction are interesting, but not essential to
the construct of reading. For the latter group, by contrast, it is not possible to make
sense of the notion of ‘reading’ without communicating meanings; the communicative
act is primary, and the specific skills involved in decoding written words cannot logically
be separated from this.
These two perspectives can be thought of as competing paradigms: theory-systems
that shape experience. The paradigm determines what counts as evidence, what obser-
vations are relevant, and even what is observed. Because the difference between them is
logical, definitional rather than empirical, no observation can prove that one is right
rather than the other. But this rather bleak view of paradigm competition does not rule
out an understanding of both, nor a rapprochement between them in practice. In the real
world of a literate society and an education system preparing children to participate in
it, the stark differences outlined above are masked. All agree that children need to
acquire the ability to recognise words fluently and to use this ability to facilitate and
enrich their everyday lives.
At the same time, there are equally fundamental disagreements about what – and
whom – assessment is for, and once again these can be seen as competing paradigms.
One kind of assessment purpose is to pinpoint strengths and weaknesses in reading
development and to diagnose barriers to that development. Such assessments give rise
to indications that guide teaching or prescribe remedial action. These formative and
diagnostic assessments can be seen as broadly for the benefit of the learner, but also of
teachers and other professionals, whose effectiveness is enhanced by this information.
Formative assessment stresses the value of informative feedback in the course of
ongoing teaching. The information obtained from informal, day-to-day assessment is
used by the teacher to provide better-focused teaching. It can also be used by the learn-
er as a powerful tool for improvement. If pupils are able to monitor their own learning
in this way, rather than relying on the teacher and other outsiders, they can play an
active part in planning their own learning experiences. In this classroom use, the assess-
ment is very informal. The evidence can be entirely ephemeral, such as a pupil’s answer
to a teacher’s question, or take the form of feedback comments on a pupil’s written
work. It is also possible to use more formal assessments in this formative way. Rather
than focus on the numerical score obtained in a test, it is possible to make an analysis of
introduction and overview 3
the strengths and weaknesses demonstrated at individual, group or class level, and to
use this information to plan the curriculum.
Diagnostic assessment is used when a child is experiencing difficulties in learning to
read, in order to pinpoint the perceptual or cognitive problems that underlie the lack of
progress. For this purpose, carefully designed batteries of subtests are devised, and are
administered on an individual basis by an educational psychologist.
An entirely different purpose for assessment is certification. Final examinations and
tests assess the reading curriculum covered in the course of schooling. The certificates
awarded on the basis of these assessments serve to attest to the competence and under-
standing of the student. They thus benefit the student, in providing a recognised
measure of attainment, but also society, where they fulfil the purpose of selecting candi-
dates for further study or for employment. They are high-stakes assessments, because
the individual student’s life chances are affected by them.
In some countries in recent years, however, notably the UK and the USA, the pre-
dominant purpose for assessment is political accountability. Governments have a
legitimate interest in improving educational standards. Better national attainment in lit-
eracy benefits individuals as well as enhancing the economic performance of a country.
In this context, tests have the role of providing the performance outcomes that are used
by government and the public to evaluate progress towards defined targets. As a result,
the tests acquire high stakes for the local authorities, schools and teachers who are being
held to account for their pupils’ performance.
The participants in the seminars represented a range of interests in and allegiances to
these differing views on the nature of reading and on the purpose of assessment. They
were brought together with the aim of understanding more about one another’s perspec-
tives, and perhaps finding an overarching position that brought them closer. In the latter
aspiration, it is fair to say that the seminars had only limited success, as the incompatibility
of the underpinning theories became if anything more evident. This will be discussed in
more detail in the concluding comments of the book. But in the aim of fostering mutual
understanding and respect, the seminar series can be counted a success.
This book is the outcome of those seminars, and the themes outlined above are
worked out in a variety of cross-cutting ways in the following chapters. Part 1 is devot-
ed to explicating in more depth some of the theoretical underpinnings of reading and of
assessment. Each of the authors in this section sets out a single perspective; it is only
later in the book that the links between them become apparent.
Marian Sainsbury starts this process by outlining the evolving theory of construct
validity in assessment and suggesting an overall shape for the construct of reading that
attempts to integrate competing points of view.
In John Beech’s chapter, a psychological perspective is advanced that can broadly
be situated in the tradition focusing primarily on the decoding of the written word. In
highlighting the contribution of psychology to the study of reading, he stresses the
value of soundly researched psychological theories, arguing that these are rarely if
ever embodied in reading tests. His stance on assessment is to highlight its diagnostic
4 assessing reading: from theories to classrooms
the historical context. P David Pearson surveys some key moments in the history of
reading tests in the USA, pointing up the evolution in underlying theory that gave rise to
each new development. Similarly, Chris Whetton highlights significant points in read-
ing testing in the UK, but argues that political and social influences are at least as
important as theoretical developments in determining key outcomes. Paul Thompson’s
chapter chronicles the course of one influential innovation in the UK, demonstrating in
its own way the jostling of literary and educational theories with political imperatives in
determining the shape of reading assessment at one moment in time.
Leading on from these historical insights, the third section of the book looks at the
cutting edge of current work and finds two apparently contradictory ideas occupying
equally prominent positions. On the one hand, developments in information and com-
munication technology have led to a burgeoning interest in computer-based assessment.
Colin Harrison introduces some fundamental issues and principles for consideration.
Once again juxtaposing theory with practice, Claudia Leacock’s chapter describes an
innovative computer program, already operational, that makes possible the assessment
of open written responses, releasing computer-based reading tests from the limitations
of the multiple-choice question.
On the other hand, a contrasting area of interest and innovation that can be discerned
in current thought is the use of informal assessment by both teacher and pupils to support
learning – known broadly as the ‘assessment for learning’ movement. Gordon Stobart
sets out the principles of formative classroom assessment and applies them to reading. To
complement this, Lorna Pepper, Rifat Siddiqui and Andrew Watts describe a research
project investigating the value of giving feedback in a specific form to students who have
taken a reading test.
In implementing national assessment systems, governments make decisions about
the nature of reading and the purpose of assessment that set the agenda for national dis-
course. The seminar group included participants who were directly involved in devising
the national assessments in England, Wales, Scotland, France and the USA. The
insights from these very different systems make up the fourth and final part of the book.
Marian Sainsbury and Andrew Watts describe a system of national testing in England
that attempts to combine a complex, meaning-centred, literary construct of reading with
the constraints of a high-stakes testing regime. Roger Palmer and David Watcyn Jones
describe a similar construct of reading operating in Wales, but their chapter traces the
evolution of a similar high-stakes accountability assessment system into one that sup-
ports teachers in assessing for formative and diagnostic purposes. This perspective is
further amplified by Louise Hayward and Ernie Spencer, writing about Scotland. Here,
there is an established commitment to formative assessment that is worked out in all
aspects of the national assessment system.
The national evaluations in France take a distinctive view of both the construct of
reading and of the nature and purpose of national assessment. Martine Rémond
describes a set of formal national tests that are entirely formative in purpose, and that
embody a definition of reading which accords more importance to grammatical knowl-
edge than is usual in the Anglo-Saxon world. Finally, Patricia Donahue sets out yet
6 assessing reading: from theories to classrooms
another different national response to the set of questions posed by the definition and
purpose of reading assessment. The National Assessment of Educational Progress in the
USA is a national survey of reading attainment that yields indications of performance
that are crucial to political decision-making but low stakes for individual students. The
construct of reading can broadly be situated within a ‘responsive reading’ paradigm.
These five national assessment systems therefore represent a variety of ways in which
contemporary societies obtain their evidence about reading, and demonstrate how these
governmental decisions are both reflections and determinants of national values.
Each of these chapters is capable of standing alone, giving a summary and overview
of a particular perspective. The book can be used for reference, bringing together a col-
lection of theoretical and practical information about the assessment of reading in its
political, educational, geographical, historical and contemporary contexts. Reading the
entire book brings out the interaction between these factors, as principles are juxta-
posed with concrete examples, political demands with academic, social influences with
individual, theories with classrooms.
[Part 1]
Competing paradigms:
theories of reading and theories of assessment
2 Validity and the construct of reading
Marian Sainsbury
This book sets out to examine the nature of reading by way of its assessment. The cen-
tral question running through it is the apparently simple one of how reading should be
defined. The question can be investigated in many different ways, but in this book the
main focus is upon the constructs embodied in different reading tests and assessments.
The need for careful definition of constructs is a central element of validity theory, the
branch of psychometrics that raises philosophical questions about what is assessed and
how. This chapter will lay some foundations for later discussions by exploring the con-
cept of validity itself, its recent evolution into a more flexible and powerful form and
the implications of this for the assessment of reading.
Validation evidence
In most cases, a test user wishes to generalise beyond the scope of test items them-
selves, to draw inferences about valued attributes that go beyond the test. As Haertel
(1985) puts it:
Tests are settings for structured observations designed to provide an efficient source of
information about attributes of examinees. Often, these are attributes that cannot be
observed directly. The necessity of making inferences to a broader domain than the
test directly samples brings a need for some deeper theoretical basis for linking test
and criterion. This is a need for construct validation.
(Haertel, 1985, p.25)
The validation of a test consists of defining the underlying construct of interest and
establishing the theoretical and empirical links between this and test performance.
A complex but clear theory of validity emerges from the mainstream of recent
scholarship. Of particular note in articulating this broadly accepted view are Messick’s
(1989) comprehensive overview of the field and the latest version of the Standards for
Educational and Psychological Testing (AERA/APA/NCME, 1999). On this account,
the validation of an assessment consists of establishing lines of argument that demon-
strate that inferences about the construct can validly be drawn from performances on
the test. This process of validation is not related abstractly to the test in itself, but is
specific to the purposes for which the test is used. The test developer has a responsibil-
ity to work from a definition of the information the test is intended to provide and the
purposes for which it is required to provide it. From these definitions, appropriate lines
validity and the construct of reading 9
of argument can be determined and the evidence to support the validation argument
collected and reported.
Any discussion of the validation of an assessment requires consideration of the pur-
poses of its use and the kinds of evidence that can support its appropriateness for that
purpose. A reading test may be intended for the diagnosis and remediation of difficulties
for individual students; examinations in reading aim to certify competence in relation to
employment or further study; national and international tests and surveys in reading are
designed for the evaluation of the overall functioning of schools and education systems
within or between countries. These purposes for assessment form an essential part of
the foundations upon which validation arguments are built.
Five broad types of evidential argument are identified in the Standards
(AERA/APA/NCME, 1999), and validation for specific purposes takes the form of col-
lecting evidence and building arguments of one or more of these types. The first type of
evidence listed is that based on test content. The definition of the construct will set out a
range of knowledge, skills and understandings. Content evidence is a consideration of
how well the range of performance elicited by the test represents the range described in
the construct. On Messick’s (1989) analysis, construct under-representation is one of
the major threats to test validity. As an example, it is important for a curriculum-based
assessment to represent adequately the range of knowledge and understanding included
in the curriculum guidelines. Only with such adequate representation can test scores be
taken as a valid indication of how well the student has mastered the curriculum taught.
Validation evidence for test content representation is largely judgemental. Test items
may be mapped on to elements of the construct in a systematic way and the strength of
the relationship between the two reviewed by experts. In the case of many types of read-
ing tests, this evaluation of content must be applied both to passages of text and to the
questions asked.
A second type of validation evidence is that based on response processes and this too
relates to the adequate representation of the construct. In answering a test question, the
thought processes engaged in by candidates should replicate as closely as possible the
ways of thinking described in the construct. For example, if a test of reading is based on
reading and understanding an extract which is intended to be unseen, candidates with
prior knowledge of the text from which the extract is drawn are likely to give answers
based on that prior knowledge, rather than their abilities to read and understand a pas-
sage presented unseen. Any rote learning of model answers would also fail to elicit the
desired thought processes in test takers. In this case, the aim is to provide evidence of
the students’ mental processes. These are by their very nature invisible, but can be
accessed to some extent through words or actions. For example, an open response ques-
tion in a reading test may ask a pupil to draw an inference about character or motivation
in a literary reading passage and present some textual evidence for the answer. Analysis
of written answers can yield a range of indications of students’ reasoning. Test develop-
ment processes may also include a trial stage, where test-takers are asked to think aloud
as they read, or to explain how they obtain their answers to comprehension questions,
and this evidence can also be presented in a validation argument.
10 assessing reading: from theories to classrooms
In both of the above lines of validation argument, the differential performance of dif-
ferent groups – boys and girls, for example, or different ethnic groups – may also
provide evidence. Alongside construct under-representation, the second threat to validi-
ty identified by Messick (1989) is construct-irrelevant variance. For example, if a
reading passage contains subject matter that is more familiar to one group than others,
the responses to test items could reflect that familiarity, rather than the knowledge and
thought processes defined by the construct. Correspondingly, scores on the test would
provide information about cultural familiarity rather than about the construct of interest
and inferences about the construct would be invalid.
A third type of validation evidence is provided by statistical analyses of the internal
structure of the test. The definition of the construct should indicate whether the domain
being assessed is homogeneous or consists of a variety of distinguishable but related
elements. A curriculum-related reading test, for example, could address aspects such as:
literal understanding of content; grammatical analysis of sentence structures and autho-
rial techniques such as the use of imagery. Validation evidence might look for greater
consistency between items within each of these elements, rather than across different
ones. Other reading tests, for example those devised to assess word recognition, might
have a more unidimensional structure.
There is a strong tradition of validation studies based on correlation with other vari-
ables and this type of evidence constitutes a further potential line of argument. The
scores on a given test should relate strongly to variables that represent the same or sim-
ilar construct, providing ‘convergent’ evidence and should relate weakly to variables
defined as irrelevant to the construct – ‘discriminant’ evidence. In these investigations,
new tests under development may be correlated with a criterion measure. In the devel-
opment of standardised reading tests, evidence is often obtained from a sample who
take the new test together with an established test of the same construct; a high correla-
tion is desirable in these cases. In curriculum-based test development, a strong
correlation between test scores and teachers’ ratings of the attainment of the same pupils
would stand as validation evidence.
The final category of validation evidence listed in the Standards
(AERA/APA/NCME, 1999) relates to the consequences of test use. This area has been
the subject of lively debate in recent years, for example filling an entire issue of the
journal Educational Measurement: Issues and Practice in 1997 (Volume 16, number 2).
Since the validation of a test provides evidence that it is suitable to be used for a specific
purpose, the consequences that ensue when it is used for that purpose become relevant
to its validity. A clear example of this is found in high-stakes curriculum related tests.
Where there are important consequences of test scores in terms of targets or penalties,
teachers and others are likely to find ways of maximising performance on the specific
content and format of the test’s items. An over-emphasis on this could lead to a distor-
tion of the curriculum, where only what is tested is taught. The consequences of testing
are therefore a legitimate part of validation evidence. However, test results can be used
for many different purposes, some of them unintended and unforeseeable and the debate
has raged over the scope of the concept of validity and the extent of the test developer’s
validity and the construct of reading 11
responsibilities in these cases. In the example, it is evident that a test can only have this
effect of narrowing the curriculum if it under-represents the construct in significant
ways. The mainstream view of consequential evidence, as expressed in the current Stan-
dards (AERA/APA/NCME, 1999), is that it is only relevant to validity where it can be
traced back to construct under-representation or construct-irrelevant variance. Broader
consequences may be investigated and are of interest in informing policy decisions, but
are not strictly a part of validity.
Thus contemporary theorists regard validity as a unitary concept in which the test’s
relationship to the construct is central. Different lines of argument and different types of
evidence can be brought to bear in validating a test, but in all cases these arguments and
evidence serve to illuminate and justify the relationship between test and construct. This
can be depicted in a simple diagram, shown in Figure 2.1, where the lines of argument
provide evidence about the relationship between concrete performances on a specific
test instrument and the abstract ideas that make up the construct of interest.
This unitary concept differs from earlier accounts of validity in the psychometric lit-
erature (for example, AERA/APA/NCME, 1966). Previously, distinctly different types
of validity were identified. ‘Content validity’ corresponded to the first type of evidence
described above. Then, ‘criterion-related validity’ looked at the relationship between
the test and an external criterion. This was divided into ‘concurrent validity’ where the
criterion measure was administered or collected at the same time as the administration
of the test and ‘predictive validity’, where a criterion measure was collected later, in the
form, for example, of subsequent examination success. Correlational studies were
prominent in the validation literature as a result. Finally on this traditional account,
‘construct validity’ was evaluated by investigating the qualities measured by the test
and relating this to theoretical constructs.
The contemporary conception of validity is more complex and flexible and corre-
spondingly more powerful, including all the ideas covered by the previous subdivisions
but going beyond them and uniting them. It is the responsibility of the test developer, in
conjunction (where appropriate) with the test user, to define the purpose and circumstances
Figure 2.1 Diagrammatic representation of construct validation
Purpose of
assessment
Test
Construct performance
abstract concrete
general specific
complex simple
12 assessing reading: from theories to classrooms
of the use of the test or assessment and to delineate the construct being addressed. In the
light of this, appropriate lines of validation argument must be identified and relevant
evidence collected. This unified concept means that a very broad range of types of
assessment can be encompassed, going beyond traditional paper-based short-item tests
to include assessments of extended performances, practical skills, computer-based
interactions and portfolios of performance, for example. The types of evidence relevant
to a diagnostic individual test for use by educational psychologists are likely to be very
different from those for a high-stakes curriculum related test and different again from an
observational checklist of the behaviours of young children. Yet this over-arching
analysis of construct validity can be applied to all of these.
Published writing on validity in the UK, the location of the seminars on which this
book is based, is generally more variable and lacks the spirit of lively debate found in
the USA, where all the writing described above took place. There is some work investi-
gating particular instances of validity argument (see, for example, Stobart, 2001) and
definitions of validity can sometimes be found in the literature. However, there is little
discussion addressing the definition itself, or developing it beyond what has already
been established. Gipps (1994) gives a good overview of validity as a unified concept
and of the debate over consequences. A recognition of recent ideas about validity is also
apparent in the work of Wiliam (see, for example, 1993, 1996). Other authors, however,
show little acknowledgement of the recent evolution of ideas. Harlen (2004) gives a
supposedly authoritative review of research into the validity of assessment by teachers
for summative purposes. Yet her introductory definition largely relies on traditional con-
cepts of different types of validity: ‘content’, ‘face’, ‘concurrent’ and ‘construct’
validity. Messick’s ideas are mentioned briefly in this review, in the context of conse-
quential validity, but the mainstream of scholarship on the unified concept of construct
validity is referred to in just half a sentence as ‘a view held by many’ (Harlen, 2004,
p.13). Similarly, Stuart and Stainthorp (2004) offer a review of 17 currently available
British reading tests, but refer only to the traditional notions of ‘construct validity’ and
‘content validity’ and to correlational studies in their discussion. It is clear that thinking
on validity elsewhere in the world offers a more comprehensive and dynamic background
than that in the UK, both in respect to reading and more widely.
Recent work in the USA does not represent a complete consensus, however. For some
commentators, the range of lines of argument and types of evidence within this main-
stream definition of validation remains too narrow. This challenge is part of the
far-reaching debate about the nature of social science: the psychometric tradition with
its emphasis on quantifiable evidence versus the interpretative tradition in which human
meaning takes on central importance. Although this debate may seem abstruse, it has
important implications for the range of reading assessments covered by this book.
validity and the construct of reading 13
A key exponent of the case for the broadening of validity theory is Moss (1992,
1994, 1995, 1996, 1998, 2003), who argues strongly for the inclusion of evidence based
in interpretative research. Essentially, this approach recognises human individual and
social interpretation of contextualised meaning as a central and valid source of research
evidence, rather than relying on a notion of scientific objectivity. She does not wish to
reject the mainstream account of validation evidence, but argues that, in ignoring
human meaning and judgement, it is too narrow.
One consequence of this view is to place Moss amongst those arguing for a stronger
role for consequential evidence in validation. When test scores are communicated to
different stakeholders – teachers, students, parents – mainstream theories assume a neu-
tral, authoritative, shared understanding of their meaning. But when this
communication is analysed in terms of the interpretative theories of discourse analysis
and social psychology, the actual ways in which scores are interpreted by different peo-
ple become relevant. Moss’s argument is that these individual meanings should be
regarded as part of the consequences of test use and part of the validation evidence
(Moss, 1995, 1998).
Moss explains how research in the hermeneutic tradition can lead to the inclusion of
a wider range of judgements in arriving at an evaluation of validity. She characterises
this hermeneutic approach as follows:
… a holistic and integrative approach to interpretation of human phenomena that
seeks to understand the whole in the light of its parts, repeatedly testing interpreta-
tions against the available evidence until each of the parts can be accounted for in a
coherent interpretation of the whole.
(Moss, 1994, p.7)
Included in this evidence might be a teacher’s specialised knowledge of the context in
which a portfolio of work was produced, personal acquaintance with the student and
expert knowledge of the subject matter. The expert dialogue that takes place between ref-
erees for academic journals, or examiners of doctoral theses, is better characterised in this
way than by drawing upon the standardised and de-personalised approach of mainstream
psychometric theories. Moss argues, not for an abandonment of these theories, but for an
open-minded examination of various lines of evidence, from a variety of theoretical per-
spectives, in order better to account for the complexity of the validation process.
Moss’s constructive questioning of established approaches to validation is particular-
ly important for the consideration of assessments that do not take the form of traditional
tests. These ideas have found an application in the area of formative classroom assess-
ment (for example, Black and Wiliam, 1998; Moss, 2003; Stiggins, 2001). This refers to
the ongoing judgements made by teachers, with stress on the value of informative feed-
back in the course of ongoing teaching. The information obtained from informal,
day-to-day assessment is used by the teacher to provide better-focused teaching. It can
also be used by the learner as a powerful tool for improvement. If pupils are able to
monitor their own learning in this way, rather than relying on the teacher and other out-
siders, they can play an active part in planning their own learning experiences. Black
14 assessing reading: from theories to classrooms
(2003) argues that significant learning gains arise when pupils become autonomous,
self-monitoring learners in this way. In this classroom use, the assessment is very infor-
mal. The evidence can be entirely ephemeral, such as a pupil’s answer to a teacher’s
question, or take the form of feedback comments on a pupil’s written work.
Moss’s argument is that mainstream validation studies are inadequate for the analy-
sis of such assessment, for a number of reasons. Unlike a formal test, classroom
assessment is not a discrete activity; it is integral to the continuing interaction within the
learning environment, so that it is difficult to isolate a ‘performance’ that can be related
to the construct. Similarly, there are problems when one attempts to define a given set of
inferences that should be derived from the test performance; instead, interpretations are
fleeting and provisional and not always recorded. Classroom assessment is individual
rather than standardised across groups and the contextual factors make each individ-
ual’s response different. Consequences of assessment in this fluid context take on
central importance; both intended and unintended effects result from individual inter-
pretations and shape future learning. She argues that little sense can be made of the idea
of aggregating these individual judgements into an overarching ‘score’ to summarise
attainment; rather, the informal judgements and comments themselves constitute the
validity of this assessment approach. In these circumstances, Moss argues, the
hermeneutic approach demonstrates its worth, whereas the mainstream approach fails.
Similar arguments are developed by Wiliam (2004) and Stiggins (2004).
These ideas apply to the assessment of reading in many different ways, some of which
are worked out in more detail in other chapters of this book. Reading itself is an act of
interpretation and individuals may make different but valid interpretations of what they
read. Assessments of reading include some item-based tests that fit quite comfortably
into the mainstream of validation argument. But in other cases, there is an evident ten-
sion between the openness necessary for recognising individual interpretations and the
standardisation required by the mainstream tradition. In the study of literature by older
students, postmodernism rules out any single correct interpretation. And much reading
assessment is informal, taking place from day to day in the classroom in spontaneous
exchanges between teacher and pupil. In all of these contexts, the interpretative tradition
of research may have more to contribute than mainstream psychometrics.
It is not currently clear whether these challenges to the mainstream view of validity in
the USA will result in a consensus on a broader definition or lead to fragmentation with-
in the field. For the purposes of this chapter, however, it is possible to see how the
differing perspectives can be brought together into a unified, though multifaceted, view
of the issue.
For this book is mainly concerned with a construct – the construct of reading – and its
reflection in different assessments. Because the focus is on the construct rather than the
validity and the construct of reading 15
assessment, no approach should be arbitrarily excluded and this book includes chapters
on computer-based assessment and formative classroom assessment alongside more tra-
ditional item-based tests. Centring the discussion on the construct itself, however, makes
it possible to see how the two sides of the debate described above can be integrated.
Many chapters of this book describe reading tests on a fairly traditional model:
whether the focus is on word recognition, comprehension or literary understanding, the
test consists of a standard set of requirements, with marks and scores awarded for per-
formance. For all of these instruments, the mainstream account of validity can be
brought to bear. A construct can be delineated and lines of evidence established to sup-
port the use of the test in giving information about that construct. These constructs exist
as complex, abstract and general notions in answer to the question ‘What does it mean
to be a reader?’ Sometimes they are described exhaustively in words; in other cases they
are inferred from curriculum documents or from briefer descriptions in the test docu-
mentation. But constructs exist only as ideas and it is this quality that makes possible a
rapprochement between formal and informal assessments.
The construct embodied in a test is an idea frozen in time, set out in written form as
a fixed point. But ideas can also be viewed as reflected in human consciousness, in a
more dynamic, interactive form. And in this form, a construct of reading can be found in
the consciousness of every teacher. It is this individual, personal construct that allows
teachers to plan literacy lessons. It is also the teacher’s personal construct that gauges
questioning and feedback to make the ephemeral judgements that constitute classroom
assessment. When teachers help students to understand the criteria for successful per-
formance in their work for formative purposes, they are attempting to make available in
the learner’s consciousness a fuller, more specific and detailed version of that same con-
struct. Wiliam (2004) and Marshall (2004) make a similar point when both identify a
need for a community of practice to underpin a shared construct of quality in a subject,
as a necessary requirement for formative assessment.
The arguments of the validity debate can be seen as two sides of a single coin. The
relationship between the static construct of reading in a test and dynamic constructs of
reading in people’s heads is a two-way one, in which neither could logically exist with-
out the other. The construct of a test has itself been derived interactively, through
reviewing related research and discussion with colleagues and experts. The individual
teacher’s construct has been built up through contact with established, published, theo-
ries. Indeed, the entire world of human understanding must be seen in this way, both
contributing to and derived from individual instances of interaction (Sainsbury, 1992).
This review of validity theory reinforces the importance of defining the relevant con-
struct, whether embodied in a formal test or informing an exchange between teacher and
pupil. It is this centrality of the need for construct definition that gave rise to a seminar
16 assessing reading: from theories to classrooms
series focused entirely on the construct of reading. All of the chapters of this book
explore aspects of this topic in depth, but it is possible to sketch out here some of the
broad features that can be distinguished in the construct of reading. Haertel (1985) gives
a helpful analysis of what he calls ‘achievement constructs’, which he distinguishes from
the constructs of psychology. Achievement constructs are grounded in curricular theory
on the one hand, and educational psychology on the other.
Assessments, in an educational context, aim to give information about valued educa-
tional outcomes. They aim to tell us how well the pupils have learned what they have
been taught. They are typically about cognitive outcomes, with understanding, knowl-
edge and skill as central elements. Defining an educational construct is likely to involve,
at the very minimum, ideas about the nature of the subject itself, what the pupils have
been taught and what is known about how children learn in that curriculum area. Edu-
cational constructs, as Haertel pointed out, will inevitably be complex constructs, with
both logical and psychological aspects.
Reading is a fundamental educational construct and it is unsurprising that its defi-
nition is difficult. It is a flexible skill rather than a body of knowledge. In outline, it
can be seen to involve, at least, knowledge of language, knowledge of the written
code, the ways in which children learn to read and the difficulties they may encounter.
A consideration of the purposes that are intrinsic to the act of reading brings in aes-
thetic and emotional as well as pragmatic factors for the individual. The social,
philosophical and political context can be seen in the functions fulfilled by reading in
society and the role of literature in cultural life. Like much knowledge, skill and
understanding, the act of reading itself is mostly invisible, consisting of mental
changes that cannot be directly observed, so that evidence about reading has to be
evinced through observable performances of one kind or another.
The diagram in Figure 2.2 depicts a schematic structure for conceptualising this val-
ued educational attribute – an overall construct of reading as it emerges from a variety
of reading tests and assessments.
This envisages four main reading processes: decoding, comprehending, responding
and analysing. The four are nested in the diagram, as there is a substantial overlap
between them. Each of the four ‘layers’ of the construct manifests itself in research, in
teaching and in assessment.
The outer ring, ‘decoding’, recognises that the ability to translate written words into
their spoken form underlies all other reading processes, which are therefore represented
within it. In the outer ring alone are theories addressing the ways in which children
learn to decode text, investigating such areas as phonological awareness, visual memo-
ry and the use of analogy (for example, Adams, 1990; Goswami and Bryant, 1990;
Garrod and Pickering, 1999). The teaching implications of these theories find their form
in phonics, the systematic teaching of phoneme-grapheme correspondences, which
forms a part of most early literacy programmes. In assessment terms, the area of decod-
ing is represented by numerous specialist tests such as the Phonological Assessment
Battery (Fredrickson, 1996) and also general word-reading tests such as Schonell
(1945).
validity and the construct of reading 17
Decoding
Comprehending
Responding
Analysing
Within the outer ring is the second layer, ‘comprehending’. The area that lies within
this ring alone has a relatively small representation in recent research. Here, lexical and
grammatical knowledge is combined with recognising the written form of the word, so
that meaning is attached to the word, sentence or passage. In teaching terms, too, it is
difficult to point to many relevant practices, although the teaching of word and sentence
recognition to beginner readers and the old-fashioned ‘comprehension exercise’ can be
seen as examples. Assessment, by contrast, is strongly represented in this area. There
are many tests of sentence completion and straightforward literal comprehension, for
example the Suffolk Reading Scale (Hagley, 1987) or the Neale Analysis of Reading
Ability (Neale, 1997).
The third of the rings is labelled ‘responding’. This is the process by which the read-
er engages purposefully with the text to make meaning and it underpins most recent
theories of comprehension in cognitive psychology as well as literary theories. The dis-
course comprehension theory of Kintsch (1988), the reader response theories of Iser
(1978) and Rosenblatt (1978) and the constructively responsive reading of Pressley and
Afflerbach (1995) all envisage an active reader, bringing individual world knowledge to
build a personal understanding of the text. Literary theory offers many elaborations of
this process and the postmodern view set out in Harrison’s chapter (chapter 5) in this
book makes it clear that the interpretation of texts is an infinitely varied and flexible
process. In current teaching terms, this is text-level and literary knowledge. The early
stages are taught by shared and guided reading, in which the teacher models the
processes of making sense of ideas, themes, plot and character. Later, in secondary
school and beyond, it becomes the study of literature.
The fourth ring, ‘analysing’, is related to the same research and theories as respond-
ing. In analysing, the reader steps back from the meaning of the text, and considers it in
relation to the authorial techniques adopted and the literary traditions within which it
was produced. In this activity, by contrast with responding, the literary theories are
explicit and a conscious part of the reader’s understanding.
18 assessing reading: from theories to classrooms
The reading tests that attempt to assess responsive reading are those that present in
their entirety texts that were written for real purposes and ask thought-provoking ques-
tions for which more than one answer may be acceptable. The key stage reading tests in
England and Wales (QCA, 2004; ACCAC, 2004) and the tests used in the PIRLS
(Campbell et al., 2001) and PISA (OECD, 2003) international surveys are examples of
this approach. For older students, public examinations, coursework and essays assess
this understanding. These assessments also include the ability to analyse, though this is
much more fully worked out for older students than younger ones.
The diagram in Figure 2.2, while representing the overall shape of the construct in sim-
plified form, also illuminates the differing emphases in the ways reading is taught and
tested. One way of seeing the rings is cumulative, with decoding preceding comprehen-
sion and response and analysis following. The alternative view is a holistic one, with
teaching and testing addressing all four layers at once, and it is this latter view that is
embodied in the England National Curriculum and Literacy Strategy. Considering reading
tests for young children, the Suffolk Reading Scale addresses decoding and simple com-
prehension, whereas the national reading tests for children of the same age ask questions
requiring some response to text and the recognition of some obvious authorial devices.
Informal classroom assessment is not located within one section of the diagram, but
may focus upon any of the skills and understandings in any of its rings. Because it is
dynamic and responsive, teachers may shift their attention from moment to moment: a
teacher of young children is likely to be supporting and assessing decoding skills one
minute and response to literature the next. With older students, as well as checking basic
understanding, there may be open-ended discussions exploring different interpretations
and analysing techniques.
Reading by an experienced adult is an activity that is normally characterised by the
‘responding’ category. Word recognition and understanding of vocabulary and grammar
are taken for granted, as the experienced reader reads for a purpose. This may be practi-
cal, as in following a recipe, for interest and enjoyment, or in engaging in any number
of professional and social functions. The reader brings knowledge and experience to the
text and this interaction brings about the meaning that the reader is seeking. These var-
ied meanings are embedded in the personal and cultural experiences of the individual,
so that reading is woven into the very fabric of social life. It is because of the variety of
meanings and purposes that the construct is so complex: reading a bus timetable is dif-
ferent from appreciating War and Peace, but the scope of the construct of reading
encompasses them both.
The assessment of reading unites two intertwined strands of human activity, each of
which has purpose and meaning for individuals and for society. Reading itself is not (or,
validity and the construct of reading 19
at most, not for long) the purely mechanical activity of decoding written signs into spo-
ken words. Its nature is essentially bound up with the fulfilment of purposes,
relationships and actions. The construct of reading is defined by these purposes as much
as by the related skills and understandings.
From validity theory it is clear that assessment, too, has the notion of purpose at its
heart. Validation consists of gathering evidence that an assessment provides the infor-
mation necessary for a purpose, within the classroom, the education system or in society
more broadly. An individual assessment of reading is based on decisions about the
range of reading skills and purposes, drawn from within the overall construct, that it
should include in order to fulfil its assessment purpose.
In the other chapters of this book, the issues surrounding the definition of the con-
struct of reading are worked out in many different ways. At first sight, it may seem that
there is little commonality between them and it is for this reason that an appreciation of
the complexity of the theoretical background is so important. The last quarter-century
has seen a dynamic evolution of the theory of validity, from its positivist roots into a
broad and flexible approach that can be adapted to apply to all forms of assessment,
however formal or informal. Over the same period, discussion of the nature of reading
has not stood still, as cognitive, social, linguistic and literary theories have continued to
challenge one another. Defining the construct of reading draws upon the full range of
ideas from both of these well-developed theoretical contexts.
References
Adams, M.J. (1990). Beginning to Read: Thinking and Learning about Print. Cam-
bridge, MA: MIT Press.
American Educational Research Association, American Psychological Association and
National Council on Measurement in Education (1966). Standards for Educational
and Psychological Testing. Washington, DC: AERA.
American Educational Research Association, American Psychological Association and
National Council on Measurement in Education (1999). Standards for Educational
and Psychological Testing. Washington, DC: AERA.
Black, P. and Wiliam, D. (1998). Inside the Black Box: Raising Standards through
Classroom Assessment. London: School of Education, King’s College.
Black, P. (2003). ‘The nature and value of formative assessment for learning’, Improv-
ing Schools, 6, 3, 7–22.
Campbell, J., Kelly, D., Mullis, I., Martin, M. and Sainsbury, M. (2001). Framework
and Specifications for PIRLS Assessment. 2nd Edition. Boston: International Study
Center.
Frederickson, N. (1996). Phonological Assessment Battery. Windsor: nferNelson.
Garrod, S. and Pickering, M. (Eds) (1999). Language Processing. Hove: Psychological
Press.
20 assessing reading: from theories to classrooms
Psychologists are not the only professionals concerned with reading assessment, so let us
first look at the bigger picture. Figure 3.1 illustrates many of the potential influences that
affect a young person's development in reading. The outer part illustrates the interests of
different disciplines, professionals, parents and others, with some much more connected
to reading than others. The inclusion of some professions, such as politicians, may raise
eyebrows, but politicians are involved at some level because they ultimately control
resources for assessment and have to make choices based on the evidence given to them.
The relationship between assessment and how to allocate resources is an interesting
one. It could be argued that if a particular assessment is conceivably not going to produce
much overall improvement then it is not worth consuming resources to undertake it. For
example, there appear to be relatively few readers with surface dyslexia, but if an assess-
ment of this results in the improvement of a minority of readers, it would be worth it for
the quality of their life. This problem is very much like a medical issue in which a rela-
tively large amount of money might be spent on expensive surgery for the benefit of a
relative few. Effect size is an important consideration when looking at the potential
advantages of an assessment. This, too, is an important factor in medicine when the
effect size of a medication can be relatively small. Take for example aspirin, which has
an effect size of a low 0.3 standard deviation in its effects in reducing myocardial infarc-
tion; but although small this can translate into the savings of thousands, if not hundreds
a psychological perspective on the diagnostic assessment of reading 23
Figure 3.1 A schematic representation of two spheres of influence on the child learning to read.
The outer circle represents various agencies, professionals and other influences who have different
degrees of effect on reading and the inner part represents internal and external influences on read-
ing. The relative positions of items in the inner and outer circles are not intended to be connected.
psychologists
education psychometrics
language
teaching assistants characteristcs English literature
behavioural cognition
motivation gender
parents politicians
personality/
attitude
Child biological
emotional medical
mood
practice physical
relations of reading social/ well-being crimonologists
cultural
culture sociologists
economists media/TV
doctors
24 assessing reading: from theories to classrooms
while E represents environmental factors (such as the influence of the teacher). This
may not get us much further on, but it is at least a summarising principle that implies
that if we know one element we can predict the unknown variable.
One example illustrating the interaction of some of these elements would be the gender
of a child, which does have an influence on reading performance. Girls are more positive
in their beliefs about reading in Grades 1 to 4 (Eccles et al., 1993). In addition, Smart et al.
(1996) found a much stronger connection between reading problems and behavioural
problems in boys compared to girls in Grades 2 and 4 in Australia. Gender illustrates that
several possibilities could be operating, such as the interplay between sex role and culture,
the biological or hormonal basis for sex differences, effects on personality and even the
over-preponderance of female teachers of reading in the classroom.
Despite all these potential influences, psychologists might be forgiven for believing
that psychological influences are the most pervasive, especially when the variance that
can be accounted for in predicting reading development appears mainly to derive from
psychological factors. To give just one example, Awaida and Beech (1995), in a cross-
sectional longitudinal study, tested children and then examined their performance one
year later. Mainly cognitive tests of 5- and 6-year-olds accounted for 79 per cent and 78
per cent of the variance one year later in predicting reading quotient (reading age as a
proportion of chronological age). It was interesting that reading performance itself one
year earlier accounted for most of the variance. A case of the better readers getting better,
which we will return to later.
We shall now look at what in my view should be good principles of reading assessment
from the perspective of a psychologist. No particular principle is to be recommended in
isolation; it helps if an assessment can satisfy most if not all of the following points.
Without getting too far ahead, it is all very well having an assessment based on a solid
theoretical basis, but if that assessment does not ultimately lead to an improvement of
some aspect of reading skill, then it may be good for ‘blue skies’ research, but not much
use for helping children to improve their reading.
I predict that in most cases it will be found that the construction of many reading
tests was not guided by an explicit translation of reading theory into testing practice,
but in fact the instruments will have more of an atheoretical, craft-like feel.
(Englehard, 2001, p.13)
So why is it so necessary to construct theories? The problem with just collecting sub-
stantial quantities of data is that these data in themselves do not provide a complete
picture. We need theories to integrate this information and to give it meaning. Theories
should serve the function of filling gaps in our knowledge and should enable us to make
predictions. A key purpose of an assessment should be to provide an instrument that can
adequately test participants to examine their position within this theoretical framework.
For example, Eysenck (1947) found the orthogonal dimensions of extraversion-
introversion and neuroticism. Individuals can take his subsequent questionnaire (for
example the Eysenck Personality Questionnaire) and it provides a measure of their
relative position within a plot of these axes.
As a brief illustration of how a reading theory can be the basis for an assessment, one
might be interested in constructing an assessment based on traditional reading stage
models (for example, Frith, 1985; Marsh et al., 1981). This illustration should not be
seen as an endorsement of the model, which has its critics (for example, Goswami,
1998; Stuart and Coltheart, 1988). Frith’s well-known model divides reading develop-
ment into three main stages of logographic, alphabetic and orthographic processing.
One could develop a test instrument that measures the level of attainment in each of
these stages. The result might be that in a group of early readers the instrument would
show that the great majority of children were at the logographic stage, and a few were
already within the alphabetic stage. As one moves through increasingly older groups of
readers these proportions should change, so that in a much older group the majority
were at the orthographic stage with a possible few still predominantly in a alphabetic or
even a logographic phase. This is a hypothetical example, but serves to illustrate how
assessment could be integrated with theory.
Some might point to dyslexia as an example of an integration of theory and assess-
ment. The term ‘dyslexia’, although it has variations in its definition, is basically a
deficit in reading taking into account underlying intelligence. The model relies on the
intelligence test, which can measure general intelligence, or ‘g’ as Spearman (1904)
called it. General intelligence is significantly correlated with reading vocabulary level,
such that bright children generally are better readers (for example Rutter and Yule,
1975). The argument runs that because of individual differences in the development of
g, a better way of assessing reading is to look at reading development as a function of
general ability rather than by chronological age.
At present the Wechsler Objective Reading Dimensions (WORD) (Rust et al., 1993)
test in conjunction with the WISC-IV can be used to provide not only age-related norms
but to assess reading (and spelling) performance in relation to general intelligence. The
WISC-IV comprises four subcomponents: verbal comprehension, perceptual reasoning,
working memory and processing speed, from which a full scale IQ measure can be com-
26 assessing reading: from theories to classrooms
puted. The WORD test provides statistical tests to allow the assessor to find out if a child
is reading at a level that is significantly below that predicted from IQ. The same test can
be undertaken for reading comprehension. This reading comprehension test involves the
child silently reading a passage. (Although the instructions allow children to read aloud
if they wish, in my experience they read silently.) They are asked a single question after
each passage which is scored according to a template of correct possibilities.
One of the major justifications of using the IQ test is that it is an important compo-
nent of determining dyslexia. Children with dyslexia are reading at a level below that
expected based on their IQ. By contrast, some children will be poor readers but reading
at a level that would be expected on the basis of their IQ. This is because there is con-
sidered to be a level of low intelligence below which most children will perform below
average reading level. Such children have general learning difficulties and would be
described as reading backward, compared to those above this benchmark level. This is
referred to by Yule and colleagues as having specific reading retardation (Yule et al.,
1974), and by using the criterion of 2 standard errors below expected reading scores
found that 3.1 per cent of 10-year-olds in the Isle of Wight were in this category.
Unfortunately there are some problems with this particular approach. One problem is
that it seems that when defining boundaries for dyslexia there is a certain degree of insta-
bility. Shaywitz et al. (1992) used the criterion of 1.5 standard errors and found that by
third grade, of those who had been defined as having dyslexia at the end of their first
grade, only 28 per cent were left. This sort of figure does not inspire confidence in using
dyslexia for the purposes of assessment. Another problem is that it takes a long time to
test children (each one is tested individually) using these tasks. Also, it can lead to appar-
ently bizarre outcomes, such as a very intelligent child being found to be significantly
impaired in reading, but having the equivalent level of reading as a child of average IQ
and average level of reading. Using these criteria the very intelligent child qualifies for
extra training in reading. Because of the difficulties in measuring intelligence (for exam-
ple, the length of time involved in testing, the historical controversies connected with IQ
testing, conceptual difficulties defining intelligence, and so on) this is not something that
is normally undertaken in the classroom. Nevertheless, the assessment of general ability
is an important tool currently used by educational psychologists. But is it justified? I
would suggest that it does not add much that is useful and might be considered to be low
in terms of information gained for the amount of time spent in assessment, at least in the
context of reading assessment. This is because current experimental work indicates little
justification for arguing that dyslexia (defined as looking at the discrepancy between
reading level and reading predicted by full scale IQ) is a special condition (for example,
Share et al., 1987; Fletcher et al., 1994; Stanovich and Siegel, 1994).
Perhaps a more useful model of reading is dual route theory (for example, Castles
and Coltheart, 1993; Coltheart, 1978). This is not exactly a developmental model of
reading, but proposes that we have a lexical and a phonological route for reading. The
lexical route is a direct visual route and evokes a dictionary entry of the word, whereas
the phonological route involves coding letters (or graphemes) that correspond to
phonemes. These phonemes are blended to provide a pronunciation. This phonological
a psychological perspective on the diagnostic assessment of reading 27
route enables decoding of nonwords and regularly spelled words, but irregularly spelled
words are more likely to be decoded by the lexical route. Dual route theory is particular-
ly useful in providing a basis for the interpretation of acquired dyslexias (for example,
Patterson et al., 1985).
Dual route theory affords a means of assessing developmental phonological and sur-
face dyslexia. For example, Bailey et al. (2004) tested 5th Grade dyslexic readers and
split them into subgroups of surface and phonological dyslexics, based on their per-
formance reading nonwords and exception words. These scores were converted to z
scores and non-word scores were subtracted from exception word scores. For example,
suppose a child was better at reading non-words than exception words, this would be a
profile of a surface dyslexic. Thus this child has greater proficiency in using the phono-
logical route. This formula provided a gradation in performance from extreme surface
dyslexia at one end of the scale to extreme phonological dyslexia at the other. Bailey et
al. then chose children above and below the top and bottom 25th percentiles, respective-
ly, to produce a group of surface dyslexics and a group of phonological dyslexics. The
children in these two groups were matched on their level of word identification. The
children then undertook a training study to examine how well they learned nonwords
with either regular pronunciations or irregular pronunciations. Bailey et al. found that
phonological dyslexics have a specific phonological deficit. Such children develop
differently from normal younger readers of equivalent reading age.
Conversely, surface dyslexics were much more similar to normal younger readers
and seem to have a different type of deficit. Studies such as these are valuable in show-
ing that it is not useful to assume that those with dyslexia are within an homogenous
group. But we must be clear that although this study shows important differences
between the two types of reading deficit and that although it helps us to understand the
underlying mechanisms better, it does not as yet illuminate how these two groups might
benefit from differential training programmes. We will examine one study that attempts
to undertake a specific programme for surface dyslexics in the next section.
The Perry Preschool Project in the USA – later called High/Scope – was an early inter-
vention study that included heavy parental participation. Key features were that it tracked
from pre-school to nearly 30 years later and it had random allocation to treatment and
control groups. It has been shown from this particular project that for every $1000 invest-
ed in such an intervention, $7160 (after inflation) is returned in terms of fewer arrests,
higher earnings (and hence taxed income), reduced involvement with social services and
so on (Schweinhart and Weikart, 1993). The impact of training is clear, not just economi-
cally, but in terms of children fulfilling their potential (we included the involvement of
criminologists, sociologists, politicians and economists in Figure 3.1.) The economic
costs of assessment of aspects of reading and then targeted subsequent training may well
have a similar abundant return not only for the children involved but for society as well.
This would be much more a ‘magic bullet’ approach than blunderbuss and just might put
arguments across in a way that politicians can appreciate.
So where in particular can we point to an assessment followed by training making an
impact? Bradley and Bryant (1983) published what has become a classic well-known
study, in which pre-reading children were assessed and selected on the basis of their poor
phonological skills. These children were put into various types of training programmes
and they found that training phonological awareness, and training in letter-sound connec-
tions in particular, had a significant impact on future reading performance relative to
controls who had been given the same amount of attention, dealt with the same materials,
but had a semantic type of task. According to Bradley (1987), the relative differences in
reading ability between their different training groups persisted even when the children
were ready for secondary school. The implication here is that assessing poor phonological
skills with a view to subsequent training can be an aid to improved reading. Subsequent
studies have refined and built on this work (for example, Wagner and Torgesen, 1987).
It turns out that training phonological awareness can be relatively rewarding in rela-
tion to the resources expended. Wagner et al. (1993) in a meta-analysis, showed an
effect size of 1.23 on phonological awareness after nine hours of training and Foster et
al. (1994) in a computerised training study of phonological awareness took only 4.5
hours to achieve an effect size of 1.05 standard deviations. Several studies have shown
that training phonemic awareness significantly improves subsequent reading (Ball and
Blackman, 1991; Barker and Torgesen, 1995; Cunningham, 1990; Kjeldsen and Abo
Akademi, 2003; Lundberg et al., 1988; Torgesen et al., 1992).
Such programmes of training should normally be useful for phonological dyslexics
(although some find it very difficult to improve their phonological skills, for example
Torgesen and Burgess, 1998), but what about surface dyslexics? A recent Italian study
(Judica et al., 2002) compared two groups of surface dyslexics one of which was trained
in simply reading briefly presented words, the idea being to force processing away from
the serial processing of graphemes. This training had the desired effect in that eye fixa-
tions on words were shorter and word naming times were faster. However, there was no
improvement in comprehension relative to controls. This is perhaps a crude approach
for the treatment of surface dyslexia, but there do not seem to be many training pro-
grammes available for them at the moment and although we can assess the surface
a psychological perspective on the diagnostic assessment of reading 29
dyslexic fairly accurately, this does not get us much further on.
Although there is now strong evidence for the efficacy of training phonology and its
effectiveness in improving reading there are still many sceptics who believe that it is
important that children need to discover for themselves the magic experience of getting
lost and totally engrossed in reading. These are advocates of what is sometimes known
as a ‘top-down’ approach to reading. Here they emphasise the importance of compre-
hension rather than concentrating on learning how to decode words. A whole-language
theorist such as Reid (1993) is critical of instruction on how to decode words that is
taken out of context. It is fair to say that there has been a paradigm war between an
emphasis on decoding skills on the one hand and an emphasis on whole language (see
Stanovich and Stanovich, 1995 and Pressley, 1998 for further discussion).
Both paradigms would be able to explain evidence that children who read a lot tend
to be better readers. Stanovich (1986) attracted a lot of citations by coining the phrase
‘the Matthew effect’ to describe this. He and Cunningham (Cunningham and Stanovich,
1991; Stanovich and Cunningham, 1992) used recognition tests of authors and book
titles to show that better readers recognised more book authors, presumably because
they had read more books. Such tests could be useful as indicators of the extent to which
children are immersing themselves in reading.
It might be (mistakenly) believed from this that one step forward is to encourage chil-
dren to use context in reading to help them progress through a passage of text. It might
help them identify difficult words as well as encourage semantic processing. However,
context is not actually all that useful as the probability of guessing the correct word on the
basis of context is quite low (for example, Perfetti et al., 1979). Furthermore, good read-
ers do not use word context more than poor readers to help with their word recognition. In
reaction time word recognition studies where participants are primed with context before-
hand poor readers actually show greater context effects (for example, Briggs et al., 1984;
West and Stanovich, 1978). Stanovich and Stanovich (1995) argue that these and other
findings pose a problem for top-down theorists. Good readers have achieved automaticity
in their word recognition skills to the extent that they do not need context. By contrast,
poor readers try to use contextual information due to weak decoding skills, but this does
little to help their plight. An assessment test evaluating children’s skill in the use of
context while reading under these circumstances is not likely to be useful.
To conclude this section, it can be argued that designing a particular assessment for
reading in the absence of knowledge about what kinds of training are going to work
after these results are known, is putting the cart before the horse. It would be better to
start with finding a type of training that works particularly well for a selected group and
then developing an assessment instrument that could be employed for larger-scale
usage, to find children from the general population who would benefit from this inter-
vention. Spear-Swerling and Sternberg (1998) have much useful and pragmatic advice
for those constructing intervention programmes. This includes first giving priority to the
child’s immediate success in reading and then building on this; second, the profession-
als involved must have the conviction that all children can learn reading if they are
instructed appropriately and third, there needs to be strong support for the teachers
30 assessing reading: from theories to classrooms
involved in the programme and they should have plenty of opportunity for giving feed-
back. As far as the phonological awareness programmes are concerned, however,
Spear-Swerling and Sternberg note that a significant number of children do not respond
to such training (for example, 30 per cent in the Torgesen et al. study) and that there is a
need to explore the effects of training other types of related skill.
underlying statistical basis for making measurements or analysing data. Messick pro-
vides a more detailed definition: ‘Theories of measurement broadly conceived may be
viewed as loosely integrated conceptual frameworks within which are imbedded rigor-
ously formulated statistical models of estimation and inferences about the properties of
measurements or scores’ (Messick, 1983, p.498). Engelhard (2001) proposes a relation-
ship between measurement theory and reading theory, which is partly represented by
Figure 3.2. In an ideal world he believes that reading tests are affected by measurement
theory and reading theory and in turn, measurement theory and reading theory should
interact with each other for mutual development. In this section we are going to explore
his ideas further.
Charles Spearman (1904) – in the same paper referred to earlier – began test theory
or classical test theory with the basic idea that an obtained score (X) is equal to a true
score (Xt) and a variable error (e), which can be positive of negative, shown thus:
X = Xt + e
The error component can be derived from problems with the actual test, the partici-
pant’s physical condition, error in scoring, error due to time of day and so on.
Subsequent statistical derivations are based on this simple beginning. Estimating this
error has led to many different formulations of reliability, such as the Kuder-Richardson
method. Further information on test construction for reading using classical test theory,
such as the use of reliability (internal consistency, test-retest, etc), can be found in
Beech and Singleton (1997).
Unfortunately there are many problems with this approach (as outlined by Engle-
hard). One difficulty is the ‘attenuation paradox’ (Loevinger, 1954) in which, contrary
to what one might expect, increasing test reliability eventually decreases test validity
(Tucker, 1946). One way a tester can achieve higher reliability is by maximising test
variance. Schumacker (2003) taking this to its extreme (and with tongue firmly in
cheek) notes that maximum test variance is produced when half the participants score
zero and the other half are at ceiling. Perhaps a more understandable situation is where
Figure 3.2 Part of Englehard’s conceptual framework (2001) for the assessment of reading
Measurement theory
Reading test
Reading theory
32 assessing reading: from theories to classrooms
one has run a new test, which produces a certain level of reliability on the Kuder-
Richardson (KR21). One can then experiment by eliminating items to increase the
inter-item correlations. But the problem with this process as one proceeds is that one
potentially ends up with items so well correlated that in effect they reduce to one item.
This process is obviously reducing the validity of the test. Kuhn (1970) proposed that
when confronted with a paradox this is the beginning of the downfall of that theory to
make way for another paradigm. The paradigm that should be used in the place of clas-
sical measurement theory in the view of Schumacker (2003), Englehard (2001) and
others should be the Rasch measurement model. Without going into the detail of the
rationale of the Rasch model, this particular method maximises both reliability and
validity by taking out the extreme positively and negatively discriminating items, so that
the same problem does not arise. Item elimination should be accompanied by careful
qualitative scrutiny.
Englehard in his argument goes on to note the strong reliance of traditional tests of
reading on reliability and on norm-referencing rather than examining validity. One might
conceivably have a highly reliable measure of reading, but it may be of doubtful validity.
There are some Rasch calibrated reading tests, such as the Woodstock Reading Mastery
Test (1973, 1998), which Englehard believes was the first diagnostic reading test to make
use of Rasch measurement. He notes that Rasch measurement can provide extra informa-
tion on construct validity and hopes that future reading test developers will make use of
these advantages. One of his conclusions is that so far reading theories have not usually
determined the construction of reading tests and that to improve quality we need to ensure
a much closer association between the two. He also notes that, from an historical perspec-
tive, in the early days of the developments in measurement theory reading theorists had a
very close involvement. One outstanding example of this was the work of E.L. Thorndike
who was innovative both in measurement theory and reading theory. However, since then
the two fields seem to have drifted away from each other. But this is not to say that teams
of experts from each domain might well cooperate with each other in the future.
Conclusion
A huge amount of world-wide effort goes into the creation of reading assessments and
even more effort into the subsequent testing and evaluation of children and adults. But is
it all worth it? At this stage in the development of research in reading we have some well-
developed theories and have collected a great deal of experimental evidence. It seems
that there is a gap between this work and the development of properly constructed assess-
ment tests. Furthermore, despite all this effort, there does not seem to be enough energy
being put into undertaking well-constructed studies that select particular individuals
according to their profiles and then giving them specialised training to improve their
skills. This selection would be based on psychometric assessments that are well con-
structed. The final figure (Figure 3.3) attempts to put this idealised process together.
a psychological perspective on the diagnostic assessment of reading 33
Figure 3.3 An idealised conceptual framework in which the assessment of reading leads to
selection of a sub-population followed by training of that group. The results inform further
development or refinement of the theory in a continuous cycle.
Reading
theory
Assessment Targeted
Selection
of reading training
Measurement
theory
This shows how reading theory and measurement theory produce an instrument for
assessment. Once the instrument is appropriately constructed it is used to select individ-
uals for specialised training. This training is monitored and the outcomes can be used to
inform theory leading to further refinements in a continuous process. It can be seen
from this view that assessment without the close involvement of theory and some kind
of training procedure for putative readers is considered to be less than useful.
We started with the role of the psychologist in reading research and its assessment.
Psychologists have a role in terms of their involvement in properly controlled experi-
mental studies in reading, in the construction of appropriate assessment tests, in helping
to conduct experimental studies of training and in the evaluation of outcomes. Psychol-
ogists are part of a wider community of professionals all concerned with helping children
to progress in reading and it is only by closely collaborating with this community that
real progress can be made.
References
Awaida, M. and Beech, J.R. (1995). ‘Children’s lexical and sublexical development
while learning to read’, Journal of Experimental Education, 63, 97–113.
Bailey, C.E., Manis, F.R., Pedersen, W.C. and Seidenberg, M.S. (2004). ‘Variation
among developmental dyslexics: Evidence from a printed-word-learning task’, Jour-
nal of Experimental Child Psychology, 87, 125–54.
Ball E. and Blachman, B. (1991). ‘Does phoneme awareness training in kindergarten
make a difference in early word recognition and developmental spelling?’ Reading
Research Quarterly, 26, 49–66.
34 assessing reading: from theories to classrooms
Share, D.L., McGee, R., McKenzie, D., Williams, S. and Silva, P.A. (1987). ‘Further
evidence relating to the distinction between specific reading retardation and general
reading backwardness’, British Journal of Developmental Psychology, 5, 35–44.
Shaywitz, S. et al. (1992). ‘Evidence that dyslexia may represent the lower tail of a nor-
mal distribution of reading ability’, New England Journal of Medicine, 326, 145–50.
Smart, D., Sanson, A. and Prior, M. (1996). ‘Connections between reading disability
and behaviour problems: testing temporal and causal hypotheses’, Journal of Abnor-
mal Child Psychology, 24, 363–83.
Spearman, C. (1904). ‘General intelligence objectively determined and measured’,
American Journal of Psychology, 15, 201–93.
Spear-Swerling, L. and Sternberg, R.J. (1998). Off Track: When Poor Readers Become
‘Learning Disabled’. Boulder, CO: Westview.
Stanovich, K. (1986). ‘Matthew effects in reading: some consequences of individual
differences in the acquisition of literacy’, Reading Research Quarterly, 21, 360–407.
Stanovich, K. and Cunningham, A. (1992). ‘Studying the consequences of literacy with-
in a literate society: the cognitive correlates of print exposure’, Memory and
Cognition, 20, 51–68.
Stanovich, K. and Siegel, L. (1994). ‘Phenotypic performance profile of children with
reading disabilities: a regression-based test of the phonological-core variable-differ-
ence model’, Journal of Educational Psychology, 86, 24–53.
Stanovich, K. and Stanovich, P. (1995). ‘How research might inform the debate about
early reading acquisition’, Journal of Research in Reading, 18, 87–105.
Stuart, M. and Coltheart, M. (1988). ‘Does reading develop in a sequence of stages?’
Cognition, 30, 139–81.
Steering Committee of the Physicians’ Health Study Research Group (1989). ‘Final
report on the aspirin component of the ongoing physicians’ health study’, New Eng-
land Journal of Medicine, 32, 129–35.
Torgerson, C.J., Porthouse, and Brooks, G. (2003). ‘A systematic review and meta-
analysis of randomized controlled trials evaluating interventions in adult literacy and
numeracy’, Journal of Research in Reading, 26, 234–55.
Torgesen, J. and Burgess, S. (1998). ‘Consistency of reading-related phonological
processes throughout early childhood: evidence from longitudinal-correlational and
instructional studies.’ In: Metsala, J.L. and Ehri, L.C. (Eds) Word Recognition in
Beginning Literacy. Mahwah, NJ: Erlbaum.
Torgesen, J.K, Morgan, S. and Davis, C. (1992). ‘The effects of two types of phonolog-
ical awareness training on word learning in kindergarten children’, Journal of
Educational Psychology, 84, 364–70.
Tucker, L. (1946). ‘Maximum validity of a test with equivalent items’, Psychometrica,
11, 1–13.
Wagner, R. and Torgesen, J. (1987). ‘The nature of phonological processing and its
causal role in the acquisition of reading skills’, Psychological Bulletin, 101,
192–212.
a psychological perspective on the diagnostic assessment of reading 37
Wagner, R., Torgesen, J. and Rashotte, C. (1993). ‘The efficacy of phonological aware-
ness training for early reading development: A meta-analysis.’ Symposium presented
at annual meeting of the American Educational Research Association, Atlanta, GA,
April.
West, R. and Stanovich, K. (1978). ‘Automatic contextual facilitation in readers of three
ages’, Child Developmental Psychology, 49, 717–27.
WISC-IV. (2000). Weschler Intelligence Scale for Children. Fourth edition. Texas, San
Antonio: Harcourt Assessment Inc.
Woodstock, R. (1973). Woodstock Reading Mastery Tests – Revised. Circle Pines, MN:
American Guidance Service.
Woodstock, R. (1998). Woodstock Reading Mastery Tests – Revised. Circle Pines, MN:
American Guidance Service.
Yule, W., Rutter, M., Berger, M. and Thompson, J. (1974). ‘Over and under achieve-
ment in reading: distribution in the general population’, British Journal of
Educational Psychology, 44, 1–12.
4 Cognitive psychology and reading
assessment
Alastair Pollitt and Lynda Taylor
The past forty years have seen a great expansion in the amount of empirical research
carried out in the field of reading assessment. Many of the various question formats
commonly used have been the subject of intense investigation with regard to issues of
reliability and validity; multiple-choice and cloze, in particular, have been the focus of
considerable attention with large numbers of studies devoted to analysing the efficiency
of multiple-choice items or the relative merits of one cloze format over another. Others
have studied the strategies adopted by test-takers during a reading test, the role of cul-
tural or background knowledge, or the relationship between reading and other language
skills.
There has been rapid expansion in all areas of both L1 and L2 reading research. The
late 1960s and the 1970s saw extensive advances in the development of theories and
models of reading, a trend which continues to this day. Considerable attention has been
directed towards trying to identify and describe the component processes of reading at a
level beyond basic decoding (the so-called higher order reading skills) and towards
finding an appropriate model to describe and explain the nature of comprehension.
Recent models for text comprehension have stressed the active and constructive nature
of the process in which meaning is generated by the cognitive processes of the reader;
using text together with pre-existing knowledge, the reader apparently builds a person-
al mental representation which may be modified by personal attitudinal characteristics
and intentions; this mental representation may be visual, or verbal, or both.
It is surely reasonable to suggest that reading comprehension theory and reading
assessment theory must overlap and that research developments in one field are likely to
be of relevance and value to the other. One might therefore expect there to exist between
these two fields a strong and reciprocal relationship, as a result of which advances in our
understanding of reading processes and products are directly reflected in developments
in reading assessment theory and practice. This has not always been the case and it is
easy to see a considerable gap between current theories of reading comprehension and
current practice in the testing of reading comprehension ability. The result of this mis-
match is that much of what is currently done in reading comprehension assessment is
undertaken without sufficient regard to what we now understand about the process of
reading comprehension.
cognitive psychology and reading assessment 39
Most researchers and practitioners working in the field of reading test design and use
probably assume that an underlying relationship automatically exists between, on the
one hand their own theory and practice and on the other, a theory of reading; various
comments in the literature, however, suggest that some researchers perceive a gap
between current theories of reading and current reading test design. Farr and Carey
(1986) and Anderson et al. (1991) conclude that reading tests have not changed signifi-
cantly in the last fifty years and have not therefore responded to changes in how
comprehension is understood:
While models of reading have evolved, changing our thinking about how the printed
word is understood, the tests that we use to measure that understanding have not
changed significantly.
It would thus appear that an examination of the construct validity of current reading
theories, is in order.
(Anderson et al., 1991, p.41)
We use Anderson’s (1983, 1993) ACT theories of cognition to analyse the reading
process, but most of what we say would differ very little if we chose a different
40 assessing reading: from theories to classrooms
approach. Recent theories of cognitive processing argue that there is little real distinc-
tion to be made between the three familiar terms learning, remembering and thinking.
When a student reads a text and constructs a meaning representation because of it, this
is thinking, but it operates on structures recalled from memory and the resulting repre-
sentation is returned to memory, at least for some time; if it remains long enough in
memory we say that the understanding of the text has been learned. The close relation-
ships between these terms were first described by Anderson (1976) and later
developments are outlined in Anderson (2000).
At the start, you began building a representation using schemas your mind already con-
tained about warfare, modern high technology military aircraft and the thrill of combat.
As you read on, more details were added to that initial representation – until you
reached the italicised phrase. Its meaning – something to do with maintaining office
equipment – cannot immediately be fitted into the existing representation, so a new
structure is begun to accommodate it. Finally, when you reached the word ‘dream’ (if
not before then) and driven by the maxim of relevance to integrate everything in the
text, you created a new super-structure that resolved the conflict by representing ‘Mr’
Hong as a young man rather bored by his everyday duties and fantasising about a more
exciting life. And then you added extra layers or facets to the structure as you noted the
humour, the fact that it was an advertisement, your feelings about militarism and so on.
That’s an awful lot of brain activity provoked by just 39 words.
Many experiments could be chosen to illustrate some of the unconscious processing
that is part of the reading process. Consider this sentence:
The king died and the queen died of grief.
This seems an easy sentence stating two simple propositions, but in fact we must make
several crucial inferences in order to ‘understand’ it. Ask yourself the following questions:
• Who died first?
• Why did the queen die?
• Were the king and the queen related?
That we can readily answer these questions shows that our brains spontaneously make
many inferences, going beyond what is explicitly stated in the text in order to build a
coherent and satisfactory representation of the meanings this text seems to contain.
For our purpose of understanding how children read text, and so how we ought to try to
assess their reading ability, there are several lessons to be taken from this psychological
view of cognition and the reading process. The importance of schemas in the current
view should make us realise that meaning does not reside in the text, but in the interac-
tion between text and reader. We cannot ignore the importance of the reader’s
pre-existing schemas, since these are the very stuff with which the meaning structures
are built. To the extent that the past experiences of different readers vary, so the mean-
ings they construct will be different. At the very least, each student’s representation of
(i.e. provoked by) the text will reflect differences in their interests and their purposes for
reading it: their structure of meaning will represent what the test means for them. This
leads to a problem: if understanding is necessarily idiosyncratic, how can we ever hope
to measure it? Spolsky (1994) argues that we cannot and that we would do better to seek
cognitive psychology and reading assessment 43
rich descriptions of students’ understanding than to claim we are measuring it. We will
suggest a solution to this problem later.
Schemas are pre-fabricated structures that are activated in their entirety when pro-
voked, even though much of what each one contains will not necessarily be needed for
the text in question. They prepare us for whatever is likely to arise in a subsequent text,
in terms of what we would expect to encounter in a context capable of provoking them.
Their influence is both subtle and strong; we enhance our imagined representations with
stereotypic people, objects and events to fill out the ‘slots’ in the schema that are not
explicitly specified by the author, stereotypes that may not be consistent with the
author’s own imagination. A simple proof of this is the lack of surprise we feel if, in a
narrative about a crime or road accident, reference is made to the policeman, even
though no policeman has yet been mentioned, apparently contravening the general rule
that characters must first be introduced to an English narrative with the indefinite arti-
cle. It can be argued (Brown and Yule, 1983) that the policeman was introduced along
with the schema and had been present implicitly from that point.
Children, especially in the unnatural and rather stressful conditions of a test or exam-
ination, will be less able to distinguish what is their own idiosyncratic knowledge and
expectation from the meanings intended by the author and less able to avoid the pitfalls
of stereotypic thinking as they read. We must take care to ensure that the test does not
unfairly depend on our adult consensus of what ought to be in a reader’s schemas. Tay-
lor (1996) observed marked differences between the schemas of younger and older
teenagers (13–14-year-olds and 17–18-year-olds) which no doubt resulted from their
varying degrees of life experience.
We should recognise that the meaning of a text is not constructed in a simple linear
fashion. As the Sergeant Hong example showed, we can modify our meaning structures
retrospectively in the light of new information. Later parts of the text may change our
understanding of earlier parts; this is obvious when you consider that almost every mod-
ern novel begins in media res, leaving the setting, the characters and even the scene to
be set as the narrative progresses. The study of literature assumes our ability to alter our
understanding of a text through study and reflection. Most pertinent to our concern is
the effect of comprehension test questions. Gordon and Hanauer (1993) reported empir-
ical evidence that multiple-choice questions can cause some readers to over-elaborate
their mental representations, incorporating misleading ‘information’ from the distrac-
tors, but a cognitive model of reading implies that any question will at least provoke
re-consideration and so cause changes in the meaning structure. We face a rather serious
‘Heisenberg’ problem: we need to use questions to find out about the representation the
student has made while reading the text but the questions are bound to change that
representation. Perhaps some kinds of questions may cause less change than others?
Finally, for the moment, we should question the distinction commonly made between
understanding and remembering. Is it reasonable to claim that you have understood a
text if you cannot remember the important things about it? In most current theories of
cognition, thinking operates on memory and what we think of as new information from a
text is actually a new structure in memory built from elements that were already present
44 assessing reading: from theories to classrooms
in memory. There is no separate ‘place’ where thinking takes place, for thinking consists
of changes in memory and the only question is whether or not the new structures will
persist after the reading is finished. Barring some sort of short term memory pathology, it
seems that a coherent and meaningful representation that has no unresolved inconsistencies
will remain accessible in memory, at least for as long as it takes for us to test it.
If ‘reading’ equals ‘building models of meaning’, how can we test the quality of a
child’s comprehension quality? The first step seems obvious: we must investigate the
models that readers build. It has long been accepted that asking a reader to produce a
summary – oral or written – of what they have read is a reasonable way of accessing the
mental representation which they have built for themselves. Summary tasks enjoy a nat-
ural appeal among English teachers, since a summary is an attempt by the student to
express their mental representation in words. What we know from empirical research
into the cognitive operations involved in the summarisation process suggests that they
include skills such as identifying relevant information, distinguishing superordinate
from subordinate material and eliminating trivial or redundant material; such operations
are key to constructing a mental model and can thus provide insights into comprehen-
sion (Kintsch and van Dijk, 1978; Johnston, 1984).
We need to recognise, however, that summary production can be a very difficult and
cognitively demanding task for children and even for college students (Brown and Day,
1983); after all, the child’s mental representation is a multi-modal model of a world
rather than a literal one of the text (Garnham, 1987). A traditional summary production
task usually involves a set of questions to which the reader must give written answers
after reading a particular text or set of texts. But there is a significant danger associated
with the testing method we usually use. If it is difficult to turn the multi-modal mental
structures into a string of words and the original text presents just that simple string of
words, children are far more likely to draw directly on the text to answer questions
rather than to try to answer from their mental representations. What will happen if we
remove the text following reading but before asking them the questions?
We believe that a valid test of text comprehension should consist of asking the reader to
provide some sort of summary of the text, because a summary will be the closest
approximation we can get to a description of the mental representation they have built
while reading. Even so, we must remember that there are several reasons why different
readers will, quite validly, construct different representations from reading the same
text. We have already noted that they will approach the task with different prior knowl-
edge and experiences and so will start their construction of meaning with different
schemas. We should also remember that there is more involved than just information.
Schemas contain affective components too, including memories of the emotions felt in
the experiences from which they derive; the meaning structures readers build will incor-
porate all the feelings of interest or boredom, pleasure or distaste that they feel while
reading. How they feel about the author, both in the sense of evaluating the quality of
the writing and in the sense of agreement (or not) with the author’s attitudes and
assumptions, may profoundly affect the meaning they make of the text.
All of these factors will cause readers to prioritise the content of the text differently,
leading to Spolsky’s problem with ‘measuring comprehension’ we referred to earlier.
There is one way that we can reduce the problem. In one sense measuring reading is
no more problematic than measuring writing or science or running. We don’t actually
want to measure ‘reading’ at all; we want to measure ‘the ability to read’. Just as in sport
or in any educational test, the results of a reading test are of no value – they are not gen-
eralisable to other contexts of interest – unless the reader agreed to ‘play by the rules’.
To measure children’s ability to read we need to ensure (a) that what they are doing is
what we mean by reading, (b) that they know what we are expecting of them and (c) that
they agree to try their best.
There is one final feature of ‘real’ language use that we have not yet discussed which
is crucial to our success in achieving these demands and it is that real language use is pur-
poseful. In real life we read for a purpose. We may read a poem to appreciate the poet’s
feelings, or a novel for enjoyment, or a reference source for information. We may study
a journal article to grasp the writer’s opinions and argument, or a specialist book to deep-
en our understanding of a topic we already know quite well. These five categories of
purpose for using language, with their mnemonic Affect, Enjoyment, Information, Opin-
ion, Understanding, were developed for the AAP, the Scottish national survey of
standards in English (Pollitt and Hutchinson, 1990). In each case the purpose precedes
the reading and guides it. The different kinds of reading which are sometimes tested –
gist, skimming or scanning, criticism – have developed to serve these different purposes.
The best way to make reading purposeful in a test is not to try to explain to students what
purpose you would like them to adopt, but to make the purpose arise naturally from the
context in which the reading takes place. For example, if a student knows that they are
reading a text in order to judge the suitability of the story for a younger audience (as in Tay-
lor, 1996) or to explain the argument to fellow students, or to write a detailed critique, or to
find the best way to travel to Timbuktu, they will approach the task appropriately without
needing specific detailed instructions. Purpose is much easier to understand through context
than through instructions which can place an extra set of linguistic demands on readers.
comprehension through purpose. If it is clear that the reading is meant to achieve a cer-
tain purpose and that the student is expected to show that they can achieve it, then they
will at least know the rules of the game. Of course, we cannot guarantee that all of them
will be motivated enough to play the game to the best of their ability.
6 Give the students room to tell you what the text means for
them
Remember that many different meaning structures can be defended as interpretations of
a single text. If possible, give the students an opportunity to tell you what they remember
48 assessing reading: from theories to classrooms
most, what struck them as most interesting, in the text – even if their answers are not to
be scored in the usual way. We are encouraged to include a few questions of this open
kind in a survey questionnaire, to show that we respect the respondents’ right to hold per-
sonal and unpredictable opinions, so perhaps we should adopt a similar approach in
reading tests. After all, the students are human beings too.
References
Anderson, J.R. (1976). Language, Memory and Thought. Hillsdale, NJ: Lawrence Erl-
baum Associates.
Anderson, J.R. (1983). The Architecture of Cognition. Cambridge, MA: Harvard Uni-
versity Press.
Anderson, J.R. (1993). Rules of the Mind. Hillsdale, NJ: Lawrence Erlbaum Associates.
Anderson, J.R. (2000). Learning and Memory: An Integrated Approach. New York:
Wiley.
Anderson, N., Bachman, L., Perkins, K. and Cohen, A. (1991). ‘An exploratory study
into the construct validity of a reading comprehension test: triangulation of data
sources’, Language Testing, 8, 41–66.
Anderson, R.C. (1977). ‘The notion of schemata and the educational enterprise.’ In:
Anderson R.C., Spiro, R.J. and Montague, W.E. (1977). Schooling and the Acquisi-
tion of Knowledge. Hillsdale, NJ: Lawrence Erlbaum Associates.
Bartlett, F.C. (1932). Remembering: A Study in Experimental and Social Psychology.
Cambridge: Cambridge University Press.
Bear, M.F., Connors, B.W. and Paradiso, M.A. (2001). Neuroscience: Exploring the
Brain. Baltimore, MD and Philadelphia, PA: Lippincott Williams and Wilkins.
Brown, A. and Day, J. (1983). ‘Macrorules for summarizing texts: The development of
expertise’, Journal of Verbal Learning and Verbal Behavior, 22, 1–14.
Brown, G. and Yule, G. (1983). Discourse Analysis. Cambridge: Cambridge University
Press.
Farr, R. and Carey, R.F. (1986). Reading: What Can be Measured? Newark, DE: Inter-
national Reading Association.
Garnham, A. (1987). Mental Models as Representations of Discourse and Text. Chich-
ester: Ellis Horwood.
Gernsbacher, M.A. (1990). Language Comprehension as Structure Building. Hillsdale,
NJ: Lawrence Erlbaum Associates.
Gernsbacher, M.A. and Foertsch, J.A. (1999). ‘Three models of discourse comprehen-
sion.’ In: Garrod, S. and Pickering, M. Language Processing. Hove: Psychological
Press.
Gordon, C. and Hanauer, D. (1993). ‘Test answers as indicators of mental model con-
struction.’ Paper presented at the 1993 Language Testing Research Colloquium,
Cambridge/Arnhem.
cognitive psychology and reading assessment 49
Grice, P. (1975). ‘Logic and Conversation.’ In: Cole, P. and Morgan, J. (Eds) Syntax and
Semantics, Vol. 3: Speech Acts. New York: Academic Press.
Johnson-Laird, P.N. (1977). Mental Models. Cambridge: Cambridge University Press.
Johnston, P. (1984). ‘Prior knowledge and reading comprehension test bias’, Reading
Research Quarterly, 19, 219–39.
Kintsch, W. and van Dijk, T.A. (1978). ‘Toward a model of text comprehension and pro-
duction’, Psychological Review, 85, 363–94.
Messick, S. (1988). ‘Validity.’ In: Linn, R.L. (Ed) Educational Measurement. Third edn.
New York: Macmillan.
Pollitt, A. and Ahmed, A. (2000). ‘Comprehension Failures in Educational Assessment.’
Paper presented at the European Conference for Educational Research, Edinburgh.
Pollitt, A. and Hutchinson, C. (1990). The English Language Monitoring Project.
Report of the Assessment of Achievement Programme, Second Round, 1989. Edin-
burgh: Scottish Education Department.
Spolsky, B. (1994). ‘Comprehension testing, or can understanding be measured?’ In:
Brown, G., Malmkjaer, K., Pollitt, A. and Williams, J. (Eds) Language and Under-
standing. Oxford, OUP.
Taylor, L.B. (1996). An investigation of text-removed summary completion as a means
of assessing reading comprehension ability. Unpublished PhD thesis, University of
Cambridge.
Weir, C.J. (2005). Language Testing and Validation. Basingstoke: Palgrave Macmillan.
5 Postmodernism and the assessment
of literature
Colin Harrison
How should we assess response to literature? It is worth considering this issue, since
to do so raises many of the problems that traditionally accompany assessment in the
arts, or indeed assessment in any area in which it is difficult to say whether a stu-
dent’s response is unequivocally right or wrong. I also want to argue that we need to
take a postmodern perspective on assessing response to literature and to suggest that
although postmodernism poses some major challenges for traditional assessment,
such problems are inescapable, but they are also soluble.
Assessment systems are by their nature conservative. This is hardly surprising: as
guardians of standards, arbiters of achievement and gatekeepers whose role is to
determine in significant ways the life course of both students and their teachers,
those in control of assessment systems bear a heavy responsibility and it is under-
standable that they should in general adopt a historical model of definition and an
incremental (or even plate tectonic) model of change.
To anyone who works in an examinations board in the UK, however, the sugges-
tion that changes proceeds at a lithospheric pace would seem risible, since – even if
it is structurally conservative – the examination system in all UK countries has been
through a period of unprecedented and unceasing change for the past thirty-five
years. The main precipitating agents of these changes have been external to the
boards themselves: external forces in the shape of government interventions have
speeded up the rate of change in the lithosphere of assessment and these changes in
turn have led to the production of new assessment life forms (not all of which have
had a healthy long-term prognosis).
When the context of assessment changes, everything changes and in the UK the
context began to change in the late 1970s. Interestingly, up to that point, gradual sys-
temic change had been taking place in the examination system of the UK and it had
been initiated in partnership with teachers, teachers’ organisations and teacher edu-
cators. If teacher autonomy in curriculum and assessment is gold, then yes, in
England and Wales there was once a golden age and that age was in the 1960s and
1970s, when teachers, teachers’ groups and unions held sway within the Schools
Council. Innovation in curriculum and assessment was encouraged through universi-
ty departments of education and was led at the grass roots by school-based consortia,
postmodernism and the assessment of literature 51
‘meaning’ is a social as well as a linguistic phenomenon. Bakhtin argued that not just
words but whole texts were ‘dialogic’. Dostoyevsky’s novels, for example, are not ‘mono-
logic’; they introduce a ‘polyphonic’ range of points of view, expressed through the
various characters and between which the author does not adjudicate. Instead, the reader
is faced with the difficult task of struggling to come to an active, personal and individual
interpretation of meaning and to engage in a personal search for unification.
This emphasis on the reader as determiner of meaning was also attractive to Wolf-
gang Iser (1978), who argued that the process of reading is a dynamic one, to which
readers bring personal experiences and social and cognitive schemata, and in which pre-
dictions, assumptions and inferences are constantly made, developed, challenged and
negated. Iser’s reception theory positions the reader as a central and active collaborator
in making meaning; whose habits of interpretation are challenged and disconfirmed by
reading, a process which leads to new insights and understandings, not only of the text,
but also of themselves. Iser’s theory goes further than Bakhtin’s in suggesting that the text
is unfinished without the reader’s contribution to making meaning: it is the reader who, in
partnership with the author, fills the ‘hermeneutic gap’ in the text, bringing to it his or her
own experience and understanding and resolving the conflicts and indeterminacies which
the author has left unresolved.
An even more extreme challenge to any notion of stability in meaning and interpre-
tation – a notion which is essential if we are to retain any hope that it is possible to
assess response to reading with any notion of certainty – is that posed by the literary
theories of Jacques Derrida. Derrida’s Of Grammatology (1976) proposed a theory of
‘deconstruction’ of texts which was so radical that it seemed to imply not only the
‘death of the author’ as determiner of meaning, but to threaten the death of meaning
itself. According to Derrida, the reader’s role is not to discover meaning, but to produce
it: to dismantle (déconstruire) the text and rebuild it another way. Derrida uses the
metaphor of bricoleur to describe the reader’s role. He denied that the search for mean-
ing could be so banal as a simple ‘logocentric’ transfer of consciousness from the
‘transcendental subject’ (the author) to the ‘subject’ (the reader). For Derrida, written
texts are the site of an endless series of possibilities, oppositions and indeterminacies.
Deciding on a text’s meaning under these circumstances is not possible – the reader can
do no more than look for traces of meaning and contemplate the text’s geological strata
during the unending fall into the abyss of possible deferred meanings.
Here, then, are three formal challenges to traditional assessment of response to literature:
From Bakhtin (1973): the meaning of individual words and whole texts is unstable.
From Iser (1978): it is the reader (not the literary critic, or the chief examiner) who
brings meaning to the text.
From Derrida (1976): any search for an agreed meaning is doomed to failure, since
a text is not so much an IKEA flat pack that the reader assembles (what we might
think of as the archetypal do-it-yourself task) as a set of power tools that can perform
an infinite number of jobs.
54 assessing reading: from theories to classrooms
At this point, one might expect to encounter a complete philosophical stalemate, since
these postmodern perspectives appear to be fundamentally incompatible with the value
system that produced traditional English Literature examinations for GCSE and A-level
and more recently, end-of-key-stage tests in England and Wales. But in fact this has not
happened. Where there have been plate tectonic collisions, these have been over gov-
ernment policy, rather than over incompatible theories and the reason for this has been a
general acceptance, if not of postmodernism, then of some of the implications of a post-
modern position. As I have argued elsewhere (Harrison and Salinger, 1998; Harrison,
2004), agreeing to adopt a postmodern view is not what is crucial; what is important is
to consider the practical implications of postmodernism and to determine whether or
not these can be accommodated.
The following are the six implications of a postmodern perspective that I have sug-
gested need to be considered and which could form a principled basis for assessment.
What is interesting is that these six implications (and I have presented them elsewhere
as moral imperatives) are in many respects far less contentious than the theories from
which they derive (Harrison, 2004).
The imperatives for responsive assessment (the first three derived from a scientific
perspective on postmodernism, the second three from a literary perspective), were these:
1. that we acknowledge the potential of local system solutions if global system solu-
tions are difficult or impossible to achieve
2. that we acknowledge the importance of the individual subject, given that the con-
cept of ‘objectivity’ has to be recognised as problematic
3. that we acknowledge the importance of accepting as valid a range of methodologies
4. that we acknowledge that we need to recognise a polysemic concept of meaning
5. that we acknowledge a privileging of the role of the reader
6. that we acknowledge a diminution of the authority of the author and of the text.
How should these imperatives drive a practical model of assessment of literature? Let’s
consider some examples, if only briefly.
understanding of how marks, levels and boundaries were applied in the coursework of
hundreds of students, including their own and the consortia provided a forum within
which there was mutual respect for each others’ expertise and a shared sense of purpose.
The advantages of these arrangements were bidirectional – certainly most teachers felt
that they gained immeasurably: from participating in the construction of new courses;
from seeing students, many of whom hated exams, put heart and soul into their course-
work and they gained from being treated as professionals, with unique expertise and a
distinctive role to play in setting standards, rather than as technicians whose job was
simply to ‘deliver’ a curriculum. But the exam boards gained, too: teachers, particularly
newer teachers, were inducted into the craft knowledge of assessment; they came to
understand not only what grades and levels meant in reality, they also learned how val-
idation at whole school level worked, what excellent (and shoddy) coursework looked
like, they learned that unreliable marking would be identified and where necessary
regraded by experts.
response that are most congruent with a monosemic concept of meaning, but their place
in assessment may be diminishing.
In assessing response, exam boards prefer not to put too much emphasis on reproduc-
tion, since it may be independent of comprehension: a student could simply parrot an
answer, or give direct quotation without any paraphrase or commentary and this gives
very little evidence of understanding. Similarly, exam boards prefer not to put too much
emphasis on transformation, which may betoken comprehension, but may also be inde-
pendent of the reader’s having integrated meaning at the whole text level, or of that
reader’s having a rich associational response. However, while juxtaposition can offer
evidence of comprehension and creative response, here the problem is the opposite: the
more creative and original the intertextual links suggested by the reader, the more risky
for the student and the more problematic for the exam board in terms of interpretability
and inter-rater reliability (see any issue of The London Review of Books letters page for
evidence of this in the context of contemporary literary criticism).
But juxtaposition is the route that must be travelled: not only postmodern literary
theorists but also cognitive psychologists now accept that texts are best understood as
arrangements of propositions made up of complex semantic vectors that exist in loose-
ly structured relationships and that knowledge is best understood in the same way
(Kintsch, 1998). For Kintsch, comprehension does not involve the application of pre-
cise semantic and syntactic rules, but is rather a process of spreading activation, in
which fuzzy mental representations are created in response to text, which yield under-
standings that are approximate solutions, full, initially at least, of irrelevancies and
redundancies. The challenge of comprehension is to integrate those networks of propo-
sitions into coherent mental representations and the challenge of assessment is to offer
opportunities for readers to give evidence of that integration through sharing those rep-
resentations and constructing others that use juxtaposition to forge new associations that
give evidence of understanding, but are also novel, just and complex.
development, which in turn helps to avoid the problem that in less exam-oriented
assessment approaches, content knowledge is harder to test, because the text is not with-
held from the student. Finally, if new approaches to assessment are to be worthwhile,
one criterion they should meet is that of greater alignment with post-school and real-
world tasks and self-generated text activity, with text focus and task determined at least
in part by the student, would tend to approximate more closely to real world reading
than would formulaic exam questions on a set task judged against a common answer
template.
As Pearson and Hamm make clear (chapter 7), in the USA portfolio-based and stu-
dent-centred approaches to assessment gained currency and even began to become
fashionable during the 1990s, but their population began to wane as the calls for high-
stakes testing became more insistent. This drop in popularity occurred in relation to two
problems – authenticity and generalisability. ‘Whose work is it anyway?’ was the cry
that went up and the answer was ‘we don’t know’: in portfolio-based work the student
clearly had plenty of input, but teacher also certainly had some input and so had peers.
The generalisability issue was related to the fact that student performance is more vari-
able in portfolio-based work than it is within the narrower boundaries of traditional
examinations. This should hardly surprise us, because reliability goes down whenever
the range of item types, content and media goes up, but if the impact on measured reli-
ability is negative, there certainly are problems in acceptability for employers and
others who would wish to use literature exam results as an indicator of overall compe-
tence in reading and writing at a high level. However, this is not an insurmountable
problem. If, as Pearson and Hamm report, in order to obtain a reliability score above .9,
an open-book portfolio-type exam needed 13 hours of task activity, compared with 1.25
hours for some traditional exams, then teachers simply need to build exam tasks into the
regular curriculum. As Richard Kimbell found in a recent e-assessment project on
Design and Technology for the Qualifications and Curriculum Authority in England, it
proved possible to set a reliable open-ended six-hour design task that was executed over
two days by 500 students, using hand-held PDAs, with students uploading to the web an
e-portfolio containing diagrams, text, photographs and even audio files (Kenny, 2005).
Under the classroom exam conditions of this examination, the teacher withdrew from
the role of mentor and guide and peer support was minimal; each student produced an
individual portfolio and the fact that there were opportunities to discuss ongoing work
during breaks was not regarded as a problem, since it was the skills exhibited during the
class time that were being assessed, rather than content knowledge.
2000s. They developed for reasons that were largely or even wholly independent of the
theoretical earthquakes that were erupting in science and the world of critical theory.
Teachers have not embraced coursework and student-centred approaches to assessment
because they questioned the grand narratives of science, or because they had been read-
ing contemporary French philosophy. Broadly speaking, English teachers of the 1970s
still held to Enlightenment theories of knowledge and hierarchical, canonical, Leavisite
theories of literary criticism, which placed Shakespeare, Milton and Dickens at the top of
the Great Chain of Literary Being. The rationale for developing new approaches to
assessing response to literature had much more to do with the comprehensive school
movement and a pragmatic determination to offer an experience of great literature to a
broader range of pupils than any postmodern theoretical position.
There is, nevertheless, a powerful alignment between postmodernism and these
newer approaches: when student engagement has to be courted rather than assumed,
motivation, choice and extensions to the concept of the role of the reader move up the
teacher’s agenda and these most certainly harmonise with postmodern principles. More
recently, technologies have also played a significant part in extending the concept of lit-
erature and of response: new technologies have spawned multiliteracies and these have
contributed to a major change in the authority of knowledge in relation to literature.
Postmodernism redefines the authority of the text and redistributes it, with a new and
much greater share in this zero sum game being won by the reader. But the internet has
also redefined authority: the author, the text and the teacher are no longer the only
sources in the quest for meaning or knowledge. The search term ‘Hamlet’ evoked 14
million results in November 2005 and the term ‘Shakespeare’ 46 million. The internet
has truly opened up the possibility of changing the authority structure of literary enquiry
and although concerns over plagiarism and authentication have led to a questioning of
open-ended approaches to assessment, these flood gates cannot now be closed.
The question ‘who is the author?’ no longer applies only to issues of response to lit-
erature – it applies to the works themselves. Fan fiction sites (featuring full-length
novels based on the same characters as those from celebrated works of fiction) on the
internet offer many hundreds of novels echoing Jane Austen, and over two hundred
thousand (at least half of which are on a continuum from mild sexual fantasy to outright
pornography) based on the characters from the Harry Potter novels (FanFiction.net is
the most visited and best organised source). What is more, these novels are reviewed by
their readers: many Harry Potter fan fiction novels are more than 50,000 words in length
and many have over 100 reviews. No author receives royalties and no student receives
a grade for their review, but not only are hundreds of thousands of fan fiction novels,
there are tens of millions of fan fiction reviews, which have generated their own discourse
and abbreviations (‘A/U’ = ‘alternative universe’; ‘UST’ = ‘unresolved sexual tension’;
‘PWP’ ‘plot – what plot?’, usually referring to a story in which there is a conspicuous
absence of UST).
What one would want to emphasise here is that our notions of what counts as litera-
ture are evolving and our notions of what counts as response to literature are transmuting
62 assessing reading: from theories to classrooms
as new technologies and new literacies also evolve. To most teachers, it would seem
absurd to classify fan fiction as literature and to regard internet reviews of fan fiction as
response to literature that might in principle be assessed, but I would want to suggest that
to compose good fan fiction does require sensitivity to character, historical and social
context, tone, diction and figurative language and so on and to review that fiction calls in
principle upon the same skill set that is used by professional literary critics.
Conclusion
It may be some time before exam boards include fan fiction as an assessed element in A-
level English Literature, but it is already the case that exam boards are dramatically
changing how they view electronic assessment, and are encouraging much greater align-
ment between curriculum and assessment, at a time when the curriculum is evolving at a
rapid pace and that evolution is being accelerated by new technologies (see, for example,
DiDA, the Diploma in Digital Applications, Edexcel, 2005). Maintaining trust in national
examination systems is a delicate challenge and in England as recently as 2002 (Tomlin-
son, 2002) the Qualifications and Curriculum Authority had its fingers seriously burned
when public confidence wavered for reasons that were more attributable to a press feeding
frenzy than any fundamental instability. But behind a public discourse that emphasises
steady and responsible development, the exam boards are changing and in my view these
changes, seeking local solutions, valuing the individual, extending the range of examina-
tion modes, attempting to take account of multiple perspectives on meaning, bringing
back portfolio approaches and redefining authority, are essential. To attempt to ignore the
calls for such changes, or to attempt to hold them back, would be fruitless and ultimately
catastrophic, because these changes are fundamentally linked to movements in the plate
tectonics of our civilisation. Postmodernism is not a choice, it is the condition of our soci-
ety and represents the state of our somewhat unstable cultural lithosphere; what is needed
therefore are design solutions that can cope with stress, uncertainty and change; solutions
that are flexible, responsive and adaptable. Fortunately, there is emerging evidence of that
responsiveness in our assessment system, and it will be interesting in the coming years to
observe the ways in which the postmodern continues to impinge upon assessment in gen-
eral and the assessment of response to literature in particular.
References
Almasi, J.F. (1995). ‘The Nature of Fourth Graders’ Sociocognitive Conflicts in Peer-
Led and Teacher-Led Discussions of Literature’, Reading Research Quarterly, 29, 4
304–6.
Bakhtin, M.M. (1973). Problems of Dostoevsky’s Poetics. Ann Arbor, MI: Ardis.
Callaghan, J. (1976). ‘Education: towards a national debate.’ Speech given at Ruskin
College, Oxford, 18 October.
postmodernism and the assessment of literature 63
Introduction
In this chapter I intend to consider the position of teachers and children amidst this dis-
cussion about the assessment of reading. Paul Black and Dylan Wiliam thoughtfully
entitled their booklet on assessment Inside the Black Box (Black and Wiliam, 1998).
They argue:
[P]resent policy seems to treat the classroom as a black box. Certain inputs from the
outside are fed in or make demands – pupils, teachers, other resources, management
rules and requirements, parental anxieties, tests with pressure to score highly and so
on. Some outputs follow: hopefully pupils who are more knowledgeable and compe-
tent, better test results, teachers who are more or less satisfied and more or less
exhausted. But what is happening inside? How can anyone be sure that a particular set
of new inputs will produce better outputs if we don’t at least study what is happening
inside?’
(Black and Wiliam, 1998, p.1)
Teachers and pupils are centrally involved in assessment. They are involved not only as
people to whom the assessment is administered but those who are the agents of the
assessment. In addition, regardless of the intentions of those who design the assessment
and judge its outcomes, it is the teachers and pupils who ultimately interpret the nature
of the construct.
This chapter is essentially a personal reflection of someone who is not an expert in
assessment as are others in this book. I write here as someone who has worked as and
with teachers of language and literacy over many years. And as someone who is as con-
cerned with the minutiae of seemingly inconsequential outcomes as with the big picture
of raised scores. It is these minutiae that represent individual children and individual
lives. I start with a reflection on my own position as a primary school teacher in the 70s
and 80s and then move on to how I see the changes in relation to assessment in class-
rooms today. I shall then reflect, with the perspective of distance, on what teachers and
pupils may require from assessment. Finally I want to consider how these points could
relate to the construct of reading. Much of what I have to say relates to many subjects in
the curriculum. However, I shall argue that the case of reading may be significantly
different from other subjects in some ways.
learning to read or learning to do tests? 65
Therefore, as the teacher I wanted those children who I felt needed help to score 80 or
less.
The other outcome from the test was the assignment of a reading age to each child. In
the absence of any other norm referenced measure, these were important to parents –
comparing reading books at the school gate was the only other way they had of know-
ing how their child was doing. Here again, as a teacher, the narrowness of the construct
was frustrating. It was so hard to explain to parents without appearing to have low
expectations of their child that the fact that Nicola at five had a reading age of 11.6 years
did not mean she should be reading books for eleven-year-olds. Trying to explain the
gap between a child’s ability to read individual words or short sentences and their abili-
ty to follow a complicated plot or tackle conceptual complexity seemed to sound as
though I wanted to hold their child back. Both Schonnel and Young advantaged slow
careful word by word reading. Yet fluency seemed important.
Fortunately, I had also read about miscue analysis (Goodman, 1967). This was a rev-
elation to me. The way I suddenly was able to judge what a child knew and could use
and what they didn’t know or couldn’t use was fantastic. All of a sudden I was able to
do something specific with individuals. The reading tests had told me what I knew
already – that someone was not making good progress in their reading or was doing
really well. Now I had something I could do about it. Not only did miscue give me a
way of judging what help my pupils needed, it also taught me about the small parts of
decoding and to recognise them in any reading context, not just when specifically
assessing reading. The construct with miscue is clearly mostly about decoding and early
comprehension. In fact, it gave me little help with more able readers. However, as an
enthusiastic and critical reader who enjoyed all kinds of texts, I feel miscue helped me
develop my own understanding of the process of reading rather than limit my own
definition of the construct.
1998); Literacy Block in Australia (DEET, Vic, 1997, 1998, 1999); Success for All
(now Roots and Wings) (Slavin, 1996). The success of these programmes is judged by
the assessments that children take. Often for political reasons as much as educational
ones, there is considerable pressure for children to do well on the assessments. Where
teachers’ status and even pay are determined by these results there is considerable pres-
sure for children to do well on the assessments. Where government ministers’ political
ambition is also involved the pressure becomes greater. When learning becomes a polit-
ical issue more than an educational one, there is danger that learning loses out to
attainment.
The team for Ontario Institute for Studies in Education (OISE) in Canada that was
charged with the evaluation of the implementation of the NLS, in the final report, com-
ment favourably on the success of the programme. However, they go on to warn,
Targets or standards and high stakes testing are among the most contentious ele-
ments of large-scale reform. Most would agree that the move toward higher
standards is necessary and important. There is less agreement, however, about the
way that tests and targets are used in the process. … [We] see some evidence that the
high political profile of the 2002 national targets skewed efforts in the direction of
activities that would lead to increases in the one highly publicised score.
(Earl et al., 2003)
This importance afforded to national test results places a strong emphasis on perform-
ance as opposed to learning: on what children can do as opposed to what they can’t do;
on the product of the reading rather than the process.
But what is the product of reading? As Sainsbury (this volume) argues, the product of
reading for a test does not represent the full scope of the construct. Rather the test
should allow the tester to ‘draw inferences about valued attributes that go beyond the
test’ (p. 8). While this may be the case, these ‘valued attributes’ are unlikely to include
indicators for reading attitudes and practice in the long term. If the teaching of reading
is to go beyond teaching the performance of certain skills, application of the skills to a
range of purposes and attitudes to reading must be important.
A further consequence of the focus on performance is the concern for what children
can do as opposed to what they cannot do. Many classrooms now use records in which
children record what they can do. For example, ‘I can divide a word into syllables’, ‘I can
find the main character in a story’. Hall and Myers (1998) argue that criterion referencing
and ‘can do’ statements accord status to the what rather than the how of learning. Hall
and Myers report on a study by O’Sullivan and Joy (1994), which showed that children,
when talking about reading problems, attributed these to lack of effort as opposed to abil-
ity. O’Sullivan and Joy conclude that teachers’ emphasis on practice and working hard
allows children to retain a naïve understanding of the reading process. Teachers’ practice
of focusing on what is to be achieved rather than how it is achieved can only reinforce
this. Dweck (1989) proposes two kinds of achievement goals (learning goals and per-
formance goals) and describes the sort of learners who favour these types of goals.
Learners who set themselves learning goals try to increase their competence. They
68 assessing reading: from theories to classrooms
choose challenging tasks, persist despite the challenge and work out strategies for gain-
ing proficiency. On the other hand, learners who set themselves performance goals, in
which they strive to gain favourable judgements from others, tend to avoid challenge,
attribute difficulty to low ability and give up in the face of problems.
This proposition is interesting in the light of Moss’ (2000) study on gender and read-
ing. She explored the place of reading in the curriculum and children’s responses to this.
She found that boys and girls reacted differently to the judgements made about their
proficiency as readers and that this impacted on their progress in reading. Whereas girls
were likely to respond to unfavourable judgements about their reading by trying harder,
boys were more likely to avoid the challenge. She speculates that boys, contrary to pop-
ular belief, choose information texts not because they prefer them but because when
‘reading’ information books they are better able to conceal any difficulty with decoding
and enhance their status by showing expertise in the topic.
Non-fiction texts allowed weaker boy readers to escape others’ judgements about
how well they read, how competent they were. At the same time, they enabled them to
maintain self-esteem in the combative environment of their peer group relationships.
… They bought self-esteem at the expense of spending much time practising their
reading skills.
(Moss, 2000, p.103)
The climate in education at the moment is all about performance, about trying to do bet-
ter, to achieve better results, to be judged to be a level higher than last year. Teachers
under pressure to show increased performance, both on their own part and on the part of
their pupils, are not well placed to focus on learning itself and to encourage children to
seek difficulty and ways of overcoming it. The emphasis on performance over learning
makes quick fix solutions attractive. Yet there is a danger that concentrating on helping
children to perform well in the short term, may not provide the foundations for a lifetime
of thinking and achievement.
For this to happen, teachers need to know a lot about what is involved in reading and
children have to be confident to accept challenges and be open about difficulties. Read-
ing is a complex process in which a range of knowledge and skills are orchestrated to
make meaning from text. We have long been aware that children who fail to make
progress in reading and writing find the putting together of the component parts more
difficult than learning each separately. For example, Clay (1979) found that the poorest
readers tended to do exactly and only what they had been taught and appeared to have
become instruction dependent, with the result that, although they knew letter sound cor-
respondences, they did not use them efficiently because they used them exclusively.
Similarly, Garner (1987) showed that whereas good readers monitor their comprehen-
sion of a text, poor or less experienced readers do not seem to recognise when the text
does not make sense. In both these cases an awareness, first, of the need to select an
appropriate decoding strategy before applying it and, in the second case, that readers
should continuously check on their understanding of the text would help these readers.
This strategic knowledge is important to learners but difficult to test.
In reading assessment that examines the product of the reading, the strategic activity
will be largely unobservable. Teachers need to develop pupils’ strategic repertoires so
they can make conscious choices and take control over their reading. This implies a
construct of reading that is complex and not fixed. It also implies a model of teaching
and learning that values uncertainty and challenge over correctness and simplicity. Mis-
cue analysis recognised that a great deal of insight could be gained into how readers
were processing text from the errors or ‘miscues’ that they made. We need to encourage
children to face challenge and welcome less than perfect outcomes as a way of helping
them develop their own strategies and for the insight those errors give the teacher into
how children are tackling an activity that involves reading.
The question of what teachers and children require from reading assessment is inextrica-
bly bound up with the socio-cultural context of the assessment. In a climate where success
is judged by test scores, performance outweighs learning in importance. However, the per-
formance is surely meant to act as a marker for learning? Therefore in response to the
question ‘what do teachers and children want from reading assessment?’ I would propose:
Teachers want reading assessment to:
• tell them if their teaching is successful
• tell others that their teaching is successful
• tell them what children can already do
In all of these they want the assessment to be manageable and convincing. It is likely that
teachers cannot get all this from one form of assessment. Nisbet (1993) in a report on
70 assessing reading: from theories to classrooms
assessment for the Organisation for Economic Co-operation and Development (OECD)
proposes that ‘there is an ideological divide between those who hope to raise standards by
more extensive testing and those who hope to improve the quality of learning by changing
assessment methods’ (p.28). It is arguable whether these two purposes of assessment can
operate together or whether they are mutually incompatible. Gipps (1994) argues ‘assess-
ment to support learning, offering detailed feedback to the teacher and pupil, is necessarily
different from assessment for monitoring and accountability purposes’ (p.3).
In response to the question of what children require from reading assessment, I
would argue that they want to know:
• that they are doing OK
• that what they are trying to do is achievable for them
• that what they can do (or are trying to do) matters
• how they can do better.
If I am right about these points, perhaps we have more indication about what assess-
ment should do and what it can do. If these points matter, then reading has to be seen as
something that others like them can do and do do. Thus reading is not just a construct of
school and school-based literacy purposes but a part of lives before, during and after
school. It is not just something difficult that other people can do well, it is something
that all of us do to some extent or other as part of our everyday lives. They need to
understand that the best readers can be uncertain about interpretation of texts; confident
readers happily tackle difficult texts to find out something that they need.
it’ (p.10). However, in a subject such as reading, in which the overall aims relate to fluid
and complex outcomes such as interpreting text and developing critical judgement,
identifying learning goals can be problematic. The NLS ‘Framework of Objectives’
gives a clear menu of teaching objectives. It is when these teaching objectives are trans-
lated into learning outcomes that there is a danger of narrowing children’s (and
teachers’) perception of what being a reader means. Marshall (2004) argues that teach-
ers do need a clear idea of progression in English, but that this should be related to
‘horizons not goals’.
Peer- and self-assessment can help here by encouraging learners to focus on the
process and the learning; not just the performance. Sadler (1989) argued that criteria
alone are not enough to ensure progression. He argues that it is important to know how
to interpret the criteria in specific cases by developing ‘guild knowledge’. In being
involved in self- and peer-assessment, learners can engage in discussion about reading
practices – about pleasures and difficulties as well as achievements. It seems to me that
in order to learn about reading, children should engage in writing and vice versa. When
readers assess other readers and writers, they are helped in identifying the goals of
learning to read. Here we are not talking about how many words that can be read or
which comprehension questions can be answered, but what sort of enjoyment can pupils
derive from a text? How able are they to judge the validity of an information text? How
good are they at seeing through the rhetoric of a newspaper editorial? By working with
authentic texts from the child’s world there is a greater likelihood of engagement as
opposed to disaffection. For teachers too, being involved in open-ended discussion
around texts and writers will help develop their own guild knowledge – related perhaps
to their own understanding of being a reader rather than an understanding derived from
expected test outcomes.
72 assessing reading: from theories to classrooms
What has to be accepted is that, if we agree that different assessments are needed for dif-
fering purposes, then we must also agree that one form of assessment must not outweigh
the other in the value that is attributed to it. Earl et al. (2003) report that the high-stakes
assessment has had a narrowing function on the curriculum. Others go further and argue
that the curriculum has had a narrowing function on the model of literacy taught. Street
(1984) argued that many schools adopt what he terms an ‘autonomous’ model of literacy in
which literacy itself is an object to be studied and learned rather than social practice which
is shaped by those who use it. It is also argued that this ‘autonomous’ view of literacy dis-
advantages some children for whom the relationship between literacy practice at home and
school literacy is not evident. As Luke (2003) argues:
the classroom is one of the few places where formal taxonomic categories (e.g. the
curriculum) and the official partitioning of time and space (e.g. the timetable) often
are used to discourage children from blending, mixing and matching knowledge
drawn from diverse textual sources and communications media.
(Luke, 2003, p.398)
Street and Luke’s criticisms of school versions of literacy are valid. Literacy as social
practice stands at one end of a continuum of reading practices of which the practices
tested by most reading tests stand at a distant other end. Teachers need to keep the broad
view of reading in their mind as an end point and make sure that children too keep the
big picture in mind. In this way children can be helped to see why they are learning to
read and how worthwhile and relevant it is to their lives. It is possible to retain the big
picture while smaller skills and strategies are taught while working on authentic tasks,
discussing reading practice and behaviours, accepting different view points. Assess-
ment of what children are doing when talking about their reading can provide teachers
with a greater understanding of what is involved in reading (in the way that miscue
analysis enabled me to identify how children were decoding). This then will enable
them to design teaching tasks that are sufficiently structured to scaffold learning but not
so tightly defined to limit thinking.
The solution is to widen the narrow end of the funnel. As a society we do need feed-
back on the outcomes of our education system but we should not expect this monitoring
to do everything else as well. Analysis of national assessment results show that there is
not a blanket problem with attainment in literacy. There are specific problems in partic-
ular aspects of literacy or particular groups of learners. Specific problems require
specific solutions. Torrance and Pryor (1998) describe two types of assessment: conver-
gent and divergent. Convergent assessment is a closed system which judges if learners
know, understand or can do something. Whereas divergent assessment is more open and
assesses what a learner knows, understands or can do. It seems that what is needed now,
as well as a national assessment system that gives a broad picture, is a bank of formative
assessments for particular purposes that are related to particular children’s needs. These
would allow the teacher to adapt their teaching to the needs of the learner – not to the
demands of the text.
learning to read or learning to do tests? 73
It is not enough to say that formative and summative assessment are mutually incom-
patible. In the current climate, summative assessment will necessarily drive the
formative. Society needs to monitor the outcome of what happens in school, but this
should not result in the distorting of what happens in the classroom. It may well be that
these instruments are not the same. In summative assessment teachers and children need
a test that values what children are learning to do and that is valued by those who see
and use the results. They also want formative assessment procedures that can help them
judge what children can do, what they are trying to do and what they can’t yet do. From
their own life experience, teachers and children already have a picture of reading that
encompasses literacy as social practice as well as literacy for school purposes. They
need to be confident that the procedures for assessment and uses to which the assessment
is put do not value one type of reading over another.
References
Black, P., Harrison, C., Lee, L., Marshall and Wiliam, D. (2002). Working inside the
Black Box: Assessment for Learning in the Classroom. London: Department of Edu-
cation and Professional Studies, Kings College.
Black, P. and Wiliam, D. (1998). Inside the Black Box: Raising Standards through
Classroom Assessment. London: Department of Education and Professional Studies,
Kings College.
Bourdieu, P. (1990). The Logic of Practice, trans. Nice, R. Cambridge: Polity Press.
(Original work published in 1980.)
Clay, M.M. (1979). Reading: The Patterning of Complex Behaviour. London: Heine-
mann Educational Books.
DEET:Vic (1997). Teaching Readers in the Classroom Early Years Literacy Program
Stage 1. S. Melbourne, Victoria: Addison Wesley Longman.
DEET:Vic (1998). Teaching Writers in the Classroom Early Years Literacy Program
Stage 2. S. Melbourne, Victoria: Addison Wesley Longman.
DEET:Vic (1999). Teaching Speakers and Listeners in the Classroom Early Years Liter-
acy Program Stage 3. S. Melbourne, Victoria: Addison Wesley Longman.
Department For Education and Employment (1998). The National Literacy Strategy.
London: Department for Education and Employment.
Dweck, C. (1989). ‘Motivation.’ In: Lesgold, A. and Glaser, R. (Eds) Foundations for a
Psychology of Education. Hillsdale, NJ: Erlbaum.
Earl, L., Watson, N., Levin, B., Leithwood, K., Fullan, M. and Torrance, N. (2003).
Watching and Learning 3: Final Report of the External Evaluation of England’s
National Literacy and Numeracy Strategies. (Ontario Institute for Studies in Educa-
tion, University of Toronto) January 2003.
Garner, R. (1987). Metacognition and Reading Comprehension. New Jersey: Ablex
Publishing Corporation.
74 assessing reading: from theories to classrooms
The purpose of this chapter is to build a rich and detailed historical account of reading
comprehension, both as a theoretical phenomenon and an operational construct that lives
and breathes in classrooms throughout the USA. We will review both basic research,
which deals with reading comprehension largely in its theoretical aspect and applied
research, which is much more concerned about how comprehension gets operationalised
in classrooms, reading materials and tests.
With a renewed professional interest in reading comprehension (e.g. Rand Study
Group, 2002), it is an optimal time to undertake a new initiative in the area of reading
comprehension assessment. To do so, it needs our rapt and collective attention at this
particular point in history. First, reading comprehension, both its instruction and its
assessment, is arguably the most important outcome of reform movements designed to
improve reading curriculum and instruction. Second, given the national thirst for
accountability, we must have better (i.e. conceptually and psychometrically more trust-
worthy) tools to drive the engines of accountability at the national, state and local level.
Third, and even more important, we need better assessments so that we can respond to
the pleas of teachers desperate for useful tools to assist them in meeting individual
needs. It is doubly appropriate that the assessment of reading comprehension receive as
much attention as the construct itself. In the final analysis, a construct is judged as much
by how it is operationalised as by how it is conceptualised.
The process of text comprehension has always provoked exasperated but nonetheless
enthusiastic inquiry within the research community. Comprehension, or ‘understand-
ing’, by its very nature, is a phenomenon that can only be assessed, examined, or
observed indirectly (Pearson and Johnson, 1978; Johnston, 1984a). We talk about the
‘click’ of comprehension that propels a reader through a text, yet we never see it direct-
ly. We can only rely on indirect symptoms and artifacts of its occurrence. People tell us
that they understood, or were puzzled by, or enjoyed, or were upset by a text. Or, more
commonly, we quiz them on ‘the text’ in some way – requiring them to recall its gist or
its major details, asking specific questions about its content and purpose, or insisting on
an interpretation and critique of its message. All of these tasks, however challenging or
engaging they might be, are little more than the residue of the comprehension process
itself. Like it or not, it is precisely this residue that scholars of comprehension and com-
prehension assessment must work with in order to improve our understanding of the
the assessment of reading comprehension 77
construct. We see little more of comprehension than Plato saw of the shadows in the
cave of reality.
Models of reading comprehension and how to assess it have evolved throughout the
century (see Johnston, 1984b). Many techniques of assessment have risen to prominence
and then fallen out of use, some to be reincarnated decades later, usually with new twists.
Our aim is to provide a thorough account of what we know about assessing reading com-
prehension. Where possible and appropriate, we will take detours into research and
theory about the comprehension process, on the grounds that conceptions of the process,
because they have influenced how it is assessed, will inform our understanding. We hope
to illuminate the patterns, cycles and trends in comprehension assessment. Through these
efforts, we hope to provide our readers with a means to evaluate the current state of read-
ing assessment, which we believe has reached a critical juncture, one that can be crossed
only by shaping a research agenda that will improve our capacity to create valid, fair and
informative assessments of this important phenomenon.
The beginning
It is well worth our effort to examine early trends in reading assessment, for they sug-
gest that nearly all of the tools we use to measure reading comprehension today made an
78 assessing reading: from theories to classrooms
appearance in some way shape or form before World War II. Granted, today’s formats
and approaches may look more sophisticated and complex, but, as our review will
demonstrate, those formats were there, at least in prototypic form, long ago.
The first systematic attempts to index reading ability by measuring comprehension
date back to the period just prior to World War I. Binet, as early as 1895 (cited in John-
ston, 1984a), used comprehension test items, ironically, to measure intelligence rather
than reading achievement. In 1916, Kelly brought us the first published comprehension
assessment, the Kansas Silent Reading Test (see Kelly, 1916). Thorndike, in his classic
1917 piece, ‘Reading as reasoning’, offered us our first professional glimpse ‘inside the
head’ as he tried to characterise what must have been going on in the minds of students
to produce the sorts of answers they come up with when answering questions about text
(Thorndike, 1917). As we indicated earlier, the quest to get as close as possible to the
‘phenomenological act of comprehension’ as it occurs has always driven researchers to
discover new and more direct indices of reading comprehension.
The scientific movement and the changing demographic patterns of schooling in
America were both forces that shaped the way reading was conceptualised and assessed
in the first third of the 20th century. Schools had to accommodate to rapid increases in
enrolment, due to waves of immigration, a rapidly industrialising society, the prohibi-
tion of child labour and mandatory school attendance laws. The spike in school
enrolment, coupled with a population of students with dubious literacy skills, dramati-
cally increased the need for a cheap, efficient screening device to determine students’
levels of literacy. During this same period, psychology struggled to gain the status of a
‘science’ by employing the methods that governed physical sciences and research. In
America, the behaviorist schools of thought, with their focus on measurable outcomes,
strongly influenced the field of psychology (Johnston, 1984a; Resnick, 1982; Pearson,
2000); quantification and objectivity were the two hallmarks to which educational ‘sci-
ence’ aspired. When psychologists with their newfound scientific lenses were put to
work creating cheap and efficient tests for beleaguered schools, the course of reading
assessment was set. Group administered, multiple-choice, standardised tests would be
the inevitable result.
The other strong influence in moving toward comprehension as a measure of reading
accomplishment was the curricular shift from oral to silent reading as the dominant
mode of reading activity in our classrooms. Although the first published reading assess-
ment, circa 1914, was an oral reading assessment created by William S. Gray (who
eventually became a pre-eminent scholar in the reading field and the senior author of the
country’s most widely used reading series), most reading assessments developed in the
first third of this century focused on the relatively new construct of silent reading (see
Gray, 1917; Pearson, 2000; Johnston, 1984a). Unlike oral reading, which had to be test-
ed individually and required that teachers judge the quality of responses, silent reading
comprehension (and rate) could be tested in group settings and scored without recourse
to professional judgement; only stop watches and multiple-choice questions were need-
ed. In modern parlance, we would say that they moved from a ‘high inference’
assessment tool (oral reading and retelling) to a ‘low inference’ tool (multiple-choice
the assessment of reading comprehension 79
tests or timed readings). It fit the demands for efficiency and scientific objectivity,
themes that were part of the emerging scientism of the period. The practice proved
remarkably persistent for at least another 40 or 50 years. Significant developments in
reading comprehension would occur in the second third of the 20th century, but assess-
ment would remain a psychometric rather than a cognitive activity until the cognitive
revolution of the early 1970s.
It is important to note that comprehension instruction and the curricular materials
teachers employed were driven by the same infrastructure of tasks used to create test
items – finding main ideas, noting important details, determining sequence of events,
cause-effect relations, comparing and contrasting and drawing conclusions.1 If these
new assessments had not found a comfortable match in school curricular schemes, one
wonders whether they would have survived and prospered to the degree that they did.
(e.g. Rumelhart, 1977). Even at this early stage, scholars recognized that recall is not
the same process as making or uncovering meaning (Kelly, 1916), but recall continued
to be used in research and later in practice, as a direct index of comprehension. This use
of recall would be revived in the 1970s as a retelling procedure, which would give us a
window on whether students were remembering important ideas in stories (Stein and
Glenn, 1977) or in the propositional data base of expository texts (Kintsch and van Dijk,
1978; Turner and Greene, 1977).
Consistent with the efficiency criterion in the new scientific education, speed was
often used as an important factor in assessing comprehension. Kelly, the author of the
Kansas Silent Reading Test (1916) required students to complete as many of a set of 16
diverse tasks as they could in the 5 minutes allotted. The tasks included some ‘fill in the
blanks’, some verbal logic problems and some procedural tasks (following directions).
Monroe (1918) also used a speeded task – asking students to underline the words that
answered specific questions.
We can even find foreshadowing of the error detection paradigms that were to be so
widely used by psychologists investigating metacognitive processes in the 1970s
through the 1990s (Markman, 1977; Winograd and Johnston, 1980). For example,
Chapman (1924) asked students to detect words that were erroneous or out of place in
the second half of each paragraph (presumably they did so by using, as the criterion for
rejection, the set or schema for paragraph meaning that became established as they read
the first half). In 1936, Eurich required students to detect ‘irrelevant clauses’ rather than
words.
Thorndike (1917) was probably the first educational psychologist to try to launch
inquiry into both the complex thought processes associated with comprehension and
assessment methods. He referred to reading ‘as reasoning’, suggesting there are many
factors that comprise it: ‘elements in a sentence, their organization … proper relations,
selection of certain connotations and the rejection of others and the cooperation of many
forces’. He proposed ideas about what should occur during ‘correct reading’, claiming
that a great many misreadings of questions and passages are produced because of under-
or over-potency of individual words, thus violating his ‘correct weighting’ principle:
Understanding a paragraph is like solving a problem in mathematics. It consists in
selecting the right elements in the situation and putting them together in the right
relations and also with the right amount of weight or influence or force of each
(Thorndike, 1917)
Of course, he assumed that there are such things as ‘correct’ readings. He argued further
that in the act of reading, the mind must organise and analyse ideas from the text. ‘The
vice of the poor reader is to say the words to himself without actively making judgements
concerning what they reveal’ (Thorndike, 1917). Clearly for Thorndike, reading was an
active and complex cognitive process. Although this perspective did not become domi-
nant in this early period, it certainly anticipated the highly active view of the reader that
would become prominent during the cognitive revolution of the 1970s.3
the assessment of reading comprehension 81
Paralleling an active line of inquiry in oral reading error analysis (see Allington,
1984, pp.829–64) during this period, some researchers followed Thorndike’s lead and
tried to develop taxonomies of the kinds of errors readers make either during decoding
or understanding. Touton and Berry (1931) classified errors into six categories based on
research on college students:
1. failure to understand the question
2. failure to isolate elements of ‘an involved statement’ read in context
3. failure to associate related elements in a context
4. failure to grasp and retain ideas essential to understanding concepts
5. failure to see setting of the context as a whole
6. other irrelevant answers.
Even though Goodman is rightfully credited with helping us understand that oral reading
errors, or ‘miscues’ to use his term, can reveal much about the comprehension processes
a student engages in; there were inklings of this perspective emerging in the 1930s. Gates
(1937), for example, was interested in how readers’ fluency may be an indicator of one’s
ability and understanding. He looked at readers’ ‘error of hesitation’, that is, whether a
reader stumbled over a word or phrase. Durrell (1937) and later Betts (1946) sought to
use these error patterns as indicators of the level of reading material students could han-
dle, both from a word recognition and comprehension perspective. These early scholars
determined that students who misread many words (they found that 2 per cent seems to
be our outside limit – although modern scholars often go up to 5 per cent) will have dif-
ficulty comprehending a passage. These harbingers notwithstanding, it would be another
30 years before the Goodmans’ (Goodman, 1968, 1969; Goodman and Burke, 1970)
miscue analysis work prompted us to take oral reading miscues seriously as a lens that
would allow us to look into the windows of the mind at the comprehension process.
Two significant events in the history of assessment occurred during the 1930s and
1940s; both would have dramatic effects on reading comprehension assessment. First,
in 1935, IBM introduced the IBM 805 scanner, which had the potential to reduce the
cost of scoring dramatically (compared to hand scoring of multiple-choice, or ‘even
worse’, short answer and essay tests) by a factor of 10 (Johnston, 1984a). It is not
insignificant that the Scholastic Aptitude Test, which, in the 1920s and early 1930s, had
been mostly an essay test, was transformed into a machine-scorable multiple-choice test
shortly thereafter (Resnick, 1982). This development paved the way for a new genera-
tion of multiple-choice assessments for all fields in which testing is used; reading
comprehension assessment proved no exception.
82 assessing reading: from theories to classrooms
Davis employed the most sophisticated factor analytic tools available (Kelly, 1935) in
his search for psychometric uniqueness to match the conceptual uniqueness of his cate-
gories. Acknowledging the unreliability of some of the subtests (due among other
factors to the small standard deviations and the fact each passage had items from sever-
al cognitive categories attached to it), he was able to conclude that reading
comprehension consisted of two major factors, word knowledge and ‘reasoning about
reading’, that were sufficiently powerful and reliable to guide us in the construction of
the assessment of reading comprehension 83
tests and reading curriculum. He speculated that another three factors (comprehension
of explicitly stated ideas, understanding passage organisation and detecting literary
devices) had the potential, with better item development, to reveal themselves as
independent factors.
Between 1944 and the early 1970s, several scholars attempted to either replicate or
refute Davis’ findings. Harris (1948) found only one factor among the seven he tested.
Derrik (1953) found three and they were consistent across different levels of passage
length. Hunt (1957) used differential item analysis and correction formulae to adjust his
correlations, finding vocabulary (i.e. word knowledge) as the single most important factor.
The unsettled question about cloze tests is whether they are measures of individual dif-
ferences in comprehension or measures of the linguistic predictability of the passages to
which they are applied. They have been widely criticised for this ambiguity. But perhaps
84 assessing reading: from theories to classrooms
the most damaging evidence in their role as indices of reading comprehension is that
they are not sensitive to ‘intersentential’ comprehension, i.e. understanding that reaches
across sentences in a passage. In a classic study, Shanahan et al. (1982) created several
passage variations and assessed cloze fill in rates. In one condition, sentence order was
scrambled by randomly ordering the sentences. In another condition, sentences from
different passages were intermingled and in a third condition, isolated sentences from
different passages were used. There were no differences in cloze fill in rate across any of
these conditions, indicating that an individual’s ability to fill in cloze blanks does not
depend upon passage context; in short, when people fill in cloze blanks, they do not
think across sentence boundaries. In the period of the cognitive revolution of the 1980s,
in which comprehension was viewed as an integrative process, a measure that did not
require text integration did not fare well.
These findings notwithstanding, modified, multiple-choice versions of cloze are still
alive and well in standardised tests (i.e. the Degrees of Reading Power and the Stanford
Diagnostic Reading Test referred to earlier) and in ESL assessment for adults and college
students (Bachman, 2000).
Passage dependency
Beginning in the late 1960s a new construct arose in reading assessment, one that, at the
time, had the impact of a ‘the emperor has no clothes’ epiphany. Several scholars became
concerned about the fact that many of the questions on standardised tests of reading com-
prehension could be answered correctly without reading the passage (mainly because the
information assessed was likely to exist in examinees’ prior knowledge, as well as in the
text). This problem is particularly exacerbated in passages about everyday or common
academic topics (in comparison, for example, to fictional narratives). A number of
researchers (e.g. Tuinman, 1974, 1978, pp.165–73) conducted passage dependency stud-
ies in which some subjects took the test without the passage present. The difference
between the p-value of an item in the two conditions (with and without text) is an index
of an item’s passage dependency. The logic of this construct is simple and compelling: a
reader should have to read a passage in order to answer questions about it. The interest in
passage dependency, like the interest in cloze, waned considerably during the cognitive
revolution. In the new paradigm, prior knowledge would be embraced as one of the cor-
nerstones of comprehension and scholars would attempt to take prior knowledge into
account rather than trying to eliminate or encapsulate its impact on comprehension (see
Johnston, 1984b, for an account of these attempts during the early 1980s).
Cognitive psychology
In rejecting behaviorism, cognitive psychology allowed psychologists to extend con-
structs such as human purpose, intention and motivation to a greater range of
psychological phenomena, including perception, attention, comprehension, learning,
memory and executive control or ‘meta-cognition’ of all cognitive process. All of these
would have important consequences in reading pedagogy and, to a lesser extent, reading
assessment.
The most notable change within psychology was that it became fashionable for psy-
chologists, for the first time since the early part of the century, to study complex
phenomena such as language and reading.4 And in the decade of the 1970s works by
psychologists flooded the literature on basic processes in reading. One group focused on
text comprehension, trying to explain how readers come to understand the underlying
structure of texts. We were offered story grammars – structural accounts of the nature of
narratives, complete with predictions about how those structures impede and enhance
story understanding and memory (Rumelhart, 1977; Mandler and Johnson, 1977; Stein
and Glenn, 1977). Others chose to focus on the expository tradition in text (e.g. Kintsch,
1974; Meyer, 1975). Like their colleagues interested in story comprehension, they
believed that structural accounts of the nature of expository (informational) texts would
provide valid and useful models for human text comprehension. And in a sense, both of
these efforts worked. Story grammars did provide explanations for story comprehen-
sion. Analyses of the structural relations among ideas in an informational piece also
provided explanations for expository text comprehension. But neither text-analysis tra-
dition really tackled the relationship between the knowledge of the world that readers
bring to text and comprehension of those texts. In other words, by focusing on structur-
al rather than the ideational, or content, characteristics of texts, they failed to get to the
heart of comprehension. That task, as it turned out, fell to one of the most popular and
influential movements of the 70s, schema theory.
Schema theory (see Anderson and Pearson, 1984, pp.255–90; Rumelhart, 1981,
pp.3–26; Rumelhart and Ortony, 1977) is a theory about the structure of human knowl-
edge as it is represented in memory. In our memory, schemata are like little containers
into which we deposit the particular traces of particular experiences as well as the
‘ideas’ that derive from those experiences. So, if we see a chair, we store that visual
86 assessing reading: from theories to classrooms
psychometric criteria in building new reading assessments. They emphasised the need
for assessments to reflect resources such as prior knowledge, environmental clues, the
text itself and the key players involved in the reading process. They emphasised
metacogntion as a reflective face of comprehension. And they championed the position
that only a fresh start in assessments would give us tests to match our models of instruction.
Major changes
Changes included longer text passages, more challenging questions, different question
formats (such as the more than one right answer format and open-ended questions).
Reading scholars acknowledged that while all multiple-choice items include answers
that are plausible under certain conditions, they did not necessarily invite reflection or
interactive learning. Assessment efforts in Illinois and Michigan (see Valencia et al.,
1989) led the charge in trying to incorporate these new elements. In the spirit of authen-
ticity, they included longer and more naturally occurring or ‘authentic’ text selections in
tests. And both included test items that measured prior knowledge rather than trying to
neutralise its effects (i.e. the passage dependency phenomenon). They also included items
that were designed to measure students’ use of reading strategies and their dispositions
toward reading.
Classroom assessment
The most significant advances in classroom comprehension assessment tools during this
period also came from cognitive science. First was the spread of retellings as a tool for
assessing comprehension. Driven by the 1970s advances in our knowledge about the
structure of narrative and expository text (see Meyer and Rice, 1984), many scholars
(see Irwin and Mitchell, 1983; Morrow, 1988, pp.128–49) developed systems for eval-
uating the depth and breadth of students’ text understandings, based upon their attempts
to retell or recall what they had read. Like the formal efforts of this era, there was a con-
scious attempt to take into account reader, text and context factors in characterising
students’ retellings.
Second was the use the think-aloud protocol as a measure of comprehension. Think-
alouds had become respectable research tools by virtue of the important work on
self-reports of cognitive processes popularised by Ericsson and Simon (1984). In
attempting to characterise the nature of expertise in complex activities, such as chess,
Ericsson and Simon learned that the most effective way inside the heads of expertise
was to engage the players in thinking aloud about the what, why and how of their thing
and actions during the activity.
This led to the wider use of think-alouds. First it became a research tool to get at the
process, not just the product of student thinking (e.g. Olshavsky, 1977; Hartman, 1995).
Then, it became an instructional practice (Baumann et al., 1993) and finally, it was used
as an assessment practice (Farr and Greene, 1992; California Learning Assessment Sys-
tem, 1994). With the ostensible purpose of assessing metacognitive processes during
reading, Farr and Greene engaged students in write-along tasks (a kind of mandatory set
of marginal notes prompted by a red dot at key points in the text). Students were encour-
aged, as they are in think-alouds, to say (in this case make a few notes about) what they
thought at a given point. A similar practice was a standard part of the now defunct Cali-
fornia Learning Assessment System: marginal notes were allowed, even encouraged, in
the initial reading of the texts and those notes were fair game for review when the tasks
were scored. Unfortunately, with the exception of a very thorough account of the
research and theoretical background on verbal protocols by Pressley and Afflerbach
(1995), very little careful work of either a conceptual or psychometric nature on the use
of think-alouds as a viable assessment tool has emerged, although there was one effort to
evaluate different approaches to metacognitive assessment in the special studies of
NAEP in 1994.
We are not sure whether what happened next constitutes a second major shift or is bet-
ter thought of as an extension of the first shift. It came so fast on the heels of the
cognitive revolution that it is hard to pinpoint its precise beginning point. But by the late
the assessment of reading comprehension 89
1980s and early 1990s, a new contextual force was at work in shaping our views of
comprehension assessment.
Sociolinguistics
In fact, harbingers of this socio-cultural revolution, emanating from sociolinguistic per-
spectives (see Bloom and Greene, 1984, pp.394–421) and the rediscovery of Vygotsky
(see Vygotsky, 1978; Wertsch, 1985) were around in the early to mid-1980s, even as the
cognitive revolution was exercising its muscle on assessment practices. For example, in
cognitively motivated teaching approaches such as reciprocal teaching, students took on
more responsibility for their own learning by teaching each other. In process writing,
revision and conversation around revision delved more deeply into the social nature of
reading, writing and understanding. Teachers used such practices to engage students to
reflect on their work as well as interact with others around it. The concept of ‘dynamic
assessment’ also emerged in this period. Dynamic assessment (Feuerstein et al., 1979)
allows the teacher to use student responses to a given task as a basis for determining
what sort of task, accompanied by what level of support and scaffolding from the
teacher, should come next. Here we see both cognitive and socio-cultural influences in
assessment.
These early developments notwithstanding, the next round of assessment reforms car-
ried more direct signs of the influence of these new social perspectives of learning,
including group activities for the construction of meaning and peer response for activities
requiring writing in response to reading.
Literary theory
The other influential trend was a renaissance in literary theory in the elementary class-
room. One cannot understand the changes in pedagogy and assessment that occurred in
the late 1980s and early 1990s without understanding the impact of literary theory, par-
ticularly reader response theory. In our secondary schools, the various traditions of
literary criticism have always had a voice in the curriculum, especially in guiding dis-
cussions of classic literary works. Until the middle 1980s, the ‘New Criticism’
(Richards, 1929) that began its ascendancy in the depression era dominated the interpre-
tation of text for several decades. It had sent teachers and students on a search for the
one ‘true’ meaning in each text they encountered.9 With the emergence (some would
argue the re-emergence) of reader response theories, all of which gave as much author-
ity to the reader as to either the text or the author, theoretical perspectives, along with
classroom practices, changed dramatically. The basals that had been so skill-oriented in
the 1970s and so comprehension oriented in the 1980s, became decidedly literature-
based in the late 1980s and early 1990s. Comprehension gave way to readers’ response
to literature. Reader response emphasises affect and feeling that can either augment or
replace cognitive responses to the content. To use the terminology of the most influential
90 assessing reading: from theories to classrooms
figure in the period, Louise Rosenblatt (1978), the field moved from efferent to aesthetic
response to literature. And a ‘transactive model’ replaced the ‘interactive model’ of
reading championed by the cognitive views of the 1980s. According to Rosenblatt,
meaning is created in the transaction between reader and text. This meaning, which she
refers to as the ‘poem’, is a new entity that resides above the reader–text interaction.
Meaning is therefore neither subject nor object nor the interaction of the two. Instead it
is transaction, something new and different from any of its inputs and influences.10
Task generalisability
Task generalisability, the degree to which performance on one task predicts perform-
ance on a second, is a major concern with these performance tasks. The data gathered
from the first scoring of New Standards tasks (Linn et al.,1995) indicate that indices of
generalisability for both math and reading tasks were quite low. That essentially means
that performance on any one task is not a good predictor of scores on other tasks.
Shavelson and his colleagues encountered the same lack of generalisability with science
tasks (Shavelson et al., 1992), as have other scholars (e.g. Linn, 1993) even on highly
respected enterprises such as the advanced placement tests sponsored by the College
Board. The findings in the College Board analysis are noteworthy for the incredible
variability in generalisability found as a function of subject matter. For example, in
order to achieve a generalisability coefficient of .90, estimates of testing time range
from a low of 1.25 hours for Physics to over 13 hours for European History. These find-
ings suggest that we need to measure students’ performance on a large number of tasks
before we can feel confident in having a stable estimate of their accomplishment in a
complex area such as reading, writing, or subject matter knowledge. Findings such as
these probably explain why standardised test developers have included many short pas-
sages on a wide array of topics in their comprehension assessments. They also point to
a bleak future for performance assessment in reading; one wonders whether we can
afford the time to administer and score the number of tasks required to achieve a stable
estimate of individuals’ achievement.
The legacy
If one examines trends in the assessment marketplace and in state initiatives, one can
make predictions based on usually a reliable indicator of the latest trends in assessment.
Now the revolution begun in the 1980s is over, or at least inching along in a very quiet
cycle. Granted, successful implementations of authentic wide-scale assessment have
been maintained in states like Maryland (Kapinus et al., 1994, pp.255–76), Kentucky
and Oregon (see Pearson et al., 2002). However, other states (e.g. California, Wisconsin,
92 assessing reading: from theories to classrooms
Arizona and Indiana) have rejected performance assessment and returned to off-the-
shelf, multiple-choice, standardised reading assessments. Pearson et al. (2002), found a
definite trend among states in which performance assessment is still alive to include it in
a mixed model, not unlike NAEP, in which substantive, extended response items sit
along side more conventional multiple-choice items. Both these item formats accompa-
ny relatively lengthy passages. Even the more modest reforms in Illinois (the
multiple-correct answer approach) were dropped in 1998 (interestingly, in favour of a
NAEP-like mixed model approach). And it is the NAEP model that, in our view, is most
likely to prevail. It is within the NAEP mixed model that the legacy of the reforms of the
early 1990s are likely to survive, albeit in a highly protracted form (see chapter 18 for an
account of the NAEP approach to reading assessment).
Concluding statement
References
Allington, R.L. (1984). ‘Oral reading.’ In: Pearson, P.D., Barr, R., Kamil, M. and
Mosenthal, P. (Eds) Handbook of Reading Research. New York: Longman.
Anderson, R.C. and Pearson, P.D. (1984). ‘A schema-theoretic view of basic processes
in reading comprehension.’ In: Pearson, P.D., Barr, R., Kamil, M. and Mosenthal, P.
(Eds), Handbook of Reading Research. New York: Longman.
Bachman, L.F. (1982). ‘The trait structure of cloze test scores’, TESOL Quarterly, 16, 1,
61–70.
Bachman, L.F. (2000). ‘Modern language testing at the turn of the century: assuring that
what we count counts’, Language Testing, 17, 1, 1–42.
Baumann, J., Jones L. and Seifert-Kessell, N. (1993). ‘Using think alouds to enhance
children’s comprehension monitoring abilities’, The Reading Teacher, 47, 184–93.
Betts, E. (1946). Foundations of Reading Instruction. New York: American Book.
Binet, A. (1895). Cited in Johnston, P.H. (1984a) ‘Assessment in reading.’ In: Pearson,
P.D., Barr, R., Kamil, M. and Mosenthal, P. (Eds) Handbook of Reading Research.
New York: Longman, 147–82.
Bloom, D. and Greene, J. (1984). ‘Directions in the sociolinguistic study of reading.’ In:
Pearson P.D., Barr, R., Kamil, M. and Mosenthal, P. (Eds) Handbook of Reading
Research. New York: Longman.
Bormuth, J.R. (1966). ‘Reading: a new approach’, Reading Research Quarterly, 1,
79–132.
California Learning Assessment System. (1994). Elementary Performance Assess-
ments: Integrated English-Language Arts Illustrative Material. Sacramento, CA:
California Department of Education.
California State Department of Education. (1987). English Language Arts Framework.
Sacramento, CA: California Department of Education.
Chapman, J.C. (1924). Chapman Unspeeded Reading-Comprehension Test. Minneapo-
lis: Educational Test Bureau.
Courtis, S.A. (1914). ‘Standard tests in English’, Elementary School Teacher, 14,
374–92.
Davis, F.B. (1944). ‘Fundamental factors of comprehension of reading’, Psychometrika,
9, 185–97.
Davis, F.B. (1968). ‘Research in comprehension in reading’, Reading Research Quar-
terly, 3, 499–545.
Derrik, C. (1953). ‘Three aspects of reading comprehension as measured by tests of dif-
ferent lengths’, Research Bulletin 53–8. Princeton, NJ: ETS.
Dewey, J. (1938). Experience and Education. New York: Collier Books.
Durrell, D.D. (1937). Durrell Analysis of Reading Difficulty. New York: Harcourt,
Brace and World.
Ericsson, K.A. and Simon, H.A. (1984). Protocol Analysis: Verbal Reports as Data.
Cambridge, MA: MIT Press.
94 assessing reading: from theories to classrooms
Farr, R. and Greene, B. G. (1992). ‘Using verbal and written think-alongs to assess
metacognition in reading.’ Paper presented at the 15th annual conference of the East-
ern Education Research Association Hilton Head, SC.
Feuerstein, R.R., Rand, Y. and Hoffman, M.B. (1979). The Dynamic Assessment of
Retarded Performance. Baltimore, MD: University Park Press.
Freeman, F.N. (1926). Mental Tests: Their History, Principles and Applications. Chica-
go: Houghton Mifflin.
Gates, A.I. (1937). ‘The measurement and evaluation of achievement in reading.’ In: The
Teaching of Reading: A Second Report (Forty-Sixth Yearbook of the National Society
for Studies in Education, Part 1). Bloomington, IL: Public School Publishing.
Gearhart, M., Herman, J., Baker, E. and Whittaker, A.K. (1993). Whose Work is It? A
Question for the Validity of Large-Scale Portfolio Assessment. CSE Technical report
363. Los Angeles: University of California, National Center for Research on Evalua-
tion, Standards and Student Testing.
Goodman, K.S. (1968). The Psycholinguistic Nature of the Reading Process. Detroit:
Wayne State University Press.
Goodman, K.S. (1969). ‘Analysis of oral reading miscues: applied psycholinguistics’,
Reading Research Quarterly, 5, 1.
Goodman, Y.M. and Burke, C.L. (1970). Reading Miscue Inventory Manual Procedure
for Diagnosis and Evaluation. New York: Macmillan.
Gray, W.S. (1917). Studies of Elementary School Reading through Standardized Tests
(Supplemental Educational Monographs No. 1). Chicago: University of Chicago
Press.
Harris, C.W. (1948). ‘Measurement of comprehension in literature’, The School Review,
56, 280–89 and 332–42.
Huey, E. (1908). The Psychology and Pedagogy of Reading. Cambridge, MA: MIT
Press.
Hunt, L.C. (1957). ‘Can we measure specific factors associated with reading compre-
hension?’, Journal of Educational Research, 51, 161–71.
Irwin, P.A. and Mitchell, J.N. (1983). ‘A procedure for assessing the richness of
retellings’, Journal of Reading, 26, 391–96.
Johnston, P. H. (1984a). ‘Assessment in reading.’ In: Pearson, P.D., Barr, R., Kamil, M.
and Mosenthal, P. (Eds) Handbook of Reading Research. New York: Longman.
Johnston, P.H. (1984b). Reading Comprehension Assessment: A Cognitive Basis.
Newark, DE: International Reading Association.
Kapinus, B., Collier, G.V. and Kruglanski, H. (1994). ‘The Maryland school perform-
ance assessment program: a new wave of assessment.’ In: Valencia, S., Hiebert, E.
and P. Afflerbach (Eds) Authentic Reading Assessment: Practices and Possibilities.
Newark, DE: International Reading Association.
Kelly, E.J. (1916). ‘The Kansas silent reading tests’, Journal of Educational Psycholo-
gy, 7, 63–80.
Kelly, T.L. (1935). Essential Traits of Mental Life. Cambridge, MA: Harvard University
Press.
the assessment of reading comprehension 95
Pearson, P.D. and Johnson, D.D. (1978). Teaching Reading Comprehension. New York:
Holt, Rinehart and Winston.
Pearson, P.D., DeStefano, L. and Garcia G.E. (1998). ‘Ten dilemmas of performance
assessment.’ In: Harrision, C. and Salinger, T. (Eds) Assessing Reading 1, Theory
and Practice. London: Routledge.
Pearson, P.D. and Stephens, D. (1993). ‘Learning about literacy: a 30-year journey.’ In:
Gordon, C.J., Labercane, G.D. and McEachern, W.R. (Eds) Elementary Reading:
Process and Practice. Boston: Ginn Press.
Pressley, M., and Afflerbach, P. (1995).Verbal Protocols of Reading: The Nature of
Constructively Responsive Reading. Hillsdale, NJ: Erlbaum.
RAND Reading Study Group. (2002). Reading for understanding: Toward an R&D
program in reading comprehension. Santa Monica, CA: RAND Corporation.
Rankin, E.F. (1965). ‘The cloze procedure: a survey of research.’ In: Thurston, E. and
Hafner, L. (Eds) Fourteenth Yearbook of the National Reading Conference. Clem-
son, SC: National Reading Conference.
Resnick, D.P. (1982). ‘History of educational testing.’ In: Wigdor, A.K. and Garner,
W.R. (Eds) Ability Testing: Uses, Consequences and Controversies (Part 2). Wash-
ington, DC: National Academy Press.
Richards, I.A. (1929). Practical Criticism. New York: Harcourt, Brace.
Rosenblatt, L.M. (1978). The Reader, the Text, the Poem: The Transactional Theory of
the Literary Work. Carbondale, IL: Southern Illinois University Press.
Rumelhart, D.E. (1977). ‘Understanding and summarizing brief stories.’ In: LaBerge,
D. and Samuels, J. (Eds) Basic Processes in Reading Perception and Comprehen-
sion. Hillsdale, NJ: Erlbaum.
Rumelhart, D.E. (1981). ‘Schemata: the building blocks of cognition.’ In: Guthrie,
J.T. (Ed) Comprehension in Teaching. Newark, DE: International Reading Associ-
ation.
Rumelhart, D.E. and Ortony, A. (1977). ‘The representation of knowledge in memory.’
In: Anderson, R.C., Spiro, R.J. and Montague. W.E. (Eds) Schooling and the Acquis-
tion of Knowledge. Hillsdale, NJ: Erlbaum.
Shanahan, T., Kamil, M.L. and Tobin, A.W. (1982). ‘Cloze as a measure of intersenten-
tial comprehension’, Reading Research Quarterly, 17, 2, 229–55.
Shavelson, R.J., Baxter, G.P. and Pine, J. (1992). ‘Performance assessments: political
rhetoric and measurement reality’, Educational Researcher, 21, 4, 22–7.
Smith, N.B. (1966). American Reading Instruction. Newark, DE: International Reading
Association.
Spiro, R. and Jehng, J. (1990). ‘Cognitive flexibility and hypertext: theory and technol-
ogy for the linear and nonlinear multidimensional traversal of complex subject
matter.’ In: Nix, D. and Spiro, R. (Eds) Cognition, Education and Multimedia:
Exploring Ideas in High Technology. Hillsdale, NJ: Erlbaum.
Starch, D. (1915). ‘The measurement of efficiency in reading’, Journal of Educational
Psychology, 6, 1–24.
the assessment of reading comprehension 97
Stein, N.L. and Glenn, C.G. (1977). ‘An analysis of story comprehension in elementary
school children.’ In: Freedle, R.O. (Ed) Discourse Processing: Multidisciplinary
Perspective. Norwood, NJ: Ablex.
Taylor, W. (1953). ‘Cloze procedure: a new tool for measuring readability’, Journalism
Quarterly, 9, 206–223.
Thorndike, E.L. (1917). ‘Reading as reasoning: a study of mistakes in paragraph read-
ing’, Journal of Educational Psychology, 8, 323–32.
Touton, F.C. and Berry, B.T. (1931). ‘Reading comprehension at the junior college lev-
el’, California Quarterly of Secondary Education, 6, 245–51.
Tuinman, J.J. (1974). ‘Determining the passage-dependency of comprehension ques-
tions in 5 major tests’, Reading Research Quarterly, 9, 2, 207–23.
Tuinman, J.J. (1978). ‘Criterion referenced measurement in a norm referenced context.’
In: Samuels, J. (Ed) What Research has to Say about Reading Instruction. Newark,
DE: International Reading Association.
Turner, A. and Greene, E. (1977). The Construction of a Propositional Text Base (Tech-
nical Report No. 63). Boulder: University of Colorado Press.
Valencia, S. and Pearson, P.D. (1987). New Models for Reading Assessment (Read Ed.
Report No. 71). Urbana: University of Illinois, Center for the Study of Reading.
Valencia, S., Pearson, P.D., Peters, C.W. and Wixson K.K. (1989). ‘Theory and practice
in statewide reading assessment: closing the gap’, Educational Leadership, 47, 7,
57–63.
Valencia, S.V., Pearson, P.D., Reeve, R., Shanahan, T., Croll, V., Foertsch, D., Foertsch,
M. and Seda, I. (1986). Illinois Assessment of Educational Progress: Reading (for
grades 3, 6, 8, 10). Springfield, IL: Illinois State Board of Education.
Vygotsky, L. (1978). Mind in Society: The Development of Higher Psychological
Processes. Cambridge, MA: Harvard University Press.
Wertsch, J.V. (1985). Vygotsky and the Social Formation of Mind. Cambridge, MA:
Harvard University Press.
Winograd, P. and Johnston, P. (1980). Comprehension Monitoring and the Error Detec-
tion Paradigm (Tech. Rep. No. 153). Urbana: University of Illinois, Center for the
Study of Reading (ED 181 425).
Further reading
Buly, M. and Valencia, S.W. (in press) ‘Below the bar: profiles of students who fail state
reading assessments’, Educational Evaluation and Policy Analysis.
Campell, J.R. (1999). Cognitive processes elicited by multiple-choice and constructed-
response questions on an assessment of reading comprehension. Unpublished
doctoral dissertation, Temple University.
Campell, J.R., Voelkl, K.E., and Donahue, P.L. (1998). NAEP 1996 Trends in Academic
Progress: Achievement of U.S. Students in Science 1969 to 1996, Mathematics, 1973
to 1996, Reading, 1971 to 1996 and Writing, 1984 to 1996. NCES 97–985. Washing-
ton DC: U.S. Department of Education.
Carroll, J. (1963). ‘A model of school learning’, Teachers College Record, 64, 723–32.
Davis, F. B. (1972). ‘Psychometric research on comprehension in reading’, Reading
Research Quarterly, 7, 4, 628–78.
Destefano, L., Pearson, P.D. and Afflerbach, P. (1997). ‘Content validation of the 1994
NAEP in Reading: Assessing the relationship between the 1994 assessment and the
reading framework.’ In: Linn, R., Glaser, R. and Bohrnstedt G. (Eds) Assessment in
Transition: 1994 Trial State Assessment Report on Reading: Background Studies.
Stanford, CA: The National Academy of Education.
Durrell, D.D. (1955). Durrell Analysis of Reading Difficulty. New York: Harcourt,
Brace and World.
Eurich, A.C. (1936). Minnesota Speed of Reading Test for College Students. Minneapo-
lis: University of Minnesota Press.
Frederiksen, N. (1984). ‘The real test bias: influences of testing on teaching and learn-
ing’, American Psychologist, 39, 193–202.
Gagné, R.M. 1965. The Conditions of Learning. New York: Holt, Rinehart and Winston.
Garavalia D. (in press). The impact of item format on depth of cognitive engagement.
Unpublished doctoral dissertation, Michigan State University.
Gardner, H. (1985). The Mind’s New Science: A History of the Cognitive Revolution.
New York: Basic Books.
Ginn and Company. (1982). The Ginn Reading Program. Lexington, MA: Ginn and
Company.
Glaser, R., Linn, R. and Bohrnstedt, G. (1997). Assessment in Transition: Monitoring
the Nation’s Educational Progress. Stanford, CA: National Academy of Education.
Hartman, D.K. (1995). ‘Eight readers reading: the intertextual links of proficient read-
ers reading multiple passages’, Reading Research Quarterly, 30, 3.
Illinois Goal Assessment Program. (1991). The Illinois Reading Assessment: Classroom
Connections. Springfield, IL: Illinois State Board of Education.
Johnson, D.D. and Pearson, P.D. (1975). ‘Skills management systems: a critique’, The
Reading Teacher, 28, 757–64.
Jones L.V. (1996). ‘A history of the National Assessment of Educational Progress and
some questions about its future’, Educational Researcher, 25, 7, 1–8.
Klare, G.R. (1984). ‘Readability.’ In: Pearson, P.D., Barr, R., Kamil, M. and Mosenthal,
P. (Eds) Handbook of Reading Research. New York: Longman.
the assessment of reading comprehension 99
Royer, J.M., Kulhavy, R.W., Lee, J.B. and Peterson, S.E. (1984). ‘The sentence verifica-
tion technique as a measure of listening and reading comprehension’, Educational
and Psy5chological Research, 6, 299–314.
Royer, J.M., Lynch, D.J., Hambleton, R.K. and Bulgarelli, C. (1984). ‘Using the sentence
verification technique to assess the comprehension of technical text as a function of
subject matter expertise’, American Educational Research Journal, 21, 839–69.
Salinger, T. and Campbell, J. (1998). ‘The national assessment of reading in the USA.’
In: Harrison, C. and Salinger, T. (Eds) Assessing Reading: Theory and Practice. Lon-
don: Routledge, 96–109.
Sarroub, L. and Pearson, P.D. (1998). ‘Two steps forward, three steps back: the stormy
history of reading comprehension assessment’, The Clearing House, 72, 2, 97–105.
Schreiner, R.L., Heironymus, A.N. and Forsyth, R. (1969). ‘Differential measurement of
reading abilities at the elementary school level’, Reading Research Quarterly, 5, 1.
Silver, Burdett, and Ginn. (1989). World of Reading. Needham, MA: Silver, Burdett,
and Ginn.
Spearitt, D. (1972). ‘Identification of subskills of reading comprehension by maximum
likelihood factor analysis’, Reading Research Quarterly, 8, 92–111.
Stenner, A.J. and Burdick, D.S. (1997). The Objective Measurement of Reading Com-
prehension. Durham, NC: MetaMetrics Inc.
Stenner, A.J., Smith, D.R., Horabin, I. and Smith, M. (1987). Fit of the Lexlie Theory to
Item Difficulties on Fourteen Standardized Reading Comprehension Tests. Durham,
NC: MetaMetrics Inc.
Thurstone, L.L. (no date). Psychological examination (Test 4). Stoelting.
Touchstone Applied Science Associates. (1995). Degrees of Reading Power. Benbrook,
TX: Touchstone Applied Science Associates.
Valencia, S. and Pearson, P.D. (1987). ‘Reading assessment: time for a change’, The
Reading Teacher, 40, 726–33.
White, E.B. (1952). Charlotte’s Web. New York: Harper and Row.
Yepes-Bayara, M. (1996). ‘A cognitive study based on the National Assessment of Edu-
cational Progress (NAEP) Science Assessment.’ Paper presented at the annual
meeting of the National Council on Measurement in Education, New York.
Notes
1 This tradition of isomorphism between the infrastructure of tests and curriculum has been a persistent
issue throughout the century. See, for example, Johnson and Pearson (1975) and Resnick (1982). Also see
Smith (1966) for an account of the expansion of reading comprehension as a curricular phenomenon.
2 The use of more than one right answer predates the infamous a, b, c (a and b) multiple-choice format as
well as the systematic use of the ‘more than one right answer’ approach used in some state assessments in
the 1980s and 1990s (Pearson et al., 1990).
the assessment of reading comprehension 101
3 It is somewhat ironic that the sort of thinking exhibited in this piece did not become dominant view in the
teens and twenties. Unquestionably, Thorndike was the pre-eminent educational psychologist of his time
(Thorndike, 1917). Further, his work in the psychology of learning (the law of effect and the law of conti-
guity) became the basis of the behaviorism that dominated educational psychology and pedagogy during
this period and his work in assessment led was highly influential in developing the components of classi-
cal measurement theory (reliability and validity). Somehow this more cognitively oriented side of his
work was less influential, at least in the period in which it was written.
4 During this period, great homage was paid to intellectual ancestors such as Edmund Burke Huey, who as
early as 1908 recognized the cognitive complexity of reading. Voices such as Huey’s, unfortunately, were
not heard during the period from 1915 to 1965 when behaviorism dominated psychology and education.
5 It is not altogether clear that schema theory is dead, especially in contexts of practice. Its role in psycho-
logical theory is undoubtedly diminished due to attacks on its efficay as a model of memory and cognition.
See McNarmara et al. (1991) or Spiro and Jehng (1990, pp.163–205).
6 Smagorinsky (in press) uses the phrase ‘inscribed’ in the text as a way of indicating that the author of the
text has some specific intentions when he or she set pen to paper, thereby avoiding the thorny question of
whether meaning exists ‘out there’ outside of the minds of readers. We use the term here to avoid the very
same question.
7 Most coherent model is defined as that model which provides the best account of the ‘facts’ of the text
uncovered at a given point in time by the reader in relation to the schemata instantiated at that same point
in time.
8 Note that this approach tends, on average, to favor those students who have high general verbal skills as
might be indexed by an intelligence test, for example. These will be the students who will possess at least
some knowledge on a wide array of topics (Johnston, 1984a, 1984b).
9 We find it most interesting that the ultimate psychometrician, Frederick Davis (e.g. 1968), was fond of ref-
erencing the New Criticism of I.A. Richards (1929) in his essays and investigations about comprehension.
10 Rosenblatt credits the idea of transaction to John Dewey, who discussed it in many texts, including Expe-
rience and Education (Dewey, 1938).
8 Significant moments in the history of
reading assessment in the UK
Chris Whetton
It was the initial hope and aim of the discussion series of seminars which gave rise to
this book, that a unified theory of reading assessment would be produced. If successful,
this would give a relationship between the psychological approaches to reading and the
view of literacy as a process steeped in the understanding of all the information impart-
ed by text. As with many endeavours, this has had to be modified over time into a lesser
objective (see Marian Sainsbury’s introduction.)
This chapter will take one view, based on a historical survey of some important tests
in use in the UK over the last eighty years. This view is that the prevailing definition of
reading is a reflection of the needs and values of the education system at the time, which
themselves arise from the prevailing attitudes and requirements of society in general.
This viewpoint is best illustrated by a personal anecdote. In 1990, the first National
Curriculum assessments for England were under development. The present author was
the Director of the NFER project which was creating ‘standard assessment tasks’
(SATs) for the assessment of seven-year-old pupils. The specification had said that these
were to be naturalistic tasks to be used by teachers in the classroom as a part of their
usual practice. Many aspects of the curriculum from science through to history were to
be assessed, but a particular bone of contention was the assessment of reading. The
political background of the time was that claims were being made that standards had
fallen during the 1980s, due to teachers’ reliance on children’s books to teach reading
using whole-word and context strategies (summarised as ‘real books’) as opposed to
phonics methods advocated by the criticising educational psychologists and others.
(Turner, 1990a and 1990b.) This debate had developed such importance that the issue of
how to assess was to be resolved by the responsible minister himself – the Secretary of
State for Education. At the time, this was Kenneth Clarke, famous as a ‘big picture’
strategist with little wish to grapple with detail.
A meeting was arranged to discuss the issue of reading assessment. Representatives
of the Schools’ Examination and Assessment Council (SEAC) (the government agency
then responsible), Her Majesty’s Inspectors of Schools (HMI) (involved in steering the
process) and the education ministry, then called the Department for Education and Sci-
ence (DES) gathered in a meeting room and waited. The minister, Mr Clarke, swept in
surrounded by an entourage of (about six) political advisers. He listened for a short time
to the muddled arguments of the education professionals and then pronounced, with
words along the lines of:
significant moments in the history of reading assessment 103
It doesn’t seem so difficult to me to find if a child can read. You give them a book and
ask them to read. If they can read, they read the book, if they can’t read they don’t
read the book. It’s perfectly simple. Mind you, I’m not having anything to do with this
real books nonsense.
With that Delphic remark he and his entourage departed from the room, moving on to
their next decision.
Translated into policy by SEAC, HMI and DfES, this meant that the reading assess-
ment was one in which children read from actual books, with a choice of book; reading
individually to the teacher who maintained a running record, including a miscue analy-
sis. The children also answered questions about the meaning of what they had read. The
assessment is described more fully in the last section of this chapter.
While there was a wealth of educational and assessment research behind this form of
assessment, ultimately it was the person empowered by society who determined the
nature of the assessment. This story is unusual in that the individual concerned was such
a high-ranking representative of society and the decision so abrupt. However, the con-
tention of this article is that this is always the case, but that the representative(s)
concerned varies according to time and circumstances, being in some cases an educational
psychologist, in others a curriculum developer and in others an examination officer.
The plan for this chapter has therefore been to select tests which have had an histori-
cal importance in Britain and then to use these to illustrate the view of the construct of
reading prevailing at the time. This is related as far as possible to the prevailing views
of education and society at that time. In determining which reading tests are considered
to be those with historical importance, a set of general principles has been followed.
First, the tests should have been influential, that is, they should have been widely used
in the education system and they should have been referred to in the research literature
and in books for teachers and students. They should also have served as models for
other tests, so that other authors and publishers produced tests of the same general type.
A second principle has been that the tests should have had ‘staying power’, that is they
should have been used over a long period of time. Finally, tests have been selected for
their importance in serving the educational priorities of their time.
A preliminary list of tests meeting these principles was proposed to the group attend-
ing the third ESRC seminar. This was largely accepted and one or two additions were
proposed. This led to the final list of ‘important’ tests which is:
• Burt Word Reading Test (1921)
• Schonell Reading Tests (1942)
• Reading Test AD (NFER, 1955) A.F. Watts
• Neale Analysis of Reading Ability (1958)
• Gap and Gapadol (1970, 1973)
• Edinburgh Reading tests (1975)
104 assessing reading: from theories to classrooms
The Burt Word Reading Test was first published in 1921 and consists of a set of 110
words, printed with five on each line. The words are in (broadly) increasing order of dif-
ficulty and the child is asked to read them aloud, continuing until they fail to read any
word on two consecutive lines. The first line is:
to is up he at
and the final line is:
alienate phthisis poignancy ingratiating subtlety
The score is the number of words read aloud correctly which can then be converted to a
reading age. The concept of reading is very rudimentary. Provided words are pro-
nounced aloud, success is assumed. The child may be unfamiliar with the word or its
meaning but still be regarded as able to read it.
The information in the following section is derived largely from Hernshaw’s (1979)
biography of Burt.
The initial version of the test was produced by Cyril Burt as one of a battery of Men-
tal and Scholastic tests. They were devised for use in London, where Burt was working
as a psychologist for London County Council. The primary reason for his appointment
in 1913 was to assist with the examination of pupils in elementary schools nominated
for admission to schools for the mentally deficient. Previously this had been the
province of medical officers, who were suspected of sending many pupils to special
schools incorrectly and it was hoped that such errors could be avoided through the use
of psychological testing. Burt was the sole psychologist (and only part-time) but set up
a programme in which he spent two days a week testing in schools and a third on test
construction. He initially published a report on the Distribution and Relation of Educa-
tional Abilities, concluding that the educational system was failing to push the brightest
to the limits of their potentialities (plus ça change) and was also failing to make provi-
sion for the backward, defined as children, though not defective, who were unable to do
the work of even a class below their age. Burt utilised the tests he adapted (such as
Binet’s intelligence scale) or devised, to underpin this practical work on educational
attainment. In 1921 he published Mental and Scholastic Tests which contained sections
on the validity of the tests which among other matters described test development, item
analysis and other statistical methods. It also contained the series of scholastic tests for
reading, spelling, arithmetic, writing, drawing, handwork and composition. Among
these was the Word Reading Test. The book was also balanced in its advocacy of test-
ing, stressing that tests should be regarded as ‘but the beginning, never the end, of the
significant moments in the history of reading assessment 105
examination of the child … to appraise the results demands the tact, the experience, the
imaginative insight of the teacher born and trained’.
Always immodest, Burt claimed that many of the technical procedures used in devel-
oping the tests were used for the first time. These included item analysis, tetrachoric
correlations, factor scores, correlations of person’s scores, representative sampling and
practical regression equations. Some of these he devised, but others were adapted from
the work of others. However, what this did represent was the first application of test
development technology, as now understood, to educational assessment in Britain.
In terms of defining the construct of reading, Burt said that he had selected words,
sentences and passages from a large preliminary collection taken from reading books
used in schools, from children’s own talk and compositions and from books and maga-
zines which children read out of school. Teachers then assisted with the selection of the
texts, since a variety of methods of teaching reading was in vogue, resulting in different
vocabularies in each. The words selected were common to all. The final selection of
words was based on an item analysis and correlations of each word with teachers’ ratings
of children’s reading ability.
This account is of interest since it shows that two elements of the process of develop-
ing reading tests – the attempt to find material of interest to children and common to all
their experiences and the relationship of the test outcomes to teachers’ ratings of chil-
dren – which we sometimes think of as modern concerns have been in place from the
start of the systematic testing of reading.
Mental and Scholastic Tests was an influential publication and second, third and fourth
editions followed, the final one in 1962. The Word Reading Test was re-standardised by
P.E. Vernon in 1938 using a Scottish sample and the Scottish Council for Research in Edu-
cation (SCRE) undertook a slight revision (some words were moved to new positions) and
re-standardisation in 1974. By that time, the utility of the test was being questioned and a
1976 review says of this last exercise that ‘it seems a pity in view of the basically archaic
conception of reading underlying the tests’ (Vincent and Cresswell, 1976).
Nevertheless, this type of reading test provided a model for many other tests and
appears even in modern psychological batteries as a means of assessing children’s reading
ability (e.g. British Ability Scales, Elliott et al., 1997).
In terms of its vision of the concept of reading, it can be seen in its context of attempt-
ing to provide a rapid measure, usable with individuals, to help identify children who were
under-performing in an urban education system which was trying to cope with a large
number of pupils from impoverished backgrounds. It brought psychometrics into educa-
tional measurement for the first time and became a test with great longevity and influence.
Just as the First World Was provided the background to the development of Burt’s Men-
tal and Scholastic Tests, the Second World War provided the context for the development
of the Schonell Reading Tests. These consist of four tests, shown in Table 8.1.
106 assessing reading: from theories to classrooms
The Graded Word Reading Test follows Burt’s model almost exactly. There are 100
words, five to a line, with the lines in pairs. Children read the words aloud to the tester
until ten successive mistakes are made. The score is the number of words said aloud
correctly. This is converted to a Reading Age.
The Simple Prose Reading Test has four paragraphs of prose which are read aloud to
the tester. The time taken is recorded. The passage is then removed and several com-
prehension questions asked. Reading Ages can be derived for word recognition,
comprehension and speed.
In a review of the tests undertaken in 1984, Stibbs refers to them as ‘self-validat-
ing’. They had become such familiar instruments and were so widely used by teachers
that ‘to say they had content validity would be almost tautologous’. The norms were so
long established they were used as benchmarks against which to standardise other
tests.
The Graded Word Reading Test is criticised (by Stibbs) for the concept of reading
it espouses. This is inferred, rather than stated by the test authors, but seems to be that
the essence of reading is recognising words and that only words pronounced correct-
ly are recognised. Recognition is gained through phonic analysis and synthesis
together with guessing. This conception had hardly moved on from that of Burt thirty
years before.
There were though the other three tests which did provide other measures – compre-
hension and speed. The conception of reading has become somewhat broader, though
only one passage is used and this has relatively mundane subject matter (‘My Dog’).
The importance of the Schonell tests lies in their acceptance and use by the teaching
profession. The survey of test use for the Bullock Report A Language for Life (DES,
1975) found that they were by far the most popular reading tests in schools (thirty
years after publication!). The Graded Word Reading Test was the most used test in
both primary and secondary schools and the two silent reading tests were second and
third most popular in secondary schools and fourth and fifth in primary schools. This
popularity reflected the tests’ use for a wide range of purposes: screening for remedial
attention; streaming and setting; monitoring progress of individuals; informing parents
and providing transfer information either prior to the move to secondary school or on
entry to the secondary school.
With their different measures, there was some scope for the Schonell Reading Tests
to give some diagnostic information. However, this function was left to a separate set
of diagnostic tests.
significant moments in the history of reading assessment 107
Reading Test AD has been included in this list as a representative of a large number of
multiple-choice sentence-completion tests. These are important because during the
1950s and on into the 60s and 70s, they came to be the most widely used form of group
reading test. As such they defined a view of reading comprehension as operating at a
sentence level with understanding largely dependent on vocabulary and on word order
and the other structures of English. A larger text or authenticity were not regarded as
important as the virtues of reliability and ease of marking.
Reading Test AD is a 35 item multiple-choice sentence-completion test. It is speed-
ed, having a short administration time of 15 minutes. The target age group is
8–10-year-olds. An example item is:
The engine driver and the guard left the train on reaching the (door, hill, farm,
station, street)
The test was first published in 1955 and is very similar to a series of other NFER read-
ing tests of the period (Reading Test BD for 7–11-year-olds; Reading Test EH1 for
11–15-year-olds). However, it had formerly been known as Sentence Reading Test 1
and had been used in an enquiry of the standards of reading conducted in Kent in 1954
(Morris, 1959).
The manual of the published test gave no indication of its intended purpose, but its first
use and the author give the clue to this. It was designed as a test for use in surveys and
monitoring of standards. As such its reliability and ease (cheapness) of marking were
important and its validity (again no comments on this in the manual) reflect a behaviourist
view of reading, as in the USA at the time (Pearson and Hamm, this volume).
Watts was also an author of the Watts-Vernon test1 which was used in national sur-
veys of schoolchildren in England and Wales. It was first developed in 1938 and was
used in national surveys in 1948, 1952, 1956, 1961, 1964 and 1970–71. Brooks et al.,
(1995) summarises the results of these surveys as showing that average reading score
rose slightly between 1948 and 1952 (the improvement being attributed to recovery of
the education system after the war years) followed by a period of little change. The sta-
bility, as measured by other tests, in fact continued until around 1987 and 1988, when a
decline was perceived. The successor test in use for national surveys, NS6, was also a
multiple-choice sentence completion test.
The failure for reading scores to increase, despite periods of increased expenditure
on education was one of the reasons for the introduction of a National Curriculum in
England and Wales and was still being cited by a government minister in 2003 as a rea-
son for continuing with a programme of National Curriculum Assessment (including
reading tests) (Miliband, 2003). It is of interest, but idle to speculate, that one reason for
the lack of an increase in scores may have been that the monitoring tests espoused a dif-
ferent view of reading attainment than that current in primary schools. A test used in this
108 assessing reading: from theories to classrooms
way with no accountability for individual schools had little backwash effect on teaching
and consequently the efforts of schools may not have been directed to the skills required
for sentence completion tests.
Reading Test AD differed from the earlier tests also in the manner of its results. Rather
than reading ages (easily interpretable by teachers but statistically suspect), the outcomes
were standardised scores, normalised and age-adjusted with a neat mean of 100 and stan-
dard deviation of 15. These have very nice statistical properties, useful in averaging and
manipulation for survey results, but much less intuitively useful for teachers.
Successors to these sentence completion tests remain in use in the new millennium.
Two examples are the Suffolk Reading Test (Hagley, 1987) and The Group Reading
Test (Macmillan Test Unit, 1997) which is currently very popular.
An important successor test, which might have stood alone in this listing, except that
it offers little really new in terms of its conception of reading is Young’s Group Reading
Test. This is a test for the transition from infant to junior school (now years 2 and 3;
6–8-year-olds). It was published in 1968, restandardised in the late 1970s, with a second
edition in 1980. Its importance lies in that by 1979, 33 local education authorities
(LEAs) used this test for monitoring and record-keeping purposes (about a third of the
LEAs at that time). As such and as a rapidly administered (13 minutes) multiple-choice
sentence completion test it is clearly in the Watts-Vernon, NS6, Reading Test AD
mould. The innovation was the first section where pictures are given and the child must
circle the word matching the picture. Here the construct of reading therefore incorpo-
rates not only recognising the written word but identifying the object in the picture
which must be matched. The manual gives no evidence of construct or content validity
but does offer evidence of concurrent and predictive validity with impressive correlations
to a range of other reading tests.
The Neale Analysis of Reading Ability is an individual test of reading in which the child
must read aloud a complete narrative passage. This process provides measures of their rate
of reading, accuracy and comprehension. There are three parallel forms, each with six pas-
sages. The passages are ordered in terms of difficulty as indicated by length, vocabulary
and sentence structure. There are criteria for which passage to begin the test with and for
stopping the test, providing a measure of tailored testing. For each passage, the time taken
to read it is measured and recorded; scores from the passages being combined to give an
overall measure of rate of reading. As the child reads, the tester records errors on a record
sheet and codes these into six types of error in oral reading (mispronunciations, substitu-
tions, refusals, additions, omissions, reversals). Finally, a set of comprehensive questions
are asked and the number correct provides a raw score for comprehension.
All three measures can be converted, using tables, to give reading ages for accuracy,
rate and comprehension.
significant moments in the history of reading assessment 109
The first edition of the Neale Analysis was published in 1958 and arose from Marie
Neale’s PhD thesis undertaken at Birmingham University in the early 1950s. (When the
present author met Neale during the 1990s as part of discussions about the second edi-
tion, she spoke of the work she did alone for the standardisation, cycling along snowy
country lanes to village schools to test their pupils.)
The manual gives a clear purpose for the test as to ‘provide a sympathetic and
stimulating situation in which (the child’s) difficulties, weaknesses, types of error,
persistence and attitudes could be assessed’. The miscue analysis in the accuracy
assessment, the provision of three measures and the inclusion in the package of a
further set of optional supplementary diagnostic tests all point clearly to its concep-
tion as a diagnostic instrument, intended to be of use to classroom teachers in a
clinical informative sense. Again, the test is reflecting its times with its post-war
concerns to raise levels of literacy following a long period of disruption of the edu-
cational process. The means of doing this is through improving the knowledge that
teachers had about the profile of skills of their pupils, helping them to decide on the
next steps.
With this diagnostic intent, the conception of reading has become broader than in any
of the previous tests described. The intention is that the reading experience should be
authentic. Children read whole passages from a reader which is designed to appear as a
children’s book with illustrations, large print and white space on the page. The manner
of testing, too, is authentically like classroom practice of the child reading to an adult.
The provision of parallel forms allows teachers to conduct investigations, even to use
the material for teaching after the test and yet still re-assess the child in the future in the
same way and with the same construct of reading.
For these reasons, the Neale Analysis remained popular with teachers through to the
1980s. By then, Marie Neale had emigrated to Australia where she produced a second
edition (Neale, 1988). This was re-imported, revised and restandardised for British use
(Neale et al., 1989).
In the 1960s and 1970s, a new form of assessment of reading comprehension began to
appear. This was the cloze procedure credited as the invention of W.L. Taylor.2 The pro-
cedure had originally been developed by Taylor (1953) as a means of estimating the
readability of children’s books. Words are removed from a passage in a systematic man-
ner (say every 5th word) and the readability of the text was defined as the ease of
supplying the missing words. This soon became extended into a method of assessing the
persons supplying the words rather than the readability of the passage. As Pearson (this
volume) describes, the classic cloze procedure has evolved in many directions in terms
of the manner of selecting the deleted words and the level of assistance in providing the
110 assessing reading: from theories to classrooms
missing word. This can extend from no clues, through providing the first letter, to giving
several multiple-choice alternatives.
In the UK, the first influential tests to use cloze for native speakers were the Gap and
Gapadol tests. (The qualification of ‘native speakers’ is required since cloze was and
remains, much more important in tests of English as an additional/foreign/second lan-
guage.) In fact, the two tests were Australian in origin, devised by John McLeod. Each
was adapted for British use by a second author.
The Gap Reading Comprehension Test consists of a series of short passages of
increasing difficulty. It has two forms: B with seven passages and R with eight. The
tests use a modified cloze procedure, having 44 gaps at irregular intervals. The child has
to write in the missing words. For scoring, there is one acceptable word for each gap.
This was determined during development as the response of ‘expert readers’ (university
undergraduates). Incorrect spelling is permitted but the word must be grammatically
correct. The test is aimed at primary school pupils, aged 8–12 years. British norms were
obtained from Scotland, Wales and England.
The Gapadol Reading Comprehension Test is very similar to the Gap test, but it cov-
ers a wider age range: from 7 to 17. There is no UK standardisation data. For this test
there are again two forms, but each with eight passages, including practice. The cloze
gaps are at irregular intervals with 83 in one form and 81 in the other. The student writes
a word into each gap and if this matches that of the ‘first class readers’ a mark is scored.
Details are given in McLeod and Anderson (1970).
These two tests do not appear ever to have obtained a wide set of users. Their impor-
tance lies in their introduction of a new technique (to teachers as well as testers) that
appeared to be a truer measure of reading comprehension than methods requiring a
mediating vehicle like an open-ended or multiple-choice question on the text. Certainly,
this is the claim made in the manual of the Gap test: ‘the only stimulus to which the
child must respond is the reading passage itself; there are no extraneous questions to
constitute an intervening variable’. This is a more valid test of reading comprehension
than completing questions based on a written passage.
The popularity of cloze techniques as assessments for second language learners aris-
es from the fact that the technique gives information on the extent to which grammatical
processes are understood. Taylor (1953) himself used the notion of ‘grammatical expec-
tation’ to justify the validity of cloze. Some combinations of words are more likely than
others: ‘Merry Christmas’ is more likely than ‘Merry Birthday’. It is not easy to explain
why (as anyone who tries to explain English to non-native speakers knows) but custom,
history and cliché contribute to certain combinations being more probable. A second
factor is redundancy. Taylor cites ‘A man is coming this way now’ as having a great deal
of redundancy. The singular ‘man’ is signalled three times (‘a’, ‘man’ and ‘is’), the pres-
ent tense twice (‘is’ and ‘now’), direction twice (‘coming’ and ‘this way’). He attributes
probability differences in grammatical expectation to this redundancy.
For many purposes, cloze tests have drawbacks. Classic cloze tests are tedious to
mark and despite demonstrations of reliability, there is the constant carping that other
words are possible in the gaps than those produced by the expert/first class readers.
significant moments in the history of reading assessment 111
Perhaps for these reasons, there are few (if any) pure classic cloze tests in use now.
However, variants of the cloze technique occur in many tests of reading comprehension,
for example key stage 2 National Curriculum tests in England which frequently incor-
porate a cloze retelling of a passage or text.
The construction of these tests was undertaken in the belief that instruments of this
kind, designed for use by teachers and requiring no special psychological expertise
in administration or in interpretation of results, were needed urgently to assist in the
teaching of reading.
In the light of the results she [the teacher] can adapt her methods and choose her
teaching material to remedy a weakness or satisfy a strength.
These quotations from the early manuals of the Edinburgh Reading Tests sets out the
purpose of the tests and their attempt at a unique selling point. If earlier published tests
had aspired to be diagnostic, it was through looking at reading rate, accuracy and under-
standing separately. In contrast, the Edinburgh Reading Tests attempted to have many
sub-tests and hence to give detailed diagnostic information on the processes of reading.
Through this means, it was ‘hoped in particular that the tests will help the primary
school teacher to ensure that all her pupils, within reason, pass to the secondary school
with no outstanding reading disability’.
This was (and is) an extremely laudable aim. It reflects the commissioning of the
tests in the early 1970s by the Scottish Education Department and the Educational Insti-
tute of Scotland (a professional association of teachers).
The Edinburgh Reading Tests have four ‘stages’ each for a different age group and
each with a different set of sub-tests, as shown in Table 8.2.
Table 8.2 The Edinburgh Reading Tests
Each sub-test in each stage contains two or three different types of item and these also
develop across the stages. For some scales, this has a type of logic. In vocabulary for
example, there is:
• recognition of pictures (Stage 1)
• sentence completion (Stages 1, 2 and 3)
• selection of a word to give the meaning of a phrase (Stage 2)
• selection of a word or phrase to fit a precis (Stage 3)
• synonyms (Stages 3 and 4).
However, most sub-tests do not have such a systematic development across the stages.
The summation of the various sub-tests leads to overall test scores which give rise to
‘deviation quotients’ (standardised scores or reading ages) giving an overall measure of
ability in reading. However, in line with the stated purposes of the tests, users can com-
pile a subtest profile and plot this on a chart. Teachers are then able to identify those
pupils’ results which are ‘sufficiently exceptional to demand special attention’. This can
be done for individuals or whole classes.
Through its wide range of sub-tests and its approach to profiling reading, the Edin-
burgh Tests attempted to provide rich information on the skills of each pupil and to
relate this to helpful diagnostic information, leading to improvements in children learning
to read.
The approach was essentially an atomised and psychometric one, which had the
intention of providing differentiated information. However, this was not achieved in a
statistical sense. The subtests tend to be highly correlated and therefore not to provide
distinctive or useful diagnostic information. This leads even the authors to conclude that
‘children in general do equally well or poorly on all the sub-tests and the various read-
ing tasks involve the same competencies to a high degree’ (e.g. Stage 3 manual (Moray
House College of Education (1981, p.28)).
The importance of the Edinburgh Reading Tests lies in their attempt to identify the
underlying processes of reading and to separate them out in a helpful diagnostic way.
This work seems to have been guided initially by agreements on the theoretical struc-
ture by the steering committee. However, it was not ratified and supported by the
psychometric examinations of structure, leading the authors back to a view of reading
as a unified process.
The large role of a government department and a teachers’ association in commis-
sioning the tests and then steering their development illustrates the growing role of
society’s representatives in determining the nature of reading as a construct. However,
the failure to identify psychometrically rigorous and useful sub-tests making up read-
ing, may also have been influential in causing later individuals and groups to regard
reading as a unified construct and define their tests accordingly.
significant moments in the history of reading assessment 113
In the mid 1970s a trend began in England (or perhaps the UK as a whole) to recognise
the importance of education in promoting the economic well-being of the country. A
debate on the need for improvements in education was begun by the Prime Minister,
James Callaghan, with an important speech at Ruskin College in 1976.
One strand in this concern was the establishment of a commission of enquiry into the
teaching of English. The Bullock committee published their report, A Language for Life
in 1975. This included a review of evidence from earlier national surveys or reading and
the conclusion that there was a need for a new type of test which would indicate the
extent to which pupils had developed a proficiency in reading, sufficient to serve their
personal and social needs. They proposed that new instruments should be developed
which would ‘embrace teaching objectives for the entire ability range’. Tests should
also draw on a variety of sources to ensure an extensive coverage, rather than focusing
narrowly on a single text.
In this they were echoing the worldwide shift in thinking about the nature of literacy.
La Fontaine (2004) describes this as the move from comprehension to literacy. In the
context of international comparative studies, definitions of reading broadened so that
the concept became one of a process which regards comprehension as the outcome of an
interaction between the text and the reader’s previous knowledge, a process of construc-
tion. However, outside this, the pragmatic requirements of society are also relevant.
This change reflected the influence of the ‘response to literature’ movement and the new
focus on literacy as relating to a wide range of text types. This type of definition became
the currency of international survey of reading and alongside this, the same movement
in definition seems to have taken place within the UK.
One of the recommendations of the Bullock report was to advocate a proper pro-
gramme of monitoring of standards of literacy. The government responded by
establishing monitoring in English, mathematics, science and, briefly, foreign lan-
guages. The surveys were to be undertaken by an Assessment of Performance Unit
(APU). For English the content of the survey was determined by a ‘Language Steering
Group’ which comprised teachers, members of Her Majesty’s Inspectorate of Schools
(HMIS), LEA advisers and academics (two professors of linguistics). This group
addressed theoretical and practical issues, including questions such as the following.
• How would the tests represent the range of reading activities that children might be
engaged in?
• What account, if any, should be taken of the many inconclusive attempts to differen-
tiate the so called ‘sub-skills’ in reading? (This phrasing gives a clue to what would
be decided in relation to this issue! It is probably a reference to the Edinburgh
Reading Tests.)
• To what extent is it appropriate in tests of reading to require pupils to provide extended
written answers?
114 assessing reading: from theories to classrooms
The resolution of these issues was to reject reading ‘sub-skills’ and any attempt to iso-
late factors underlying reading. Comprehension was said to be as complex as thinking
itself and therefore no model of the reading process was possible. The previous use of
sentence completion tests was rejected in favour of tests which had coherence in content
and structure. Three categories of material were used in the surveys: works of reference,
works of literature, and everyday reading materials which pupils would encounter for
practical purposes in everyday life (comics, newspapers, forms, notices, brochures,
instructions etc.). The reading stimulus material consisted of booklets which were
intended to be naturalistic, with coherent organisation in terms of content and structure,
including contents pages, indexes, chapters etc. The works of literature used were com-
plete, rather than extracts, and covered a range of genres such as short stories and poems.
Attempts were made to include different types of response to the questions asked
about the stimulus material. Most questions required a written response, but pupils also
had to complete forms, fill in tables, label diagrams, make notes, prepare summaries and
design posters. The stated guiding principle was that the tasks should be similar to those
an experienced teacher would be likely to ask, taking into account the subject matter,
form and function of the reading material.
This expansive approach to the assessment of reading (and the definition of reading)
was possible because of the purpose of the exercise as a national survey with many tests
each being used with a defined randomly selected group of pupils. Five surveys of read-
ing were undertaken each year from 1979 to 1983, with all involving both 11- and 15-
year-olds, chosen as the end points of primary and secondary schooling. A full account
of the APU Language Monitoring Programme can be found in Gorman et al. (1988). In
addition to reading, writing and speaking and listening were also surveyed.
In terms of a definition of reading to be assessed, the APU tests were important in a
number of ways. They gave an even greater emphasis to the use of actual or realistic
material, reflecting the notion of an actual purposeful activity, rather than an abstraction
or simulation as in most of the earlier tests. The range of genres was also important,
moving away from a constant reliance on story and narrative. Finally, the range of ques-
tioning was broadened and the styles of responding became more varied. To some
extent, this represented the introduction of a more literary approach in which pupils
were expected to understand the roles and meanings of test types and also provide a
personal response.
The ‘official’ nature of the material, as a government-sponsored project led to a
process for defining the construct of reading which was based on consensus among a
variety of representatives. This contrasts with the earlier national surveys where the test
used was devised by a single researcher. This democratisation and centralisation of the
process illustrates the view taken here that the constructs of reading assessed over the
years have reflected the needs of both the educative endeavour and wide society, reflect-
ing their concerns at a given time and the prevailing views of the functions of the
assessment.
In historical terms, the APU monitoring programme was not long-lived. It functioned
from 1979 to about 1989, unlike its US cousin NAEP, which has continued to the present.
significant moments in the history of reading assessment 115
In 1989, the structure and responsibility for education in England and Wales received
its biggest change for forty years. A National Curriculum was introduced, passing
responsibility for what was taught from local control by teachers, schools and LEAs to
central government. This was intended to radically alter the nature of compulsory edu-
cation in the UK. The genesis of the reform was concerns over many aspects of the
process of schooling and its outcomes. There was a growing view that despite increas-
es in resources, standards of attainment had not improved since the Second World War.
Indeed, national surveys seemed to support such a view (Brooks et al., 1995). This
poor achievement was in contrast to students in other countries and low ability stu-
dents were thought to be most at risk, giving rise to very wide ranges of attainment.
Prior to that time, teachers, schools and local education authorities had determined the
curriculum in each locality, leading to large variations in standards. During the 1970s,
some pedagogic practices received high levels of publicity and condemnation. In other
spheres of government, the ideology of the time had been based on the introduction of
market forces in order to raise standards and this philosophy was now to be applied to
education.
The National Curriculum approach to English was set out in the Cox proposals (DES
and WO, 1989) and continues to underlie the curriculum, even after several revisions.
116 assessing reading: from theories to classrooms
This though gives rise to a conflict between validity and reliability. With the use of dif-
fering passages, a variety of answers, an interactive and relaxed format for the child and
reliance on teacher judgement, the key stage 1 reading tasks seemed not to be the type
of assessment which would survive in a high-stakes accountability regime (as became
the case in England). In the early years, they were attacked in both the press (‘a woolly-
minded fudge’ – TES, 1991) and by academic psychologists (Pumphrey and Elliott,
1991). Yet, they have survived from 1991 through to 2004. Over this period, there have
been changes. The initial tasks were tied closely to the criterion-referenced statements
of the first incarnation of the National Curriculum, but from 1996 the task was updated
to reflect a more holistic approach of the revised National Curriculum. The need for
standardisation, heightened by accountability was met through the introduction of a for-
mal written test, at first taken voluntarily, at the choice of teachers, then becoming
compulsory. In 2004, a pilot of key stage 1 assessments allowed greater flexibility in the
timing and conduct of the tasks, taking them back to their origins. This is to become the
procedure for all schools from 2005.
In some ways, the task with its running records and miscue analysis was a continua-
tion of the approach in the Neale Analysis and other reading-aloud tests. Such a running
record approach was not new and was widely recommended before the introduction of
the National Curriculum. It was not, however, widely used. In the first pilots for the
National Curriculum tests, its use was emphasised and it received a great deal of atten-
tion at training sessions. As a result, most infant teachers became proficient at using
the running record and its diagnostic value was recognised by teachers and LEA
advisers. This helped the task to survive, rather than being replaced completely by a
simple written test.
The importance of the key stage 1 reading task for level 2 is in its authenticity for the
classroom environment. It allows choice and endeavours to allow children to show their
best attainment. It provides some diagnostic information, yet functions in a high-stakes
environment. All schools in England must use it, so it has had a wide impact in terms of
training teachers and forming their attitudes to the assessment of reading. For all these
reasons, it has to be seen as an important contribution to current understandings of the
concept of reading, for young children.
Conclusion
This listing of significant reading tests has spanned about ninety years of educational
assessment. It has included tests intended for screening, supporting teaching, diagnosis,
surveys and monitoring and, eventually, accountability. The tests themselves have
moved from word-level through sentence level to whole texts.3 In their scope and
demand on students they are very different. Yet some characteristics remain fairly con-
stant. Throughout, from Burt onwards through to the key stage tests there is an overt
desire to be fair to the children. Material, whether it is words, sentences or texts, was
118 assessing reading: from theories to classrooms
selected to be familiar to the children and to be accessible to them. In all cases, the edu-
cational purpose of the tests was laudable, from Burt’s desire to prevent misplacement
of children in schools for the mentally deficient through desires to improve the teaching
of reading (Schonell, Neale, Edinburgh) to attempts to support the curriculum (APU
tests) and provide information to parents and schools (key stage tests). Where they dif-
fer is in the approach taken to reading as a construct. They move from attempts to
simplify and break up the reading process, a psychological approach, to attempts to
have naturalistic texts with purpose to the reading, a constructivist and literary
approach. These reflect the needs of the education systems of the times and also the
view of those controlling the test construction. This gradually moved from individuals
(often psychologists) like Cyril Burt, Fred Schonell and Marie Neale to representatives
of government, committees and society’s representatives (Edinburgh, APU and key
stage tests).
The tests included in this account vary a great deal, in a sort of progression. Yet, they
are all called reading tests and considered to be so. In each case, their longevity and
wide use demonstrates that they have been accepted as measuring reading. In terms of
validity (as elaborated by Sainsbury in this volume), they all represent some construct
which is or was acceptable to society of the time. It could be argued that the essential
process of reading has not changed over this time. At each stage, it still demanded
decoding of symbols, recognition of words and extraction of meaning from sentences
and texts. But in different ages the emphasis has been on different parts of this process
leading the tests to change over the eighty years. What was measured may well have
been related to the capabilities of the majority of children at the time, but this in itself
was a reflection of the established curriculum and its teaching. This chapter has attempt-
ed to show that the acceptability of the construct (in its own time) arises from its
adoption by individuals or groups empowered by society to devise or accept the prevail-
ing notion of reading. Such a notion can come from careful academic study, referenced
and scientific in nature, or from the spirit of the times, the current view among those
interested in the question. That seems to be the case to a greater extent with the later
tests, defined and controlled by committees established by governments.
The significant assessments of reading in the UK over the last century each reflect
a prevailing definition of reading reflecting the needs of society of the time, as per-
ceived by some empowered individual(s). This is the essence of their construct
validity.
References
Brooks, G., Foxman, D. and Gorman, T. (1995). Standards in Literacy and Numeracy:
1948–1994 (NCE Briefing New Series 7). London: National Commission on Education.
Burt, C. (1921). Mental and Scholastic Tests. London: King and Staples.
Department of Education and Science (1975). A Language for Life (Bullock Report).
London: HMSO.
significant moments in the history of reading assessment 119
Department of Education and Science and Welsh Office (1989). English in the National
Curriculum. London: HMSO.
Elliott, C. with Smith, P. and McCulloch, K. (1997). British Ability Scales II. Windsor:
nferNelson.
Gorman, T.P. et al. (1988). Language Performance in Schools: Review of APU Lan-
guage Monitoring, 1979–1983. London: HMSO.
Hagley, F. (1987). Suffolk Reading Scale. Windsor: nferNelson.
Hearnshaw, L. (1979). Cyril Burt: Psychologist. London: Hodder and Stoughton.
La Fontaine, D. (2004). ‘From comprehension to literacy: thirty years of reading assess-
ment.’ In: Moskowitz, J. and Stephens, M. (Eds) Comparing Learning Outcomes:
International Assessment and Education Policy. London: RoutledgeFalmer.
Macmillan Test Unit (1997). Group Reading Test II (6–14). Windsor: nferNelson.
McLeod, J. and Anderson, J. (1970). ‘An approach to the assessment of reading ability
through information transmissions’, Journal of Reading Behavior, 2, 116–43.
Miliband, D. (2003). ‘Don’t believe the NUT’s testing myths’, Times Educ. Suppl.,
4558, 14 November, 19.
Morris, J. (1959). Reading in the Primary School: an Investigation into Standards of
Reading and their Association with Primary School Characteristics. London:
Newnes Educational Publishing.
Pumphrey, P. and Elliott, C. (1991). ‘A house of cards?’ Times Educ. Suppl., 3905, 3 May.
Sainsbury, M. (1996). ‘Assessing English.’ In: Sainsbury, M. (Ed) SATS the Inside
Story: The Development of the First National Assessments for Seven-year-olds,
1989–1995. Slough: NFER.
Stibbs, A. (1984). ‘Review of Schonell Reading Tests.’ In: Levy, P. and Goldstein, H.
(Eds) Tests in Education: A Book of Critical Reviews. London: Academic Press.
Taylor, W.L. (1953). ‘Cloze procedure: a tool for measuring readability’, Journalism
Quarterly, 30, 415–33.
Turner, M. (1990a). ‘A closed book’, Times Educ. Suppl., 20 July.
Turner, M. (1990b). Sponsored Reading Failure. Warlingham: Warlingham Park
School, IPSET Education Unit.
Vincent, D. and Cresswell, M. (1976). Reading Tests in the Classroom. Slough: NFER.
Wood, R. (1986). ‘The agenda for educational measurement.’ In: Nuttall, D.L. (Ed)
Assessing Educational Achievement. London: Falmer Press.
Reading tests
Reading Test AD
Watts, A.F. (1955). Reading Test AD. Windsor: NFER.
Gorman, T.P. et al. (1982). Language Performance in Schools: Primary Survey Report
No.2 (APU Survey). London: HMSO.
Gorman, T.P. et al. (1983). Language Performance in Schools: Secondary Survey
Report No.1 (APU Survey). London: HMSO.
Gorman, T.P. et al. (1983). Language Performance in Schools: Secondary Survey
Report No.2 (APU Survey). London: HMSO.
Gorman, T.P. et al. (1988). Language Performance in Schools: Review of APU Lan-
guage Monitoring 1979–1983. London: HMSO.
Notes
1. Strangely, for such an important test, it is never referenced in full, and no publication details can be given.
2. During research for this paper, the author examined Cyril Burt’s 1921 edition of Mental and Scholastic
Tests. This includes many psychological and educational tests and techniques. Among these is a type of
tests called by Burt ‘Completion’ which required the entry of words into blanks in a piece of continuous
narrative text. This is used as a measure of intelligence rather than reading. Unfortunately without a refer-
ence, Burt attributes the style of test to Ebbinghaus the German psychologist (1850–1909). It seems that
the basic technique of cloze tests was familiar at least forty years before its generally ascribed invention
and possibly long before that.
3. This survey describes the popular tests, that is those mostly used. In fact many of the styles of test which
followed (Schonell, Neale and Edinburgh) were included in Burt’s Mental and Scholastic Tests, but were
not widely taken up. The progression referred to was not one of invention but of use.
9 Lessons of the GCSE English ‘100
per cent coursework’ option,
1986–1993
Paul Thompson
Many schools, particularly those with a poor academic track record, have opted for
100 per cent course work. The replacement of formal examinations by 100 per cent
course work is surely the heart of the corruption.
(Stoll, 1988, p.34)
In 1986, a new kind of public examination was launched for all 16-year-olds in England
and Wales. Neither the introduction of the General Certificate of Secondary Education
(GCSE), nor the abolition of its 100 per cent coursework options in 1993, escaped con-
troversy. Responding to allegations that GCSE 100 per cent coursework was subject to
abuse, exam boards in England and Wales were required in 1993 to introduce ‘a more
reliable system of assessment’. The 1988 Education Reform Act had introduced a high-
er degree of ‘public accountability’. Nationally published league tables were instated
for the purpose of school comparison. Those at the bottom would be publicly ‘named
and shamed’. In the particular field of GCSE assessment, policymakers in the early
nineties argued that the abolition of 100 per cent coursework would increase test relia-
bility through a higher quality of standardisation in terms of questions asked, time
allocated for answers and mark schemes.
Two decades later, the prospect for high stakes testing is far less positive. It is con-
ventional wisdom within the profession that ‘teaching to the test’ skews the curriculum.
Many educationalists consider that the tail of assessment is wagging the curriculum
dog. There is also a major crisis in the recruitment of markers: Edexcel, for example,
one of England’s biggest exam boards, admitted that it regularly used non-teaching staff
to mark papers where there was a shortage of practitioners (Curtis and Smithers, 2005).
On the website of English 21, which hosts a consultation by the Qualifications and
Curriculum Authority (QCA) on the possible shape of English in 2015, English Officer
Paul Wright comments:
The current high-stakes position of exams rests on the assumption that they deliver
the most reliable comparability of standards and are easily understood by the public.
Yet there remains a widespread perception that standards are falling. It may be time
to re-examine the idea that examinations are the only way to ensure credibility with
the public and the media.
(Wright, 2005)
lessons of the GCSE English ‘100 per cent coursework’ option 123
Until fairly recently, it looked as though high-stakes testing was a fixture in English
education. Today the picture is changing. A QCA and DfES-funded project, ‘Monitor-
ing Pupils’ Progress in English at KS3’, has been working since 2003 to ‘improve the
quality, regularity and reliability of teacher assessment throughout KS3’ through the
development of diagnostic approaches which are based on the principle of portfolio
assessment (QCA, 2004). Enthusiasm for formative assessment has never apparently
been greater. In the first year of its introduction as a strand in the KS3 National Strategy
(2004–5), ‘Assessment for Learning’ was chosen by 80 per cent of schools as their key
whole school priority, suggesting considerable hunger for a more responsive assessment
approach, no doubt partly in order to counteract the negative, constraining backwash of
summative National Curriculum tests.
Reflecting the urgency of this debate, Paul Wright’s introductory statement to the
assessment section of the English 21 website calls for a reassessment of the teacher’s posi-
tion in the assessment process and seeks to re-evaluate the sharp prevailing distinction
between formative and summative assessment practices:
Are there ways of making formative and summative assessment more complementary,
rather than seeing them in opposition? We need to explore ways to harness the
detailed knowledge teachers have of their pupils as learners in ways that will help
them progress as well as being accountable to national standards.
Many older English teachers would argue that it was precisely this productive balance
which we had twenty years ago in GCSE English 100 per cent coursework assessment.
In fact, ‘100 per cent coursework’ was something of a misnomer: all such schemes actu-
ally included at least one assignment conducted under controlled conditions and all
syllabuses were subject to internal and external moderation procedures. Of course, the
system had weaknesses as well as strengths. The purpose of this chapter is to review
them in order to draw some lessons for the future of assessment at GCSE level in Eng-
lish and English Literature. What was the nature of this assessment approach and how
did it impact on candidates’ learning experiences? What lessons can be learned from a
system of assessment and a curriculum which were primarily in the hands of teachers?
GCSE English adopted a unitary approach. It replaced the General Certificate of Educa-
tion (GCE) ‘O’ level English and Certificate of Secondary Education (CSE) English
which had examined the top 60 per cent of the school population. The bottom 40 per
cent had not been entered for examinations at all. GCSE, by contrast, aimed to accom-
modate all candidates. The earliest GCSE English syllabuses offered both 100 per cent
coursework and 50–50 options (i.e. 50 per cent examination and 50 per cent course-
work). Neither emerged ‘out of the blue’ in 1986. Both represented the culmination of
developments which had been taking place in English throughout the 1970s and early
124 assessing reading: from theories to classrooms
1980s, particularly in lower school English. The introduction of GCSE English allowed
teachers to extend some of the good practice current in the lower school syllabus into
examination work:
Methods of cooperative study, the focus on oral communication, drafting and editing
of written assignments and the emphasis on the individual selection of reading mat-
ter, for example, are common experiences for many pupils in the early years of the
secondary school. GCSE has encouraged teachers to experiment further with such
approaches with all age groups, with the result that the secondary English course
should become a coherent whole; GCSE will not be seen as a bolt-on necessity but a
natural development of what has preceded it.
(Wainhouse, 1989, p.80)
‘Whole language’ approaches to literacy education and a spirit of integration had been
paramount in the eighties (Newman, 1985). GCSE English and English Literature came
to be regarded as a single unified course which could nevertheless lead to a grade in two
separate subjects. The GCSE English Criteria explained that English should be regard-
ed as ‘a single unified course’ in which spoken and written work blended seamlessly.
Assessment objectives in English were considered to be interdependent and could be
tested both through speech and writing. (Although students could study the same texts
for both subjects, the same pieces of work could not be submitted for both ‘exams’.)
The aim was to establish an integrated course which did not artificially separate the four
language modes – speaking, listening, reading and writing. It was hoped that this would
enable a wider range of integrated, realistic and purposeful classroom activities. There
was a renewed emphasis on English as meaningful communication, carried out for
genuine purposes.
Candidates were expected to produce a folder of work over the five terms of the
course which involved both formal and informal speaking and listening, study of a wide
variety of reading (including complete non-literary as well as literary texts) and a wide
range of writing (including stories, reports and letters). ‘The writing must include writ-
ten response to reading and this must range from the closed response of … a report of
some kind to the more open-ended response of a piece of imaginative writing’ (Chilver
1987, p.9). All syllabuses included an element of wider reading. Although coverage of
Shakespeare was not compulsory, study of a minimum of five or six texts in three gen-
res was expected. Some response to unseen texts was required under controlled
conditions although structured tests could not account for more than 20 per cent of the
syllabus.
The approach was positive. Teachers were encouraged to differentiate by task where
necessary in order to give candidates every chance to display their ability. A range of
assignment choices was often offered to students who also had a certain amount of free-
dom in deciding which evidence would be submitted for their final ‘folder’. The notion
of ‘differentiation by outcome’ was used to explain how candidates of differing levels of
ability could respond to a common question in a variety of different ways, ranging from
the basic to the advanced. This seemed to be particularly suitable in English where,
lessons of the GCSE English ‘100 per cent coursework’ option 125
given careful thought, it seemed relatively easy to design common tasks which gave
candidates across the whole range of ability the opportunity to do their best.
The end of set-text prescription and the broadening of the canon was a particular source
of anxiety for traditionalists who were concerned that GCSE English criteria required lit-
erature during the course to reflect ‘the linguistic and cultural diversity of society’. The
proposition in the GCSE General Criteria that the teacher must ‘make sure … that the
range of texts offered relates equally to the interests and experiences of girls and boys and
meets the requirements for ethnic and political balance’ (SEC, 1986, p.14) was seen by
some as a threat: ‘Education is the transmission of culture and the public has the right to
expect that British culture is being transmitted in British schools’ (Stoll, 1988, p.35).
There was also anxiety that the ‘best authors’ – especially Shakespeare – would
inevitably be ignored within GCSE’s mixed ability teaching approach. Worthen (1987,
p.32) feared that traditional literary study would soon be phased out by ‘the egalitarian-
s’ and replaced with a ‘mish-mash of other “ways of responding” to literature’ (p.32).
Indeed, prefiguring dual accreditation, he also predicted that English Literature could
soon be abolished as an area of study in its own right and diluted into a programme of
‘general English’. What arguably materialised under the Dual Award, however, was a
greater breadth of study within which the literary heritage remained largely intact.
According to an Assistant Chief Examiner, Shakespeare, Dickens, Hardy and Owen
were still studied in most centres. He stated that candidates
no longer have to pretend that they are incipient literary critics, at least in the
stylised literary form of the discursive essay and equipped with the appropriate tools
for the dissection of literary bodies. They can now be frankly and expressively what
literature requires them to be – readers.
(Thomas, 1989, pp.32–3)
The staple diet of reading development in the old GCE ‘O’ level and mode 1 CSE
exams had been the comprehension test in which a short extract was followed by a
sequence of questions requiring short written answers. Critics had argued that such
exercises often required little more than literal paraphrasing of words and phrases.
Many ‘progressive’ English teachers at this time treated ‘the extract’ with great caution
on the grounds that its de-contextualisation distorted authorial intention and weakened
the quality of the reading experience (e.g. Hamlin and Jackson, 1984). There was a par-
ticular emphasis on the need to read whole texts which were thought to serve the
purposes of active reading and critical engagement more effectively than short pas-
sages. More purposeful approaches to reading comprehension were now preferred.
Directed Activities Related to Texts (DARTs), for example, enabled the reader to active-
ly construct meaning, undermining the authoritarian notion which was implicit in
traditional comprehension testing – that it, is the writer of the mark scheme who ‘owns’
the correct textual interpretation; in DARTs, the reader was empowered (Lunzer and
Gardner, 1984).
What 100 per cent coursework approaches aimed to do was move beyond the artifi-
ciality and mundanity of traditional approaches to reading and writing by asking
126 assessing reading: from theories to classrooms
teachers themselves to design assessment tasks which embodied genuine purposes and
audiences. It was felt that GCSE assessment should reflect the wide range of purposes
for which language was used outside school. Teachers tried to establish a greater degree
of task authenticity, often through use of drama and role play in the setting up of assign-
ments. Part of the purpose was to motivate students but the deeper aim was to capture,
as closely as possible, the elusive quality of reading response. Authenticity was consid-
ered to be as much about the quality of response as the quality of the task. By allowing
teachers to use popular modern literature to help students to explore issues and ideas that
were important in their own lives, the new GCSE syllabuses aimed to release students’
own voices:
The possibilities opened up by the GCSE syllabuses mean that the authentic voice in
pupils’ writing ought to be much more prevalent than was ever possible under the old
system of timed essays in response to given and often sterile topics. It remains to be
seen how capable we are as teachers of releasing this voice.
(Walker, 1987, pp.45–54)
Central to this new approach was the idea that students should feel passionately about
their work. Particularly in the reading of literature, the National Criteria for English
(section B, para. 1.1.1) recognised that students’ own personal responses were valuable
and important. Whereas GCE ‘O’ level had been concerned overwhelmingly with liter-
ary knowledge and the development of comprehension skills, GCSE English objectives
repeatedly alluded to the need for candidates to be able to communicate sensitive and
informed personal response.
In GCSE English, candidates were expected ‘to understand and respond imaginative-
ly to what they read, hear and experience in a variety of media; enjoy and appreciate the
reading of literature’. In GCSE English Literature, candidates were required ‘to
communicate a sensitive and informed personal response to what is read’ (DES, 1985).
As well as encouraging more active reading, GCSE English demanded more imagi-
native frameworks for the study of literature. Whereas CSE English literature
coursework had basically required the repetitive production of assignments of essential-
ly the same discursive type, GCSE promoted a much wider range of writing and
teachers were encouraged to devote quality time to the discussion of drafts as well as to
finished pieces. Literature was no longer to be regarded as an object for memorising and
classification but rather as a medium for reflection and the expression of personal expe-
rience. Whereas GCE ‘O’ level had concentrated on the text as an object of study, 100
per cent coursework approaches in GCSE English foregrounded the reader and the
process of reading itself.
GCSE English aimed not to deliver the ‘canon’ but to teach students how to study lit-
erature. This ‘liberation’ from prescribed set texts meant that candidates could now be
offered opportunities both for detailed study and wide reading. The coursework folder
was expected to include both detailed study of a small number of texts and wider read-
ing of a broader range of texts. The GCSE syllabus recognised the diversity inherent in
language use, encouraging a much wider range of types of writing and reading
lessons of the GCSE English ‘100 per cent coursework’ option 127
response. In so doing, it particularly aimed to support less able students who might have
struggled in the past with the narrower range of academic forms. The study of women
novelists, dramatists and poets was particularly recommended.
It was hoped that teachers’ selection of texts would include a balance of genres so
that students could acquire as varied a reading experience as possible. Both literary and
non-literary material was required for study. Some works of translation could be includ-
ed. There was also a greater stress on the range of purposes for reading. Candidates
were encouraged to read for pleasure, look for information, explore underlying assump-
tions, compare texts on the same topic and become familiar with new kinds of writing
such as biography, travel writing and scientific accounts. There was an enormous
widening of the subject matter of English lessons. Films were ‘read’ in class for the first
time. Film versions might be compared with the original or studied in their own right.
Media studies and theoretical ideas about how films could be analysed began to pervade
the English classroom.
For many English teachers, the new approach seemed to offer greater rigour and
opportunities for a higher level of literary appreciation than the earlier examinations. By
broadening the range and depth of study, the GCSE course created the potential for
increased challenge and higher standards. Additionally, the freedom given to English
teachers to choose their own assignments and texts was highly empowering. They were
offered greater responsibility than ever before for the structuring, assessment and mod-
eration of candidates’ work. Many consequently felt that 100 per cent coursework was
both fairer and more satisfying as a method of assessment, both for students and teach-
ers (Walker, 1987, p.45). By 1993, when the 100 per cent coursework option was
scrapped, only 20 per cent of candidates were following the alternative, more tradition-
al 50–50 (i.e. 50 per cent examination/ 50 per cent coursework) syllabuses (Leeming,
1994).
Activity theory
Havnes (2004) uses activity theory to theorise the ‘backwash’ of assessment on learn-
ing, aiming to understand how assessment affects education at the level of ‘system’
(Engestrom, 1987, 2000). He argues that any form of testing and examination must fun-
damentally influence how teachers teach and their students learn. Any particular form of
formative or summative assessment will also systemically influence other aspects of the
educational process such as the production of textbooks, learning materials and the
design of the learning environment.
Black and Wiliam (2005) also suggest that a helpful way of understanding the
impact of formative assessment approaches in subject classrooms is through the medi-
um of activity theory (Engestrom, 1987, 1993, 64–103). Using Engestrom’s theoretical
model, they show how the enhancement of the formative aspects of assessment inter-
action can radically change classroom relationships and the quality and culture of
128 assessing reading: from theories to classrooms
learning. Speaking about the King’s, Medway and Oxfordshire Formative Assessment
Project, they comment:
Using this framework, the course of the project can be seen as beginning with tools
(in particular findings related to the nature of feedback and the importance of ques-
tions) which in turn prompted changes in the relationship between the subjects (i.e.
in the relationship between the teacher and the students) which in turn prompted
changes in the subjects themselves (i.e. changes in the teachers’ and students’ roles).
These changes then triggered changes in other tools such as the nature of the subject
and the view of learning. In particular, the changes prompted in the teachers’ class-
room practices involved moving from simple associationist views of learning to
embracing constructivism, taking responsibility for learning linked to self-regulation
of learning, metacognition and social learning.
(Black and Wiliam, 1998, pp.12–3)
They explain that the subjects of the activity system are teacher and students: the tools
(or cultural resources) which appeared to be significant in the development of formative
assessment in the classrooms which they researched were the views held by students
and teachers about the nature of the subject and the nature of learning; the object of each
classroom activity system was ‘better quality learning’ and improved test scores; its out-
come included changes in expectations and also changes ‘towards assessments that
could be formative for the teacher’.
Using Engestrom’s theoretical model (see Figure 9.1), they show how the enhance-
ment of the formative aspects of assessment interaction can radically change classroom
relationships and the quality and culture of learning.
Figure 9.1 Activity triangle (adapted from Engestrom, 1987)
Artifacts
tools and
practices
Values,
rules and Division
conventions Community of labour
lessons of the GCSE English ‘100 per cent coursework’ option 129
I would suggest that a similar framework of activity analysis can be used to retro-
spectively understand changes which took place in GCSE English classrooms in the late
eighties and early nineties when the 100 per cent coursework folder became a new tool
or instrument of classroom activity. Because the syllabus-driven rules of year 10 and 11
classroom activity systems changed so radically in 1986 with the introduction of GCSE,
so did their division of labour; teachers and students worked much more collaborative-
ly. Students began to share greater ownership of their work. Their roles and social
relationships became transformed. At the time, these changes were seen quite negatively
from the standpoint of the traditionalist wing of educational opinion:
Traditionally, individual pupils faced a teacher and listened. Now groups of pupils
face each other around tables and talk. The devaluing of the teacher as an authority
(both on his subject and as a controller of classroom discipline) is implicit in the new
classroom arrangement and in the whole GCSE style. Implicit also is the idea that
the GCSE is an examination which will make mixed-ability teaching a widespread
practice.
(Worthen, 1987, p.41)
Traditionalists had maintained since the abolition of grammar schools and the introduc-
tion of comprehensives that the ‘pseudo-egalitarianism’ of mixed ability teaching
would lead to an erosion of the authority of the teacher’s voice and an inevitable decline
in standards. Mixed ability organisation often did involve students being seated around
tables for the purposes of small group discussion. In GCSE English, since the early sev-
enties, there had also been the growth of the ‘oracy movement’ through which talk had
been increasingly regarded within English as an important medium of good learning;
the classroom activity systems instigated by the arrival of GCSE in 1986 would, in
many cases, have foregrounded small group discussion as a feature of their pedagogy.
In the view of most English specialists, however, this did not by any means devalue the
role and authority of the teacher: quite the opposite. The teacher in the typical mixed
ability, GCSE English classroom now became a resource – much more a source of sup-
port and scaffolding than an oracle of subject knowledge. Students, with the help of their
teachers, were now expected to actively construct their own understanding of literature
and language. The basic learning paradigm had changed.
Especially through classroom talk activity, the roles of students and teachers them-
selves could be transformed. In his account of the National Oracy Project (NOP) whose
timespan coincided almost identically with that of 100 per cent coursework, project
director, John Johnson, explained that ‘children did good and important, work in small-
group activities, particularly if they had or established fairly clear roles and purposes for
their individual contributions and for the whole-group activities’ (Johnson, 1994, p.35).
Teachers also gradually realised over the course of the NOP that ‘putting children and
students into new roles also gave them the opportunity to take on new roles’ (Johnson,
1994, p.37). Of course, the NOP covered the whole age continuum but it was equally
true of the 100 per cent coursework GCSE classroom that students were able to adopt a
much wider range of classroom roles through the medium of group talk and that the
130 assessing reading: from theories to classrooms
teacher’s role, (e.g. as organiser, provider, collaborator, expert, listener) became much
more diversified.
As for the idea that the introduction of GCSE subverted discipline, my own experi-
ence, as Head of English in what were regarded as two of the most challenging schools
in Nottingham, was that the new paradigm quite clearly enhanced the quality of learn-
ing and the quality of classroom relationships between 1986 and 1993; it was actually
during the mid nineties that classroom discipline began to seriously deteriorate as
league table anxiety and Ofsted oppression created a paradigm of mundane, objectives-
led English lessons, subverting teacher autonomy, student motivation and general
interest. By comparison, in the 100 per cent coursework years, the feeling of many col-
leagues was that classroom relationships had become much more intimate and
cooperative. Through imaginative assignments, students became inspired and engaged
and it was possible, for the first time, to work collaboratively and build trust.
One chief examiner argued that the new approach made it ‘possible to release Eng-
lish Literature from the clutches of an elitist concept of Literary Criticism and to make
the study of books, plays and poetry into an open engagement for all rather than a ster-
ile pursuit for the chosen few’ (Sweetman, 1987, p.62).
As in the classrooms studied by Black and Wiliam (2005), new theories of learning
developed from the changes in teachers’ classroom practices. Teachers and students
were able to move beyond transmission approaches towards a more constructive
methodology of literary study. Traditionalists had argued that the new GCSE practice
of allowing students access to their texts when writing under controlled conditions
minimised the importance of memory and textual knowledge:
The progressives, who now have in the GCSE the encapsulation of their doctrines,
would claim to lay stress on understanding rather than knowledge. But the fact is
that knowledge is the necessary prerequisite to understanding.
(Worthen, 1987, p.35)
Worthen believed that to know a text is an indispensable basis for understanding it
because memory trains the mind and allows students to internalise by committing to
heart an area of valued knowledge. Pupils consequently needed to be taught the funda-
mentals of literary study before they could branch out into speculation and independent
research. He maintained that, although coursework folders may have promoted ‘breadth of
study’, they did not make students think with any rigour, often demonstrating superficiality
and a lack of focus.
The experience of teachers in GCSE classrooms was quite different: one Assistant Chief
Examiner (writing towards the end of the 100 per cent coursework period) stated that he
had found more evidence of thinking about reading and enjoyment in reading during six
years of examining 100 per cent coursework than had ever been the case during GCE ‘O’
level study – and far greater evidence of ‘sustained application of intelligence to fiction’:
The question to me seems simple: Do we want people who think and feel passionate-
ly about Literature and its connection to life as it is lived around them? If we do, then
lessons of the GCSE English ‘100 per cent coursework’ option 131
GCSE Coursework schemes offer enormous benefit. If not, then a restoration of the
norm-referenced, externally-set and assessed will soon put Literature back in its
place as a marginal subject useful in testing comprehension and memory.
(Thomas, 1989, p.41)
The restoration of the examination as the primary medium of GCSE assessment in 1993
did not quite marginalise literature but it did lead to the development of a newer, differ-
ent kind of activity system which continues to prevail: although the content and
emphasis of the examination has rarely remained static since 1993, the basic combina-
tion of 60 per cent examination and 40 per cent coursework (including 20 per cent
speaking and listening) has been the standard format across syllabuses for over ten
years. The classroom dynamic which has emerged from these more recent arrangements
seems to many older teachers to be much more transmissive and hierarchical than the
collaborative, mixed ability classrooms of the late eighties and early nineties.
Assessment issues
A great strength of the assessment system in GCSE English 100 per cent coursework lay
in the fact that it depended upon and directly involved classroom teachers who worked
in local consortia (which often included quite different types of school) to develop a
shared expertise in the marking of candidates’ writing. Although consensus marking is
time-consuming, it has great in-service education and training (INSET) potential and
the local consortia acted as INSET networks through which assignments could be
shared and good practice collaboratively developed. Through group marking and mod-
eration, a guild knowledge was rapidly accumulated of the key features of different
levels of performance and there was an emerging awareness of what was needed to
move on from one level to the next. Although assessment was made holistically and
there were few detailed mark schemes, teachers believed that they were able to offer
reliable summative judgements as well as ongoing formative advice which involved
students themselves in assessment and target setting.
Teachers worked hard in their consortia to create systems of continuous assessment
which monitored the development of coursework over the five terms of the course.
There was extensive debate about the possible shape of assessment proforma. Review
sheets were designed for use at the end of each unit. Student self-assessment sheets
were also developed. It was during this period that self-assessment approaches were
introduced for the first time into many English classrooms.
However, it was the assessment of GCSE 100 per cent coursework in particular
which many politicians and policymakers targeted for criticism. The central issue was
‘consistency’. At the root of perceived inconsistency lay a general weakness in mark
schemes and some fairly underdeveloped grade descriptions. For example, the Mid-
lands Examining Group (MEG) GCSE English syllabus grade criteria for reading
132 assessing reading: from theories to classrooms
performance at Grade D required that ‘students will understand and convey information
at a straightforward and occasionally at a more complex level’; one grade higher at C
required that ‘students will understand and convey information at a straightforward and
at a more complex level’; at grade B ‘… at a straightforward and a quite complex lev-
el’; at grade A ‘… at both a straightforward and a complex level’ (MEG, 1986). Such
bland generalities did not encourage either public or professional confidence.
Although examination results did improve year on year, critics argued that this was
only as a result of an absence of proper checks and balances within the moderation sys-
tem. ‘Controlled conditions’ did not seem to be controlled in the way that examinations
had been; candidates could use prepared material and would often know the question in
advance. Work started at school under controlled conditions could be completed at
home without supervision or time limit. The problem of plagiarism and the danger of
excessive parental or teacher support for particular assignments was also identified by
some commentators.
There was a deep-seated political belief that 100 per cent coursework assessment
was beyond most teachers’ professional capacity because assessors needed training,
time and a distance from their students to enable objective comparison of different lev-
els of performance. It was felt to be inherently problematic to place such great
importance on teachers’ own assessments of their students. Proper and fair comparison
of a final grade based entirely on coursework with a grade based on the 50 per cent
coursework and 50 per cent exam option was considered to be inherently difficult. The
possibility of a legitimate comparison between work handed in at the beginning of year
10 with work completed at the end of the course in year 11 was also questioned. How
could a teacher assessor easily and fairly evaluate the standards appropriate at each
extreme? Another problem lay in the fact that students had opportunities to edit their
work in successive drafts. At what point in the process did assessment occur? To what
extent should formative evaluation be taken into consideration? What is the ideal bal-
ance between formative and summative assessment? These questions were certainly
broached both by politicians and by English educators at this time.
Reading assessment also presented particular opportunities for debate and improve-
ment. Both comprehension testing and responsiveness to literature were issues of
concern. MacLure (1986) argued that assessment of the language modes should be
holistic, opposing moves in one Examining Group to artificially distinguish between
expression, understanding and response to reading. Since the teaching of reading, writ-
ing, speaking and listening are coordinated, she maintained, assessment should equally
be conducted in the spirit of integration.
Of course, the reading process is particularly difficult to isolate. Although teachers
have several possible ways of discovering information about students’ reading perform-
ance (e.g. reading aloud, reading logs, library issue statistics), written and oral response
to text are usually the primary sources. Several arguments have recently been advanced
in favour of placing greater value on speaking and listening as a medium for assess-
ment, especially for the assessment of children’s reading responses (e.g. Coultas, 2005).
lessons of the GCSE English ‘100 per cent coursework’ option 133
In fact, oral response to reading was encouraged in several 100 per cent coursework syl-
labuses. This aimed to address the problem of the ‘double transformation’, i.e. when the
first mental reading response undergoes further transformation into writing. Oral
responses were felt to be more authentic and closer to the original act of cognition.
Interviews and role plays based on candidates’ reading were consequently recorded in
many classrooms. This especially supported students who read well but wrote badly,
particularly disadvantaged if the exclusive medium of reading response was written.
The traditional form of written response to text had been either the comprehension
test or the discursive essay. By encouraging a much wider variety of extended written
forms, GCSE English broadened the range of reading assessment information potential-
ly available to teachers. Students might be asked to write a newspaper account of an
incident from a novel. They might be asked to reformulate a textual incident for a
younger audience. They could be asked to create a script of a radio interview with the
novel’s characters. An incident might be reconstructed dramatically or in the form of a
television documentary. Such assignments required completely new forms of interpre-
tive and linguistic ability. This broader range of possible reading response made texts
much more accessible to the full ability range than the traditional discursive assignment
which had typically asked for an ‘account of an incident’ or the extent to which a candi-
date agreed or disagreed with a particular statement. However, this richness of potential
reading assessment information was not always realised by teachers because the task of
inferring reading ability from written and oral response is complex. The act of writing
or speaking about text inevitably modifies the nature of the response itself, making the
reading process correspondingly less accessible:
It could be argued that there has been a tendency for discussion and experimentation
to be focused on the more obviously assessable aspects of the course, that is on the
products of written and oral work rather than on the receptive processes. A great
deal more practical exploration is needed to find ways of encouraging an active and
involved pupil response to all aspects of reading.
(Wainhouse, 1989, p.68)
Particularly after 1990, examiners worked hard to encourage teachers to develop their
understanding of ‘reading response through writing’ by requiring, for all assignments,
the formulation of mark schemes which included a reading assessment framework.
There was, within the 100 per cent coursework movement, an advanced form of self
criticism which was beginning to address the inevitable problems and anomalies posed
by this new approach.
Another area of contention lay in the increasingly popular genre of ‘empathy assign-
ment’ through which candidates were asked to assume the viewpoint of a fictional
character in order to demonstrate textual comprehension at a variety of levels. Several
commentators were dubious about the notion of empathy as a medium of response,
(Worthen, 1987; Stoll, 1988). It was suggested that empathy was being used as a short cut
for the expression of basic textual knowledge which was held to be pointless, offering few
134 assessing reading: from theories to classrooms
advantages over direct statement. In a very constructive article on this subject in Eng-
lish in Education, Thomas (1989) argued that, in writing an empathy assignment, the
candidate enters into a purposeful dialogue with the text, not merely as a passive recip-
ient but actively engaged in the construction of personal meaning. ‘The task does not
require a straightforward description of character as the reader understands it, but a par-
tial and possibly distorted one as the character sees it’ (Thomas, 1989, p.36). He argued
that a successful empathy assignment should replicate ‘perceptions consistent with the
perspective of the persona chosen for empathetic identification’ (p.36) and suggested
that a pair of descriptions from different character viewpoints, together with a brief
postscript explaining the key features of the text under focus, could help to deepen the
quality of assignments of this kind.
Validity or reliability?
The strongest argument that can be advanced in favour of any form of assessment is that
it raises standards at the same time as reporting efficiently on performance. What many
argue that we have today is a system of assessment that reports on performance without
raising standards (e.g. Black and Wiliam, 1998). Was the GCSE English 100 per cent
coursework option any better? There is no doubt that it had strengths. For example, the
early GCSE English syllabuses required candidates to read far more widely than had
been the case before 1986 and there are strong arguments for believing that candidates
at that time read more extensively than they have done since. Warner (2003) compares
the richness of GCSE candidates’ reading experiences in the late eighties and early
nineties with ‘the suffocating effect of genre theory and literacy strategy’ on students’
reading today. He maintains that ‘centralised attempts to widen the reading curriculum
have narrowed it’ (p.13). Despite some patchiness and inconsistency in curriculum cov-
erage, there was a breadth and range built into 100 per cent coursework syllabuses
certainly not evident today.
The system of teacher assessment itself has many advantages. Observations made by
a range of people over a range of authentic situations must be fairly reliable. There is an
inherent advantage for an assessment system when teachers themselves are involved in
its creation and development. When 100 per cent coursework was abolished in 1993,
improvements in external moderation procedures and teacher assessment techniques
were already under way. Many older English teachers would still maintain that regional
consortia are quite capable of establishing a reliable and valid form of national assess-
ment through improved training and the accrediting of appropriate personnel. External
moderation can validate teacher assessment while particular aspects of the course are
tested under strictly maintained controlled conditions. With the support of external
validation, it ought to be possible to mark formal tests fairly within schools.
Nevertheless, the difficulties of generalising from assessment data which have been
collected at a local level should not be underestimated either. It may well be that there is
lessons of the GCSE English ‘100 per cent coursework’ option 135
Assessment here consists of checking whether the information has been received and
absorbed… By contrast, constructivist models see learning as requiring personal
knowledge construction and meaning making and as involving complex and diverse
processes. Such models therefore require assessment to be diverse … intense, even
interactive.
(Gipps, 2005)
It seems that a consensus is developing within the profession around a view that a
greater degree of flexibility is necessary and that there is room in the future for both par-
adigms. In order to match the type of assessment as closely as possible to its specific
learning objective, we need a wider range of assessment forms. The use of external
assessment should be sparing and thoughtful – appropriate when valid but that is not
likely to be often. Neither should assessment any longer be allowed to distort the cur-
riculum. It should be part of the process of planning, teaching and learning and ought to
be able to give students the best chance of performing well. Assessment in English
needs to be directly relevant to pupils themselves. It needs to reflect pupils’ authentic
experiences of reading, writing, speaking and listening.
Conclusion
years later, there remained a considerable outcry among many English teachers about the
abolition of 100 per cent coursework and the sacrifice of the principle of task authenticity.
Due to the proliferation of new educational initiatives and their associated workload over
the past decade, it could prove quite difficult in the future to reintroduce a national course-
work-based system of teacher assessment in English. Nevertheless, the pendulum, which
had seemed to be swinging away from the principle of validity towards a requirement for
greater manageability and test reliability, is now swinging back again. It is clear that sev-
eral aspects of the current examination arrangements have had a pernicious backwash
effect upon the English curriculum and that ‘something needs to change’. This chapter has
sought to inform the change process by reviewing significant features of the rise and fall
of 100 per cent coursework assessment between 1986 and 1993.
References
Black, P. and Wiliam, D. (1998). Inside the Black Box: Raising Standards through
Classroom Assessment. London: School of Education, King’s College.
Black, P. and Wiliam, D. (2005). ‘Developing a theory of formative assessment’. In:
Gardner, J. (Ed) Assessment for Learning: Practice, Theory and Policy. London:
Sage Books.
Chilver, P. (1987). GCSE Coursework: English and English Literature. A Teacher’s
Guide to Organisation and Assessment. Basingstoke: Macmillan Education.
Coultas, V. (2005). ‘Thinking out of the SATs box – Assessment through talk.’ Avail-
able: http://www.late.org.uk/English21OralAssessment.htm
Curtis, P. and Smithers, R. (2005). ‘GCSE papers marked by admin staff’, The
Guardian, 22 August, 1.
Department of Education and Science (DES) (1985). GCSE: The National Criteria:
English. London: HMSO.
Department of Education and Science (DES) and Welsh Office (1989). English for Ages
5 to 16. London: DES and Welsh Office.
Engestrom, Y. (1987). Learning by Expanding. An Activity-Theoretical Approach to
Developmental Research. Helsinki, Finland: Orienta-Konsultit Oy.
Engestrom, Y. (1993). ‘Developmental studies of work as a testbench of activity theory:
the case of primary care in medical education.’ In: Chaiklin, S. and Lave, J. (Eds)
Understanding Practice: Perspectives on Activity and Context. Cambridge: Cam-
bridge University Press.
Engestrom, Y. (2000). ‘Activity theory as a framework for analyzing and redesigning
work’, Ergonomics, 43.7, 960–74.
Gipps, C. (2005). ‘Assessing English 21: some frameworks’. Available:
http://www.qca.org.uk/13008.html
Hamlin, M. and Jackson, D. (1984). Making Sense of Comprehension. Basingstoke:
Macmillan Education.
138 assessing reading: from theories to classrooms
Currently, many commercial test development companies, states and governments are
moving into computer-based assessment of reading and this is hardly surprising: com-
puters and especially computers connected to the internet, offer the promise of instant
data on reading achievement, based on centrally standardised and uniformly adminis-
tered tests. Computers also offer commercial developers the promise of instant sales of
test instruments, with minimal printing and distribution costs. To make online assess-
ment appear more sensitive to the individual, an increasing number of states in the USA
are declaring that their tests are ‘adaptive’: the computer tailors the items to the achieve-
ment level of the child taking the test, thereby, it is argued, increasing validity and
reliability, while reducing stress, anxiety and a possible sense of failure.
postmodern principles for responsive reading assessment 141
‘Idaho to adopt “adaptive” online state testing’ ran the headline in Education Week
(Olson, 2002), over a story that saw the chair of the state board of education saying
‘We wanted an assessment system that would provide data first and foremost to
improve instruction, which in turn, would improve accountability.’ But if assessment is
to improve instruction, then the nature of the data is critical and in the case of most
online reading assessment the most common data source is that generated by multiple-
choice test results and it is by no means clear just what ‘data to improve instruction’ is
available from these scores. The question of the ways in which the tests are ‘adaptive’
is also somewhat problematic: broadly speaking, the computer takes multiple-choice
items from an item bank and if the reader gets the question wrong, offers an easier one
and if the reader gets the item correct, selects a harder one. The bonus for the testee and
test developer is shorter tests and fewer items (though drawn from a large item bank);
the bonus for the state is online access to statewide data on reading or maths achieve-
ment that is updated hourly. However, if we consider for a moment what is happening
here, the gains for the individual student are negligible – if the student receives no
useful feedback an online test is no different from the old pencil-and-paper tests that
have been taken for decades. In most cases such tests provide no developmental pro-
file, no diagnosis of reading errors and no recommendations for the individual’s
future pedagogy.
The question that drove the present study was whether it was possible to design a
reading test that was based on a more authentic and ecologically valid task than a mul-
tiple-choice comprehension test, one that also made better use of the massive increases
in computing power that are now available – advances so great that what was twenty-
five years ago a whole university’s computing power is now compressed into a single
desktop computer. The challenge we set ourselves was to consider, as a desk study ini-
tially, the extent to which it might be possible to evaluate a reader’s skilled behaviour in
carrying out a complex research task using the internet.
To begin with, let us consider for a moment how good readers behave when carrying
out such a complex research task. Good readers do more than just read: we suggest in
the list below that they carry out eight related sets of behaviours. Clearly there is a mas-
sive set of research studies that one might call upon in drawing up such a list and we
have necessarily been selective, but while the individual studies that we have cited
might not necessarily be the ones that other researchers (or indeed the authors them-
selves) would choose, we suggest that our eight areas themselves are less contentious.
In carrying out a complex reading task, good readers:
• set themselves purposeful reading and writing goals (O’Hara, 1996)
• decide where they need to look for multiple reading resources (McGinley, 1992)
• navigate effectively towards those resources (Wright and Lickorish, 1994)
• adjudicate thoughtfully between possible sources of information: rejecting, selecting,
prioritising (Pressley et al., 1992).
142 assessing reading: from theories to classrooms
• decide which parts of the chosen sources will be useful: rejecting, selecting, prioritising
(Pearson and Camperell, 1994)
• decide how to use the sources: to edit, order, transform, critique (Duffy and Roehler,
1989; Stallard, 1974; Kintsch and van Dijk, 1978)
• produce a text that takes account of its audience (Hayes and Flower, 1980)
• evaluate the adequacy of their performance, revising and looping back to earlier
stages of the process as appropriate (Rudner & Boston, 1994).
The first point to make is that a reading task that offered an opportunity to demonstrate
these behaviours would also provide a close mapping onto our six practical imperatives
for postmodern assessment, in that evidence of a student’s performance on these tasks
would be valuable for both student and teacher (a ‘local systems solution’ – Imperative
1); the task could readily accommodate self-assessment (acknowledging ‘the impor-
tance of the subjective’ – Imperative 2); it would be based on a range of reading skills
and behaviours (accepting ‘a range of methodologies’ – Imperative 3); it would make
use of a massive range of potentially conflicting data sources in a highly authentic envi-
ronment (recognising a ‘polysemic concept of meaning’ – Imperative 4); it would invite
a dynamic, critical reading response (privileging ‘the role of the reader’ – Imperative 5)
and would do so in a context that clearly puts an emphasis on the authority and autono-
my of the reader (thereby diminishing ‘the authority of the author and of the text’ –
Imperative 6). Such an approach would therefore offer in principle a comprehensive
basis for a postmodern and responsive assessment of reading.
The second issue is to consider whether it possible for a computer to assess automat-
ically such authentic reading behaviour. Our answer is – yes, in principle, but it would
be incredibly challenging, for not only would it be dauntingly difficult to attempt to
write an intelligent adaptive program that would capture and evaluate some of the
behaviours listed above; in reality, as Spiro et al. (1994) have reminded us, actual online
behaviours are even more complex than is indicated in the list. For a good reader, goal-
setting is provisional and the task being executed is therefore provisional; resource
selection is provisional and evaluation at the local level is carried out incredibly rapid-
ly, on the basis of partial information (fluent readers can evaluate and reject potential
web sites at the rate faster than one per second, under certain conditions); finally, a good
reader assembles information from diverse sources, integrates it with what is already
known, mapping it into a new, context-sensitive situation-specific adaptive schema,
rather than calling up a precompiled schema.
But if we are interested in pushing forward the use of the computer into this area
then we would want to suggest that reading specialists and test developers need to
work with cognitive scientists and artificial intelligence specialists and to begin to
take reading assessment into this new and exciting domain. The case study reported in
the remainder of this chapter represents the first fruits of such a collaboration, in a
project that we call Intelligent Online Reading Assessment (IORA). We are possibly a
postmodern principles for responsive reading assessment 143
decade away from having anything approaching online reading assessment of the sort
that is envisioned here, but we want to suggest that if we put online multiple-choice
tests at one end of the continuum and fully fledged IORA at the other, then we can at
least use the two as reference points and measure the progress in intelligent online
assessment against a challenging and more worthwhile target than that offered by
‘adaptive’ instruments that do little more than put multiple-choice reading tests onto
the internet.
The study that we report was preliminary and exploratory and no more than an indi-
cation of the direction that IORA might take. It makes use of Latent Semantic Analysis
(LSA) (Landauer, 2002), an interesting but somewhat contentious computer-based
approach to evaluating the semantic content of texts. We are currently embarking upon
a more extensive series of studies, using seventy readers, a new set of tasks and our own
(rather than the University of Colorado’s) LSA program which will be based on the 100-
million-word British National Corpus; we are also trialling a plagiarism detection tool
to give those taking the test constant information about the relative amounts of verbatim
and non-verbatim content in their essays.
The exploratory study that we report here aimed to explore an alternative to multiple-
choice online reading comprehension tests, which may have high construct validity, but
which also have low ecological validity and negative backwash effects at the system
level. Our aim was to investigate an approach that might capture and evaluate some of
the complex cognitive processes that are involved in authentic web-based research
tasks. Eight fluent readers participated and for each participant, files were generated
based on the search terms used, the URLs visited and the text of a final essay. Each par-
ticipant’s evidence (i.e. the search terms they entered into Google and the text of their
final essay) was evaluated using LSA to produce an indication of five factors:
1. the degree of match between participant’s search goal and the lexical items in the
given task
2. the degree of match between the participant’s Google search terms and an expert’s
Google search terms
3. the degree of match between the participant’s essay task output and lexical items in
the given task
4. the degree of match between the participant’s essay task output
5. an expert’s written task output and the overall coherence of essay task output.
144 assessing reading: from theories to classrooms
Emerging technologies have changed the ways we read and write. Readers are not only
often confronted with a potentially vast amount of online text but also with web tech-
nologies which introduce new research tools, while preserving much of the old ways of
reading in traditional print. In addition, online texts link a variety of media and sources
that challenge readers to locate the information they need, decide which pieces of text to
view and in what order and relate this information to other facts also found among the
thousands of millions of web pages available via the internet (Charney, 1987). Reading
online text involves handling multiple texts displayed concurrently, sorting, navigating,
responding (summarising or copying and pasting the text) and filing (Schilit et al., 1999).
Figure 10.1 is a flow chart based on actual reader behaviour and outlines the nature
of the observable online reading process, starting from when readers are given an online
reading task until they produce a written output.
Figure 10.1 What happen when a reader performs a complex search task?
(Derived from Harrison et al., 2004)
Start
Read task
Loop back
to task
Put keyword(s)
into Google
Edit/ Evaluate
augment/change Google
Google search results
term?
Navigate to
new URL
Copy/paste
content
Deal with
new URL
possibilities?
Revise/
transform
task output
Evaluate task
completion
Stop
Figure 10.2, online readers are affected by various factors before and while the online
reading take place. Readers’ reading skills, vision skills, web skills, reading styles,
goals and strategies, prior knowledge and belief will guide them while reading and
comprehending the current online text.
Pilot work with adults in a higher education setting led us to the following reading
activity cycle. In reading online, readers will first determine a search goal. To determine
the search goal, parts of the Comprehending procedure are called up and readers will
check their current knowledge state against the potential areas for searching out new
knowledge and the search terms that might lead them to that knowledge.
Readers will then attempt to reach a search goal and search the web by using a search
engine. Then, readers will activate the Online Reading procedure, which is to navigate
the search list provided by the search engine. In the next step, readers will select a web
page and read. Upon reading, readers will call up the Reading procedure, in which read-
ers will integrate word, phrase and sentence while at the same time checking for local
146 assessing reading: from theories to classrooms
Figure 10.2 The input, process and output of online reading comprehension
(derived from Omar et al., 2004)
Prior to reading
reading coherence. After reading the selected web page, readers will decide whether to
accept the content (copy and paste or summarise the content in the working document)
or to reject the content and activate the Comprehending procedure again in order to
check their current comprehension and to come up with a revised search goal. The acti-
vation cycles of Online Reading, Reading and Comprehending procedures will continue
until the online readers are satisfied with their reading comprehension in order to pro-
duce reading comprehension activity outputs: navigation search lists, selected web
pages, copy, paste and rewrite activities.
Figure 10.3 presents the key elements of a low-inference model of reading, one
which focuses on the observable. It is this model (see Harrison et al., 2004) for details
on how the reading and research captured from the searching or writing tasks are relat-
ed) whose usefulness we aim to explore in this paper.
Process P1 Start
Non-task activities
Process P2
Second loop
Third loop
Fourth loop
Process P7 Stop
148 assessing reading: from theories to classrooms
treated as a score in the range 0.0–1.0) can tell us how similar one text is to the other. A
cosine of 0.6, for example, would suggest that the texts are very similar; a cosine of 0.2
would suggest that they have some similarities, but are not closely related. LSA has
been widely used in the evaluation of written answers (Kintsch et al., 2000; Lemaire
and Dessus, 2001; Kanijeya, Kumar and Prasad, 2003) and also has been found useful
in assessing online reading behaviour (Juvina et al., 2002; Juvina and van Oostendorp,
2004). Since online reading involves online reading comprehension and online web
navigation, our research used LSA in a two-pronged evaluation: of online reading com-
prehension and online web navigation outputs.
The experiment
In this study, we were interested in what kind of data we might capture concerning read-
ers and their interaction with online text so that we could have a detailed understanding
of their online reading behaviour. Eight participants who were regular computer users
took part in the experiment. Six participants, who were also students or student teach-
ers, were paid to participate. All participants were given one hour to complete the task
of defining Socialism, using web resources, in not more than 500 words, of which up to
300 could be verbatim quotation from sources. The search task was defined as follows:
The web offers many definitions of the term Socialism. Perhaps this is because it is a
term that produces strong emotions and it is a word that is considered important by
groups as well as individuals. Your task is to spend an hour looking at the informa-
tion available to you using whatever internet resources you are able to locate, to
produce a document in Word that presents and comments on some definitions of
Socialism.
Every participant used the same hardware: a computer – a Pentium 4 desktop with
2.66GHZ CPU, 40G hard disk storage, network connection, sound system, head set
microphone, external CD writer and external hard disk (for backup purposes), CDs –
700 MB capacity (for backup purposes) and Camcorder, tripod and Mini DVs – to cap-
ture observable behaviour) and software (Internet Explorer and Netscape Navigator to
make online reading possible, Camtasia (Camtasia) to record audio of the participant’s
verbal comments and interview and to capture screen and online activities, Squid-NT –
to capture visited URLs with time stamp and Beeper – to beep every 2 minutes so that
the participant was alerted to comment on his or her work). The experiment took place
in a quiet area where only the online reader and the observer/technician were allowed to
be present and all the software and hardware were as depicted in Figure 10.4.
Before the experiment took place, the technician turned on the computer, deleted the
previous cache file written by the Squid-NT, ran the batch file – to activate the Squid-
NT, checked the internet connection, started Word and created a new document, started
the browser, checked that the participant knew how to quickly switch between Word
postmodern principles for responsive reading assessment 149
and the Browser (using the Taskbar or Alt-Tab), did a sound check to ensure the speech
recording was working properly, turned on the camcorder, started the beeper and
arranged windows (beeper and browser so that both would appear on the screen) and
last, asked the participant not to move around too much in order to make sure the video
recorder managed to capture a clear view of the participant and his/her online activities.
Once the participant was ready, he or she was given the written task and reminded to
respond (explain on what they were doing) when the beeper beeped. Then the technician
started recording audio, video and screen.
After the experiment, the technician would stop Camtasia; stop Squid-NT, move
and rename its var directory; stop the beeper, turn off the camcorder and microphone
and copy the history.dat in the browser to a file, video tape from camera and the .avi
(camtasia file) to CD. Then, the observer would have a short interview with the partic-
ipant. The camcorder would be on once again. The following were the prompts for the
interview:
Thank you very much indeed for helping us.
Please talk to us about how you felt you tackled this task.
Did you feel happy about any things you did?
Were you unhappy about anything?
How would you comment on your own searching and navigation strategies?
What can you tell us about how you selected the sources that you decided were useful?
Have you anything to say about how you carried out the written task?
Thank you very much indeed for all your help.
Once the observer finished the interview, the technician would turn off the camcorder
and copy the captured interview to videotape.
Initially, all the data captured by video, audio and screen recorder were transcribed
manually. However, the screen captures were the primary source of interest, as they
explicitly and implicitly revealed online reading progress. Collecting data from vari-
ous sources was valuable, despite the fact that transcribing the audio/video of the
screen capture file was very time consuming. Combining different data sources
150 assessing reading: from theories to classrooms
allowed triangulation of different data sources, which provided a fuller data picture of
users’ reasoning and interactions during reading online.
Each participant produced two types of data files: 1. A text file consisting of all the
visited URLs. 2. A document file consisting of an essay. The document file needed
only a few modifications (delete blank lines in between paragraphs) in order to be
used by the LSA system. However, the text file (these were identified and analysed in
more detail, e.g. by examining the log file, audio and screen recording), which con-
tained the visited URLs, needed to be processed by a specific Java program to
differentiate which URLs were to be considered as the navigation page (referred to as
P3 in Figure 10.3) and which as websites (referred to as P4 in Figure 10.3) and to
identify the navigational goals used. Figures 10.5, 10.6, 10.7 and 10.8 show the URLs
collected from one participant, the search terms used, the URLs of the page returned
by the Google search term call and finally the websites visited as a result of navigat-
ing away from the Google-supplied links.
Figure 10.5 Part of the participant’s collected URLs
http://www.google.com
http://www.google.co.uk/search?q=socialism&ie=UTF-8&oe=UTF-8&hl=en&btnG
=Google+Search&meta=
http://www.google.co.uk/search?q=socialism&ie=UTF-8&oe=UTF-8&hl=en&btnG
=Google+Search&meta=
http://home.vicnet.au/~dmcm/
http://www.socialism.com/
http://www.socialism.com/whatfsp.html
Figure 10.6 Part of the participant’s navigational goals
socialism
socialism
socialism
dictionaries
dictionaries
dictionaries
dictionaries
online charles kingsley
political dictionary
political dictionary
Figure 10.7 Part of the participant’s associated navigational search list
http://www.google.co.uk/search?q=socialism&ie=UTF-8&oe=UTF-8&hl=en&btnG
postmodern principles for responsive reading assessment 151
=Google+Search&meta=
http://www.google.co.uk/search?q=dictionaries&ie=UTF-8&oe=UTF-8&hl=en
&meta=
http://www.google.co.uk/search?q=dictionaries&ie=UTF-8&oe=UTF-8&hl=en
&meta=
Figure 10.8 Part the participant’s associated of visited websites
http://home.vicnet.au/~dmcm/ , http://home.vicnet.au/~dmcm/,
http://www.socialism.com/whatfsp.html ,http://www.socialism.com/whatfsp.html ,
http://www.socialism.com/ ,http://www.google.com
Once all the participants’ files had been collected, the data were evaluated by using
LSA data obtained from the www.lsa.colorado.edu website. Table 10.1 shows the
degree of similarity (i.e. cosine scores) for the seven participants based on the following
five factors:
1. the degree of match between participant’s search goal and the given task
2. participant’s search goal and an expert’s internet search goal
3. participant’s written task output with the given task
4. participant’s written task output with an expert’s written task output
5. the overall coherence of written task output.
The eighth participant was an expert historian and his responses were used as a basis for
comparison (see factors 2 and 4 in Table 10.1).
Table 10.1 Latent Semantic Analysis cosine scores of all participants
Participants
Factors 1 2 3 4 5 6 7
The result in row 1 of Table 10.1 shows that all the participants received an LSA cosine
score (indicating the degree of similarity with a source text) of less than 0.5 when their
search terms (as used in Google) were compared with the words in the research task.
This is because, while navigating the web, participants 1–6 tended at times to enter tan-
gential search terms (e.g. participant 1 put ‘Charles Kingsley’ and ‘political’, participant
2 put ‘emotion’ and ‘Nouveu Cristianity’ [sic], participant 3 put ‘Communism’ and
‘democracy’). However, participant 7 scored the highest degree of match when he just
used two search terms: ‘Socialism’ and ‘Oxford Dictionary’, which seemed to agree
more with the given task. However, the LSA values in row 2 (which computes the LSA
score between participants’ and expert’s Google search terms) have higher values. Par-
ticipants 1–6 scored more than 0.5. This showed that, the participants used search terms
similar to those of an expert, those of participant 5 being the most similar.
Rows 1 and 3 show LSA similarity scores on the search terms used by participants
and on the lexical items in their essays, as compared to the lexical items in the research
task. Even though the scores in row 1 vary, the LSA score in row 3 does not follow the
same pattern. However, neither sets of scores is high, which suggests that using the task
prompt as a basis for LSA comparisons is not particularly illuminating.
Rows 2 and 4 present LSA similarity scores on the Google search terms used by par-
ticipants and on their written output, the essay. Row 2 shows a good deal of variation;
participants 2 and 5 are very similar and participant 7 is quite distant. However, the LSA
scores based on the participant’s essay as compared with the expert’s essay were in the
range 0.93 to 0.96, which suggests that for these participants, at least, variation in
search term did not lead to variation in overall content and in terms of what LSA can
measure, there was high degree of agreement between the non-expert and expert.
Row 5, however, shows clear differences between participants. The coherence score
is based on the mean of cosine scores from pairs of adjacent sentences in the final essay
(i.e. an LSA score is computed for each pair of sentences in the essay: 1 v 2, 2 v 3, 3 v
4, 4 v 5, etc.; the mean of these cosine scores is taken as an indication of the overall
coherence of the text). Participants 1–5 have more writing experience background when
compared to participants 6 and 7. Therefore, while summarising, copying and pasting
from the web pages to their document, they were aware of their own writing coherence.
Whereas, participants 6 and 7 were concerned with getting the materials pasted in their
documents without thinking much about the coherence between the selected sources.
However, participant 5 has an exceptionally high value of coherence because she copied
or summarised from a single well-organised web page.
Our current research is still at an early stage and will be used as the basis of the next
stages of our investigation. There are still many angles to explore on the online reading
assessment horizon especially in the aspect of psychology (Juvina and Oostendorp,
2004) and web navigation (McEaney, 1999a). The correlation statistical analysis made
between the expert’s mark for all the participants and the five factors in Table 10.1
showed that there were positive correlations between the expert’s scores with factors 2,
3 and 5. In our next experiment we will further investigate these three factors by
increasing the number of participants and collecting our data automatically by using
postmodern principles for responsive reading assessment 153
Real-time Data Collection (RDC) (McEaney, 1999b) since the data collection method
significantly enhances the potential to contribute to investigations of reading.
The five factors used in the evaluation did not penalise those who copied verbatim
from the source web pages more than they were allowed to. Therefore, in the next stage
of our experiment, a verbatim detection method will be included. Another matter that
we have to think about is the number of web pages visited and referred to. Our thinking
here is that if the participant visited and referred to only a one or two web pages, they
might score higher in factor 5 but defeat the main purpose of evaluating the whole per-
spective in online reading, that is, to gain as much information as possible from a
variety of sources.
Conclusions
As we have already stated, our work is at an early stage, but we are clear about a num-
ber of issues: we want to research online reading using complex, purposeful and
authentic tasks; we want to continue to use the whole internet, rather than a limited sub-
set of web pages, as the basis for those tasks; we are interested in assessment using
artificial intelligence approaches and we feel that these may well continue to exploit
LSA approaches (though we also believe that our LSA scores will be more valid when
they are based on a much larger and UK-derived corpus).
As we have indicated, we now have our own LSA engine, which is capable of deliv-
ering online LSA text matching data and we also have a beta version of an online
plagiarism detection tool. We anticipate using both tools as part of a kit that would be
capable of not only assessing online reading activity, but of providing real-time online
support for readers tackling complex research tasks. Over the next two years, therefore,
we expect to be developing programs that will both deliver both assessment tools and
online reader support tools.
References
Burniske, R.W. (2000). ‘Literacy in the cyber age’, Ubiquity – an ACM IT Magazine
and Forum.
Camtasia, ‘TechSmith Corporation’ [online]. Available: http://www.techsmith.com.
Charney, D. (1987). ‘Comprehending non-linear text: the role of discourse cues and
reading strategies.’ Paper presented at the Hypertext 87 Conference, New York.
Kanijeya, D., Kumar, A. and Prasad, S. (2003). ‘Automatic evaluation of students’ answers
using syntactically enhanced LSA.’ Paper presented at the HLT-NAACL 2003.
Dreyer C. and Nel, C. (2003). ‘Teaching reading strategies and reading comprehension
within a technology-enhanced learning environment’, System, 31, 3, 349–65.
154 assessing reading: from theories to classrooms
Duffy, G.G. and Roehler, L.R. (1989). ‘Why strategy instruction is so difficult and what
we need to do about it.’ In: McCormick, C.B., Miller G. and Pressley M. (Eds) Cog-
nitive Strategy Research: From Basic Research to Education Applications. New
York: Springer-Verlag.
Eagleton, M.B. (2002). ‘Making text come to life on the computer: toward an under-
standing of hypermedia literacy’, Reading Online, 6, 1, 2002.
Goodman, K.S. (1994). ‘Reading, writing and written text: a transactional sociopsy-
cholinguistic view.’ In: Singer, H. and Ruddell, R.B. (Eds) Theoretical Models and
Processes of Reading. Fourth edn. Newark, DE: International Reading Association.
Harrison, C. (1995). ‘The assessment of response to reading: developing a post-modern
perspective.’ In: Goodwyn, A. (Ed) English and Ability. London: David Fulton.
Harrison, C. (2004). Understanding Reading Development. London: Sage.
Harrison, C., Omar N. and Higgins, C. (2004) ‘Intelligent Online Reading Assessment:
capturing and evaluating internet search activity.’ Paper presented at the Beyond the
Blackboard: Future Direction for Teaching, Robinson College, Cambridge, 3–4
November.
Hayes, J.R. and Flower, L.S. (1980). ‘Identifying the Organization of Writing Process-
es.’ In: Gregg, LW. and Steinberg, E.R. (Eds) Cognitive Processes in Writing.
Hillsdale, NJ: Lawrence Erlbaum Associates.
International Reading Association (1999). ‘High Stakes Assessments in Reading.’
Available: http://www.reading.org/pdf/high_stakes.pdf [9 July, 2003].
Juvina, I., Iosif, G., Marhan, A.M., Trausan-Matu, S., van deer Veer, G. and Chisalita,
C. (2002). ‘Analysis of web browsing behavior – a great potential for psychology
research.’ Paper presented at the 1st International Workshop on Task Models and
Diagrams for user Interface Design Tamodia 2002, Bucharest, 18–19 July.
Juvina, I. and Van Oostendorp, H. (2004). ‘Predicting user preferences – from semantic
to pragmatic metrics of web navigation behaviour.’ Paper presented at the Dutch
Directions in HCI, Amsterdam, June 10.
Kintsch, E., Steinhart, D., Stahl, G., Matthews C. and Lamb, R. (2000). ‘Developing
summarization skills through the use of LSA-backed feedback’, Interactive Learning
Environments, 8, 2, 87–109.
Kintsch, W. (1998). Comprehension: A Model for Cognition. Cambridge: Cambridge
University Press.
Kintsch, W. and van Dijk, T.A. (1978). ‘Toward a model of text comprehension and pro-
duction’, Psychological Review, 85, 363–94.
Landauer, T. K. (2002). ‘On the computational basis of learning and cognition: Argu-
ments from LSA.’ In: Ross, N. (Ed) The Psychology of Learning and Motivation, 41,
43–84.
Landauer, T.K., Foltz, P.W. and Laham, D. (1998). ‘An introduction to latent semantic
analysis’, Discourse Processes, 25, 2 and 3, 259–84.
Lemaire, B. and Dessus, P. (2001). ‘A system to assess the semantic content of student
Essays’, Journal of Educational Computing Research, 24, 3, 305–20.
postmodern principles for responsive reading assessment 155
Spiro, R., Rand J., Coulson, R.L., Feltovich, P.J. and Anderson, D.K. (1994). ‘Cognitive
flexibility theory: advanced knowledge acquisition in ill-structured domains.’ In:
Singer, H. and Ruddell, R.B. (Eds) Theoretical Models and Processes of Reading.
Fourth edn. Newark, DE: International Reading Association.
Stallard, C. (1974). ‘An analysis of the writing behaviour of good student writers’,
Researching the Teaching of English, 8, 206–18.
Wright, P. and Lickorish, A. (1994). ‘Menus and memory load: navigation strategies in
interactive search tasks’, International Journal of Human-Computer Studies, 40,
965–1008.
Further reading
AEA (2003). ‘American Evaluation Association Position Statement on high stakes test-
ing in pre K-12 Education’ [online]. Available: http://www.eval.org/hst3.htm.
BBC (2003a). Tests ‘cause infants stress’. BBC News online, 25 April.
BBC (2003b). Teacher jailed for forging tests. BBC News online, 7 March. Available:
http://news.bbc.co.uk/1/hi/england/2829067.stm.
Cronbach, L.J., Linn, R.L., Brennan, R.L. and Haertel, E.H. (1997). ‘Generalizability
analysis for performance assessments of student achievement or school effective-
ness’, Educational and Psychological Measurement, 57, 3, 373–99.
Derrida, J. (1976). Of Grammatology. Trans. Spivac, G.C. Baltimore: The Johns Hop-
kins University Press.
Eagleton, T. (1983). Literary Theory. Oxford: Basil Blackwell.
Fullan, M. (2000). ‘The return of large scale reform,’ Journal of Educational Change, 1,
15–27.
Harrison, C. (1995). ‘The assessment of response to reading: developing a post-modern
perspective.’ In: Goodwyn, A. (Ed) English and Ability. London: David Fulton.
Harrison, C., Bailey, M. and Dewar, A. (1998). ‘Responsive reading assessment: is
postmodern assessment possible?’ In: Harrison, C. and Salinger, T. (Eds) Assessing
Reading 1: Theory and Practice. London: Routledge.
Harrison, C. and Salinger, T. (Eds) (1998). Assessing Reading 1: Theory and Practice.
London: Routledge.
Hayward, L. and Spencer, E. (1998). ‘Taking a closer look: a Scottish perspective on
reading assessment.’ In: Harrison, C. and Salinger, T. (Eds) Assessing Reading 1:
Theory and Practice. London: Routledge.
Hellman, C. (1992). Implicitness in Discourse. International Tryck AB, Uppsala. Avail-
able http://lsa.colorado.edu/papers/Ross-final-submit.pdf [22 November, 2002].
Jacob, B.A. and Levitt, S.D. (2001). ‘Rotten apples: An investigation of the prevalence and
predictors of teacher cheating’ [online]. Available: http://economics.uchicago.edu/
download/teachercheat61.pdf
postmodern principles for responsive reading assessment 157
Kibby, M.W. and Scott, L. (2002). ‘Using computer simulations to teach decision making
in reading diagnostic assessment for re-mediation’, Reading Online, 6, 3. Available:
http://www.readingonline.org/articles/art_index.asp?HREF=kibby/index.html
Kintsch, W. (1998). Comprehension: A Model for Cognition. Cambridge: Cambridge
University Press.
Lyotard, J.-F. (1984). The Postmodern Condition: A Report on Knowledge. Trans. Ben-
nington, G. and Massumi, B. Manchester: Manchester University Press.
Medvedev, P.N. and Bakhtin, M. (1978). The Formal Method in Literary Scholarship,
trans. Wehrle, A.J. Baltimore: Johns Hopkins University Press.
Pearson, P. D. and Hamm, D.H. (in press). The Assessment of Reading Comprehension:
A Review of Practices – Past, Present and Future.
Pumpfrey, P.D. (1977). Measuring Reading Abilities: Concepts, Sources and Applica-
tions. London: Hodder and Stoughton.
Salinger, T. (1998). ‘Consequential validity of an early literacy portfolio: the “back-
wash” of reform.’ In: Harrison, C. and Salinger, T. (Eds) Assessing Reading 1:
Theory and Practice. London: Routledge.
Valencia, S. and Wixon, K. (2000). ‘Policy-oriented research on literacy standards and
assessment.’ In: Kamil, M. et al. (Eds) Handbook of Reading Research: Volume III
Mahwah, NJ: Lawrence Erlbaum Associates.
Vincent, D. and Harrison, C. (1998). ‘Curriculum-based assessment of reading in Eng-
land and Wales.’ In: Harrison, C. and Salinger, T. (Eds) Assessing Reading 1: Theory
and Practice. London: Routledge.
11 Automated marking of content-based
constructed responses
Claudia Leacock
Introduction
Educators and assessment specialists would like to increase test validity by replacing, as
much as possible, multiple-choice assessments with constructed responses. However,
scoring or marking these assessments is time consuming and therefore costly. As an
example, Vigilante reports that ‘25 percent of online faculty time [is] currently spent on
grading written assignments and examinations’ in the New York University Virtual Col-
lege (Vigilante, 1999, p.59). To date, most of the research in automated marking has
focused on grading essay-length responses (Burstein, 2003, pp.113–22; Landauer et al.,
2003, pp.87–112).
However, much of the teacher’s time is spent on grading short content-based
responses such as those that appear in in-class tests or quizzes or homework assign-
ments, such as those found in a textbook’s end-of-chapter review questions. There is
a growing interest in automated marking of fairly brief content-based responses (Pen-
stein-Rosé and Hall, 2004; Perez et al., 2004; Sukkarieh et al., 2004). An automated
marking engine, c-rater™, is being developed at the Educational Testing Service
(ETS), to measure a student’s understanding of specific content material. c-rater is
designed to measure a student’s understanding of specific content in free responses –
and assigns full, partial or no credit to a response. It uses automated natural language
processing (NLP) techniques to determine whether a student response contains spe-
cific linguistic information that is required as evidence that particular concepts have
been learned.
If responses can be marked automatically, then a natural extension of c-rater would
be to provide automatically generated feedback for the student as to why a response
received the mark that it did. That is, c-rater could identify which part of the response
it recognized as being correct and also identify the concept(s) that it did not find in the
student response.
A question can be scored by c-rater if there is a fixed set of concepts that satisfy it. An
open-ended question asking for an opinion or for examples from personal experience is
not a question for c-rater. A sample of questions that c-rater has scored successfully
appears in Table 11.1. As can be seen, the questions are varied – over science, math and
reading comprehension – and open-ended enough to admit a variety of responses. How-
ever, there is a limited number of possible correct answers. Consider the 11th grade
automated marking of content-based constructed responses 159
reading comprehension question. This particular question requires the student to identify
four concepts, in order for a response to get full credit: (1) what is important to Walter,
(2) a supporting quote from Walter, (3) what is important to his mother and (4) a support-
ing quote from her. In this particular rubric, partial credit is assigned if the response
contains only one, two, or three of these concepts. Otherwise, no credit is assigned.
Table 11.1 Questions that c-rater has successfully marked
4 Reading Comprehension: National Give two reasons stated in the article why
Assessment of Educational Progress the hearth was the center of the home in
(NAEP) colonial times.
8 Reading Comprehension: National How did ‘Oregon fever’ influence westward
Assessment of Educational Progress movement?
(NAEP)
8 Science Explain how you would design an
experiment that would investigate the
importance of light to plant growth.
Include the type of organisms required,
the control and the variable and the
method of measuring results.
8 Math: Reading Comprehension: A radio station wanted to determine the
National Assessment of Educational most popular type of music among those
Progress (NAEP) (This is an in the listening range of the station. Would
approximation of the prompt used in sampling opinions at a Country Music
the study.) Concert held in the listening area of the
station be a good way to do this?
Explain your answer.
11 Reading Comprehension: Indiana Compare and contrast what Mama and
Core 40 End-of-Year Assessment Walter in A Raisin in the Sun believe to be
the most important thing in life or what
they ‘dream’ of. Support your choice for
each character with dialogue from the
excerpt of the play.
c-rater marks a response by matching it to a set of model answers that have been gener-
ated by a content expert using a graphical user interface that is described in Leacock and
Chodorow (2003). In order for these models to be robust, the person who is generating
it needs to have access to about 50 human-scored responses for each score point.
The system recognises when a response is equivalent to a correct answer that is rep-
resented in the model or, in other words, c-rater recognises paraphrases of the correct
answer or answers. In order to recognise a paraphrase, c-rater normalises across the
kinds of variation that typically occur in paraphrases: syntactic and morphological vari-
ation, use of words that are synonymous or similar in meaning and the use of pronouns
in the place of nouns.
160 assessing reading: from theories to classrooms
A single concept can be expressed in many different ways. One of the concepts that is
required in order to get full credit for the 11th grade reading comprehension item is that
‘money is important to Walter’. Below are some of the ways the students expressed this
concept:
1. Walter thinks that it is money.
2. … but to Walter money is almost everything.
3. He mostly believes that money is important.
4. Walter is concerned with money.
5. … he wants material things.
6. … the son tries to tell his mom that money is the most important thing.
Although these sentences and thousands of other variants differ in a number of ways,
they all convey the same concept. It is c-rater’s task to recognise that these responses
are correct while distinguishing it from incorrect responses that contain very similar
language, such as ‘Walter cares about his dignity and his mother worries about money.’
Much of the variation among the student’s responses is due to surface syntactic dif-
ferences. For example, in the question about photosynthesis, one student may write an
active sentence such as, ‘water them both’ while another may choose to use the passive
voice as in ‘they both should be watered’. To recover the underlying syntactic form, c-
rater generates a syntactic analysis (Abney, 1996) from which it extracts each clause’s
predicate argument structure – such as its subject, verb and object. When the predicate
argument structure is extracted, much of the surface syntactic difference among the
responses is eliminated.
c-rater then normalises across word variation due to inflected and derived word
forms using a morphological analyser that was developed at ETS. Inflectional normal-
ization removes the differences due to number, tense or agreement (dreamed and
dreams are normalised to dream). Normalising across derived forms allows verbs like
measure and the derived noun measurement to be collapsed when one student uses the
noun in ‘Take their measurements’ while another use the verb in ‘Measure the plants’
growth.’
c-rater’s pronoun resolution module is a version of Morton (2000) that has been
specifically trained on student responses to essays and short-answer question. It identi-
fies all of the noun phrases that a pronoun could refer to and automatically selects the
most likely one. In the examples we have seen, he and she refer unambiguously to Wal-
ter or to his mother. However, when c-rater was used to mark a question that involved
comparing three former US presidents, the pronoun he provided no clue as to which
president was being discussed. The most frequently used pronoun in these responses is
it, which similarly gives little clue as to its referent.
automated marking of content-based constructed responses 161
The system also needs to recognise lexical substitution – when students use words
that are synonyms of, or have a similar meaning to, words in the model answer. To this
end, the system uses a word similarity matrix (Lin, 1998) that was statistically generat-
ed after training on more than 300 million words of current American and British fiction,
nonfiction and textbooks. Table 11.2 shows the words that are closest to ‘important’,
‘humorous’ and ‘light’.
Table 11.2 Example word similarity matrices
standard edit distance algorithm to count the number of keystrokes that separate the
unrecognised word from the words in a dictionary. When the minimum edit distance is
small, the unrecognised word is replaced with the closest word in the dictionary.
When marking content-based short answer questions, the domain of discourse is
highly restricted – consisting of the test question, the model answer and the reading pas-
sage or the content on which an assessment is based. This restricted domain enables
c-rater to perform accurate highly spelling correction because it uses only the discourse
as its dictionary. For example, one of the questions that c-rater marked was looking to
see whether President Reagan was identified in the response. We identified 67 distinct
mis-spellings of ‘Reagan’ in the student responses, one of the most frequent being
‘Reagons’. My word processor suggests replacing ‘Reagons’ with ‘Reasons’. However,
since ‘reasons’ does not appear in the domain of that question, it was not considered as
a possible correction at all. Using this method, 84 per cent of the variants of ‘Reagan’
were correctly identified.
The c-rater answer models consist of a set of relations that represent the components
of a correct answer. Then, for each relation that is represented in that model, c-rater
attempts to identify a comparable relation in a student’s response by extracting and
normalising verbs and their arguments.
Case studies
In the Spring of 2003, the State of Indiana Commission for Higher Education, the Indi-
ana Department of Education, ETS and a subcontractor, Achievement Data, Inc.,
collaborated to develop and field test an administration of an online end-of-course test
for 11th grade students. The two courses selected for this pilot study were 11th grade
English and Algebra I. All of the prompts were marked on a three-point scale, with zero
for no credit, one for partial credit and two for full credit. There was a six-week period
from when testing began in May to reporting all of the scores. During these six weeks,
c-rater scoring models were deployed for 17 reading comprehension and five algebra
questions. By the middle of June, c-rater had scored about 170,000 11th grade student
responses to the reading comprehension and algebra questions.
In order to estimate how accurate the scoring was, 100 responses to each ques-
tion were selected from the pool and scored by two readers. These data were used to
calculate inter-reader agreement, as well as c-rater agreement with reader 1 and
with reader 2. The average agreement rates and kappa values for the five algebra
models and 17 reading comprehension models are shown in Table 11.3. On these
responses, c-rater agreed with the readers about 85 per cent of the time for both the
reading comprehension and algebra questions, whereas the readers agreed with
each other 93 per cent of the time for reading comprehension and 91.5 per cent of
the time for algebra.
automated marking of content-based constructed responses 163
A subsequent pilot test of the system took place in May 2004, in which eight of the c-
rater models were reused – two of the algebra models and six of the reading
comprehension models. Again, to estimate the scoring accuracy, 100 responses to each
question were scored by two readers used to calculate inter-reader agreement, c-rater
agreement with reader 1 and with reader 2, as shown in Table 11.4. The c-rater models
remained stable and the results have remained consistent with a previous c-rater study
(Sandene et al., 2005).
Table 11.4 Results for the Indiana 2004 pilot
Sources of error
more difficult it is to build a model. When the concept being looked for is imprecise,
then there are sure to be ways of stating it that are not found in the range-finding sets.
Usually when c-rater errs, it assigns a score that is too high rather than one that is
too low, thereby giving more credit than is deserved. This often occurs because a
response can contain the appropriate language even though its meaning differs from
the concept required by the model. As an example, a concept that c-rater tries to iden-
tify is ‘it is an old house’. One student wrote that ‘the author is telling you how old the
house is’, which was not credited by either reader. This becomes more problematic as
a model is adjusted to accept sentence fragments as being correct answers. In this
adjustment, c-rater imposes fewer requirements in order to allow syntactically incom-
plete forms that nonetheless embody the elements of the model. The problem seems
unavoidable because human readers consistently accept sentence fragments – even
very ungrammatical ones.
In general, if a distinction between partial or no credit is difficult for humans to
make, as shown by inter-rater agreement, then that distinction is even more difficult for
c-rater to make. The same holds for distinctions between partial and full credit.
Feedback to students
Since c-rater can recognise which concepts appear in a student response and which do
not, a natural extension for c-rater is to give feedback as well as a score. For example,
suppose a prompt asks for an explanation and an example in order to receive full credit.
If c-rater finds neither example nor explanation and zero is assigned to the response, it
is clear that neither was found. Similarly, if it assigns a 2 to the response, obviously both
concepts were identified. However, when c-rater assigns partial credit, it can return a
score and along with it a feedback message such as: c-rater has identified your explana-
tion but cannot find your example or, conversely, c-rater has identified your example,
but cannot find your explanation.
Of course, when partial credit is assigned, c-rater may have assigned the correct
score for the wrong reason. For example, it is possible that the response contains an
explanation but no example and that c-rater has inappropriately recognized an example
but not the explanation. In this case, the score is correct but the feedback would be in
error. Table 11.5 shows the cross tabulation table for c-rater and one of the readers for
the eight reading comprehension questions. Table 11.6 shows the cross tabulation for
the two Algebra questions. The boldface cells show the number of responses where both
c-rater and the reader assigned partial credit. It is these responses where c-rater and the
reader agree – but the score may not have been assigned for the same reason.
automated marking of content-based constructed responses 165
Reader:
no credit 95 30 2
Reader:
partial credit 10 215 26
Reader:
full credit 1 31 188
Reader:
no credit 114 8 4
Reader:
partial credit 8 46 4
Reader:
full credit 1 3 12
The scoring rubrics for the prompts in the 2004 pilot have three structures:
Structure 1. Full credit: 2 concepts
Partial credit: 1 concept
No credit: 0 concepts
Structure 2. Full credit: any 2 examples (out of 3 in the reading passage)
Partial credit: 1 example (out of 3 in the reading passage)
No credit: 0 examples
Structure 3. Full credit: 2 examples and 1 explanation
Partial credit: 2 examples and no explanation
1 or 2 examples and no explanation
0 or 1 example and 1 explanation
No credit: no examples and no explanation.
Six of the prompts have structure 1 (the two algebra items and four reading comprehen-
sion items), one reading comprehension prompt has structure 2 and another has
structure 3.
166 assessing reading: from theories to classrooms
For the 100 examples from each prompt that were scored by two readers, we gener-
ated feedback of the sort:
Structure 1: C-rater has identified the concept that … you have not answered the
question completely.
Structure 2: C-rater has found only one example, that … you should give two exam-
ples.
Structure 3: C-rater has identified one example, that … you need to give another
example and an explanation.
… and so on.
These messages were inspected, along with the response to determine the accuracy of
the feedback. As can be seen from the three structures, there are many ways to get par-
tial credit. In structure 1, there are two ways, in structure 2 there are three ways and
structure 3 there are six. When c-rater and the reader both assigned partial credit, c-rater
gave about 4 per cent inappropriate feedback on both the Algebra test (2 out of 46
responses) and the Reading Comprehension test (9 out of 215 responses). These cases
represent only about 1 per cent and 1.5 per cent of all responses, respectively. As
expected, the prompt with structure 3 had the highest level of inaccurate feedback.
Often, disagreement between the readers and c-rater was caused by spelling errors that
c-rater could not correct (the readers are extraordinarily good at interpreting spelling
errors) or to very ungrammatical responses. In many cases, c-rater/reader agreement
would improve if the student were to revise the response by fixing these errors.
Conclusion
When c-rater agrees with readers on a partial-credit score, it may do so for the wrong
reasons. These reasons are revealed in its feedback comments so the accuracy of the
feedback is a higher standard of measure than simple score agreement. However, exper-
imental results show that only a small proportion of partial credit score agreement is for
the wrong reasons. If these cases are added to the score disagreements, the overall error
rate increases only 1–1.5 per cent.
Author’s note
This chapter describes work that was done by the author when she was at Educational
Testing Service.
automated marking of content-based constructed responses 167
References
Abney, S. (1996). ‘Partial parsing via finite-state cascades.’ In: Proceedings of the ESS-
LLI ’96 Robust Parsing Workshop.
Burstein, J. (2003). ‘The E-rater scoring engine: automated esscay scoring with natural
language processing.’ In: Shermis, M.D. and Burstein, J. (Eds) Automated Essay
Scoring: A Cross-Disciplinary Perspective. Mahwah, NJ: Lawrence Erlbaum Asso-
ciates.
Landauer, T.K., Laham, D. and Foltz, P.W. (2003). ‘Automated scoring and annotation
of essays with the Intelligent Essay Assessor.’ In: Shermis, M.D. and Burstein, J.
(Eds) Automated Essay Scoring: A Cross-Disciplinary Perspective. Mahwah, NJ:
Lawrence Erlbaum Associates.
Leacock, C. and Chodorow, M. (2003). ‘c-rater: Automated scoring of short-answer
questions’, Computers and the Humanities, 37, 4.
Lin, D. (1998). ‘Automatic retrieval and clustering of similar words.’ In: Proceedings of
the 35th Annual Meeting of the Association for Computational Linguistics, Montreal,
898–904.
Morton, T.S. (2000). ‘Coreference for NLP applications.’ In: Proceedings of the 38th
Annual Meeting of the Association for Computational Linguistics, Hong Kong.
Penstein-Rosé, C., and Hall, B.S. (2004). ‘A little goes a long way: quick authoring of
semantic knowledge sources for interpretation.’ Paper presented at the 2nd Interna-
tional Workshop on Scalable Natural Language Understanding (ScaNaLU 2004) at
HLT-NAACL 2004.
Perez, D., Alfonseca, E. and Rodriguez, P. (2004). ‘Upper Bounds of the Bleu algorithm
applied to assessing student essays.’ Paper presented at the 30th annual conference of
the International Association for Educational Assessment (IAEA), Philadelphia, PA.
Sandene, B., Horkay, N., Bennett, R.E., Allen, N., Braswell, J., Kaplan, B. and Oranje
A. (2005). Online Assessment in Mathematics and Writing: Reports From the NAEP
Technology-Based Assessment Project, Research and Development Series. Washing-
ton, DC: National Center for Education Statistics.
Sukkarieh, J.Z., Pulman, S.G., and Raikes, N. (2004). ‘Auto-marking 2: an update on
the UCLES-Oxford University research into using computational linguistics to score
short, free text responses.’ Paper presented at the 30th annual conference of the Inter-
national Association for Educational Assessment (IAEA), Philadelphia, PA.
Vigilante, R. (1999). ‘Online computer scoring of constructed-response questions’,
Journal of Information Technology Impact, 1, 2, 57–62.
12 The role of formative assessment
Gordon Stobart
Purpose in assessment
Two key questions of any assessment are:
• What is the purpose of this assessment?
• Is the form of assessment fit-for-purpose?
If the purpose of an assessment is not clear, or there are multiple, and sometimes com-
peting, purposes then there is the potential for confusion about what inferences can be
drawn from that assessment. One broad classification of purpose distinguishes between
summative and formative assessments. Summative incorporates assessments which
‘sum up’ at a given point in time where students are in their learning. End of course
examinations would be an obvious example of this, with the results used for managerial
purposes such as selection or reporting progress.
Far more problematic is the definition of formative assessment. Formative has
proved so elastic in its usage that many have stopped using it and have switched to
‘assessment for learning’ (e.g. Assessment Reform Group, 1999, 2002). This switch
the role of formative assessment 169
Timing
It is an oversimplification to claim that the difference is essentially one of timing. Tim-
ing simply offers more or less potential for learning from an assessment, it does not
define which kind of assessment it will be. Robert Stake’s often quoted ‘when the cook
tastes the soup it’s formative, when the customer tastes the soup, it’s summative’ is
potentially misleading if the impression is given that it’s the timing that makes the
difference rather than the purpose and consequence of tasting.
Timing is not the difference between formative and summative assessments, but is
often an expression of it. Summative assessment can be used formatively when there is
subsequently time to learn from the responses and act upon them. Thus summative
assessment used before the end of a course or phase has clearly more potential to be
formative for students. Would not classroom assessment undertaken two thirds of the
way through a course provide a better basis for learning?
170 assessing reading: from theories to classrooms
This may sound a bit idealistic (how will their further learning be recognised if
they’ve already done the test?). However, the Daugherty Assessment Review Group in
Wales (2004) recommended something similar for the national tests. At present the
National Curriculum tests are taken at the end of primary school (year 6). The results
have no impact on secondary school selection, that is decided prior to the tests, and sec-
ondary schools make little systematic use of the information. Daugherty has
recommended that the tests (or modified versions which look at broader skill areas)
should be taken at the end of year 5 so that teachers can use them formatively during the
next year. The end of primary school assessment will then be a summative teacher
assessment. What has made this possible for Wales is that primary school performance
tables have been abolished, so the distorting role of this form of accountability is
removed. Another variation of this is found in France where the pattern is to test stu-
dents at the beginning of secondary school, rather than the end of primary, so that the
teacher can use the information formatively.
Learning
One further definitional point needs to be made – that we also need a working definition of
learning. How we view learning impacts on how we approach assessment. The assump-
tion here is that learning is an active, meaning-making process (see chapter 4) which is
most effective when it builds on what is already known and makes connections with other
learning. I am adopting Eraut’s ‘strong’ definition of learning as ‘a significant change in
capability or understanding’ which excludes ‘the acquisition of further information when
it does not contribute to such changes’ (Coffield, 1997, p.5). The spirit of this is also cap-
tured in Harlen and James’ (1997) distinction between deep and surface learning (see
Table 12.1). They seek to reduce polarities with the introduction of the concept of strate-
gic learning in which the learner seeks optimal combinations of deep and surface learning
– the problem being that much current assessment encourages only surface learning.
the role of formative assessment 171
While some of the polarities between learning and performance orientations have been
overplayed, I would not wish to identify learning directly in terms of tests scores, espe-
cially where the tests may show limited construct validity. Good scores do not
necessarily mean that effective learning has taken place, as evidenced by Gordon and
Reese’s (1997) conclusions on their study of the Texas Assessment of Academic Skills
(TAAS) which students were passing
even though the students have never learned the concepts on which they are being
tested. As teachers become more adept at this process, they can even teach students
to correctly answer test items intended to measure students’ ability to apply, or syn-
thesise, even though the students have not developed application, analysis or
synthesis skills. (p.364)
That test scores may, in part, be the result of training in test-taking techniques inde-
pendent of an understanding of the subject matter may be self-evident to practitioners,
this is not always how policy makers see it. For many of them the relationship is
unproblematic: test scores are a direct representation of learning standards. If more 11-
year-olds reach level 4 for reading on the National Curriculum assessment in England,
then reading standards have risen, with no acknowledgement that there may be an
‘improved test taking’ factor in this. Tymms (2004) has demonstrated this effect in
relation to National Curriculum assessment of literacy in England, where progress on
other independent assessments of literacy has been far more modest. Linn (2000) has
shown how there is a regular pattern of substantial early improvements in scores on a
new high-stakes test, as teachers improve their preparation of their students, followed
after about four years by a flattening out of scores (which then comes as a shock to
these same policy makers).
So if evidence from a reading assessment is used to identify misunderstandings and
leads to a clearer conceptual understanding, then this is formative. If it simply improves
test taking techniques, with no need for changes in mastery of the concept or skill, then
this learning does not meet the definition. While I recognise there will be some grey
zones between understanding and technique, I want to signal that improving scores is
not automatically a formative learning process.
172 assessing reading: from theories to classrooms
The evidence from Andrew Watts’ research in chapter 13 illustrates just such con-
cerns. While the use of Assessment Focuses offers the potential for formative feedback
from reading assessments they also run the risk of being used as micro-teaching tech-
niques which focus on gaining extra marks (‘knowledge in bits’) rather than increasing
understanding.
One of the central themes of this book is validity. Critical to any discussion of the
assessment of reading is making explicit the construct of reading that is being tested and
evaluating the extent to which an assessment represents this. This approach sits well
with recent formulations of validity in which it is no longer seen as a static property of
an assessment (e.g. predictive/concurrent/face validity) but is based on the inferences
drawn from the results of an assessment. It is treated as:
• an integrated concept, rather than a variety of approaches, organised around a broader
view of construct validity (Shepard, 1993, pp.405–50)
• a property of the test scores rather than the test itself. The 1985 version of the Amer-
ican Standards was explicit on this: ‘validity always refers to the degree to which …
evidence supports the inferences that are made from the scores’ (AERA, 1985, p.9).
be self-correcting. This, however, is problematic when attempts are made to use forma-
tive assessment as the basis for summative judgements since the formative is likely to
become a series of mini-summative assessments, with the emphasis on reliability rather
than further learning. For this reason I am very wary of schemes of continuous assess-
ment which claim to be formative, since it is often meant in terms of ‘forming’ the final
mark rather than informing learning.
Where these difficulties are not appreciated we may see threats to the validity of both
formative and summative assessments. Harlen and James (1997) propose that:
The alternative to using the same results of assessment for both purposes is to use
relevant evidence gathered as part of teaching for formative purposes but to review
it, for summative purposes, in relation to the criteria that will be used for all pupils
… In other words, summative assessment should mean summing up the evidence, not
summing across a series of judgements or completed assessments. (p.375)
Given the complexity of the formative/summative relationship, how can assessments be
used to help learning? The rest of this chapter will focus on a central aspect of this
process: the quality of feedback and how this may encourage or discourage further
learning.
Effective feedback
Feedback is a defined as ‘closing the gap’ between current and desired performance
(Sadler, 1989). Some of the key features of effective feedback are:
• it is clearly linked to the learning intention
• the learner understands the success criteria/standard
• it focuses on the task rather than the learner (self/ego)
• it gives cues at appropriate levels on how to bridge the gap:
– self-regulatory/metacognitive
– process/deep learning
– task /surface learning
• it challenges, requires action, and is achievable. (Stobart, 2003, pp.113–22)
If the gap is not closed (that is, learning has not taken place) then what has been done
does not qualify as feedback. In their meta-analysis of feedback studies Kluger and
DeNisi (1996) concluded:
In over one third of the cases Feedback Interventions reduced performance … we
believe that researchers and practitioners alike confuse their feelings that feed-
back is desirable with the question of whether Feedback Intervention benefits
performance. (pp.275, 277)
174 assessing reading: from theories to classrooms
not necessarily follow’ (p.119). If students do not understand the purpose of what they
are doing and have no clear sense of the quality of performance needed, then feedback
will make little sense.
There are two main, and inter-related, strands to addressing this issue. The first
involves making the intended level of performance more explicit to the student. This
may involve making the ‘learning intention’ explicit as well as the ‘success criteria’
(Clarke, 2001). This is something that has been addressed in materials that accompany
National Curriculum reading assessment, though the critical issue is the extent to which
teachers have ‘downloaded’ them to their students. Two key teaching and learning
approaches are the use of exemplars and modelling, both of which are familiar to most
practitioners and part of the primary and secondary teaching and learning strategies in
England. It is through the active use of exemplars (‘what make this response better than
that one’) that learners begin to appreciate the differences in the quality of performance
and may begin to articulate what would be needed to improve a piece of work (Sadler,
1989). Modelling involves demonstrating how to approach a question, for example
‘drawing inferences’ from a reading passage will mean nothing to, or be completely
misunderstood by, some students until there is a practical demonstration of how to go
about this task (made more active when students are able to improve on the teacher’s
draft).
The second, and closely linked, strand involves students in peer- and self-assess-
ment. The rationale here is that if students are being asked to assess each other and
themselves they will have to actively develop a sense of the success criteria/standard
against which their work is to be evaluated. At the heart of this is the teacher handing
over the control and authority which assessment brings, something which teachers often
find very anxiety provoking (Black et al., 2003). This is also a student skill which has to
be developed rather than assumed (Black et al., 2003). This is a long way from the com-
mon ‘peer assessment’ practices of marking each other’s work using answers provided
by the teacher.
results, particularly with more authentic assessments. This is particularly the case where
reporting is in terms of a profile rather than just a single grade. The assumption is that
this provides increasingly rich detail about what needs to be done and so next steps can
be made explicit. I am not as optimistic about the potential from feedback on fixed
response/multiple-choice tests, where information is generally in terms of numbers of
correct responses.
Conclusion
Assessing reading provides complex information about where students are in their
learning. Formative assessment seeks to move this learning on by providing feedback
which will encourage such movement. This chapter has illustrated that this is not a sim-
ple teacher led process. For learning to take place the student must also understand what
is being learned and what successful performance entails. Similarly feedback has to be
geared to the task or process and be realistic and informative to the learner. Formative
assessment in reading offers the potential to improve as well as monitor reading skills.
Author’s note
This chapter draws on the author’s paper, ‘The formative use of summative assessmen-
t’, presented at the 30th Annual IAEA Conference, June 2004, Philadelphia, USA.
the role of formative assessment 177
References
Messick, S. (1989). ‘Validity.’ In: Linn R.L. (Ed) Educational Measurement. Third edn.
New York: American Council on Education and Macmillan.
Sadler, R. (1989). ‘Formative assessment and the design of instructional systems’,
Instructional Science, 18, 119–44.
Shepard, L. (1993). ‘Evaluating test validity.’ In: Darling-Hammond, L. (Ed) Review of
Research in Education, Vol. 19.
Stobart, G. (2001). ‘The Validity of National Curriculum Assessment’, British Journal
of Educational Studies, 49, 1, 26–39.
Stobart, G. (2003). ‘Using assessment to improve learning: intentions, feedback and
motivation.’ In: Richardson, C. (Ed) Whither Assessment. London: QCA.
Torrance, H. and Pryor, J. (1998). Investigating Formative Assessment: Teaching,
Learning and Assessment in the Classroom. Buckingham: Open University Press.
Tymms, P. (2004). ‘Are standards rising in English primary schools?’ British Educa-
tional Research Journal, 30, 4, 477–94.
Wiliam, D. (2000). ‘Recent development in educational assessment in England: the
integration of formative and summative functions of assessment.’ Paper presented at
SweMaS, Umea, Sweden.
Wiliam, D. and Black, P. (1996). ‘Meanings and consequences: a basis for distinguish-
ing formative and summative functions of assessment?’ British Educational
Research Journal, 22, 5, 537–48.
13 Using assessment focuses to give
feedback from reading assessments
Lorna Pepper, Rifat Siddiqui and Andrew Watts
While it is generally acknowledged that reading is an holistic process, defining goals for
improving pupils’ reading often depends on the identification of discrete skills. Such cate-
gorisations are an integral part of the process of planning, teaching and assessment, as
teachers may target particular areas of reading by drawing on their assessments of pupils’
performances in the discrete skills.
In the mark schemes for the National Curriculum tests of reading in England, an explicit
statement about the focus of each test item is made. Six ‘assessment focus’ (AF) categories
are presented to make clearer to markers of test papers and to teachers, the aspect of reading
being tested by each question. This information is not given explicitly to the pupils: it is the
questions which tell them what to focus on and the AFs are not printed on the test paper.
However, schools have been encouraged to undertake self-evaluative activities with pupils,
exploring their performance at question level and discussing targets for improvement. An
article in ‘On Track’The Secondary English Magazine in April 2004 described such a use of
AFs with year 10 pupils. It concluded that for reading they produced useful information
from the assessment, but for writing they were ‘of very limited help in providing pupils with
specific targets for future improvement’ (Matthews, 2004, pp.12–14).
Drawing on these areas of current practice, a study was planned to explore the useful-
ness of the AFs in giving feedback to year 9 pupils from a reading test. The study aimed to
investigate pupils’ understanding of the AFs, and the degree to which knowledge of the
focus of each question might help them in successfully understanding and then answering
reading test questions. If it is successful, this process could be said to follow what
Brookhart says is the key to the formative process, which is ‘having a concept of the goal
or learning target, which originally is the teacher’s, but which ideally the student will
internalise’ (Brookhart, 2001, p.153). This echoes the Assessment Reform Group’s pre-
scription that if assessment is to lead to learning it should ‘help pupils to know and to
recognise the standards they are aiming for’ (Assessment Reform Group, 1999, p.7).
The assessment focuses that apply to reading tests at KS3 are:
AF2 understand, describe, select or retrieve information, events or ideas from texts and
use quotation and reference to text
AF3 deduce, infer or interpret information, events or ideas from texts
180 assessing reading: from theories to classrooms
AF4 identify and comment on the structure and organisation of texts, including
grammatical and presentational features at text level
AF5 explain and comment on writers’ uses of language, including grammatical and
literary features at word and sentence level
AF6 identify and comment on writers’ purposes and viewpoints and the overall effect
of the text on the reader (QCA, 2003).
Procedure
An experienced English teacher, who was teaching several classes of year 9 pupils,
agreed to participate in the research. While a mixed-ability group was considered bene-
ficial for the research aims, the teacher felt that the procedure would prove challenging
for some pupils with lower attainment levels in English. It was agreed that as the pupils
were organised in ability groups for English, a class of average to above-average ability
would take part. This group consisted of 31 pupils (although not all pupils were present
during all research sessions).
A sequence of teaching sessions was agreed as part of the project, with the adminis-
tration of a KS3 reading test (the 2003 test, QCA 2003) being the starting point. The
research was completed during four teaching sessions, within a three-week period.
A number of questionnaires and tasks were developed to capture pupils’ reflections on
the process, and their understanding of assessment focuses. The following list summarises
the procedure and instruments.
1. Pupils completed a KS3 reading test paper (the 2003 paper).
2. The teacher marked the pupils’ test papers.
3. The teacher introduced pupils to the idea of assessment focuses and together they
discussed the application of AFs to questions in a different KS3 reading paper.
4. Pupils received their marked 2003 test papers along with a bar graph of their per-
formance on each AF, including comparisons with ‘pre-test’ average scores (taken
from a representative sample who trialled the test paper). This was designed to
focus their attention on potential for improvement in their own scores.
5. Pupils were asked to consider their results, and their areas of strength and weak-
ness, using prompts on Questionnaire 1.
6. Pupils then received different copies of the original test paper, on which the relevant
assessment focus was printed under each question. They were asked to amend or
rewrite their answers in order to improve their scores, using the AFs to help them.
7. After amending their papers, pupils were asked to complete Questionnaire 2 in order
to give their opinions about the helpfulness of AFs to the process of revision. They
were also asked to indicate the types of changes they had made to their answers.
using assessment focuses to give feedback from reading assessments 181
8. The teacher marked pupils’ amended answers, and added these marks to their
original scores.
9. Pupils received a bar graph showing their original and revised marks.
10. Using Questionnaire 3, pupils were asked to rank the different factors that they
found helpful as they revised their answers, choosing from 5 options, with 1 the
highest and 5 the lowest. 0 was also offered as an option, in cases where a particu-
lar factor was not at all helpful.
11. In Questionnaire 4, pupils were asked to rewrite each assessment focus in their own
words, to make it as simple as possible for younger pupils to understand. This was
aimed at capturing those aspects of the AFs that pupils found particularly difficult
to interpret.
12. Finally, pupils considered three specific questions from the test again, in Ques-
tionnaire 5. They were asked to describe how they might use the assessment
focus to help them answer more effectively at the first attempt, or how the AF
might prove helpful in answering similar questions. This task was designed to
record the extent to which pupils could extract and apply aspects of the AFs to
questions.
13. The teacher’s views on the value and impact of assessment focuses on pupils’
performances were captured through a semi-structured interview.
Findings
Take bits out of writing and talk about it, and use parts of the writing to show you know
about it.
Understand describe or take information, events or ideas from the text and use pieces of
the text to back up or explain things.
Where pupils were unsure about this AF, observations demonstrated that ‘retrieve’ and
‘reference to text’ proved difficult for them to comprehend, although all pupils’ attempts
were close to the meaning of the AF.
AF3 was challenging for pupils to interpret (and paraphrase), primarily because of
the terms, deduce and infer. During their writing, two pupils sought the teacher’s help
with the definition of these words. Many others wrote descriptors of this AF that
focused on understanding events or ideas, but did not touch on the meaning of deduce
or infer. In a few cases, pupils appeared to relate these terms to the need for personal
opinion in answers. Some partially accurate attempts included:
In depth describe the text, what is it saying in detail.
Use your own ideas to explain the text. Don’t just write about the text, write about what
the text makes you think.
Write what you see in the text in your own words and how you see it.
A few pupils seemed to grasp the focus on ‘reading between the lines’ in the AF:
To be able to understand the text and what is behind it.
Use the text and find the meanings hidden in it.
AF4 also prompted writing that demonstrated pupils’ partial understanding of the focus.
Structure and organisation were relatively well presented, while grammatical and pre-
sentational features at text level proved more problematic. A few pupils referred to
grammar or punctuation in their descriptions, but it is clear that the majority found these
difficult or unfamiliar concepts. Attempts included:
Find and talk about the way the writing is set out and look at the sort of punctuation used,
on your point of view.
Point out how grammar/composition is used to improve the text.
Look at and make a comment on the way the text is written and organised make sure you
include grammar at the same level as the text.
AF5 also posed difficulties for pupils in understanding grammatical and literary fea-
tures. Many pupils demonstrated understanding of the general focus on use of language,
and sometimes related this to word and sentence level, although the terms themselves
may not have been familiar.
Use the writer’s use of language and explain and comment on it at a higher level.
Explain and make comments on how the[y] written the text eg what language use in full
worded sentences.
using assessment focuses to give feedback from reading assessments 183
Some successful renderings of this assessment focus showed good understanding of the
skills involved.
Explain and comment on writer’s use of language, including grammar, e.g. exclamation
marks and literary features, eg metaphors.
Write about what words the writer uses and why, what parts the words play in each
sentence.
AF6 gave rise to many partially successful definitions. While handling the writer’s
views or intentions pupils often failed to include the effect of the text on the reader.
Look at what the writers [are] saying and their opinions about the topic.
Explain parts of the story and why the writer has chosen to do that.
Find what point the writer is trying to get across.
Pupils’ responses provide rich data for further exploration and they suggest that the
starting point for developing a more meaningful processes of feedback could be the
pupils’ own definitions of the assessment focuses.
assessment focuses (that is, for the group of questions in each AF category), there were
some noticeable differences (see Table 13.1).
Table 13.1 Changes in average class performance by AF
AF2 91 98 7
AF3 62 81 19
AF4 71 74 3
AF5 48 67 19
AF6 50 80 30
It is clear that performance on questions in the AF6 category rose most sharply but, in
common with the other categories that saw large improvements (AF3 and AF5), initial
scores for AF6 questions were among the lowest. There was, therefore, greater scope for
pupils to make gains in these categories. Questions addressing AF4 saw the smallest
improvement, which may confirm the difficulties noted during pupils’ attempts to interpret
and re-express this focus (discussed above).
While the data provide much information about performance on the test, changes in
the pupils’ scores cannot be attributed to any single factor. This research did not estab-
lish a control group with which comparisons could be drawn. The support and teaching
pupils received between their first and second attempts, aimed to introduce and explain
assessment focuses. They were encouraged to make an explicit connection between
questions in reading test papers and the AFs. This targeted intervention may have influ-
enced the success of pupils’ revised responses. However, other factors were also,
inevitably, influential in their performance.
During the process of revising their answers, the researchers noted the high degree to
which pupils sought help from others. In order to interpret the accuracy of their original
answers and understand which elements were lacking, pupils looked to their peers.
They also asked for clarification and help from the teacher and researchers, primarily
because they had tried unsuccessfully to use the assessment focus as a prompt, and were
genuinely nonplussed by the question concerned. For some pupils, a further unassisted
attempt to answer the questions seemed de-motivating.
Table 13.2 Pupil ratings of the helpfulness of AFs when revising answers
Response Frequency
The assessment focuses gave interesting ‘educational’ words for us to put in our
answers. It also helped us to understand the questions and what type of answers we
should put.
It obviously didn’t tell me what to do, but I also basically knew what the questions
were looking for, but it helped distinguish between different ideas I had about
answering some questions.
Because I knew what the markers were looking for in the answer, and helped under-
stand how to put more detail in what areas.
Because I knew the outlines to put in the questions, I knew the outline for the answers
instead of more or less flailing for the answers and getting lucky guesses.
A few pupils, while they felt the AFs had helped them, noted other factors.
The fact that we got to do it again and knowing what we got wrong helped. Plus
knowing what area our answer should be in and seeing if the answer that was wrong
was in that category.
Among those pupils who stated that the AFs were unhelpful, comments were also
revealing.
I didn’t understand the description. I didn’t think that knowing what type of question
it is would help me answer it.
It doesn’t explain the answers.
Following their second attempts, pupils were also asked to consider the kinds of
changes they had made to their answers and to comment on whether they felt their
answers had improved. The vast majority of the group felt that they had increased their
score, and once again, pupils’ explanations referred to different factors. The benefits of
re-reading both the questions, and their original answers, were mentioned several
times.
Yes, because I have read them more carefully and I can see where I went wrong.
I think that I have put better answers to some difficult questions. I did this by reading
the questions well.
I think so, looking through it I made a few stupid mistakes. I added more information
in some particular answers and I saw my particular areas I was struggling on.
186 assessing reading: from theories to classrooms
Some pupils felt the assessment focuses had been helpful and a few mentioned using the
language of the assessment focuses in their answers (the latter had been listed as a possible
strategy in a different part of the questionnaire, so may have influenced some responses).
I think I have improved my answers in particularly the 1 or 2 mark questions. After
reading the questions again I was able to look at the question from another angle.
The assessments [sic] focuses did help too, they told me what type of things I needed
to put in my answer, whether it be about language, structure or writer’s purposes.
I think I have improved my answer to Question 13 a lot because I wrote more about
the general language and why he chose words. I also improved question 10 by talk-
ing about the reasons for language and structure that the writer chose, and the effect
of them.
Yes, because in some of them [the first time] I had ‘alternative’ answers, but I didn’t
put them down. The sad thing is was [sic] that they were right! Also the assessment
focuses gave ‘educational’ words to include in my answers.
The process of revision, involving interaction with a number of different sources of
information (including assessment focuses), seemed to the pupils to have been useful.
In the next session, pupils were asked to rank a range of factors for their helpfulness
during the process of revision. Table 13.3 shows frequencies for the highest rating.
Table 13.3 The most helpful factors for pupils when revising answers
Three of the five factors from which pupils could choose were clearly viewed as being
more useful than others: Looking at how other pupils answered the questions; Dis-
cussing the questions; Having time to reread the questions. This is unsurprising in some
ways, but nevertheless revealing. For the majority of pupils, the process of reviewing
their performance was most fruitful when it involved interaction with others. As dis-
cussed above, in many cases pupils were uncertain about: why their initial answers were
inaccurate; the precise demands of the question (and mark scheme) and the meaning of
the AFs themselves. Comparing their responses with those of others offered immediate
insight into some of these unknowns. Similarly, discussing and comparing their inter-
pretations of questions supported their understanding of how an answer might be
refined.
Pupils had the opportunity to add comments to explain their rankings, providing
some useful insights into their thinking.
using assessment focuses to give feedback from reading assessments 187
Being able to discuss different questions with people enabled me to understand the
things I didn’t when we done [sic] the first test. It also helped re-reading the question,
in case you missed anything. The AFs helped when you had no idea what to write.
I found that being able to look at how others answered some of the questions was very
helpful because if you had no idea about how to go about answering the question it
gave you a good idea starter.
As observed by the teacher and researchers during this session, the opportunity for
pupils simply to re-read the questions and their own answers was, in itself, an important
factor in improving responses. Pupil ratings confirmed this. However, referring to the
assessment focus for each question was not rated highly by pupils in this exercise (in
contrast to some of their positive ratings in the questionnaire discussed above).
Table 13.4 Pupil rankings of the usefulness of AFs when revising answers
1 2
2 2
As Table 13.4 shows, pupils predominantly ranked use of the AFs as ‘4’ and ‘5’, the lowest
points on the 5-point scale, with one assigning ‘0’, to signify the AFs were not at all useful.
An important point that emerged from the sessions was that pupils’ claimed that they
rarely received detailed information about their performance following summative
assessments. Black and Wiliam’s review of the literature on classroom assessment,
Assessment and Classroom Learning, reported that ‘there is little reflection on what is
being assessed’ even among teachers (Black and Wiliam, 1998), so the pupils’ percep-
tion above is perhaps not surprising. For them the study provided an unusual and
welcome opportunity to re-read, review and revise their answers, as part of a feedback
process. The role of the assessment focuses in this process was, however, not easy to
isolate.
Overall, the teacher commented positively on the assessment focuses and the experi-
ences of the research project. As an experienced marker of the KS3 reading tests, she was
already very familiar with the assessment focuses as a guide to marking, but this knowl-
edge had not automatically been transferred into her teaching practice. To support pupils’
skills in interpreting and answering reading questions, she would focus on the language of
the question itself, asking them to highlight key words; as she put it: ‘I would always start
with the question.’ She felt that, as the assessment focuses are drawn from the National
Curriculum for English, teachers often work in harmony with them, even if they are
unaware of this. The way in which the AFs are expressed is important in this respect. She
identified several terms that are not in common usage among teachers (in her experience):
We don’t tend to think in terms of ‘text-level’, ‘sentence-level’, ‘grammatical fea-
tures’. We think: ‘this is in exam-speak’. We tend to say [to the pupils]: ‘pick out the
words that suggest … to you’ [for ‘use of language’ questions].
Particular terms (for example, show, express, reveal) were highlighted for pupils (at
KS3 and GCSE level) as signifiers of specific requirements in a question. Teachers used
language that they knew would be familiar and comprehensible to pupils.
Looking at each AF, the teacher felt that AF2 was ‘not a problem’ for pupils to under-
stand, with describe, select and retrieve being familiar terms. AF3 was considered a
little more difficult, particularly as deduce and infer are ‘not really used’ by teachers and
she felt pupils would find it difficult to work out their meaning unaided. (This is consis-
tent with pupils’ own lack of familiarity with these terms, discussed above.) For AF4,
she identified structure and organisation as easier elements for pupils to understand
and to ‘do on their own’. AF5 was considered more challenging, particularly because of
the phrase, grammatical and literary features at text level. This ‘defeats’ pupils because
of the unfamiliarity of the language. While she felt that AF6 was generally well under-
stood by pupils, a weakness was their lack of awareness of the writer’s identity or
existence beyond the text. Some pupils read texts (fiction in particular) as ‘real’, rather
than as constructed works. Explicitly mentioning the author’s name in questions could
support them.
Focusing on the impact of the research project, the teacher felt that the process of
feedback and review had a positive impact on pupils. Seeing a graph of their perform-
ance on the test helped to give a ‘confidence boost’ to some pupils and to stimulate
those who were unfocused or careless. The teacher felt there were some differences
between the attitudes of girls and boys, and that the graphical data would prompt some
boys to put greater effort into their work. She felt that the opportunity to amend their
answers was valuable and that the pupils had gained much from their discussion of
answers and of the assessment focuses. Looking at other pupils’ answers and making
judgements about degrees of success in different answers was also helpful. She felt it
would be useful to formalise this task by providing pupils with a range of answers and
asking them to explain why each answer was awarded a particular mark.
The teacher described plans to use assessment focuses with the same group in the
following half term, before they complete their next reading test; she expected their
using assessment focuses to give feedback from reading assessments 189
Commentary
The purpose of feedback from assessment should be further learning. Cowie and Bell
described formative assessment as, ‘The process used by teachers and students to recog-
nise and respond to student learning in order to enhance that learning, during the
learning’ (Cowie and Bell, 1999). In the case reported above, however, one might easi-
ly say that what was happening was that the pupils were merely trying to up their marks
in the test, a task involving little worthwhile learning. One boy in the study captured
well a cynical view of what was going on: ‘During the time we were “rewriting” the
test, people copied answers from other people who got them right. This is why some
people got a higher level than they had done recently.’ There is certainly evidence that
some pupils saw what they were doing merely in terms of gaining more marks.
Some might say that the possibility of using such feedback for a more educational
purpose – for example, to enable pupils to learn more about what it means to read and
understand a text – was lost at the start. This was because the feedback was explained in
terms of asking pupils to say how they might improve their answers in a test. This
involved a consideration of the marks they had scored, which according to the work
cited by the Assessment Reform Group ‘encourages pupils to concentrate on getting
better grades rather than on deeper understanding’ (Assessment Reform Group, 2002,
p.6). Sadler also says that, ‘where a grade or a score assigned by the teacher constitutes
a one-way cipher for students, attention is diverted away from fundamental judgements
and the criteria for making them.’(Sadler, 1989, p.121). Brookhart, however, argues that
our view of the implications of the use of summative assessment models need not be so
negative. Her study of students in English and Anatomy classes led her to suggest that
for those students ‘summative judgements were temporary stops along a learning path’
(Brookhart, 2001, p.157). She quotes Biggs in order to make the point that what matters
is how ‘deeply criterion-referenced’ the exercise is (Biggs, 1998). We would argue that
190 assessing reading: from theories to classrooms
the teacher in the study reported here was, in her use of AFs, putting her emphasis on
the underlying criteria for more successful reading, except that instead of discussing cri-
teria the class discussed focuses.
We must look more closely at the concept of focuses, or focusing, in reading. There
are two kinds of focusing going on here. One is the focusing done by the pupil on the
text to be read. The second is the focus of the assessor. For the national tests the asses-
sors fall into two categories: the test developers, for whom the AFs serve to clarify the
writing of questions and the markers, for whom the AFs guide decisions about what
credit to give. When the teacher is the assessor he or she usually plays both these roles.
The focus that is relevant to the pupils, however, is the focus of the reader. It is
important to be clear when we talk about AFs, that both kinds of focusing are caught up
in the phrase and that we are not asking the pupils to be assessors, but readers.
As readers, we can approach texts in a variety of ways, particularly when we reflect
on what we have read. We sometimes look closely at the text, at a particular word or
phrase; sometimes we stand back from it, and think about the structure. Sometimes we
think about the ideas in the text, sometimes about the feelings. Sometimes we think
about the writer and his or her point of view, sometimes we think about the text as it
appears to us as readers. Such a process in reading, which is part of our trying to make
sense of the text for ourselves, could be described as ‘focusing’. To become reflective
readers, pupils need to learn to be able to do this.
It is more appropriate to talk about focusing in reading, as we have just done, than
focuses. To choose a focus is a legitimate thing to do, but it can distort the whole pic-
ture. If we take a close-up photograph of one flower, the picture of the garden becomes
blurred. To get the full picture we must be able to move forwards and backwards. This
illustration suggests why too hard and fast an interpretation of AFs can become prob-
lematic. Our definition of reading focuses must capture the fact that they are part of a
living experience of reading, and that they change and blend into a process. Focusing as
a metaphor for what is happening when we read, reminds us of the possibility of blur-
ring what we see. A reading focus does not have a hard edge, it can be imprecise, but
that doesn’t invalidate its use.
These points raise a fundamental question, which is: Can we validly describe and use
focuses, or assessment objectives, when we seek to assess and improve pupils’ reading?
We believe we can. Sadler describes a ‘progression in which criteria [in our case for
successful reading] are translated for the students’ benefit from latent to manifest and
back to latent again’ (Sadler, 1989, p.134). But we think that we have not been good at
describing why it is that attempts to sharpen the focus, or make the criteria manifest,
(for example in mark schemes for tests of reading) are problematic and do not produce
clear-cut answers. The use of AFs is made problematic when the process of reading is
lost sight of. And the process of using AFs well is made problematic by their only
appearing in tests, especially tests which have public implications for the pupils, teach-
ers and schools involved. If the AFs were seen more in the context of learning, and
particularly of self-assessment, it would be the perspective of the readers that became
more prominent, above that of the assessors.
using assessment focuses to give feedback from reading assessments 191
In this study, the aim was to find out what feedback might help pupils understand
more the reading skills that they need to deploy in order to understand a text. They
wrote very positively about the benefits of interacting with others, but were they mere-
ly looking for ‘the correct answers’? And they tended not to value the work with AFs.
Many of the quotations from pupils do appear to be very test-bound, and one wonders
how far they would be able to generalise from working on the answers to these questions
to the underlying educational aim of the exercise.
QCA has proposed the above kind of use of AFs as ‘diagnostic feedback’. It would
be however a mistake to see what is happening during such feedback on a reading exer-
cise as a series of diagnoses into strengths and weaknesses in the pupils’ overall
performance in reading. To treat this as a diagnostic exercise would indeed push the out-
come towards a series of narrow hints about how to get more marks on one particular
test, and possibly on how to improve your performance in a future test. What underlies
this misconception is the fact that ‘diagnostic’ feedback refers to the kind of focuses that
interest the assessor (the test marker, or the teacher who is possibly too focused on the
product), whereas it is the focusing of the reader that needs to be addressed.
From the perspective of the creative activity of reflecting on our reading, the use of
AFs with pupils does indeed appear to have some possibilities. Sadler (1989, p.134)
claims that ‘Knowledge of the criteria is ‘caught’ through experience, not defined’
(Sadler, 1989, p.135). Pupils in their feedback for this study wrote of understanding
more what lay behind the questions, once they had thought about the AFs. They had
used in their activities some of the vocabulary and concepts which underlie the subject
‘English’. For example, at the very least they had identified concepts like ‘deduce and
infer’ and acknowledged that there was something about these that they needed to
understand better. Some of their comments on what they needed to learn next relied
heavily on repetition of the, perhaps, unfamiliar vocabulary of the AFs. But in using this
vocabulary, they found a language to go beyond this test and these questions to an
underlying purpose. The fact that they grasped that there was a bigger picture into
which their efforts could fit would have been an empowering exercise for them, as the
teacher described. In Sadler’s interesting term they were becoming part of ‘the guild’ of
readers.
With greater familiarity with the vocabulary of the subject’s underlying aims leading
to an increasing understanding of its concepts, the pupils’ pursuit of higher level reading
skills becomes a possible one. The pursuit involves purpose and successes rather than
pessimistic cries of ‘I must try harder’, which imply little hope of knowing where to
make the effort. We could say that what the teacher was doing was, in Sadler’s words,
‘providing guided but direct and evaluative experience for students, which enables them
to develop their evaluative knowledge, thereby bringing them within the guild of people
who are able to determine quality using multiple criteria’ (Sadler, 1989, p.139). Further
study would be needed to assess whether this generally is the case for classes which are
given feedback in terms of the AFs. This was a small preliminary study which we hope
will lead to a clearer definition for a larger study in future. But the use of AFs in this
exercise appeared to create a positive experience for the pupils, to which the teacher
192 assessing reading: from theories to classrooms
This preliminary study has yielded a number of insights into the usefulness of assess-
ment focuses in providing feedback to KS3 pupils on their reading performance. In their
present form, the assessment focuses are not wholly effective tools for communicating
skills and concepts to all year 9 pupils. Some words and phrases are unfamiliar, and
pose particular difficulties if pupils attempt to apply them (for example, to reading tasks
or questions). Pupils’ own suggestions of alternative ways to express the AFs could
prove very useful starting points for any work on re-framing the focuses.
In order to give pupils meaningful information about their performance, it is also
important to consider the process of feedback. This study suggests that activities and
contexts in which pupils interact with others (or with others’ ideas and written respons-
es) could be particularly successful. Sadler points out that students not only need to
know the criteria by which they are assessed but ‘an equally important need is for stu-
dents to know what the standards to be used are’ (Sadler, 1996). Thus, following a test,
pupils would benefit from the opportunity to read and assess a range of answers, along-
side their own, to determine how the standards have been applied in each different case.
Some interesting questions about the significance of assessment focuses in schools
have also been raised. Many teachers may be unfamiliar with the AFs, but existing
assessment and feedback processes draw on similar principles. Would pupils gain from
one particular framework for feedback, or are there a number of ways that the skills
within the assessment focuses could be structured and expressed?
To extend these initial findings, a number of areas could be investigated further.
Possible activities include:
• a larger study of the impact of using AFs to provide feedback on test performance
(this could look more closely at groups of pupils according to ability, sex and other
characteristics)
• a longitudinal study investigating the impact of using AFs across years, or across key
stages
• a larger study of teachers’ existing methods for providing feedback on reading per-
formance and their use of assessment focuses or the equivalent
• a re-framing of the existing assessment focuses to increase their accessibility for
pupils.
using assessment focuses to give feedback from reading assessments 193
References
Assessment Reform Group (1999). Assessment for Learning: Beyond the Black Box.
Cambridge: Cambridge Institute of Education.
Assessment Reform Group (2002), Testing, Motivation and Learning. Cambridge:
Cambridge Institute of Education.
Biggs, J. (1998). ‘Assessment and classroom learning: a role for summative assess-
ment?’ Assessment in Education, 5, (1), 103–110.
Black, P. and Wiliam, D. (1998). ‘Assessment and classroom learning’, Assessment in
Education, 5, 1, 7–74.
Brookhart, S.M. (2001). ‘Successful students’ formative and summative. Uses of
assessment information’, Assessment in Education, 8, 2, 153–169.
Cowie, B. and Bell, B. (1999). ‘A model of formative assessment in science education’,
Assessment in Education, 6, 101–116.
Matthews, S. (2004). ‘On Track?’ The Secondary English Magazine. London. NATE.
Qualifications and Curriculum Authority (2003). Key Stage 3 English: Teacher Pack 1
(QCA/03/988), London: QCA.
Sadler, D.R. (1989). ‘Formative assessment and the design of instructional systems’,
Instructional Science, 18, 121 and 134–5.
Sadler, D.R. (1996). ‘“Criteria and standards in student assessment”, different
approaches: theory and practice in higher education.’ Proceedings of the HERDSA
Conference. Perth, Western Australia, 8–12 July.
[Part 4]
Theory into practice: national initiatives
14 Validity challenges in a high-stakes
context: National Curriculum tests in
England
Marian Sainsbury and Andrew Watts
Since 1991, England has had a formal and centralised national testing system whose
major purpose is accountability, in the public and political sense. Tests of reading have
a central place within this system. All children1 are formally assessed at the ages of 7, 11
and 14, at the end of the phases of education known as key stages 1, 2 and 3 respective-
ly. Results are reported nationally, and are regarded in public discourse as evidence of
the success or failure of the education system, and of the education policies of the gov-
ernment of the day. Each year’s national results are the subject of news reports, media
comment and political debate.
The English National Curriculum testing system has a number of positive strengths,
but its high-stakes purpose also introduces tensions. Although there has been evolution
since 1991, the present-day system is the same, in its essential features, as the one intro-
duced then. At that time, both curriculum and assessment arrangements were introduced
together. The curriculum was defined by means of programmes of study, and associated
attainment targets laid out what was to be assessed. This alignment of curriculum and
assessment is taken for granted in England, but it should be noted that it is not
inevitable, and is in fact a major strength of the system. In the USA, by contrast, there
are numerous examples of states where a compulsory assessment has been introduced
independently of a pre-existing curriculum, and the task of alignment forms a major
research and administrative burden.
The nature of the construct of reading in the National Curriculum can therefore be
inferred from the attainment target, which sets out eight levels of attainment, each of
them defined by a description consisting of a short paragraph of prose. These eight lev-
els apply to all pupils from the ages of five to 14 years: children progress up the scale
from level to level. Most 7-year-olds are at level 2; most 11-year-olds at level 4 and
most 14-year-olds at level 5 or 6. As an example, level 4 is defined thus:
In responding to a range of texts, pupils show understanding of significant ideas,
themes, events and characters, beginning to use inference and deduction. They refer
to the text when explaining their views. They locate and use ideas and information.
The level descriptions are brief and for a fuller explanation of the construct of reading
they need to be read alongside the programmes of study. These are divided according to
the three key stages. The first category of requirements details the ‘knowledge, skills
validity challenges in a high-stakes context 197
and understanding’ to be taught. For example, for key stage 2, the knowledge, skills and
understanding section includes lists headed: reading strategies; understanding texts;
reading for information; literature; non-fiction and non-literary texts; language structure
and variation. The other main category within the programmes of study for reading is
entitled ‘breadth of study’. In this section is set out the range of literary and non-literary
texts that pupils should be taught to read. Thus the curriculum documents define in gen-
eral terms the range of reading material that children should encounter. In addition, at
key stage 3 only, there is a requirement to teach literature drawn from specific lists of
novelists and poets.
Since 1998, the teaching of reading and writing in England has been structured by
the National Literacy Strategy. This provides a much longer, more detailed and more
specific set of guidance. For each term of each year, text types and reading skills are
specified. The National Literacy Strategy is a well-resourced government initiative,
with a wide range of training and support materials and sets out one way in which the
National Curriculum for reading and writing can be taught. However, these are guide-
lines rather than requirements: it is the programmes of study, much less detailed
documents, which constitute the legal requirement for schools, and the basis for test
development.
The brief description above should give an indication of the breadth and depth of the
construct of reading embodied in the National Curriculum. A very wide range of read-
ing material is required, both literary and non-literary. The skills and understandings go
far beyond the mere recognition of words to include appropriate strategies for informa-
tion handling and literary appreciation. The list of skills presented above as an
illustration can be seen to draw upon different perspectives on the nature of reading.
‘Reading strategies’ and ‘understanding of texts’ concern decoding and straightforward
comprehension and are related to a cognitive psychology perspective. ‘Reading for
information’ and ‘literature’ and the associated list of approaches to text describe pur-
poseful, responsive and analytic reading and can be traced to theories of literature and
information handling. The final element of the list, ‘language structure and variation’ is
an explicit requirement for a linguistic perspective in analysing what is read. All four of
the ‘layers’ of reading in the Sainsbury diagram (see page 17) can be discerned in the
programmes of study. As well as these skills and understandings, there is also the sug-
gestion of an attitudinal component. The general requirements include ‘interest and
pleasure’ and ‘enthusiastically’, in addition to the ability to read independently.
Although this list was drawn from the key stage 2 curriculum, the same is true even
for the youngest children: at key stage 1 there is a long list of reading strategies for
accurate decoding and understanding, but also the requirement to describe characters
and settings, and to respond imaginatively to literature. Children are taught to use refer-
ence skills in reading for information, and to begin to analyse texts linguistically,
distinguishing between the characteristics of fiction and non-fiction texts. For older stu-
dents at key stage 3, there is more emphasis on analytic reading, described here as
‘understanding the author’s craft’.
198 assessing reading: from theories to classrooms
A further reference document in defining what is assessed in the key stage reading
tests is a list of assessment focuses (AFs) that form part of the test specification.
• Assessment focus 1: use a range of strategies, including accurate decoding of text, to
read for meaning.
• Assessment focus 2: understand, describe, select or retrieve information, events or
ideas from texts and use quotation and reference to text.
• Assessment focus 3: deduce, infer or interpret information, events or ideas from
texts.
• Assessment focus 4: identify and comment on the structure and organisation of texts,
including grammatical and presentational features at text level.
• Assessment focus 5: explain and comment on writers’ use of language, including
grammatical and literary features at word and sentence level.
• Assessment focus 6: identify and comment on writers’ purposes and viewpoints and
the effect of the text on the reader.
• Assessment focus 7: relate texts to their social, cultural and historical contexts and
literary traditions.
These apply across all three key stages, although the balance between them varies con-
siderably to reflect the age of the children. Most of them appear in most National
Curriculum reading tests, but there is no absolute requirement to cover all of them in
any one test; the exact coverage and balance reflects the nature of the texts. This list,
too, demonstrates the complex interplay between different theories and definitions of
reading in the National Curriculum.
The assessment of this broad and rich construct of reading in a high-stakes context
gives rise to significant challenges. Since teachers and schools are held accountable for
their test results, they could be motivated to teach only what is tested. Thus there is a
need to avoid construct under-representation as this could threaten validity, both in
terms of the inferences to be drawn from the results, and of any consequential narrow-
ing of the curriculum. The test developer has the dilemma of including a wide enough
range of literature in each year’s test to reflect the curriculum ‘breadth of study’ require-
ments adequately. At the same time, though, there is a curricular requirement for
responsive reading and analysis which rules out the use of several short passages to
cover the range. The texts need to be lengthy and weighty enough for pupils to engage
meaningfully with the content.
At the seminars that formed the basis for this book, the England national reading
tests for 2003 were examined both from an external perspective, by participants from
the USA and in relation to the underlying theories that were being discussed. These
same tests will be described and discussed in the later parts of this chapter, to illustrate
in detail how the challenges are met in practice. A number of general principles can be
distinguished at all three key stages.
validity challenges in a high-stakes context 199
First, it is accepted that any one year’s test cannot cover the whole range of texts and
skills required by the curriculum. Instead, each year’s test includes a more limited range
of reading skills and of text types. But over the years, the tests gradually broaden the
range that has been included. To address the requirement for engaged, responsive and
analytic reading, there is a principle that full-length texts or extended extracts should be
used, and that these are well written and interest or entertain the students as readers.
Further, since students’ responses to texts will draw on their own experiences and
understandings, and will therefore be varied in their content and expression, most ques-
tions in any test are open, allowing the student to give his or her own view, explanation
or opinion in response to the text.
Key stage 1
Children at key stage 1 range from non-readers to those who are well advanced in read-
ing and understanding simple texts. The assessments to cover this range of ability are the
most complex in structure of any National Curriculum reading tests, with four separate
instruments addressing levels 1–3 of the attainment target. Teachers decide which of
these instruments are most appropriate for each individual child, based on their ongoing
assessments.
Beginner readers have either attained or are working towards level 1, and their
assessment consists of an individual read-aloud task in which their early understandings
of reading are demonstrated with support from the teacher. An overall judgement is
made of their ability to recognise words and letters, to use some decoding strategies and
to talk in simple terms about the content of their reading. A similar reading task applies
at level 2, where the emphasis is upon the teacher’s observations of the child’s inde-
pendence, accuracy, understanding and ability to apply strategies in reading unknown
words. This task is described in more detail in Whetton (see chapter 8). For both level 1
and level 2, the children read from real books, chosen from a list of comparable difficul-
ty which represent lively and engaging literature. Thus the vocabulary encountered by
the young reader is age-appropriate, but is not strictly controlled. These tasks combine
an assessment of decoding, comprehending, responding and some simple analysis in a
single informal interview in which the overall assessment outcome is decided by the
teacher on the basis of observation.
Children who attain level 2 may also take a written test, the form of which will be
illustrated by the 2003 test, Sunflowers, which is a typical example. Texts and questions
are presented in a full-colour booklet. There are three texts: a story, Billy’s Sunflower by
Nicola Moon, in which a little boy learns that flowers die in autumn but can grow again
from seeds; information about the artist Van Gogh; and instructions for making a paper
sunflower. The questions are presented on the lower half of each page and refer only to
the text on that page (see Figure 14.1).
200 assessing reading: from theories to classrooms
Figure 14.1 Example reading test, Sunflowers (reproduced with permissions from QCA)
There are 28 questions in this test, around half of them four-option multiple-choice and
the rest requiring simple constructed responses. Two of them are two-part questions for
which a partial credit is possible, so the total number of marks available is 30. In terms
of the assessment focuses listed above, retrieval of information (AF2) accounts for over
half of the marks. A typical question of this type is:
Which part of the sunflower had bent over at the top?
A further quarter of the marks are awarded for questions requiring simple inference
(AF3), for example:
How did Mum help Billy?
Write 2 things.
In this case, the text recounts how Billy’s Mum dried his tears and then presents in dia-
logue a conversation in which she explains about plants dying in winter. Typically
children find it easy to identify the first of these ways of helping Billy, but some inference
is necessary in order to understand that the conversation was also a way of offering help.
One two-part question explicitly asks for a personal response to what has been
read (AF6):
The story is both sad and happy. Explain why.
It is sad because …
It is happy because …
validity challenges in a high-stakes context 201
This requires an overview of the whole story, and some empathy with the character of
Billy, bringing together the ideas that plants die but that they grow again.
Three of the marks are available for answers that demonstrate the beginnings of an
ability to analyse presentational features of text (AF4) and the author’s choice of lan-
guage (AF5). Two questions ask about the presentation of the instructions: the list of
what is needed; and the reason for numbering each instruction. The language question
asks what the words brown, sad and drooping express about the sunflower.
In this test, unlike the level 2 task, the ability to decode text is not assessed directly but
by means of comprehension questions. Even at this very early level the insistence on a
highly complex construct of reading in the England National Curriculum assessments is
fully apparent.
The final instrument involved in testing reading at key stage 1 is a written test for
level 3, for those children who are above average for their age and already reading quite
fluently. Entitled Memories, the 2003 test includes two texts, presented as separate full-
colour documents. The first is a story, Grandfather’s Pencil by Michael Foreman. The
second, contrasting, text is a leaflet, invented but presented authentically, about a children’s
programme at a local history museum (see Figure 14.2).
For level 3, the questions are presented in a separate booklet so that the children can
more easily skim and scan the texts to find their answers. Question formats are more
varied than at level 2, and only eight of the 26 marks are for multiple-choice answers. In
this test, the number of retrieval (AF2) questions is almost the same as those requiring
inference and deduction (AF3). One of the latter type can be used to demonstrate the
variety of responses that can be judged acceptable.
Figure 14.2 Example written test for level 3 (reproduced with permissions from QCA)
202 assessing reading: from theories to classrooms
The story tells how the various objects in a boy’s bedroom – pencil, paper, door, floor-
boards – have made journeys from their original forest homes. Intertwined with this is
the story of the boy himself and his travels around the world as a sailor when he grows
up. This question requires children to show some understanding of this theme of journey-
ing and how it is worked out in the book. The acceptable answers are varied. Children
could show this understanding by making a generalisation about the objects in the story:
The things had all made journeys from the forest to the boy’s room.
The paper made a journey back to the forest.
This is an example of a question where children can be seen responding in different ways
to what they have read, using their own experience to select evidence from the text to
demonstrate their understanding of the theme. For questions like this, only a skeleton
mark scheme is devised before the question is trialled, and children’s actual responses
have a strong influence in formulating the limits of acceptable and unacceptable answers.
These tests and tasks for the youngest children already show the hallmarks of the
England National Curriculum reading tests. Although most of the questions require
children to show a basic understanding of what they have read using retrieval and simple
inference, there is nevertheless a role for responding to and analysing texts.
Key stage 2
At key stage 2, the national tests cover levels 3–5 and thus apply to all children who
have mastered the basics of reading – around 93 per cent of the population. Children
below these levels are not included in testing but are assessed by their teachers.
The 2003 reading test is a lively, magazine-style reading booklet entitled To the Rescue,
together with a separate question booklet. This booklet deals with the theme of heroes and
superheroes. It includes several texts of types that have appeared in previous tests: a nar-
rative extract, from Katherine Paterson’s Lyddie, information listing the characteristics of
superheroes and information about special effects in films. Alongside these, there is a text
type that has not previously appeared in a key stage 2 test, a short cartoon strip presenting
a humorous subversion of the superhero genre (see Figure 14.3). This is an example of the
principle of broadening the range of text types year on year whenever possible.
The test has 30 questions yielding one, two or three marks each, and the total number
of marks available is 50. Most of the questions are open response, but a number of
validity challenges in a high-stakes context 203
Figure 14.3 Example cartoon strip used in reading test (reproduced with permission from QCA)
To gain full credit for this item, pupils have to bring together the information, stated in dif-
ferent parts of the double page, that computers are helpful for some special effects such as
‘morphing’ and some sound effects, but in cases such as flying and other sound effects
they are unhelpful. Partial credit is available for explaining just one of these points.
Other AF3 questions require an extended response that takes an overview of the text
and explains a view with some subtlety or complexity:
In what ways did Lyddie show herself to be a good leader?
Explain fully, using the text to help you.
For full credit of three marks, children need to give an answer that takes into account the
calm, bravery, intelligence and practicality of the character:
Even though Lyddie was just a child she did not panic and she knew what to do to control
the bear, she got the rest of the family out of harm’s way and only thought of saving her-
self when she knew the others were saved.
Answers that made some relevant points, without the completeness of this example, were
awarded partial credit. Here again, the different ways in which pupils interpret their read-
ing and apply to it their ideas about leadership are highly personal, and the mark scheme
could only be devised in the light of analysing many answers from children in the trials.
Questions requiring some appreciation of authorial techniques and choices, AFs 4–6, are
more frequent at key stage 2 than at key stage 1, though they still account for only around a
quarter of the marks. Some address organisational and presentational features (AF4):
Page 11 is clear because it is divided into questions and answers.
How does this layout help the reader?
204 assessing reading: from theories to classrooms
Unusually, the use of the very specific superhero genre as the focus of the reading book-
let made it possible to ask some questions about this literary tradition (AF7). Such
questions appear only rarely in tests for this key stage.
Souperkid flies and wears a special costume.
In what other way is Souperkid like a superhero?
In what way is Souperkid not like the superheroes described on page 9?
In one of the final questions of the test, the pupils are asked to bring together their
understanding of all the texts they have read, and to give their own opinion, backed up
by textual evidence.
‘I think Lyddie is a real hero but not a superhero.’
Do you agree with this opinion?
Explain your own opinion fully, using the texts to help you.
In this question can be seen several of the typical characteristics of National Curriculum
reading tests as their authors attempt to blend an ambitious construct of reading with the
constraints of a high-stakes national system. Children can have any opinion on this
question; it is not necessary to agree with the quotation, and some children in the trials
showed themselves distinctly unimpressed by Lyddie’s achievement when set against
superhero exploits. The mark scheme allows a variety of views on this, and in this way
the test can genuinely be said to recognise the potential for personal response, and even,
to some extent, for different readings in the postmodern sense. But ultimately, this is a
test where marks have to be awarded consistently because its major purpose is to award
a nationally recognised level for reading. The variety of pupil opinions which are
allowed in response to this question must be backed up clearly by reference to textual
evidence in order to be creditworthy. The national tests are full of examples such as this
one, where many real pupil responses are carefully evaluated in order to devise a mark
scheme that respects individual response and opinion whilst at the same time identifying
the specific elements that will lead to consistent marking.
Key stage 3
The tests at the end of key stage 3 are taken by pupils at the end of the third year of sec-
ondary school, when they are on average 14 years old. About 90 per cent of the pupils
in a cohort take the tests, for which the average result is that pupils attain level 5 in Eng-
lish overall. Pupils whose teachers judge their performance to be below level 4 are
validity challenges in a high-stakes context 205
encouraged not to take the set tests, but to complete some classroom-based tasks
instead.
The construct of reading which is evident at this key stage shows the same character-
istics as have already been described. The test is based on the same Levels of
Attainment statements that are used to underpin the reading curriculum and therefore,
where levels of achievement overlap both key stages 2 and 3, at levels 4 and 5 for exam-
ple, the criteria for judgement are the same. In addition, the Assessment Focuses which
underlie the reading test are the same at each key stage, which again reflects a consistent
view of the construct of reading which is being assessed.
Though this is true, there is one significant addition to the tests at this stage: a paper
which tests the pupils’ study of a Shakespeare play. Thus at key stage 3 there are three
papers, with the Reading paper scoring 32 marks and the reading section of the Shake-
speare paper scoring 18. (The third paper is the Writing paper.) The inclusion of the
Shakespeare study reflects the cultural importance attributed to the study of literature,
and in particular to the study of Shakespeare. The proportion of reading marks allocat-
ed to Shakespeare study is high, but an attempt in 1992 by the Qualifications and
Curriculum Authority (QCA) to reduce it by a few marks was rejected by the govern-
ment after a press campaign about the ‘dumbing down’ of the English curriculum.
The Shakespeare paper thus adds the study of a work of literature to the construct of
reading already described. This study usually takes place in the six months before the
tests, with the guidance of the teacher. The pupils are supposed to read the whole play,
but they are tested on two short portions of it, selected because they are of dramatic sig-
nificance to the plot and also are of a manageable length for a test. The pupils can thus
get to know those scenes in detail and they must in the test answer one, essay-length
question, which asks them to refer to both scenes in their answer.
The mark schemes for the tests claim that the Shakespeare paper ‘tests the same set
of skills as are assessed on the unseen texts in the Reading paper’. However, the assess-
ment focuses for reading do not underlie the assessment for this paper because, ‘The
emphasis is on pupils’ ability to orchestrate those skills and demonstrate their under-
standing of and response to the Shakespeare test they have studied…’. Instead of the
assessment focuses, the tasks set in the paper are based on one of the following areas
related to the study of a Shakespeare play:
• character and motivation
• ideas, themes and issues
• the language of the text
• the text in performance.
(All references above taken from the KS3 English test mark scheme, QCA, 2004,
p.21).
The criteria in the mark scheme state that pupils can gain credit by demonstrating
understanding of the set scenes and showing the ability to interpret them in the light of
206 assessing reading: from theories to classrooms
the specific task they have been set. In addition, pupils must show the ability to com-
ment on the language Shakespeare uses to create his dramatic effects. Finally, they must
demonstrate that they can use quotations and references to the scenes to back up the
points they are making.
The following is taken from the mark scheme for a task on ‘Twelfth Night’ in the
2004 test. The criteria were used to mark responses to the task:
Explain whether you think Malvolio deserves sympathy in these extracts, and why.
The marking criteria for the second highest level of performance, gaining a mark of
between 13 and 15 out of 18, included the requirements to demonstrate:
Clear focus on whether Malvolio deserves sympathy at different points and why … Clear
understanding of the effects of some features of language which Malvolio and other char-
acters use … Well-chosen references to the text [to] justify comments as part of an
overall argument.
(Mark scheme for 2004, p.42).
In terms of the construct described at the beginning of this chapter, therefore, the addi-
tion of the Shakespeare paper at key stage 3 represents a significant increase in
emphasis on the reading of literary material, and on the skills of responding to literature.
It is also argued that the inclusion of the Shakespeare play in the tests can be justified as
a way of encouraging involvement in drama. In 2004 the task set on Henry Vth was one
in which the pupils had to discuss the play in performance:
What advice would you give to help the actor playing Henry to convey his different moods
before and after battle?
(Mark scheme for 2004, p.23)
This takes the assessment in the Shakespeare paper well beyond merely the ability to
read, since the study of drama requires in addition imaginative and practical skills
which have little to do with reading.
If we remember that the pupils know they will have to write on the two set scenes
and that teachers know the task could be one about performing the scenes, we can imag-
ine that pupils who have taken part in a classroom performance of the scenes, would not
find the drama task above too onerous. However, the skills they would be demonstrating
would be a considerable distance from the reading skills we have been discussing so far.
It is the case in the national tests in England that the inclusion of Shakespeare paper in
the tests puts pressure on the assessment of reading, which for reasons of testing time
and style of questioning only has 50 marks allocated to it. Since some at least of these
marks are awarded for skills other than those of reading, a great deal is being asked of
the remaining marks if they are expected to lead to reliable conclusions about an individual
pupil’s attainment in reading.
The format of the Reading paper at key stage 3 will be familiar from the descriptions
already given of the papers at key stage 2, and even at key stage 1. Three passages are
presented in an attractively produced, magazine-style booklet in colour. In 2004 the
validity challenges in a high-stakes context 207
booklet was entitled Save It and it focused on protecting the environment. The selection
of different types of text aimed to reflect the National Curriculum’s requirement for a
range of reading. The passages were:
• a newspaper article about waste collection and disposal
• a survey of the effects of tourism on a beauty spot
• pages from the web-site of the Eden Project, a visitor centre about plants in the
environment.
It is interesting to note that in this year there was no literary text on the Reading paper,
though this is not the case in every year. In this way the curriculum’s requirement that a
range of genres and styles should be covered is emphasised in the KS3 tests.
The same AFs, which have already been described, again underpin the construction
of the test questions. However, the differentiation between levels in the use of AFs,
which has been noted at KS1 and KS2 above, is also seen here. In comparison to the
KS2 paper the proportions of questions based on AF2 (understanding and retrieving
information) and AF3 (deducing and inferring) were only 3 and 16 per cent. Most of the
marks were awarded for commenting on the structure and organisation of texts, 22 per
cent; on the writers’ use of language, 31 per cent; and on the writers’ purposes and view-
points, 28 per cent. The construct of reading tested here, therefore, goes well beyond
simple decoding and comprehension. Over 80 per cent of the marks are awarded for
pupils’ ability to comment on the writers’ craft in constructing their texts.
A variety of question types is used in the key stage 3 tests, but generally they are
open-ended and require more than objective marking. Of the 13 questions in the 2004
test only three, with five marks in total, could be marked objectively. Of the other ques-
tion types there were those that required the giving of reasons, comment on phrases,
extended response (for five marks), and explanation of the writers’ style. Two of the
questions gave some support to the structuring of answers by providing a table which
had to be completed. Even if a first sub-question was a fairly straightforward demand
for a word or phrase, the next one required an explanation or comment.
We have noted above that the key stage 3 Reading paper puts a greater emphasis on
Assessment Focuses 4, 5 and 6. In 2004, for example, AF4 – comment on text struc-
ture – was tested by a question which asked pupils to identify whether the paragraphs
in the text about waste disposal were describing personal experiences or giving facts
and statistics. The following sub-question asked for an explanation of why the writer
had organised the paragraphs in the piece with this combination of experience and
statistics.
The questions based on AF5 – comment on the writers’ use of language – range from
short, direct questions about interpreting meaning, to open questions requiring more
extended responses for a mark of 5. Question 6 about the effect of tourism on the beauty
spot asked:
What does the phrase moving relentlessly suggest about the people?
208 assessing reading: from theories to classrooms
The final, 5-mark question on the paper focused on the way the Eden web-site tried to
win support for the project:
How is language used to convey a positive image of the Eden Project?
It is worth noting here that as the longer, essay-style questions come at the end of a sec-
tion, after the pupils have focused more closely on the issues and the language in the
text, they can use their experience of answering the earlier questions to create their
longer, more discursive answers.
A similar variety is used to assess the pupils’ ability to comment on the writers’ pur-
poses and viewpoints (AF6). One of the questions on the Eden Project web-site asks:
How does paragraph 2 make the reader think the Eden Project is exciting but also has a
serious purpose?
Choose two different words or phrases and explain how they create this effect on the
reader.
The above question is one of those which gives the pupils some support in answering by
providing an answer frame. In comparison, the other 5-mark question on the paper, also
addressing AF6, asked:
How does the article try to make the reader feel some responsibility for the problem of
waste disposal?
The key stage 3 tests described above have, like those at key stages 1 and 2, attempted
to assess a broad definition of what reading is. The AFs lead well beyond mere compre-
hension, to questions requiring commentary, some of it extended, on the writers’
techniques. The curriculum requirement for breadth of reading has led to three texts of
different genres being presented in each test, and at key stage 3 a classic literary text is
also set for prior study. In addition, the tests are valuing personal responses to reading
by giving credit for pupils’ opinions and possibly idiosyncratic interpretations. The
introduction to the mark scheme (p.5) states that though the mark scheme is designed to
be comprehensive
… there will be times when markers need to use their professional judgement as to
whether a particular response matches one of the specified answers to the marking
key. In such cases, markers will check whether what a pupil has written:
• answers the question
• meets the assessment focus for the question
• is relevant in the context of the text it is referring to.
Such openness to what the pupils want to say is laudable, but it can be imagined that
there are disagreements about the markers’ interpretation of their brief and schools still
appeal more about their key stage 3 English marking than about any other subject or key
stage. It is also true that the model for the English test at key stage 3 is pushing the issue
of manageability of the test to its limits. The marking task is complex and in 2004 an
validity challenges in a high-stakes context 209
attempt to alleviate this by having different markers to mark reading and writing, led to
an administrative breakdown and significant numbers of results were sent to schools
either incorrect or late.
We come back to the challenge of assessing a broad and rich construct of reading in
a high-stakes context. In reflecting on this we must ask what the implications are for a
time-limited test which aims to assess a construct as broad as reading. Clearly valid
reading tests can be successfully constructed: the advantages of standardisation will
give it value, even if some of the breadth of the reading curriculum cannot be captured.
Perhaps, though, the national reading tests at key stage 3 in England are trying to do too
much, by attempting to assess the study of a literary text, commentary on writers’ lan-
guage, and the pupils’ personal responses, as well as their ability to comprehend and
explain what they have understood in their reading. We must ask whether some parts of
what is now assessed by the key stage 3 national reading test should not be assessed in
some other way.
Note
1 Children in private schools are not required to be assessed by the national tests, although quite large num-
bers of these schools choose to use them. Some children with special educational needs may be disapplied
from aspects of the National Curriculum and its assessment, but this is infrequent.
15 New perspectives on accountability:
statutory assessment of reading of
English in Wales
Roger Palmer and David Watcyn Jones
In Wales, the national assessment arrangements are in the process of change. A funda-
mental shift is taking place from formal testing to much greater reliance on teachers’
own assessments. This chapter will describe the background to the changes and illus-
trate the approach to assessing reading in the tests and in supporting teachers’
development as assessors.
Background
Pupils in Wales currently are statutorily assessed at the end of key stages 1, 2 and 3 – at
approximately the ages of 7, 11 and 14. The assessment outcomes are reported to parents
and a summary of the information is published by the Welsh Assembly Government.
Until recently, attainment has been assessed by both teacher assessment and the out-
comes of statutory tasks and tests taken by all pupils in Wales near the end of the final
year of the key stage. The tasks and tests provide a standard ‘snapshot’ of attainment at
the end of the key stage, while statutory teacher assessment covers the full range and
scope of the programmes of study.
It is the balance between the tasks and tests on the one hand and teacher assessment
on the other that is changing. Since 2002, at the end of key stage 1, pupils’ attainment
has been assessed only by teacher assessment – teachers’ professional judgement about
pupils’ attainment based on the programmes of study they have followed – with no
required tasks or tests. From 2005, the same will apply to key stage 2, and from 2006 to
key stage 3.
These changes are the result of policy decisions taken by the Welsh Assembly Gov-
ernment. In 2001, it published The Learning Country (National Assembly for Wales,
2001), a long term strategic statement setting out a vision for education and training in
Wales over ten years. This document initiated a process of change and also led to the
appointment of Professor Richard Daugherty to lead a group that reviewed assessment
arrangements in Wales for key stages 2 and 3, which reported early in 2004. The conclu-
sions of this group (Daugherty Assessment Review Group, 2004) proposed a staged
process of change over the years 2004–08, which has now been accepted as policy by the
new perspectives on accountability 211
Welsh Assembly Government. Testing is largely being replaced by teachers’ own judge-
ments, which are to be supported to ensure that they become more robust and consistent.
At the time of writing this chapter, the assessment of reading involves tests alongside
teacher assessment at key stages 2 and 3, but teacher assessment alone at key stage 1.
The detailed discussion of the assessment of reading that follows will consider the fea-
tures of both, in relation to the construct of reading.
English statutory tasks/tests have been Wales-only materials since 1998 for key stage 3
and since 2000 for key stage 2. Prior to this, assessment materials were developed joint-
ly for Wales and England. The tests have a rigorous development process, including
review by expert committees, informal trialling, and two large-scale formal pre-tests.
The assessments are required to meet a number of quality criteria. They should be moti-
vating and challenging for pupils, provide a valid assessment of the related programmes
of study and be manageable for teachers to administer. They must reflect the diversity of
pupils for whom they are intended, including those with special educational needs.
They should take account of the Welsh Assembly Government’s cross-cutting themes of
sustainable development, social inclusion and equal opportunities and be relevant to
pupils in Wales, taking account of the Curriculum Cymreig.
The construct of reading embodied in the tasks and tests is derived from the pro-
grammes of study for Reading as specified in English in the National Curriculum in
Wales (National Assembly for Wales and ACCAC, 2000). For each key stage, the pro-
gramme of study describes in some detail the range of texts to be read, the skills to be
taught and the aspects of language development to be included.
Overview
The assessment of reading at key stage 2 takes account of pupils’ developing ability to
read with understanding and to respond to literature of increasing complexity. Real texts
are used with questions ranging from literal comprehension to those which are more
thought provoking and for which more than one answer may be acceptable. The four
main processes of reading identified by Sainsbury (this volume), namely decoding,
comprehending, responding and analysing, are to be found in the tests, with a greater
emphasis on the first two at this key stage.
There is a Reading task for level 2 which is based on a range of texts which are man-
ageable by a level 2 reader and which are read aloud to the teacher. The range reflects
212 assessing reading: from theories to classrooms
From 2005 onwards, teacher assessment will form the sole statutory end of the key
stage assessment. National Curriculum tests will be optional in 2005 and 2006, but will
not be produced beyond that date.
Overview
The key stage 3 reading test is designed to assess pupils aged 14 working within the
range of levels 3–7 inclusive, and, taken together with the writing test result, to give a
single level for each pupil at the end of the key stage. As the test targets the expected
range of levels for the key stage (3–7), no task is produced at key stage 3.
The reading test assesses aspects of the programme of study including: location of
information; inference and deduction; identification of key themes and ideas; expres-
sion of responses supported by close reference to the texts and comment on linguistic
features. The processes of reading described earlier in this book, namely: decoding,
comprehending, responding and analysing, are evident in the test.
The test of reading is based upon a small number of stimulus texts which are acces-
sible to level 3 readers and that present a challenge to level 7 readers. The texts
comprise a selection from the categories of text set out in the key stage 3 programme of
study. To date, the tests have utilised two passages of broadly comparable length, usually
one narrative and one non-narrative (often a media or information text).
Questions on each passage comprise a combination of two relatively low tariff (two
and four marks) questions that generally assess information retrieval, inference and
deduction, and a higher tariff (eleven marks) question that assesses overall understanding/
comprehension of the text, response and appreciation of the writer’s use of language,
technique and structure. Pupils are given a bullet point framework to guide their reading
and preparation of written answers to this high-tariff question on each text.
High tariff questions are marked using criterion referenced performance descriptions
(relating to National Curriculum level descriptions), each of which is given a particular
mark allocation. The low tariff questions are marked according to individually tailored
mark schemes that clearly indicate elements of answers for which marks should be
awarded.
The accountability principles and the purposes of testing are the same as outlined
above for key stage 2. However, for key stage 3, from 2006 onwards, teacher assess-
ment will form the sole statutory end of the key stage assessment. National Curriculum
tests will be optional in 2006 and 2007, but will not be produced beyond that date.
appropriately entered for the test will be able to decode effectively. Pupils working
below level 3 are statutorily assessed through teacher assessment alone; pupils who
have difficulty with decoding are likely to be excluded from taking the test.
The second process, comprehending, features prominently in the test, in terms of
basic understanding of written words through questions that invite retrieval of informa-
tion, and a far more sophisticated awareness of the overall meaning of the test and the
writer’s purpose and methods.
The opening questions on each passage in the key stage 3 reading test (ACCAC,
2004b) ask pupils to ‘write down’ or ‘list’ two points or details from the specified short
section near the opening of the passage. Each distinct, appropriate point carries 1 mark.
The second question on each passage carries 4 marks and usually has, as its assess-
ment objectives, location of information and simple inference and deduction. An
element of responding is also apparent in this style of questioning, although that is more
fully developed in the high tariff questions described later.
An example of a 4 mark question is given below. The text is an article by the Welsh
athlete Tanni Grey-Thompson, entitled ‘Treat me as a person first, then ask about my
disability’. In the question, the term ‘list’ is again used as described above, while
‘explain briefly’ invites and encourages pupils to respond and infer, thus moving
beyond mere location. One mark is available for each appropriate location and 1 for
each linked explanation/inference. The section of the text to which this question refers
is reproduced below.
Look again at lines 19–28.
(a) List two things from these lines that shop assistants have said or done which have
annoyed Tanni while she was out shopping.
(b) In each case, explain briefly why Tanni was annoyed.
Many people who are disabled experience daily difficulties and discrimination because of
the way we are seen. A shop assistant once told me, loudly and slowly: ‘Put your change
carefully back in your purse, you might lose it.’ I have also been in a queue, credit card in
hand, waiting to pay for a skirt while the assistant looked everywhere but at me. After five
minutes she eventually asked if I would like some help or was waiting for someone. I
replied that I was waiting for a bus! With a blank stare, she asked if my carer was coming
back. I put the skirt on the counter, said that I no longer wished to purchase it, have never
shopped there again and will not be using any other store in the chain.
Marian Sainsbury (chapter 2) has described a holistic view of reading, the four ‘layers’
of decoding, comprehending, responding and analysing being addressed concurrently.
The high tariff reading questions (carrying 11 marks) are clear examples of this
approach to the assessment of reading, envisaging ‘an active reader, bringing individual
world knowledge to build a personal understanding of the text’ (Sainsbury, this volume,
p.17).
The example from the 2004 reading test given below is representative of this
approach, which allows a variety of acceptable responses and interpretations and
new perspectives on accountability 217
affords opportunities for pupils to explore the writer’s technique; the structure of the
text; devices that create interest and tension; choice and use of language and overall
impact. The question is based on an edited version of a short story Neilly and the Fir
Tree by John O’Connor, offering opportunities to demonstrate comprehension of a
‘complete’ narrative and assess responsive reading of a text written for a real purpose.
This question is about the whole story.
How does the writer create interest and suspense in this story about Neilly?
In your answer you should comment on how the writer:
• makes you feel involved with Neilly at the beginning of the story (lines 1 to 20)
• shows Neilly’s changing feelings in lines 21 to 33
• builds suspense in lines 34 to 51
• uses words and phrases to describe Neilly’s feelings, and what he can see, at the top
of the tree (lines 52 to 60)
• creates interest and suspense in the ending of the story (lines 61 to 82).
Refer to words and phrases from the whole story to support your ideas.
Pupils’ responses are marked against a set of performance criteria related directly to the
National Curriculum level descriptions for levels 3–7, with two marks available at each
level to permit some ‘fine tuning’ by markers. An award of the full 11 marks available is
reserved for responses regarded by markers as being of ‘high level 7’ standard. The
mark scheme provides full exemplification of responses at each level with commen-
taries, and markers are fully prepared through use of training and standardisation of
scripts.
Key stage 1
Since 2002 in Wales, teacher assessment has been the only statutory requirement at the end
of key stage 1. Teachers use their ongoing records and knowledge of pupils’ achievements
to make an overall, best fit judgement of the levels attained in oracy, reading and writing.
218 assessing reading: from theories to classrooms
Optional activities and resources were published in 2003 (ACCAC, 2003b) to aid
consistency in making these judgements to supplement ongoing records.
The activities are based around a story book, Harry and the Bucketful of Dinosaurs
and an information sheet, Dinosaurs. Assessments are differentiated so that they can be
used with pupils working at Levels 1, 2 and 3. There are a number of activities at word,
sentence and text level. The materials can be used at any time of year and can fit into the
teacher’s own way of working.
Each activity links with a section which provides a range of examples of pupils’ work
as they engaged in the activity, with commentaries on how teachers used this evidence
in making their assessments. These are to be used in conjunction with appropriate
exemplar materials and mark schemes already available in school, notably the Exempli-
fication of Standards booklets, distributed to schools in 1995, and writing exemplars in
previous years’ Task handbooks. Occasional references to the level descriptions in the
Curriculum Orders have been included in the commentaries. These have been provided
in order to help teachers make judgements of pupils’ work. They have been used to indi-
cate characteristics of performance and are not intended to be used to level individual
pieces of work. Detailed case studies of two children provide additional opportunities
for staffroom discussion, intra-school and inter-school moderation and possibly inter-LEA
moderation too.
Key stage 2
In 1999, ACCAC developed eight optional assessment units to support the teacher
assessment of pupils’ Speaking and Listening, Reading and Writing performance from
years 3–6 (ACCAC, 1999).
Amongst other principles, the units were designed to:
• provide schools with standard materials that can give evidence of pupils’ attainment
in English to contribute towards, and enhance the consistency of, teacher assessment
throughout the key stage
• provide pupils’ responses that may be included in school portfolios designed to
exemplify standards in English
• provide examples of assessment criteria linked to the programmes of study and level
descriptions for English.
All units provide teaching contexts within which aspects of pupils’ performance can be
assessed, but they are not end-of-unit tests. By assessing pupils’ work against the per-
formance criteria for a unit, aspects of performance characteristics of a level are drawn
out, but not as a summative level. As at key stage 1, exemplification and commentary
are provided for each unit to identify responses which are characteristic of performance
at the different levels.
new perspectives on accountability 219
Whilst the units support the integration of oracy, reading and writing, two units have
a specific assessment focus on reading. The nature of the activities and the flexible
timescale more easily allows the integration of the processes of decoding, comprehend-
ing and responding, with the objective of developing pupils as enthusiastic, independent
and reflective readers.
Unit 3, entitled Finding out about Elephants, focuses on identifying features of infor-
mation texts and contrasts with narrative genre. Unit 6, Fantastic Mr Dahl, focuses on
characteristics of biography and contrasts with other genres.
Key stage 3
In 2000 ACCAC distributed Optional Assessment Materials for Key Stage 3 English
(ACCAC, 2000), a pack comprising eight units for teachers’ optional use in supporting
their assessment of pupils’ work throughout key stage 3.
Design and preparation of the units was guided by a set of principles, including the
following, which are of particular relevance to the assessment of reading.
• Units should integrate the teaching of oracy, reading and writing.
• Assessment opportunities should be built in to as many activities as possible, not just
one end-of-unit activity.
• Developing reading skills which lead pupils to appreciate techniques used by writ-
ers to achieve effects should be linked explicitly to encouraging pupils to use these
techniques in their own writing.
Examination of these principles and the summary of the units should clearly indicate the
very different approach to the practice and assessment of reading through the use of the
units and assessment by means of end of key stage tests. One of the fundamental differ-
ences here is the explicit intention to integrate work on the three attainment targets:
Oracy, Reading and Writing. The tests require written answers, though assessment
objectives relate solely to reading, and oral responses do not feature in the English tests
(although they are utilised in the Welsh test/task arrangements).
In terms of the construct of reading, the units that focus in part on reading adopt
again the holistic integration of the processes of decoding, comprehending, responding
and analysing. For example, Unit 2, which focuses on pre-1914 poetry, outlines activities
that provide opportunities to asses the following skills:
• talking and writing about a range of reading, articulating informed personal opinions
• responding imaginatively and intellectually to the substance and style of what is
being read
• reflecting on writers’ presentation of ideas, the motivation and behaviour of characters,
the development of plot and the overall impact of a text
220 assessing reading: from theories to classrooms
The unit also contains guidance on annotating a text and preparing for and giving,
dramatised readings, thus providing assessment opportunities for teachers that cannot
be offered in a timed test situation, and covering areas of the programme of study that
are not amenable to discrete testing.
Such activities and opportunities, allied to others developed by teachers adapting the
units and adopting their principles, provide ‘evidence about reading … evinced through
observable performances of one kind or another’ (Sainsbury, this volume, p.16). It is a
series of such ‘performances’ and opportunities that provide teachers with a range of
evidence, both tangible and more ephemeral, upon which they can base their assessment
of pupils’ reading (and writing and oral skills).
To help secure teacher judgements, the units provide examples of pupils’ work, with
commentaries that indicate qualities characteristic of particular levels of attainment,
while avoiding any encouragement to level individual pieces of work, or evidence of
attainment.
With such different contexts and purposes from the tests, the activities and their relat-
ed assessment opportunities are far less ‘high stakes’ and accountability exists primarily
in terms of the users: the students and the teachers whose learning and pedagogy they
should support. As the activities and outcomes should be used to contribute evidence to
end of key stage teacher assessments, there is some element of the more traditional
accountability to society and government. However, as the materials are intended for
flexible use and as models for adaptation, they should not be regarded as having the
same statutory character as the tests. Appropriately, they do not carry the accompanying
precise arrangements for timing and administration.
The assessments derived from the activities are, by design, more diagnostic and
formative than test outcomes. A key element of the optional assessment materials for
key stage 3 is the inclusion of pupil self-assessment sheets that help students recognise
what they have achieved and what skills they are demonstrating, but also require them
to consider how they will move on and make progress.
Looking ahead
With teacher assessment becoming the sole means of end of key stage assessment in
Wales, from 2005 for key stage 2 and 2006 for key stage 3, materials of this nature that
support teacher assessment are likely to attract increasing recognition. Concomitantly,
their influence is likely to promote in all key stages, but in key stage 3 in particular, an
understanding of reading as an interwoven set of processes, in turn integrated with and
mutually supported by, speech and writing.
new perspectives on accountability 221
References
Scotland has an honourable history of approaches that have respected the idea of con-
struct validity in its national assessment systems at 5–14, 14–16 (Standard Grade) and
16–18 (formerly just Higher, now National Qualifications covering a range of levels
overlapping with Standard Grade and including Higher). The historical order of events
was the development of both Standard Grade and a revised version of Higher in the late
1970s and 1980s, 5–14 in the late 1980s and early 1990s and National Qualifications in
the late 1990s. However, despite some marked differences in summative assessment,
the validity of all three systems rests on the principle that effective assessment samples
a well defined curriculum, a clearly delineated body of knowledge and skills which
pupils are to be taught and are expected to learn. The culture has, therefore, given status
to clear specification by professional working groups of what matters in learning and
teaching a curricular area/subject. Assessment arrangements have been developed with
the curriculum firmly in mind.
The designers of the reading curriculum at all three stages were imbued with a phi-
losophy which dates back to Scottish Central Committee on the Curriculum (SCCC)
‘Bulletins’ of national guidance from the 1960s (see, for example, Scottish Education
Department, 1968). These highlighted the value of reading, particularly literature, to
growth as a person. The result has been that we have national curricular guidelines for
5–14 (Scottish Office Education Department, 1991) and advice to schools on Standard
Grade and National Qualifications that present the study of literature as engaging the
reader in thought, stimulating mind and emotions and inevitably involving interaction
with and exploration of text. These guidelines and advice promote teaching of all kinds
of reading. Reading, in the Scottish curriculum, involves activating pupils’ experience
of ideas and their imagination in responding to text, as well as their knowledge of how
language works. Pursuit of meaning in working out how and why the writer has used
language in particular ways is central. At every level, teachers are strongly encouraged
to give pupils activities which promote understanding of and engagement in reading as
there is no alternative … to trusting teachers 223
project (Scottish Council for Research in Education, 1977), commissioned by the Head-
teachers’ Association of Scotland, had piloted ways of describing pupils achievements
in various secondary subjects and had strongly influenced the Dunning Committee’s
report (Scottish Education Department, 1977), which proposed the Standard Grade
assessment system. This report argued that priority should be given to the introduction
of diagnostic assessment procedures into teaching and of criterion-referenced measures
into the certification of pupils. It recognised that teachers’ professional judgement, suit-
ably moderated, would be an essential component of assessment for certification of
abilities that could not be covered in an examination.
Another aspect of the debate about criterion-referencing also guided Scottish think-
ing. Those developing English courses and assessment rejected the arguments of (mainly
American) advocates of domain definition that every learning objective should be speci-
fied in detail. They accepted, rather, the viewpoint expressed by Ebel (1972a, 1972b) that
teachers need to think hard about their teaching, not atomistically, but rather taking flex-
ible account of many kinds of relationship among things being learned and of learners’
interaction with what they are learning. This conception of teaching is very consonant
with a rich definition of reading and with the idea that teachers know most about pupils’
success in dealing with the tasks and achieving the learning aims they set them.
The traditional Scottish philosophy of reading has much in common with themes
emerging from more contemporary research. Harrison (2004), exploring the implica-
tions of the post-modern state of our lives for the assessment of reading, argues for
approaches to both teaching and assessment that reflect the desirable practice advocated
by Scottish curricular policy. These include a local focus for planning the reading expe-
rience of pupils, taking account of needs and interests and engaging them in authentic,
purposeful tasks; teacher-, peer- and self-assessment; emphasis on the reader’s response
in interaction with the text and what may be known or guessed of the author’s intentions
and a portfolio of classwork as the evidence for assessing pupils’ success and progress.
Harrison presents these approaches as essential if we are to avoid fundamental prob-
lems arising from high-stakes testing, which does not validly assess reading
development and may actually hamper it. He quotes the American Evaluation Associa-
tion (AEA) task force’s review of all the available evidence on the effects of high-stakes
testing, which highlights teacher and administrator deprofessionalisation amongst other
unfortunate outcomes.
Teacher professionalism as a critical factor in effective teaching of reading also
emerges clearly from the USA National Reading Panel’s review of research on teaching
comprehension (2000). The review summary argues that teachers need to ‘respond flex-
ibly and opportunistically to students’ needs for instructive feedback as they read’ (p.47)
and that teacher education should give more emphasis to ensuring that teachers have the
relevant understanding of the nature of reading and of the strategies students need to
have explained and demonstrated to them. Similarly, Hoffman and Pearson (2000)
argue that teacher education for flexible planning, feedback and decision-making is critical,
as opposed to mere training to perform a series of routine procedures.
there is no alternative … to trusting teachers 225
the teacher’s judgement on the basis of a much more limited range of tasks than
could be observed in classwork.
Later publications in Scotland continued to promote formative and diagnostic assess-
ment, called respectively assessment as part of teaching and learning and taking a closer
look in the National Guidelines. In particular, a series of booklets offering diagnostic
procedures were made freely available to schools. These procedures were not tests, but
ways of using classroom interaction to identify pupils’ strengths and needs as learners in
mathematics, science, reading and writing. (Scottish Council for Research in Education,
1995; Hayward and Spencer, 1998). Essentially, Taking a Closer Look at Reading pro-
vides teachers with exploratory questions and issues to explore with pupils and
identifies possible next steps to help pupils to grow as learners. There was a constant
emphasis on the importance of the teacher’s own knowledge of the pupil and on the
pupil’s own perspective as sources of evidence to inform an analysis of activities and
thinking. The exploratory questions and next steps were proposed for four areas of the
reading process, which again reinforced the notion of it as rich and complex: Attitude
and Motivation; Decoding: Recognising Words and Sentences; Pursuit of Meaning and
Awareness of Author’s Use of Language. Central to this approach is the idea that a
learner is a whole person, whose success and progress in reading depends as much on
attitude, motivation and interaction with others as it does on analysis of the construct of
reading as decoding, comprehending and responding.
The historical survey of developments in this section shows that there have been sig-
nificant efforts for a long time in Scotland to promote assessment which reflects as fully
a possible a rich construct of the process of reading – so rich that it cannot be encom-
passed within the format of traditional reading tests. It therefore requires a considerable
contribution from the teacher.
detailed analysis of writers’ craft to support responses and evaluations (S3 and S4, ages
14–16); and mature, independent responses to and analyses of personal reading (S5–S6
pupils, ages 16–18). However, a number of aspects of reading were identified as requir-
ing improvement (HMIE, 1999): in S1–S4 reading for information and appreciation of
the writer’s craft and, additionally in S1 and S2, reading for meaning and for pleasure.
HMI had some significant comments to make about the nature of reading experiences
being offered:
In almost all departments pupils’ essential command of English would be significant-
ly enhanced if, from S1, there were more frequent close reading for analysis of
meaning, language awareness and appreciation of the writer’s craft.
(HMIE, 1999, p.15)
The more recent Improving Achievement in English Language in Primary and Second-
ary Schools shows a similar picture across both the primary and the secondary sectors:
The evidence about attainment in English Language presents a very mixed and quite
complex picture. … Overall attainment is … in need of significant improvement from
P5–S2.
(HMIE, 2003, p.8)
The results of the Sixth Survey of English Language (SEED, 2003), undertaken as part
of Scotland’s national monitoring system, the Assessment of Achievement Programme,
are consistent with the findings from HMIE. The AAP reported that the nationally
defined 5–14 levels for reading were reached by 63 per cent in P4 (age 8) but by only 41
per cent in P7 (age 11) and 43 per cent in S2 (age 13/14).
It is, however, somewhat surprising to find that, according to National Assessment
results reported by schools, there has been a steady annual improvement in levels of
attainment in recent years. For 2002–03, these figures indicate that 80.8 per cent of P4,
72.4 per cent of P7 and 60.6 per cent of S2 had been recorded as having achieved the
appropriate national level of attainment (SEED, 2004). This evidence sits uneasily with
the HMIE reports and the AAP survey. It does, however, resonate with findings report-
ed by Peter Tymms (2004) in relation to the standards data emerging from English
primary schools. He argues that national testing has failed to monitor standards over
time and cites a number of reasons for this, including the high stakes use of the test data
and its impact on classroom practices, the changing curriculum and assessment context,
and the ways in which sub-areas of the curriculum have changed differently, e.g. ‘Writ-
ing has improved much more than reading’ (Tymms, 2004, p.492).
So what is really happening in reading in Scottish schools? Commenting on assess-
ment in Scotland, HMIE reports on individual primary and secondary schools suggest
that helpful oral or written feedback to pupils on English work, including reading tasks,
is fairly common. The Standards and Quality report referred to above (HMIE, 1999)
identified only 20 per cent of the schools inspected where such feedback was not pro-
vided on most aspects of English. This report also noted that many secondary English
departments summarised classwork achievement in individual profiles, sometimes also
there is no alternative … to trusting teachers 229
identifying strengths, development needs and next steps and incorporating an element
of pupil self-evaluation. Such feedback and profiles were often based, as far as reading
is concerned, on both close-reading/interpretation tasks and extended responses to read-
ing or critical essays. In primary schools’ inspection reports, HMIE sometimes note
effective formative assessment of reading (and other areas of work) but they do also
sometimes consider that judgements about levels of attainment have been made on the
basis of too little evidence from classwork (HMIE, 2003, p.26). This comment refers to
situations where pupils have been allowed to take a National Assessment before they
have had the range of successful experience of various aspects of reading (or writing)
necessary to give the teacher evidence of full attainment of the level.
Evidence from teachers in other research projects (Hayward and Hedge, 2005; Hay-
ward et al., 2004) indicates strongly that, despite policy advice to the contrary (Scottish
Office Education Department (SOED), 1991b), decisions about levels of attainment
have been dominated by National Assessments results, rather than by professional
judgement. Despite the positive reactions to the principles of 5–14 Assessment, both
originally and when teachers were asked to reconsider them in a national consultation
instigated by the Scottish Executive (Hayward et al., 2000), it is clear that many teach-
ers believe that the data from National Assessments are regarded as more important than
the evidence from teachers’ professional judgement. There is also much anecdotal evi-
dence, for example, from teachers attending Masters’ courses in reading in universities,
that in many primary schools and for S1/S2 in some secondaries assessment of reading
means only ‘taking a National Assessment’. Many of these highly-motivated teachers
are strongly committed to the principles of 5–14 Assessment, but have not been enabled
to live the ideas in practice. They often report that they or their colleagues do not con-
duct formative assessment of reading in a systematic way. They often say they tend to
make decisions about giving pupils a National Assessment on the basis of how far they
have progressed through a reading scheme, rather than on evidence of progress across
all the strands of the reading programme. Another common view is that a National
Assessment at a particular level will be taken if the teacher believes that a pupil has a
fighting chance of being successful, rather than when (s)he is sure that all aspects of the
level of work specified in the curriculum guidelines have been satisfactorily mastered.
National Assessment data were collected nationally in the 1990s and up to 2004.
Though there were no published league tables it appears that teachers and schools
believe these test results to be an extremely important source of evidence about the
quality of their school and of the teachers who work in it.
Recent studies of relations between primary and secondary schools (Hayward et al.,
2004; Besley, 2004; Hayward et al., 2002; Baron et al., 2003) indicate a lack of trust in
primary assessment by secondary teachers and an awareness of this distrust on the part of
primary staff. Secondary teachers regard primary assessment as unreliable: they say that
pupils often cannot demonstrate the skills and understandings implied by the National
Assessment result provided by the primary school. Primary teachers tend to regard this
view as a slight on their professionalism. One reason for this lack of understanding across
sectors may be that National Assessment data are being used for essentially different pur-
230 assessing reading: from theories to classrooms
poses. Primary teachers, in adopting the fighting chance approach to having pupils under-
take a National Assessment, may be trying to ensure that as many pupils as possible
achieve as high a level as possible. Secondary teachers, on the other hand, may expect that
if a pupil is recorded as having attained a particular level then that means that s/he will be
able to demonstrate achievement across all the areas of the curriculum at that level.
If, as it appears, we are in a scenario where the only significant assessment activity is
taking a National Assessment, perhaps we should cast a critical eye over the tests them-
selves. To what extent do they assess what they purport to assess and reflect the
desirable reading experience promoted by national policy and guidance and good teach-
ing? We address this question in the final section of this chapter, where we consider
what is now needed in the assessment of reading in Scotland. But before that, it is
important to try to understand the problem a little more deeply. Why, despite the signifi-
cant efforts to encourage professional judgement in assessment, both formative and
summative, are we in the position where many headteachers and teachers rely mainly or
solely on the formal National Assessments, about which, when introduced in their original
National Test version, there had been much controversy?
philosophy and principles of the 5–14 Assessment Guidelines identified in the SEED
consultation (Hayward et al., 2000) was, therefore, not enough to ensure its translation
into practice; nor was the cascade approach to staff development adequate. There was
evidence that testing had become dominant because teachers were often less aware of
the actual policy and national guidelines than they were of the context of ‘performativ-
ity’ within which they perceived the policy to have emerged (Hayward and Hedge,
2005). Even although there have never been published league tables of Scottish primary
schools, the fact that National Assessments data were collected centrally had led people
to believe that these results were what really mattered.
The role of HMIE in this process is worthy of consideration, or, more accurately, the
role of perceptions of what HMI ‘require’ and will expect if they visit a school. HMI
have in fact strongly supported 5–14 assessment policy on formative assessment and
teachers’ professional judgement as a key factor in summative assessment. This is clear
from their comments and advice in both individual school inspection reports and such
national publications as the reports in the ‘Standards and Quality’ and the ‘Improv-
ing…’ series referred to above. Nevertheless, there is a widespread belief among
headteachers (in particular in primary schools) that HMI insist on ‘hard evidence’,
which means a test result.
This belief appears to have been reinforced by the ‘target-setting’ initiative (now no
longer operating at a national level). All education authorities co-operated in this initia-
tive devised by HMI in the 1990s as a significant part of a school’s quality assurance. It
was based on the principle that effective school improvement action requires clear evi-
dence about many important aspects of a school’s life, including attainment and the
learning and teaching activities that affect it. Target-setting involved the specification
for each school of an agreed measure of improvement in performance – e.g., an addi-
tional proportion of pupils attaining level D in P7 or a Credit award at Standard Grade.
The intention was that schools, with the support of their education authorities, would
achieve the target by improving learning and teaching. In many cases, schools did not
interpret this initiative in the way that HMIE intended, with a focus on improving learn-
ing and teaching. Sliwka and Spencer (2005) report that many headteachers and
classroom teachers hold the view that target-setting encouraged them to focus on sum-
mative assessment and that some seem to regard action to develop really effective
learning and teaching as separate from, or even inimical to, their need to improve
results. Certainly, since both education authorities and HMI were monitoring schools’
progress towards their targets, they inevitably became ‘high stakes’ in the minds of
school staff.
HMI may have inadvertently facilitated the predominance of testing over other
approaches to assessment in another way, too. Throughout the 1990s, while certainly
commenting on and advocating ‘assessment as part of teaching’ in every inspection,
HMI also criticised schools where 5–14 National Assessments were not used ‘to con-
firm teachers’ judgements’. They were obliged to do this because it was (and still
remains for at least the immediate future) government policy that the tests should be
232 assessing reading: from theories to classrooms
used in this way. The frequent appearance of this criticism in inspection reports proba-
bly contributed strongly to the perception that the tests are very important.
To sum up, the problem in Scotland has not been lack of impact from reading
research on the national policies and guidance offered to schools. Rather, there have
been strong influences on practice from other directions that have led to a narrow, test-
dominated view of assessment at 5–14. The critical issue for the future is how to create
circumstances where research, policy and practice can grow together to improve chil-
dren’s experiences as readers in schools. Tackling this issue requires reflection on other
kinds of research, on the development of collaborative learning communities.
Current developments
There are current developments in Scotland that seek to improve the situation. Current-
ly, the policy context is favourable. In 2003, the partnership agreement between the
Scottish Labour Party and the Scottish Liberal Democrats (2003), the parties presently
sharing power in the Scottish Parliament, included the following policy statement:
We will provide more time for learning by simplifying and reducing assessment, end-
ing the current national tests for 5–14 year olds. We will promote assessment methods
that support learning and teaching. We will measure improvements in overall attain-
ment through broad surveys rather than relying on national tests. (2003, p.6)
participating in communities of readers; and reading for enjoyment. They identified four
‘processes of comprehension’:
1. focus on and retrieve explicitly stated information
2. make straightforward inferences
3. interpret and integrate ideas and information, including generalising from specifics
and drawing on real life experience
4. examine and evaluate content, language and textual elements.
The analysis revealed that at no level did the National Assessments involve any task
requiring a constructed response showing personal reaction or evaluation. All tasks
were either multiple-choice, true/false/can’t tell or cloze completion, of sentences at the
lower levels, or summary paragraphs at the higher ones. The proportion of marks allo-
cated to tasks where the answer could be retrieved explicitly from the text was 88 per
cent at level A, 79 per cent at B, 80 per cent at C, 64 per cent at D and 86.5 per cent at
E. In each case all or almost all of the other tasks involved simple inference. There were
very few tasks of PIRLS types 3 and 4. At level D there were two multiple-choice ques-
tions requiring recognition of a generalisation based on the text and there were two at
level E, again in multiple-choice format, requiring pupils to make a judgement about genre.
A ‘pass’ on the Assessment at any level, widely regarded as guaranteeing attainment of that
level in reading, is achieved with 67 per cent of the marks.
By contrast, analysis of the 2003 Standard Grade and NQ Higher examinations
showed that, in the close reading (Interpretation) tasks, much larger proportions of
marks were obtained in responding to questions of types 3 and 4 in the PIRLS frame-
work: 34 per cent at Standard Grade Foundation (considered to be no more demanding
in principle than 5–14 Level E), 64 per cent at both Standard Grade General and Credit
and 92 per cent at Higher. In all the Standard Grade and Higher tasks constructed
answers are required. In addition to the close reading tasks, pupils produce, at each level
of Standard Grade, three extended ‘critical essays’ on literature studied in class, in
which they are expected to demonstrate PIRLS types 3 and 4 processes. At Higher they
must achieve criteria matching types 2, 3 and 4 processes in a critical essay and a
detailed textual analysis in internal assessment and in two critical essays in the external
examination. They also have to pass a close reading test internally (similar in style to
the external one analysed).
Standard Grade and the new National Qualifications Higher are not unproblematic.
There is, for instance a known tendency for schools to allow pupils in Standard Grade
courses to ‘polish’ pieces of work in their folio for an inordinate amount of time at the
expense of a wider range of reading. Because the new National Qualifications at High-
er and all other levels include a pass/fail internal summative assessment task (or tasks)
for each of three course units, there is a danger of heavily assessment-led teaching.
There is a strong case for the view that the present Higher arrangements give too much
emphasis to reading. Writing other than critical essays on literature is assessed only
234 assessing reading: from theories to classrooms
internally not on a pass/fail basis in the external examination, which determines the
overall grade. Talking is not assessed at all. Nevertheless, it is clear that the message to
teachers about the nature of reading from these two assessment systems is that it is
indeed rich and complex and requires personal engagement analysis and evaluation.
Indeed, the questions in the external close-reading test at Higher are categorised on the
paper as U (Understanding), A (Analysis) or E (Evaluation), so that pupils are oriented
to the range of types of response expected.
In contrast, the 5–14 National Assessments do not convey this message about the rich-
ness of reading. The question types used were selected to reduce marker unreliability, but
the result is that the tests have low validity as assessments of reading as it is defined in the
5–14 curriculum. A recent addition to the writing test at each level has extended the range
of reading skills covered by requiring pupils to continue a story or other type of text in the
same style/genre as that of the opening paragraphs provided in the test. Nevertheless, over-
all, the National Assessments provide a poor basis for informing parents, other teachers and
those monitoring the schools system about the range and quality of pupils’ reading abilities.
Perhaps even more significantly, the narrow range of reading skills they cover reinforces a
tendency in many schools to give pupils a narrow range of reading experiences, often with-
out expectation of analysis and personal evaluation of ideas and writer’s craft. There is in
principle a strong case for implementing the policy set out in the political partnership
agreement in respect of national testing, that is, to remove it. However, while Circular No.2
to education authorities (SEED, 2005), outlining current government policy on assessment
and reporting 3–14, does emphasise the idea of sharing the standard through moderation, it
also indicates that National Assessments continue as a means of confirming professional
judgement. So our reservations about their validity remain significant.
ities are currently engaging all their schools in a planned initiative to promote ‘assess-
ment for learning’, ‘assessment as learning’ and ‘assessment of learning’, using the
knowledge about successful practice gained from the schools involved in the first year’s
work. The aim is to spread widely the strategy employed in the first phase, which was to
enable researchers, policy-makers, teachers and pupils to work together in partnership,
recognizing that each of these groups brings essential yet different understandings to the
development of effective assessment and effective learning and teaching. This work is
supported by an on-line resource, available on the LTScotland website, a toolkit, with
key principles for different aspects of assessment and exemplars of individual schools’
successful practice in the first year of the programme.
We know from the evaluation of the programme conducted by the London Institute
of Education (Hallam et al., 2004) of the positive response from teachers involved in
the formative assessment programme. They had access through conferences to the work
and enthusiasm of researchers such as Dylan Wiliam and Paul Black (1998) and to the
experience of teachers elsewhere who had successfully developed formative assessment
in their classrooms. They very actively and keenly developed their own approaches
through reflection on the research and on the needs of their pupils. They also reported
that a crucial aspect of their successful engagement with the programme was the regu-
lar opportunity for networking with other teachers also engaged in it, in their own
school and more widely at national or regional conferences. The evidence emerging
suggests that there are very real, exciting, positive changes in classrooms across Scot-
land. The spaces for reflection and working out one’s own assessment strategy provided
within the project enabled teachers to make assessment for learning come alive in their
own contexts, sometimes in very different ways. They not only found their involvement
professionally fulfilling but they reported positive changes in the quality of pupils’ work
and in their commitment to learning. Many teachers expressed relief at once again being
able to focus on learning and teaching. Some reported that they now understood words
they had been using for years, that they were developing a real understanding of what
assessment as part of learning actually means. Only a very small number of those
involved remained sceptical about the approach (Hayward, Priestly and Young, 2004).
Assessment is for Learning is neither a top down model of change, nor a bottom up
model. It recognises that no community, research, policy or practice, can work without
the other groups and have any real hope of positive change. It also recognises that old
approaches to the growing of new assessment policy and practice will not do. Desforges
(2000) argues that if we are to offer better educational opportunities for learners then we
need new approaches that transcend traditional practices in research. Influences on the
programme come therefore not only from research on assessment as an educational
issue but also from research on the development of collaborative learning communities.
The aim is to have researchers, policy makers and practitioners work together, taking
full account of the many practical factors affecting teaching. The methodology is partic-
ipative and is consistent with Fullan’s admonition that (1993, p.60) any successful
innovation will involve ‘… expanding professional development to include learning
236 assessing reading: from theories to classrooms
while doing and learning from doing’. The Assessment is for Learning programme has
been heavily influenced by the work of Senge and Scharmer (2001, p.247), who offer a
model of collaborative research, arguing that ‘… knowledge creation is an intensely
human, messy process of imagination, invention and learning from mistakes, embedded
in a web of human relationships’. This is a model where all involved are learning,
pupils, teachers, researchers and policy makers, as they attempt to travel where Fullan
(2003) suggests there are not yet roads. Participants in this project must, ‘Travel and
make the road’ (2003, p.106).
The model has regard to what matters for people. Too often we have seen innovation
as somehow separate from the people who are an integral part of the process of change.
Or we have required change in what teachers and schools must do with little considera-
tion of the implications for change in the policy or research communities. The
Assessment is for Learning programme offers a more inclusive approach to change that
has potential advantage for all the communities involved. Teacher professionalism is
enhanced and research, policy and practice begin to move into a closer alignment.
However, if professional judgement is to flourish then the tensions between assess-
ment to support learning and assessment for wider purposes of accountability remain to
be addressed. The new approach proposed to monitor national standards will not
involve the use of National Assessment data. It will use sample surveys only, in the
Scottish Survey of Achievement (SSA), an expanded version of the former Assessment
of Achievement Programme (AAP). This is an approach similar to that advocated by
Tymms (2004). It is hoped that this approach, combined with improved support for
teachers’ summative assessment based on professional judgement, should continue to
supply the information necessary for schools, education authorities and HMI to monitor
the overall attainment of individual schools, without setting one against others in direct
competition. Most importantly, it is hoped that it will make it possible to enhance pupil-
s’ experiences of and achievements in reading.
We are, therefore, embarking in Scotland on a highly participative general approach
to all aspects of assessment, with real emphasis on developing teachers’ professional
judgement about formative and summative assessment for 5–14. Given this emphasis,
the strategy will involve complex resourcing and management challenges. What are the
implications for the assessment of reading?
In the first phase, teachers involved in the Assessment is for Learning programme
chose to focus on a wide range of areas of the primary curriculum and secondary sub-
jects. Few, if any, selected reading as their central concern. The Building Bridges
(across the primary-secondary interface) initiative was initiated, linked to Assessment is
for Learning and with similar methodology, to focus particularly on literacy. In keeping
with the participative approach, participants were invited to identify their own impor-
tant areas for investigation. About a quarter of the 32 schools initially involved are
focused on reading. There is, however, no guarantee that reading will be a focus of
attention as education authorities and schools take forward either the general Assess-
ment is for Learning or the Building Bridges initiatives in the coming years. There is no
certainty that reading will feature systematically in the piloting of moderation and
there is no alternative … to trusting teachers 237
References
Baron, S., Hall, S., Martin, M., McCreath, D., Roebuck, M., Schad, D. and Wilkinson,
J.E. (2003). The Learning Communities in Glasgow – Phase 2 Evaluation. Glasgow:
Faculty of Education, University of Glasgow.
Besley, T. (2004). Quality Audit of Pupil Experience in Primary – Secondary School
Transfer in the Eight Integrated Learning Communities of Falkirk Council Educa-
tion Services. Glasgow: Faculty of Education, University of Glasgow.
Black, P. and Wiliam, D. (1998). ‘Assessment and Classroom Learning’, Assessment in
Education, 5, 7–36.
Campbell, J., Kelly, D.L., Mullis, I.V., Martin, M.O. and Sainsbury, M. (2001). Frame-
work and Specifications for the PIRLS Assessment. 2nd edn. Boston: International
Study Centre.
Desforges, C. (2000). ‘Familiar challenges and new approaches: necessary advances in
theory and methods in research on teaching and learning.’ The Desmond Nuttall
Memorial Lecture, BERA Annual Conference, Cardiff.
Ebel, R.L. (1972a). Essentials of Educational Measurement. New Jersey: Prentice-Hall.
Ebel, R.L. (1972b). ‘Some limitations of criterion-referenced measurement.’ In: Bracht,
G. (Ed) Perspectives in Educational and Psychological Measurement. New Jersey:
Prentice-Hall.
Evidence for Policy and Practice Information Co-ordinating Centre (EPPI-Centre)
(2004). ‘A systematic review of the evidence of reliability and validity in assessment
by teachers used for summative purposes’ [online]. Available: http://eppi.ioe.ac.uk
[28 August, 2004].
Fullan, M. (2003). Change Forces with a Vengeance. London, Routledge Falmer.
Hallam, S., Kirkton, A., Pfeffers, J., Robertson, P. and Stobart, G. (2004). Report of the
Evaluation of Programme One of the Assessment Development Programme: Support
for Professional Practice in Formative Assessment. London: Institute of Education,
University of London.
Harrison, C. (2004). ‘Postmodern Principles for Responsive Reading Assessment’,
Journal of Research in Reading, 27, 2, 163–73.
Hayward, L., Spencer, E., Hedge, N. Arthur, L. and Hollander, R. (2004). Closing the
Gap: Primary-Secondary Liaison in South Lanarkshire. Research Report, Glasgow:
Faculty of Education, University of Glasgow.
Hayward, L., Priestly, M. and Young, M. (2004). ‘Ruffling the calm of the ocean floor:
merging, policy, practice and research in Scotland’, Oxford Review of Education, 30,
3, 397–415.
Hayward, L. and Hedge, N. (2005). ‘Travelling towards change in assessment: policy,
practice and research in education’, Assessment in Education, 12, 1, 55–75.
Hayward, L., Hedge, N. and Bunney, L. (2002). ‘Consultation on the review of assessment
within New National Qualifications’ [online]. Available: at http://www.scotland.gov.uk/
library5/cnnqa 00.asp [15 August, 2004].
there is no alternative … to trusting teachers 239
Hayward, L., Kane, J. and Cogan, N. (2000). Improving Assessment in Scotland. Research
Report to the Scottish Executive Education Department. University of Glasgow.
Hayward, L. and Spencer, E. (2004). ‘The construct of reading in 5–14 National Assess-
ments and in examinations in Scotland’ (draft journal paper).
Her Majesty’s Inspectorate of Education (HMIE) (1999). Standards and Quality in Sec-
ondary Schools 1994–97. Edinburgh: Scottish Executive Education Department.
Her Majesty’s Inspectorate of Education (HMIE) (2003). Improving Achievement in
English Language in Primary and Secondary Schools. Edinburgh: Scottish Execu-
tive Education Department.
Hoffman, J. and Pearson, P.D. (2000). ‘Reading teacher education in the next millenni-
um: what your grandmother’s teacher didn’t know that your granddaughter’s teacher
should’, Reading Research Quarterly, 35, 1, 28–44.
National Reading Panel (2000). ‘Chapter 4, Comprehension.’ In: Reports of the Sub-groups
[online]. Available: http://www.nationalreadingpanel.org/Publications/publications.htm
[15 August, 2004].
Pearson, P.D. and Fielding, L. (1994). Balancing authenticity and strategy awareness in
comprehension instruction. Unpublished article adapted from the same authors’
‘Synthesis of reading research: reading comprehension: what works?’, Educational
Leadership, 51, 5, 62–7.
Scottish Council for Research in Education (1977). Pupils in Profile: Making the Most
of Teachers’ Knowledge of Pupils. Edinburgh: Hodder and Stoughton.
Scottish Council for Research in Education (1995). Taking a Closer Look: Key Ideas in
Diagnostic Assessment. Edinburgh: Scottish Council for Research in Education.
Scottish Education Department (1968). The Teaching of Literature. Edinburgh: Scottish
Education Department/HMSO.
Scottish Education Department (1977). Assessment for All (The Dunning Report). Edin-
burgh: Scottish Education Department/HMSO.
Scottish Executive Education Department (2003). The Report of the Sixth Survey of
English Language 2001; Assessment of Achievement Programme. Edinburgh: Scot-
tish Executive Education Department.
Scottish Executive Education Department (2004). 5–14 Attainment in Publicly Funded
Schools, 2002–03. Edinburgh: Scottish Executive Publications [online]. Available:
http:www.scotland.gov.uk/stats/bulletins/00305-03.asp [1 September, 2004].
Scottish Executive Education Department (2005). Circular No.02 June 2006. Assessment
and Reporting 3–14. Available: http://www.scotland.gov.uk/Resource/Doc/54357/
0013630.pdf
Scottish Office Education Department (1991a). Curriculum and Assessment in Scot-
land: National Guidelines: English Language 5–14. Edinburgh: Scottish Office
Education Department/HMSO.
Scottish Office Education Department (1991b). Curriculum and Assessment in Scot-
land: National Guidelines: Assessment 5–14 . Edinburgh: Scottish Office Education
Department/HMSO.
240 assessing reading: from theories to classrooms
Scottish Labour Party and Scottish Liberal Democrats (2003). ‘A Partnership for a Better
Scotland: Partnership Agreement’ [online]. Available: http://www.scotland.gov.uk/
library5/government/pfbs 00.asp [15 August, 2004].
Senge, P. and Scharmer, O. (2001). ‘Community action research: learning as a commu-
nity of practitioners.’ In: Reason, P. & Bradbury, H. (Eds) Handbook of Action
Research: Participative Inquiry and Practice. London, Sage: 238–49.
Sliwka, A. and Spencer, E. (2005). Scotland: ‘Developing a Coherent System of Assess-
ment’ in Formative Assessment – Improving Learning in Secondary Classrooms.
Paris: Organisation for Economic Co-operation and Development (OECD).
Spencer, E. (1979). ‘The assessment of English – what is needed?’ In: Jeffrey, A.W.
(Ed) Issues in Educational Assessment. Edinburgh: Scottish Education Depart-
ment/HMSO.
Spolsky, B. (1994). ‘Comprehension testing, or can understanding be measured?’ In:
Brown, G., Malmkjaer, K., Pollitt, A. and Williams, J. (Eds) Language and Under-
standing. Oxford: OUP.
Tymms, P. (2004). ‘Are standards rising in English primary schools?’, British Educa-
tional Research Journal, 30, 4, 477–94.
17 Low-stakes national assessment:
national evaluations in France
Martine Rémond
In France, there are some distinct differences from the assessments of reading in the
English-speaking world which make up most of this volume. These differences apply
both to the definition of reading and to the purposes for testing. The national evaluations
of reading in French, which are described in this chapter, are formal national tests
intended for formative and diagnostic use.
Background
All schools in France for children aged from 2- or 3-years-old to 18 are required to fol-
low the National Curricula, which apply to primary schools (ages 3–11), collèges (ages
12–14) and lycées (ages 15–18). This chapter mainly focuses on primary schools, where
the National Curriculum currently includes: French, mathematics, science, history,
geography, art and design, music and physical education. The national evaluations on
entering Grades 3 and 6 (ages 8 and 11 respectively) are devoted to mathematics and
French.
The National Curriculum consists of two aspects: programmes of study, which define
what is to be taught and goals, which define expected performance at the end of the
cycle. French primary schooling is composed of three cycles, each cycle lasting three
years. Cycle 1 is entitled ‘First acquisitions’ and applies to children aged from three to
five. Cycle 2, ‘Basic acquisitions’ covers 5- to 8-year-olds and is followed by Cycle 3,
‘Consolidation of acquisitions’ for ages 8 to 11. The competences assessed by the
national evaluation at the end of Cycle 2 and at the end of Cycle 3 are determined
according to the National Curriculum and to the goals fixed for each cycle.
These tests take place at the beginning of the final year of the cycle. The objective is
not to summarise or certify the attainments of the cycle that is coming to an end, but to
proceed to a diagnostic and formative assessment. The part of the tests which is devot-
ed to French assesses both reading and writing, but writing will not form part of this
discussion.
The close relationship between the National Curriculum and the goals provides the
basis for the definition of the construct of reading that underlies the national tests.
242 assessing reading: from theories to classrooms
The programmes of study for French are called: Mastery of Language and French Lan-
guage. They are divided into the three cycles described above and each is presented as
a few pages of text. The first category of requirements indicates the objectives and time
which must be devoted to this activity. The second details the knowledge, skills and
understandings to be taught. Finally, ‘goals for performance at the end of the cycle’ are
stated. These are listed under the following headings: mastery of oral language, read-
ing and writing, which are divided into: reading comprehension, word recognition,
handwriting and spelling, production of text.
In 2002, a new National Curriculum was published. It gives an increased importance
to Cycle 3 for reading instruction and it is explicitly stated that teachers must teach lit-
erature. A document which is separate from the National Curriculum presents the
literary approach and a list of 100 books from which teachers must choose. The aim of
this is to teach some common literary heritage.
The goals for Cycle 3 consist of a long list which is divided into general competen-
cies and specific competencies (literature, grammar, spelling, vocabulary, conjugation).
The tests are national diagnostic evaluations with a formative purpose. Two perspec-
tives can be distinguished: one centred on pupils individually, the other centred on the
overall population to draw conclusions on the way the system operates.
Within this overall purpose, the test items are constructed to give information about
two levels of competence:
• attainments necessary for children to have full access to teaching and learning in the
phase of education they are beginning (understanding written instructions, locating
explicit information, etc.)
• attainments that are still being acquired (presentation of a text, following complex
instructions).
For the individual pupil, the results of the evaluations have an important function in
identifying attainment, strengths, weaknesses and difficulties in relation to objectives
they should master or to the goals of the phase they are entering. These evaluations are
intended to analyse each pupil’s performances and difficulties at a given moment of his
or her schooling, in order to constitute a summary assessment of attainment. This is
expected to have an effect on the regulation of learning for that student through school
teaching practices and learning objectives.
For the education system, the national evaluations provide a ‘snapshot’ of perform-
ance each year, giving an indication of percentages of students nationally in relation to
low-stakes national assessment 243
The national assessments in France for French have as their basis tables of assessed com-
petences which are built each year according to the National Curriculum and to the goals
fixed for each cycle. Any one year’s test cannot cover the whole range of knowledge,
skills and understandings included in the national references (curriculum and goals). Over
the years, the tests must include as wide a variety as possible of skills and text types.
The goals are presented as a table divided into two fields: reading and writing, but
only reading will be discussed here. Reading is itself divided into two skills: under-
standing a text (reading comprehension); and mastery of tools of language. This general
schema applies to both levels of evaluation (grades 3 and 6). In the first column of the
table, we find the list of competences which will be assessed, then the nature of the
assessment task. Table 17.1 presents as an example the structure of the 2002 assessment
for grade 3 students, and outlines the construct of reading assessed in that year. In other
years, the table is slightly different, reflecting the broader construct of reading represented
by the National Curriculum programmes of study and the associated goals.
244 assessing reading: from theories to classrooms
Reading comprehension
Competencies Tasks
Understand and apply written instructions for Apply a simple written instruction
school work
Apply a complex written instruction
Identify types and functions of texts using Match a work to its cover
external indications
Competencies Tasks
Recognise written words from familiar vocabulary Cross out the written word when it does not
match the picture
Decode unfamiliar words
Identify common typographical devices and fol- Understand the organisation of a printed page
low the structure of the written page
Understand typographical devices
Use context, word families and dictionaries to Distinguish the particular sense of a word from
understand the words of a text its context
Identify agreement, time markers, pronouns Identify agreement between subject and verb,
and connectives in understanding text noun and adjective, using number and gender
Identify time markers that structure the coher-
ence of a text
low-stakes national assessment 245
The tests consist of open and closed questions, for which marking guidance is provided
for teachers.
Because of the formative purpose of the national tests in France, the mark schemes
allow more differentiation than simply correct/incorrect: right answers; interesting mis-
takes (in terms of interpretation); others. This is structured as a scale rather than a long
mark scheme (see Figure 17.1). The means of recording errors gives the teacher a
detailed picture of pupils’ difficulties and the degree to which they have mastered the
skills assessed.
Figure 17.1 Example mark scheme
1 – Exact answer
2 – Other correct answer showing mastery of skill
3 – Partially correct answer, no incorrect element
4 – Partially correct, partially incorrect answer
5 – Answer showing misunderstanding of the question
9 – Incorrect answer
0 – No answer
The score obtained for each skill (i.e. Reading comprehension, Tools for Language), is
the percentage of correct answers and makes possible an overall estimation of national
performance. In recent years, around two-thirds of the national population at grade 3
were found to have reached the expected standard in reading.
Teachers are involved in the assessment procedures; for they supervise the data col-
lection, process and code these data. They have at their disposal materials which enable
them to interpret their pupils’ results, from which they can draw pedagogical conclu-
sions. Their analyses are facilitated by easy-to-use software. Interpretation of the
national evaluation outcomes plays a part in the training of teachers.
The 2002 tests provide an example of how these principles are translated into prac-
tice. For the Reading field, the tests include 7 exercises and 41 items at grade 3 level; 10
exercises and 36 items at grade 6 level.
At grade 3 level, we shall consider the competence Demonstrate understanding of a
text. In 2002, three exercises were included for assessing this competence, with a time
limit of 23 minutes for completion. The texts were a poster, a tale and a short narrative.
The longest one of these consists of about 220 words.
The total number of items was 16, ten of them multiple-choice and six constructed
response. The open questions only required short answers: one word, a few words (the
name of three characters), a sentence or phrase. Figure 17.2 shows an example of the
short tale entitled The Heron.
246 assessing reading: from theories to classrooms
Questions 1 and 4, one of them multiple-choice and the other open, each require
retrieval of a single piece of information that is expressed as one sentence in the text.
The multiple-choice question 2 is an example of the distinction in the coding frame
between answers which are exact or which are partially correct, with or without incor-
rect elements. The correct answer to this question has two elements. Whilst the third,
correct, option includes both, the second is partially correct but omits one element,
whereas the first and third include incorrect elements.
As an example of the mark schemes for open responses, question 3 can be consid-
ered. For this question, it is necessary to explain: ‘Why did he never have coins to
pay?’. For a completely correct response, two pieces of information from different parts
of the text must be linked, and this link must be inferred. The right answer should ‘refer
to poverty’. In the text, it says ‘a poor young man’. If the answer refers to ‘Wan says
thank you with his drawings’, this mistake receives a special code.
Question 5 requires students to show that they have grasped the main point of the
text by selecting an appropriate title.
The second main reading skill assessed, Tools for Language, is more difficult to
exemplify as it addresses grammatical and vocabulary issues that are mostly specific to
French. For example, the exercise that requires distinguishing the sense of words in
context is not translatable into English, as the some of the translated words would differ
according to the context. However, Figure 17.3 shows an example of an exercise that
assesses knowledge of the organisation of the printed page and understanding of
typographical devices.
This type of exercise complements and contrasts with the Reading comprehension
items, and provides information of a different kind for teachers to use.
Based on its content, each item is attributed to one of the two levels of competence
described above: those necessary for pupils to benefit fully from current teaching; or those
still being acquired. A large proportion of those items addressing ‘necessary’ competen-
cies belong to the categories Recognise written words from familiar vocabulary and
Understand and apply written instructions for school work. Software for analysing the
results of the tests allows each teacher to identify those pupils who have not acquired these
essential competencies. Pupils who fail 75 per cent of the items of this type are considered
to be in difficulty. This accounts for about 10 per cent of pupils, who are supported by
means of an individual education plan – the PPAP, or Programme Personnalisé d’Aide et
de Progrès. The specific difficulties that each child is experiencing are identified, and a
remedial programme that corresponds to these difficulties is offered to him or her.
The positive result is that nine out of ten pupils are in a position to benefit from the
teaching delivered at the beginning of Cycle 3. For the easiest tasks, those requiring
location and retrieval of information, the success rates are between 70 and 95 per cent.
For understanding the main point of a text (giving the general setting, identifying the
main character …) success rates are between 55 and 65 per cent. Reasoning about a text
and bringing information together gives rise to success rates of between 50 and 75 per
cent. Finally, the implicit understanding revealed, for example, in inferential questions
is only successful in 50 per cent of cases.
248 assessing reading: from theories to classrooms
Wolves
Wolves are animals that are similar to dogs. They have gradually disappeared
from our region because of hunting. They are very good at communicating
amongst themselves by means of howls and body language.
In the spring, pairs of wolves look for a quiet place to raise their young. Most
often, they settle beneath a rock, a fallen tree or in an old fox’s den. The parents
are not the only ones responsible for the young: the whole family group shares in
looking after the cubs.
Going up-stream, wolves push fish into shallow water and there they catch
them in their jaws. They even know how to turn over stones with their paws to
look for shrimps. They also hunt mammals such as rabbits or mules. A wolf’s
stomach can contain four kilos of meat. What an appetite!
1. How many paragraphs are there in this text?
2. Circle the first paragraph.
3. Copy a sentence from the text – whichever you choose.
4. Circle the six commas in the text.
5. The punctuation mark used at the end of the text ! is called:
Circle the correct answer
– question mark
– semicolon
– exclamation mark
– speech marks
At the beginning of Cycle 3, and within the limitations of the test exercises, the elab-
oration of inferences remains difficult for a large number of pupils; amongst these
inferences is the understanding of anaphoric references, which is still a cause of diffi-
culty at 12–13 years of age. Spelling in the context of a dictation exercise remains
poorly done. Pupils are fairly good at understanding the chronology of a story, but they
have difficulty in maintaining a coherent thread of narration in stories that they write. In
exercises that require them to select connectives, those which represent logical relation-
ships are still difficult to find.
Thus there are a number of competencies that are quite well mastered by the end of
Cycle 2, ‘Basic acquisitions’, whilst others are still being learned. This reflects the com-
plexity of the operations at work in comprehending text, and the differing requirements
of the tasks.
low-stakes national assessment 249
Conclusion
The national tests say something about what children should read and sometimes about
how they should read it. They are not based upon predetermined proportions of ques-
tions or texts of each type. Texts are chosen for their merit and interest, and the
proportion of question types varies in direct relation to the features of the texts selected.
Each pupil’s acquisitions and failures can be revealed by an analysis of answers to
each item for each test.
Yet diagnostic assessment cannot be expected to deal with everything. It does not
allow the identification of all possible difficulties, because not all the competencies
related to the goals for the cycle are included in any one test. Results are not compara-
ble from one year to the other since they are not the outcome of strictly comparable
tests.
In some cases, it will be necessary to interpret why this child has produced such poor
performances and keep exploring his/her difficulties with other tools. Nevertheless, this
national evaluation can be seen as a useful tool for detecting problems and gaps in the
pupils’ knowledge.
The great originality of the French procedure is that it takes place at the beginning of
the school year. A diagnostic assessment of every pupil is performed, and it is hoped
that teachers change their pedagogy according to their pupils’ identified difficulties.
18 The National Assessment of
Educational Progress in the USA
Patricia Donahue
General overview
for the USA. All students in the country do not take the NAEP assessment and the sampled
populations take different parts of the test. Using a procedure called matrix item sam-
pling to reduce individual student burden, no student at a particular grade takes the
entire assessment for that grade. For example, thirteen distinct reading blocks are
administered at grade 8, but each 8th grader participating in the assessment only takes
two blocks. Individual scores would be for different performances on different tests.
Instead, NAEP results are reported in aggregate for the country as a whole, for partici-
pating states and jurisdictions and for major demographic subgroups. Originally and for
the period from 1969 to 1990, NAEP assessments reported results solely for the nation
and for major demographic subgroups. Then, in 1990, NAEP’s goal was augmented to
include the tracking and reporting of academic achievement in individual states. The
first NAEP Trial State Assessment was mathematics in 1990. State by state reading
assessment followed close upon this in 1992. Most recently, in 2002 and 2003, NAEP
for the first time reported results at the district level for those localities that participated
in the Trial Urban District Assessment. In little more than a decade since the first Trial
State Assessment, the potential exists for a further expansion of NAEP’s role.
Results of the NAEP assessment are reported in terms of an average score on the
NAEP reading 0–500 scale and in terms of the percentages of students who attain each
of three achievement levels, Basic, Proficient and Advanced. The scale scores indicate
what students know and can do as measured by their performance on the assessment.
The achievement levels indicate the degree to which student performance meet the
standards set for what they should know and be able to do.
The frequency of NAEP assessments varies by subject area. With the exception of
the mathematics and reading assessments, which are administered every two years,
most of the subject areas are administered every four years.
Since 1992 the NAEP reading assessment has been developed under a framework
reflecting research that views reading as an interactive and constructive process involv-
ing the reader, the text and the context of the reading experience. The NAEP Reading
Framework views reading as a dynamic interplay in which readers bring prior knowl-
edge and previous reading experiences to the text and use various skills and strategies in
their transactions with texts. Recognising that readers vary their approach to reading
according to the demands of particular texts and situations, the framework specifies the
assessment of three ‘reading contexts’: Reading for Literary Experience, Reading for
Information and Reading to Perform a Task. All three purposes are assessed at grades 8
and 12; however, Reading to Perform a Task is not assessed at grade 4. The reading con-
texts, as presented in the Framework, attempt to codify the types of real world reading
situations to be represented in the assessment, as shown in Table 18.1.
252 assessing reading: from theories to classrooms
Reading for literary Readers explore events, characters, themes settings, plots,
experience actions and the language of literary works by reading novels, short
stories, poems, plays, legends, biographies, myths and folktales.
Reading for information Readers gain information to understand the world by reading
materials such as magazines, newspapers, textbooks, essays and
speeches.
Reading to perform a task Readers apply what they learn from reading materials such as bus
or train schedules, directions for repairs or games, classroom
procedures, tax forms (grade 12), maps and so on.
Source: National Assessment Governing Board (2002)
While it could be argued that in a large-scale assessment such as NAEP, the context is
essentially the assessment situation, the Contexts for Reading are an attempt to replicate
major types of real life reading experiences. The Reading for Literary Experience pas-
sages and items allow students to interpret and discuss such literary elements as theme,
character motivation, or the importance of the setting to the action of a story. The texts
within this context have been primarily narratives: short stories or folktales. Poetry, to a
much lesser degree, has been used at grades 8 and 12; however, the poems have had a
strong narrative element so as to assess reading comprehension and not poetic skills or
knowledge of rhyme and meter. The Reading for Information passages mainly comprise
expository texts such as articles about nature or biographical pieces from magazines
that students might encounter at school or at home, but speeches have been included at
grade 12. Items based on informative passages ask students to consider the major and
supporting ideas, the relations between them and the overall message or point of the
text. When Reading to Perform a Task, students are asked to read schedules, directions,
or documents and to respond to questions aimed at both the information itself and also
at how to apply or use the information in the text. Across the three contexts, the NAEP
assessment tries to reflect a variety of the reading experiences that students are likely to
encounter in real life.
Within these contexts, which broadly define the type of texts used in the assessment,
the framework also delineates four Reading Aspects to characterise ways readers may
construct meaning from the written word: forming a general understanding, developing
interpretation, examining content and structure and making reader–text connections.
The Reading Aspects are meant to reflect the different approaches that readers may take
in their engagement with a text. For test development purposes, they are meant to
ensure ways of tapping different features of the text by encouraging a variety of items
that elicit different ways of thinking about texts. In short, the Reading Contexts deter-
mine the type of texts used on the assessment and the Reading Aspects determine the
type of comprehension questions asked of students (see Table 18.2).
the National Assessment of Educational Progress in the USA 253
Forming a general The reader considers the text in its entirety and provides a global
understanding understanding of it.
Developing interpretation The reader extends initial impressions to develop a more
complete understanding. This may involve focusing on specific
parts of the text or linking information across the text.
Making reader– The reader connects information in the text with knowledge and
text connections experience. This may include applying ideas in the text to the real
world.
Examining content and The reader critically evaluates, compares and contrasts and
structure considers the effects of language features or style in relation to
author’s purpose.
Source: National Assessment Governing Board (2002)
All items in the NAEP reading assessment are classified as one of these four aspects.
Items that ask students to identify the main topic of a passage or to summarise the main
events of a story are classified as ‘general understanding’. Items that ask students to
explain the relationship between two pieces of textual information or to provide evi-
dence from a story to explain a character’s action are classified as ‘developing
interpretation’. The items classified under the aspect ‘making reader–text connections’
ask the reader to connect information in the text with their prior knowledge or sense of
a real-world situation. These items do not demand or assume any prior knowledge about
the topic of the passage, nor do they ask about personal feelings. The emphasis is on the
making of a connection between text-based ideas and something outside the text. Items
that ask students to focus on not just what the text says but also how the text says it are
classified as ‘examining content and structure’.
The other main features of the NAEP reading assessment are the use of authentic,
full-length texts – that is, texts that were written and published for real world purposes,
not texts that were written or abridged for the purpose of assessment – and the use of
both multiple-choice and constructed-response questions. In the NAEP reading assess-
ment, constructed-response questions are those that require students to write out their
answer. Constructed-response questions are used when more than one possible answer
would be acceptable and thus allow for a range of interpretations.
The passages used in the NAEP reading assessment vary in length by grade level. As
prescribed by the NAEP Reading Framework, 4th grade passages may range from 250
to 800 words; 8th grade passages may range from 400 to 1000 words; and at 12th grade,
passages may range from 500 to 1500 words. In addition to length, passages used in the
assessment must meet criteria for student interest, developmental and topic appropriate-
ness, style and structure, as well as being considered fair for all groups of students
taking the assessment. Meeting all these criteria makes finding suitable passages for the
assessment a challenging task indeed.
254 assessing reading: from theories to classrooms
It is difficult from the vantage of 2004 to realize what an innovation the use of full-
length, authentic texts was for a large-scale assessment such as NAEP. Prior to 1992,
when the assessment framework that has served as the basis for the current NAEP read-
ing assessment was created, the NAEP test resembled most other tests of reading
comprehension. That is, the NAEP test asked students to read a number of short pas-
sages and to respond to a few questions about each passage. Even then, however, the
NAEP test did not rely solely on multiple-choice questions, but included some con-
structed-response questions – approximately 6 per cent of the total assessment items.
With the marked increase in the length of passages called for by the 1992 Reading
Framework came an increase in the number of constructed-response questions. Longer
texts allowed for a larger range of interpretations, so in consequence constructed-
response was the proper format to accommodate the range of responses. From 1992
onward, about 50 to 60 per cent of the NAEP reading assessment was composed of con-
structed-response questions. The preponderance of constructed-response questions was
a natural outgrowth of the view of reading as a dynamic interplay between reader and
text and of the framework’s recognition that readers bring various experiences and their
own schema to their engagement with text even in an assessment situation.
Development of constructed-response items cannot occur without simultaneously
considering how the item will be scored and composing the scoring guide. Every item
in the NAEP reading assessment has its own unique scoring guide. Constructed-
response items are of two types: short constructed-response items that are scored with
either a two-level or a three-level scoring guide and extended constructed-response
items that are scored with a four-level scoring guide. The initial iteration of the scoring
guide written in conjunction with the item by the test developer anticipates the possi-
ble levels of comprehension that the item might elicit. These levels are revisited in
light of student responses from the pilot test administration and the guides are revised
accordingly.
Multiple-choice and constructed response items are configured into blocks. A block
consists of a reading passage and the accompanying items about the passage. Typically,
each block has approximately ten items. Students taking the assessment receive a book-
let containing two blocks and have 25 minutes to complete each block in their booklet.
Total assessment time is 60 minutes, as students also answer some background ques-
tions about their educational experiences. From 1992 to 2002 the assessment comprised
23 blocks; it was expanded for 2003 and now comprises a total of 28 blocks.
Inevitably, the scoring of the NAEP reading assessment is an intense occasion for all
involved. Test booklets are sent from the administration to the scoring contractor in
Iowa City, where the booklets are taken apart and all the items are scanned. While mul-
tiple-choice items are machine scored, student responses to the constructed-response
items are scanned and images of the responses are presented via computer to teams of
trained scorers. A team of ten to 12 individual scorers scores all the responses to a sin-
gle item, before they go on to the next item. This item-by-item method allows for a
more focused scoring than would occur if the all the responses in a block or booklet
were presented and the scorers went back and forth among different items. In the 2003
the National Assessment of Educational Progress in the USA 255
Concluding remarks
Reading is an elusive construct, because the processes that readers’ minds engage in and
employ when interacting with texts are invisible – multi-faceted and complex certainly
and therefore difficult to capture. Outside of an assessment situation, this elusiveness
and complexity is no problem for the reader, but part of the joy of reading and getting
256 assessing reading: from theories to classrooms
lost in a text – for in everyday life we are allowed to read without having to demonstrate
our understanding immediately, upon demand, within a time frame. Ideas can germinate
for hours, days and weeks; be discussed with others, reframed, reconfigured and
rethought at leisure.
It would be costly indeed to assess reading comprehension at the site of its most
active occurrence; the reader alone in a room or a public library, book in hand or eyes
fixed on the computer screen, at that moment when the self is lost in a fiction or finding
the facts that the mind has sought, those moments that transpire when the book rests in
the lap or the eyes from the computer and the gaze goes out the window, but doesn’t
totally register the sun or snow, for the eyes have turned inward to what the mind is
doing with the text just read. This is not to suggest that reading is a trance and that test-
ing situations are entirely artificial, for they resemble important types of reading that
occur in children’s and adult’s lives: reading situations that require a response in a lim-
ited amount of time, finishing the book for English class, the reviewing of books for
publication, the letter that must be written by a certain date. And it is well known that a
deadline can jolt the mind into activity and even incite it to perform.
These distinctions between real-life reading and reading in an assessment situation
must be acknowledged so the awareness consciously attends and informs the making of
tests that aspire to engaging students – as much as is possible in a large-scale assessment
situation – in thoughtful reading activity. This is the challenge, the platonic ideal so to
speak, of developing a reading assessment that necessarily aims to elicit the best possible
performance from students during 60 minutes of testing time. For all those involved in
the development of the NAEP test, however, the 4th, 8th and 12th grade students who
participate are nameless and faceless; they sit in many different classrooms across the
country and they come from widely disparate cultural and economic backgrounds. To
capture such a nation of readers is a daunting task and it would be foolhardy to think that
we have truly engaged all these students and elicited their highest level of performance.
And yet, considering the constraints, I always marvel during scoring sessions at the
engagement evident in student responses – the way they use or exceed the five lines pro-
vided for short constructed-responses. These are students who have been told that their
performance will not affect their grade in any way, told that they are participating in a
national survey, told that their performance will contribute to a picture of what students
in their grade are capable of doing. There are no stakes for the sampled students them-
selves. Yet the depth of their responses suggests something about the power of texts.
Even in a constrained assessment time of 60 minutes, when they are given the latitude
allowed by constructed-response questions students respond to unfamiliar texts and
express their ideas; and while responses are scored only for reading comprehension, not
writing ability, it is gratifying when students express their ideas eloquently and one dares
to hope that an assessment passage has been an occasion for a student to discover an
author who they will go on to read or a topic that they will go on to explore. While not
within the purview and purpose of the NAEP reading assessment, which by its nature is
far from formative, it can be hoped that some of the students who participate are provided
with a reading experience not too unlike their actual encounters with texts.
the National Assessment of Educational Progress in the USA 257
References
Donahue, P.L., Daane, M.C. and Jin, Y. (2005). The Nation’s Report Card: Reading
2003 (NCES 2005–453). US Department of Education, Institute of Education Sci-
ences, National Center for Education Statistics. Washington, DC: US Government
Printing Office.
Jones, L.V. and Olkin, I. (Eds) (2004). The Nation’s Report Card: Evolution and Per-
spectives. Bloomington: Phi Delta Kappa Educational Foundation, American
Educational Research Association.
National Assessment Governing Board (2002). Reading Framework for the 2003
National Assessment of Educational Progress. Washington, DC: Author.
19 Concluding reflections: from theories
to classrooms
Marian Sainsbury
At the beginning of this book, I outlined the interaction between different views of read-
ing and different views of assessment which makes this such a complex subject. The
intervening chapters have done little to simplify this complexity. Somehow, competing
theories of reading and competing theories of assessment jostle for position, and out of
this jostling emerges classroom practice. The route from theory to classroom is neither
simple nor predictable.
This is because both reading and assessment have many different stakeholders. Chil-
dren, parents, researchers, teachers, employers, governments, ordinary citizens: all are
entitled to have expectations of what reading tests can and should deliver. In response to
this, the authors have brought together their diverse experiences, knowledge and
beliefs, constructing a partial map of an ever-changing territory.
Researchers have been given a clear field to set out their visions. Beech, Pollitt and
Taylor have offered perspectives on reading that highlight cognitive processes. Fisher
has mapped out the relationship between literacy as social practice, school literacy and
test literacy. Harrison has highlighted the uncertainty of meaning in the relationship
between writer, reader and text. Stobart has set out a vision of assessment that is fit for
a truly formative purpose.
In the later sections of the book, all of these theoretical positions remain evident. The
historical overview by Pearson and Hamm specifically links reading assessment to the
dominance of particular theoretical frameworks, as does Whetton’s account of influen-
tial assessments in the UK. In the early 20th century, the emergence of psychology as a
scientific discipline led to optimism about its potential both for explicating the reading
process and designing precise and accurate tests. This gave rise to examples of assess-
ments linked to specific theories of reading and some such tests are still current,
including published tests in the UK and the French national tests described by Rémond.
They are mainly characterised by a definition of reading as word recognition and literal
comprehension.
Not until the later 20th century did theories of social literacies and postmodern
approaches to text become influential, through the study of literature, in defining school
literacy. Thompson’s chapter describes how the reader’s active construction of text was
first integrated into high-stakes assessment in the UK. These theories survive, albeit
diluted, in many of the most recent tests – for example the national assessments
described by Watts and Sainsbury, Palmer and Watcyn Jones, and Donahue. All of these
admit of the possibility that differing responses may nevertheless be creditworthy,
concluding reflections: from theories to classrooms 259
NAEP see National Assessment of Education- Pearson, P. David vii, 5, 60, 76–101
al Progress peer assessment 55, 56, 71, 175
NAGB see National Assessment Governing peer support 184, 186–7
Board Pepper, Lorna vii, 5, 179–93
National Assessment of Educational Progress performance
(NAEP) 6, 92, 159, 250–7 assessment focuses 184
National Assessment Governing Board children 66–8
(NAGB) 250 comprehension assessment 91
National Assessment (Scotland) 228–34, formative assessment feedback 173–6
236–7 France 242–3
national boards 54–5 Scotland 230–2
National Curriculum (France) 241–9 Wales 218
National Curriculum (UK) 102–3, 107–8, performance goals 67–8
115–17, 196–209 Perry Preschool Project (USA) 28
see also statutory assessment personal constructs 15
national initiatives in practice 196–259 phonics 16
National Literacy Strategy (NLS) 66–7, 71, phonological dyslexia 27
197 phonological skills 26–7, 28–30
National Oracy Project (NOP) 129–30 pilot tests, c-rater 163, 165
National Qualifications Higher (Scotland) 233 PIRLS see Progress in International Reading
see also Higher assessment (Scotland) Literacy Study
national surveys (APU) 113–15 plagiarism detection 143, 153
native speakers qualification 110 poetry 252
natural language processing (NLP) 158 policy contexts 232, 234, 237, 259
navigational goals online 150–2 Pollitt, Alastair vii, 4, 38–49, 225
Neale Analysis of Reading Ability 108–9 polysemic concept of meaning 58–9
New Criticism 89 portfolio-based assessment 60
NLP see natural language processing postmodernism 4, 50–63, 140–57, 258
NLS see National Literacy Strategy PowerPoint presentations 57–8
NOP see National Oracy Project practical applications 140–93, 196–259
normalisation (c-rater) 159–60 predictive validity 11
preschool training 27–8
objectivity 52, 55–6, 207 presentation of texts 201, 203
observed reading process 144–7 primary schools 170, 229–30, 241–9
Omar, Nasiroh vi, 140–57 see also key stage 1; key stage 2
online reading assessment 140–57, 162–3 prior knowledge 87
open-ended questions 158, 163–4, 199, 207 priorities 22–37
‘oracy movement’ 129 process, online reading 144, 146
oral reading 77–8, 81, 108–9, 132–3 product of reading 67
see also reading-aloud tests professional development, teachers’ 223–4,
O’Sullivan, J. 67 226, 230, 236–7
output, online reading 144, 146, 147 programmes of study 196–7, 241–2
Owen, Wilfred 57–8 Progress in International Reading Literacy
Study (PIRLS) 232–3
Palmer, Roger vi–vii, 5, 210–21 pronoun resolution (c-rater) 160
paradigm competition 2–3, 7–75 prose reading tests 106
passage dependency 84, 87 psychology 3–4, 22–49, 78, 85–9, 104–5,
Paterson, Katherine 202–4 258–9
266 assessing reading: from theories to classrooms
www.pre-online.co.uk
www.nfer.ac.uk/research-areas/literacy