Handbook of Quantitative Methods For Educational Research
Handbook of Quantitative Methods For Educational Research
Handbook of Quantitative Methods For Educational Research
)
University of Auckland, New Zealand
SensePublishers
DIVS
ISBN 978-94-6209-402-4
Spine
21.412 mm
Handbook of Quantitative
Methods for Educational
Research
Timothy Teo (Ed.)
Edited by
Timothy Teo
University of Auckland, New Zealand
A C.I.P. record for this book is available from the Library of Congress.
TABLE OF CONTENTS
Foreword
vii
31
45
71
5. Cluster Analysis
Christine DiStefano & Diana Mindrila
6. Multivariate Analysis of Variance: With Discriminant Function
Analysis Follow-up
Lisa L. Harlow & Sunny R. Duerr
103
123
7. LoGistic Regression
Brian F. French, Jason C. Immekus & Hsiao-Ju Yen
145
167
187
199
11. Meta-Analysis
Spyros Konstantopoulos
231
247
v
TABLE OF CONTENTS
267
289
15. Testing Measurement and Structural Invariance: Implications for Practice 315
Daniel A. Sass & Thomas A. Schmitt
16. Mixture Models in Education
George A. Marcoulides & Ronald H. Heck
347
367
395
vi
FOREWORD
This is the age of evidence and all around are claims about the need for all to make
evidence based decisions. Evidence, however, is not neutral and critically depends
on appropriate interpretation and defensible actions in light of evidence. So often
evidence is called for, collected, and then analysed with little impact. At other times
we seem awash with data, soothed by advanced methods, and too easily impressed
with the details that are extracted. Thus there seems a tension between the desire
to make more meaning out of the aplenty data, and the need for interpretations that
have defence and consequences.
This book shows this tension there are many sophisticated methods now
available but they require an advanced set of understandings to be able to interpret
meaning and can be technically complex. With more students being less prepared in
basic mathematics and statistics, taking courses in experimental design and survey
methods, often these methods appear out of reach. This is notwithstanding the major
advances in computer software. Not so long ago structural equation modelling
required a knowledge of Greek, matrix calculus, and basic computer logic; now many
programs require the facility to distinguish between boxes and circles, manipulate
arrows, and read pictures. This is not a plea that only those who did it the hard way
can appreciate the meaning of these methods as many of these chapters in this book
show how these modern methods and computer programs can advance how users
think about their data and make more defensible interpretations.
The sheer number of methods outlined in the book shows the advances that have
been made, and too often we can forget that many of these can be traced to some
fundamental principles. The generalised regression model and the non linear factor
model are two such claims for general models for example many of the item
response family are variants of the non-linear factor models and understanding these
relations can show the limitations and advantages of various decisions the user has
to make when using these methods. For example, would a user be satisfied with a
model specifying a single factor with all items loading the same on this factor as
this is what the Rasch item response model demands.
Each chapter shows some of these basic assumptions, how the methods relate to
other similar methods, but most important show how the methods can be interpreted.
That so many of the most commonly used methods are in one book is a major
asset. The methods range from measurement models (CTT, IRT), long developed
multivariate methods (regression, cluster analysis, MANOVA, factor analysis,
SEM), meta-analysis, as well as newer methods include agent-based modelling,
latent growth and mixture modelling.
There are many types of readers of this book, and an aim is to speak to them
all. There are users who read educational literature that includes these methods
vii
FOREWORD
and they can dip into the book to find more background, best references, and more
perspective of the place and meaning of the method. There are bridgers who will go
beyond the users and will become more adept at using these methods and will want
more detail, see how the method relates to others, and want to know how to derive
more meaning and alternative perspectives on the use of the method. Then there are
clones that will use this book to drill down into more depth about the method,
use it to educate others about the method, and become more expert in their field.
There are also lurkers, those from various disciplines who have been told to use
a particular method and want a reference to know more, get an overall perspective,
and begin to see how the method is meant to work. There is an art of providing just
enough for all users, to entice them to want more, seek more, and learn more about
the many aspects of the methods that can be put into a short chapter.
One of my favourite books when I was a graduate student was Amick and
Walberg (1975). This book included many of the same methods in the current
Handbook. I referred to it often and it became the book most often stolen by
colleagues and students. It became the go to book, a first place to investigate the
meaning of methods and begin to understand what to do next. This Handbook will
similarly serve these purposes. The plea, however, is to go beyond the method, to
emphasise the implications and consequences. Of course, these latter depend on the
appropriateness of the choice of method, the correctness in making critical decisions
when using these methods, the defence in interpreting from these methods, and the
quality of the data. Happy using, bridging, cloning and lurking.
John A. Hattie
University of Melbourne
REFERENCE
Amick, D., & Walberg, H. (1975). Introductory multivariate analysis: Fro education, psychological, and
social research. Berkeley, CA: McCutchan.
viii
SECTION 1
MEASUREMENT THEORY
1. PSYCHOMETRICS
the resulting measures are indeed at those corresponding levels (Michell, 1990).
Instead, the level needs to be established by testing whether the measurement model
is appropriate (van der Linden, 1992).
Corresponding to the type of measurement model that holds, measurement can be
fundamental, derived, or implicit (van der Linden, 1992). Fundamental measurement
requires that the measure has the following properties: it has an order relation, unit
arbitrariness, and additivity (see Campbell, 1928). Derived measurement assumes that
products of fundamental measurement are mathematically manipulated to produce a
new measure (such as when density is calculated as the ratio of mass to volume). In
contrast, in the implicit measurement situations in which our measurer is involved,
neither of these approaches are possible: Our measurer is interested in measuring a
hypothetical entity that is not directly observable, namely, the latent variable. Now,
latent variables can only be measured indirectly via observable indicators manifest
variables, generically called items. For example, in the context of educational testing,
if we wanted to measure the latent variable of a students knowledge of how to add
fractions, then we could consider, say, the proportion correct by each student of a
set of fractions addition problems as a manifest variable indicating the students
knowledge. But note that the, the student knowledge is measured relative to the
difficulty of the set of items. Such instances of implicit measurement can also be
found in the physical sciences, such as the measure of the hardness of an object.
To illustrate how different fundamental measurement is from implicit measurement
of a latent variable, consider the following example. If the weight of the Golden Gate
Bridge is 890,000 tons and the weight of the Bay Bridge is 1,000,000 tons, then
their combined weight is estimated as the sum of the two, 1,890,000 tons. However,
the estimated ability of the respondent A and respondent B working together on the
fractions test mentioned above would not be the sum of the performances of respondent
A and respondent B separately. Implicit measurement allows quantification of latent
variables provided variables are measured jointly (Luce & Tukey, 1964). For an
in-depth discussion, see Michell (1990) and van der Linden (1992).
THE CONSTRUCT
Planning and debating about the purpose(s) and intended use(s) of the measures
usually comes before the measurement development process itself. We will assume
that the measurer has an underlying latent phenomena of interest, which we will
call the construct (also called propensity, latent variable, person parameter, random
intercept, and often symbolized by ).
It will be assumed in this section that there is a single and definite construct that is
being measured. In practice, a single test might be measuring multiple constructs. If
such is the case, we will (for the purposes of this chapter) assume that each of these
constructs is being considered separately. Constructs can be of various kinds: Abilities,
achievement levels, skills, cognitive processes, cognitive strategies, developmental
stages, motivations, attitudes, personality traits, emotional states, behavioural patterns
4
PSYCHOMETRICS
and inclinations are some examples of constructs. What makes it possible and attractive
to measure the construct is the belief and understanding on the part of the measurer that
the amount or degree of the construct varies among people. The belief should be based
on a theory. Respondents to the test can be people, schools, organizations, or institutions.
In some cases, subjects can be animals or other biological systems or even complex
physical systems. Note that the measurer does not measure these respondents the
measurer measures the construct these respondents are believed to have.
Depending on the substantive theory underlying the construct, and ones
interpretational framework, a construct could be assumed to be dimension-like or
category-like. In this chapter, we will be assuming former, in which the variability
in the construct implies some type of continuity, as that is most common situation
in educational testing. Much of the following development (in fact virtually all of it
up to the part about the measurement model), can be readily applied to the latter
situation alsofor more information on the category-like situation see Magidson &
Vermunt (2002). There are many situations where the construct is readily assumed
to be dimension-like: in an educational setting, we most often can see that there is a
span in ability and knowledge between two extremes; in attitude surveys, we can see
a span of agreement (or disagreement); in medicine, there are often different levels of
a health condition or of patient satisfaction, but also a span in between. Consider the
following example for better understanding of continuity: the variable understanding
of logarithms can be present at many levels. In contrast, the variable pregnancy is
clearly a dichotomy one cannot be slightly pregnant or almost pregnant. It is possible
that in some domains the construct, according to an underlying theory, has discrete
categories or a set of unordered categories. A respondent might be a member of the
one of the latent classes rather than at a point on a continuous scale. These classes can
be ordered or unordered. Various models in psychometrics such as latent class models
are designed to deal with constructs of that type (see Magidson & Vermunt, 2002).
The type of measurement presented in this chapter can be understood as the
process of locating a respondents location on the continuum of the latent variable.
As an example, imagine a situation where one wants to find out about a respondents
wealth but cannot ask directly about it. The measurer can only ask questions about
whether the respondent is able to buy a particular thing, such as Are you able to buy
an average laptop? Based on the obtained responses, the measurer tries to locate
the respondent on the wealth continuum, such as claiming that the respondent is
between able to buy an average laptop and able to buy an average motorcycle.
A SURVEY OF TYPES AND PURPOSES OF MEASUREMENT
From the broadest perspective, we can distinguish two types of measurement (De
Boeck & Wilson, 2006). The first type is the accurate measurement of the underlying
latent variable on which the respondents are arrayed. This implies the use of the test
at the level of individual respondents. Inferences regarding the individual, or perhaps
groups of individuals, are of primary interest. This approach is intuitively named as
5
PSYCHOMETRICS
(Note that we avoid using the word construct in the latter case, as it is discrepant to
our definition of the construct. The terms index is often used in the formative case.)
CONSTRUCT MODELING: THE FOUR BUILDING BLOCKS APPROACH
the bottom of the construct map. Similarly, respondents who possess a high degree
of the construct (top left), and the responses that indicate this amount of the construct
(top right) are located at the top of the construct map. In between these extremes are
located qualitatively different locations of the construct, representing successively
higher intensities of the construct.
Depending on the hypothesis and the setting being applied, construct maps can be
connected or nested within each other and interpreted as learning progressions. (See
Wilson, 2009 for illustrations of this.)
The construct map approach advances a coherent definition of the construct and a
working assumption that it monotonically spans the range from one extreme to another
from low degree to high degree. There might be some complexities between the two
extremes. We are interested in locating the respondent on the construct map, the central
idea being that, between the two extremes, the respondent higher on the continuum
possesses more of that construct than the respondent lower on the continuum. Thus, a
respondent higher on the continuum has a better chance to be observed demonstrating
the higher levels of the responses. This is called the assumption of monotonicity.4
The idea of a construct map forces the measurer to take careful consideration
of the theory concerning the construct of interest. A clear definition of what is
being measured should be based on the body of literature related to the construct of
interest. The definition of the construct shouldnt be too vague, such as, for instance
the definition of intelligence given by Galton (1883), as: that faculty which the
genius has and the idiot has not. It is best to support the hypothetical nature and
order of the locations in the construct map from a specific theory. The coherence of
the definition of the construct in the construct map requires that the hypothesized
locations be clearly distinguishable. Note that the existence of these locations does
not necessarily contradict the concept of an underlying continuum, as they can
readily represent distinct identifiable points along a continuous span.
The advantage of laying out the construct on the construct map is that it helps the
measurer make the construct explicit. Activities that are carried out in the construct
map phase can also be described as construct explication (Nunnally, 1978) a term
used to describe the process of making an abstract concept explicit in terms of
observable variables.
Note that each respondent has only one location on the hypothesized unidimensional
(i.e., one-trait, single-factor) construct. Of course, the construct of interest might
be multi-dimensional and thus the respondent might have multiple locations in the
multidimensional space of several construct maps. As was noted earlier, for simplicity,
we are assuming one-dimensional construct, which is believed to be recognizably
distinct from other constructs. This is also called the assumption of unidimensionality.
Note that this assumption relates to the set of items. If the construct of interest is
multidimensional, such as achievement in chemistry, which can have multiple
dimensions (see, Claesgens, Scalise, Wilson & Stacy, 2009), each strand needs to be
considered separately in this framework to avoid ambiguity, although the measurement
models can be multidimensional (e.g., see Adams, Wilson, & Wang, 1997). For
8
PSYCHOMETRICS
example, consider the following two variables: (a) the wealth of a person, and (b) the
cash readily available to a person. Although we would expect these two variables to be
highly correlated, nevertheless, each person would have two distinct locations.
A Concrete Example: Earth and the Solar System. This example is from a test
of science content, focusing in particular on earth science knowledge in the area
of Earth and the Solar System (ESS). The items in this test are distinctive, as
they are Ordered Multiple Choice (OMC) items, which attempt to make use of
the cognitive differences built into the options to make for more valid and reliable
measurement (Briggs, Alonzo, Schwab & Wilson, 2006). The standards and
benchmarks for Earth in the Solar System appear in Appendix A of the Briggs et
al article (2006). According to these standards and the underlying research literature,
by the 8th grade, students are expected to understand three different phenomena
within the ESS domain: (1) the day/night cycle, (2) the phases of the Moon, and
(3) the seasonsin terms of the motion of objects in the Solar System. A complete
scientific understanding of these three phenomena is the top location of our construct
map. See Figure 2 for the ESS construct map. In order to define the lower locations
Figure 2. Construct map for student understanding of earth in the solar system.
of our construct map, the literature on student misconceptions with respect to ESS
was reviewed by Briggs and his colleagues. Documented explanations of student
misconceptions with respect to the day/night cycle, the phases of the Moon, and the
seasons are displayed in Appendix A of the Briggs et al article (2006).
The goal was to create a single continuum that could be used to describe typical
students understanding of three phenomena within the ESS domain. In contrast,
much of the existing literature documents students understandings about a particular
ESS phenomena without connecting each understanding to their understandings
about other related ESS phenomena. By examining student conceptions across the
three phenomena and building on the progressions described by Vosniadou & Brewer
(1994) and Baxter (1995), Briggs et al. initially established a general outline of the
construct map for student understanding of ESS. This general description helped
them impose at least a partial order on the variety of student ideas represented in
the literature. However, the locations were not fully defined until typical student
thinking at each location could be specified. This typical student understanding is
represented in the ESS construct map shown in Figure 2, (a) by general descriptions
of what the student understands, and (b) by limitations to that thinking in the form
of misconceptions, labeled as common errors. For example, common errors used
to define category 1 include explanations for day/night and the phases of the Moon
involving something covering the Sun or Moon, respectively.
In addition to defining student understanding at each location of the continuum,
the notion of common errors helps to clarify the difference between locations.
Misconceptions, represented as common errors at one location, are resolved at
the next higher location of the construct map. For example, students at location 3
think that it gets dark at night because the Earth goes around the Sun once a daya
common error for location 3while students at location 4 no longer believe that the
Earth orbits the Sun daily but rather understand that this occurs on an annual basis.
The top location on the ESS construct map represents the understanding expected
of 8th graders in national standards documents. Because students understanding of
ESS develops throughout their schooling, it was important that the same continuum
be used to describe the understandings of both 5th and 8th grade students. However,
the top location is not expected of 5th graders; equally, we do not expect many 8th
grade students to fall among the lowest locations on of the continuum.
The Items Design
Items are the basic building blocks of the test. Each item is a stimulus and each use
of it is an attempt to obtain an observation that usefully informs the construct. In
order to develop these items in an orderly way, there needs to exist a procedure of
designing these observations, which we call the items design. In a complementary
sense, the construct may not be clearly and comprehensively defined until a set of
items has been developed and tried out with respondents. Thus, the development of
items, besides its primary purpose to obtain a useful set of items, plays an important
10
PSYCHOMETRICS
step in establishing that a variable is measureable, and that the ordered locations of
the construct map are discernible.
The primary purpose of the items is to prompt for responses from the respondents.
Items should be crafted with this in mind. Items with different purposes, such as the
ones that teach the content of the test, may be costly in terms of efficiency, but, of
course, may also play an important part in instruction. It is possible to see each item
as a mini-test, and we will see the usefulness of this type of thinking when talking
about the indicators of the instrument quality later in the chapter. Thus, a test can be
seen as a set of repeated measures, since more than one observation is made for the
respondent, or, put another way, a test can be considered an experiment with repeated
observationsthis perspective places models commonly used in psychometrics in a
broader statistical framework see, for example, De Boeck & Wilson, 2004).
Item formats. Any systematic form of observation that attempts to reveal particular
characteristics of a respondent can be considered as an item. Information about the
construct can be revealed in many ways, in, say, a conversation, a directly asked
question, or from observing respondents, in both formal and informal settings. As
was mentioned above, at early stages, information revealed in any of these ways
can be used to clarify the ordered locations of the construct. The item format should
be appropriate to the nature of the construct. For instance, if one is interested
in respondents public speaking skills, the most appropriate format is direct
observation, where the respondent speaks in public, but this is just the start of a range
of authenticity which ranges all the way to self-report measures.
The open-ended item format is probably the most basic and the most unrestrictive
format. In this format the responses are not limited to predefined categories (e.g., True
or False), and there may be broad latitude in terms of modes of communication (e.g.,
written, figurative, or oral), and/or length. Open-ended items are the most common
type of format that are typically observed in informal and social settings, such as
within classrooms. However, due to their simplicity for evaluation, the most common
item format used in formal instruments is the fixed-response format. Commonly,
fixed-response format items will start out as in an open-ended item formatthe
responses to these can be used to generate a list of the types of responses, and this
in turn can be used to design multiple alternatives. A fixed-response format is also
very common in attitude surveys, where respondents are asked to pick the amount of
intensity of the construct (i.e., Strongly Agree/Agree/etc.). This item format is also
referred to as the Likert-type response format (Likert, 1932).
The list of alternative ways to give respondents a chance to reveal their place
on the construct has expanded with the advances in technology and computerized
testing. New types of observations such as simulations, interactive web-pages, and
online collaborations require more complex performances from the respondent and
allow the delineation of new locations on constructs, and sometimes new constructs
altogether (Scalise & Wilson, 2011). The potential of these innovative item formats
is that they might be capable of tapping constructs that were unreachable before.
11
PSYCHOMETRICS
The ESS Example Continued. Returning to the ESS example, the OMC items were
written as a function of the underlying construct map, which is central to both the
design and interpretation of the OMC items. Item prompts were determined by both
the domain as defined in the construct map and canonical questions (i.e., those which
are cited in standards documents and commonly used in research and assessment
contexts). The ESS construct map focuses on students understanding of the motion
of objects in the Solar System and explanations for observable phenomena (e.g., the
day/night cycle, the phases of the Moon, and the seasons) in terms of this motion.
Therefore, the ESS OMC item prompts focused on students understanding of the
motion of objects in the Solar System and the associated observable phenomena.
Distractors were written to represent (a) different locations on the construct map,
based upon the description of both understandings and common errors expected of
a student at a given location and (b) student responses that were observed from an
open-ended version of the item. Each item response option is linked to a specific
location on the construct map, as shown in the example item in Figure 3. Thus,
instead of gathering information solely related to student understanding of the
specific context described in the question, OMC items allow us to link student
answers to the larger ESS domain represented in the construct map. Taken together,
a students responses to a set of OMC items permit an estimate of the students
location on the ESS construct, as well as providing diagnostic information about that
specific misconception.
The Outcome Space
As has been pointed out above, an instrument can be seen as an experiment used
to collect qualitative data. However, in the behavioural and social sciences, the
measuring is not finished when data are collected much needs to happen after
the data are collected (van der Linden, 1992). The outcomes space is the building
block where the responses start to be transformed into measures. The main purpose
of the outcome space is to provide a standard procedure to categorize and order
observations in such a way that the observed categories are informative about the
locations on the construct.
The outcomes space as a term was first used and described by Marton (1981).
He used students responses to open-ended items to discover qualitatively different
Figure 3. A sample OMC item based upon ESS construct map. (L indicates location
on construct map.)
13
PSYCHOMETRICS
(1)
where E is the error, and where the true score is understood as the average score
the respondent would obtain over many hypothetical re-tests, assuming there are
no carry-over effects.6 In contrast, the second measurement approach focuses on
each item and its relationship to the construct thus, termed as the item-focused
approach. The most prominent example of the item-focussed approach is the work
of Guttman (1944, 1950), who based his scalogram approach on the idea that tests
could be developed for which respondents would invariably respond according
15
(2)
The requirement for the person and item locations (person and item parameters)
is that both are unbounded (there can always be a higher respondent or more difficult
item), thus < < , and < < , but the probability is, of course, bounded
between 0 and 1. The two most common probabilistic models are based on the logistic
and cumulative normal functionsthe Rasch model uses the logistic formulation.
With a multiplicative constant of 1.7, the two are very similar, particularly in the range
16
PSYCHOMETRICS
of 3 and 3 (Bradlow, Wainer, & Wang, 1999). Specifically, the logistic expression
for the probability of a correct response on an item (represented as: X = 1) is:
Probability(X = 1|, ) = exp( )/,
(3)
(4)
Figure 4. Item response function of the Rasch model (note, for this item, = 0.0).
17
In the Rasch model, the total score of the correct (endorsed) items is monotonically
(but not linearly) related to the estimated ability.8 This property of the Rasch model
will be elaborated and its implications will be described below. One fundamental
property that is associated with the Rasch model is what is referred as the sufficient
statistic the total number of correct responses by the respondent is said to be
sufficient for the person ability, which means that there is no more information
available in the data that can inform the estimation of the item difficulty beyond
the number correct. This concept also applies to the items the total number of
respondents responding correctly to the item is a sufficient statistic for the item
difficulty. Most measurement models do not have this property.
One implication of this feature is that Rasch model is simple to interpret and
explain compared to more complicated models with more complex scoring and/
or parameterization. Models of the latter type might make it difficult to justify the
fairness of the test to the public, such as when a respondent with a higher total score
is estimated at lower location than the respondent with a lower total score.9
The second implication, stemming from the same argument, is that all items
provide the same amount of information (all items are assumed to be equally good
measures of the construct). Items differ only in difficulties. The higher the person
location relative to the item location, the more likely it is that the respondent will
answer correctly (endorse) the item. Thus, when this assumption is true, only two
parameters (person location and item location) are needed to model achievement on
the item.
A further manifestation of the uniqueness of the Rasch model is referred to as
specific objectivity (Rasch, 1960). This can be understood in the following way:
if the Rasch model holds true, then locations of two respondents on a test can be
compared with each other regardless of the difficulties of the items used to measure
them, and symmetrically, the locations of two items can be compared with each
other regardless of the locations of the respondents answering the items.
Choosing the measurement model. Of course, all models are less complex than
reality, and hence, all models are ultimately wrongthis applies to measurement
models as much as any others. Some models are more suitable than others, depending
on the hypothesized construct, ones beliefs, the nature of the instrument, the sample
size, and the item type. Nevertheless, in the process of modelling, one must posit a
sensible starting-point for model-building.
Among many criteria in choosing the model, one principle that guides the choice
is the law of parsimony, also referred as Occams razor, as Occam put it:
It is vain to do with more what can be done with fewer10
Thus, among the models, generally the more parsimonious models (models
with fewer parameters and more degrees of freedom) will offer interpretational
advantages. For example, linear models are in most instances, easier to interpret than
non-linear ones. A more parsimonious model should be (and will be) a consequence
18
PSYCHOMETRICS
of good design, and in this context, good design includes careful development and
selection of the items.
Models can be categorized according to various criteria. A model can be
deterministic vs. probabilistic, linear vs. nonlinear, static vs. dynamic, discrete vs.
continuous, to name several such categorizations. Some models can allow one to
incorporate subjective knowledge into the model (i.e., Bayesian models), although,
in truth, any assumption of the form of an equation is a subjective judgement. The
ideal measurement model should provide a best possible basis for interpretation from
the data the central idea being to approximate (fit) the real-world situation, at the
same time having not so-many parameters as to complicate the interpretation of the
results. The evaluation of the model is based on checking whether the mathematical
model provides an accurate description of the observed data. For this the model fit
is an important test whether our measurement procedure was successful. (see De
Ayala, 2009 and Baker & Kim, 2004).
For the Rasch model to fit, the data should meet the relevant fit criteria. One
measure of the fit of the items in the Rasch model, known as the item and respondent
fit (or misfit) statistic, is obtained by comparing the observed patterns of responses
to the predicted patterns of responses (See, e.g., Embretson & Reise, 2000). This
type of diagnostic is an important validation step and check of the model fit. Items
that are different in their measurement quality from other items (those with different
slopes) need to be reconsidered and investigated. The measurer should filter out
items that do not fit with the model. The idea of filtering due to the model fit has
been a source of debates for many years. The approach described here might be
considered a strict standard, but this standard provides for relatively straightforward
interpretation via the Wright map (as described below).
The Wright Map. The Wright map provides a visual representation of the
relationship between the respondent ability and the item difficulty estimates by
placing them on the same logit11 scale. This provides a comparison of respondents
and items that helps to visualize how appropriately the instrument measures across
the ability range. An example of a hypothetical Wright map for science literacy
(including the ESS items) is shown in Figure 5. The left side of the map shows
examinees and their locations on the construct: respondents estimated to have the
highest ability are represented at the top, and each X represents a particular number
of respondents (depending on the sample size). The items are represented on the
right side of the map and are distributed from the most difficult at the top to the least
difficult at the bottom. When the respondent and the item have the same logit (at the
same location), the respondent has approximately a 50% probability of answering
the item correctly (or endorsing the item). When the respondent is above the item,
the probability is higher, when the respondent is below, it is lower. In this way, it is
easy to see how specific items relate both to the scale itself and to the persons whose
abilities are measured on the scale. The placement of persons and items in this kind
19
of direct linear relationship has been the genesis of an extensive methodology for
interpreting the measures (Masters, Adams & Wilson, 1990; Wilson, 2005; Wright,
1968; Wright, 1977).
For example, segments of the line representing the measurement scale can be
defined in terms of particular item content and particular person proficiencies. This
allows the measurer to make specific descriptions of the progress of students or
other test-takers whose ability estimates place them in a given segment. The set of
such segments, illustrated in Figure 5 using Roman numerals II, IV and V, can be
interpreted as qualitatively distinct regions that characterize the successive ordered
locations on the outcome variable. Defining the boundaries of these criterion zones
is often referred to as standard setting. Wright Maps have proven extremely valuable
in supporting and informing the decisions of content experts in the standard setting
process. See Draney & Wilson (2009) and Wilson & Draney (2002) for descriptions
of standard setting techniques and sessions conducted with Wright Maps in a broad
range of testing contexts.
PSYCHOMETRICS
The two most fundamental concepts in psychometrics are test reliability and test
validity. Statistical procedures exist to estimate the level of test reliability, and
reasonably simple and general procedures are available to increase it to desirable
levels. But statistical procedures alone are not sufficient to ensure an acceptable
level of validity. Regardless of their separate consideration in much of the literature,
the view of the authors is that two concepts are closely related.
Reliability
The reliability of a test is an index of how consistently a test measures whatever it
is supposed to measure (i.e., the construct). It is an integral part of the validity of
the test. If the instrument is sufficiently reliable, then the measurer can assume that
measurement errors (as defined via Equation 1) are sufficiently small to justify using
the observed score.
Thus, one can see that the closer the observed scores are to the true scores, the
higher the reliability will be. Specifically, the reliability coefficient is defined as the
ratio of the variance of these true scores to the variance of the observed scores. When
a respondent provides an answer to the item, there are influences on the response
other than the true amount of the construct, and hence, the estimated ability will
differ from the true ability due to those influences. There are many potential sources
for measurement error in addition to the respondents themselves, such as item
ordering, the test administration conditions and the environment, or raters, to name
just a few. Error is an unavoidable part of the measurement process that the measurer
always tries to reduce.
The reliability coefficients described below can be seen as summaries of
measurement error. The logic of most of these summary indices of measurement
error is based on the logic of CTT, but this logic can readily be re-expressed in the
Rasch approach. Note that the values calculated using them will be dependent on the
qualities of the sample of respondents, and on the nature and number of the items
used.
Internal consistency coefficients. Internal consistency coefficients inform about
the proportion of variability accounted for by the estimated true ability of the
respondent. This is equivalent to the KR-20 and KR-21 coefficients (Kuder &
Richardson, 1937) for dichotomous responses and the coefficient alpha (Cronbach,
1951; Guttman, 1944) for polytomous responses. By treating the subsets of items
as repeated measures (i.e., each item thought of as a mini-test), these indices apply
the idea of replication to the instrument that consists of multiple items. There are
no absolute standards for what is considered an adequate level of the reliability
coefficient: standards should be context-specific. Internal consistency coefficients
count variation due to the item sampling as error, but do not count day-to-day
21
variation as error (Shavelson, Webb & Rowley, 1989). The IRT equivalent of these
coefficients is called the separation reliability (Wright & Stone, 1979).
Test-retest reliability. Test-retest reliability is in some respects the complement
of the previous type of reliability in that it does count day-to-day variation in
performance as error (but not the variation due to the item sampling). The test-retest
index is simply the correlation between the two administrations. As the name of the
index implies, each respondent gives responses to the items twice, and the correlation
of the responses on the test and the retest is calculated. This type of index is more
appropriate when a relatively stable construct is of interest (in order to make sure
that no significant true change in the construct is influencing the responses in the readministration of the instrument). In addition, it is important that the respondents are
not simply remembering their previous responses when they take the test the second
timethe so-called carry-over effect (mentioned above). When calculating testretest reliability, the time between the two administrations should not be too long in
order to avoid true changes in the construct; and should not be too short in order to
avoid the carry-over effect.
Alternate-forms reliability. Alternate-forms reliability counts both variation due
to the item sampling and day-to-day variation as error. In calculating this index,
two alternate but equivalent forms of the test are created and administered and the
correlation between the results is calculated. Similarly, a single test can be split
into two different but similar halves and the correlation of the scores on these two
halves can be computedthe resulting index is what is referred to as the split-halves
reliability. In this case, the effect of reducing the effective number of items needs to
be taken into account using the Spearman-Brown prophecy formula (Brown, 1910;
Spearman, 1910) Using this formula, the measurer can estimate the reliability of
the score that would be obtained by doubling the number of items, resulting in the
hypothetical reliability (see Wilson, 2005, pg. 149).
Inter-rater reliability. The concept of reliability also applies to raters. Raters and
judges themselves are sources of uncertainty. Even knowledgeable and experienced
raters rarely are in perfect agreement, within themselves and with one another. There
are four different types of errors due to raters: (a) severity or leniency, (b) halo effect,
(c) central tendency, and (d) restriction of range (For more information, see Saal,
Downey, & Lahey, 1980).
Generalizability Theory. The concept of reliability is central to a branch of
psychometrics called generalizability theory (Cronbach, Gleser, Nanda, &
Rajaratnam, 1972). Generalizability theory focuses on (a) the study of types
of variation that contribute to the measurement error and (b) how accurately the
observed scores allow us to generalize about the respondents behaviour in a defined
22
PSYCHOMETRICS
investigations of response processes are think alouds and interviews. Reaction time
and eye movement studies have also been proposed as other methods to gather such
evidence (Ivie & Embretson, 2010; National Research Council, 2008). With the use
of computerized testing, recording the actions by the respondents such as movement
of the mouse cursor and log of used functions and symbols can also serve as useful
information for this strand of evidence (Cooke, 2006).
Evidence based on the internal structure. If the measurer follows the steps of
the four building blocks, a hypothesized internal structure of the construct will be
readily provided via the ordered locations. The agreement of the theoretical locations
on the construct map to the empirical findings in the Wright map provides direct
evidence of internal structure. The measurer needs to compare the hypothesized
order of the items from the construct map to the order observed from the Wright
maps: A Spearman rank-order correlation coefficient can be used to quantify this
agreement (see Wilson, 2005, p. 160). The higher the correlation, the better is the
match (note that there is no predetermined lowest acceptable valuethis will need
to be a matter of judgement). Because this analysis occurs after the procedures of
the four building blocks has taken place, a negative finding implicates all four of the
steps: A low correlation implies that at least one of the four building blocks needs to
be re-examined.
One should also examine whether the item locations adequately cover the person
locations in order to makes sure that respondents are being measured adequately
throughout the whole continuum. For example, a small range of the difficulty of the
items would look like an attempt to find out the fastest runner in a distance of two
meters.
A similar question can be asked at the item level: the behaviour of the items need
to be checked for consistency with the estimates from the test. Consistency here is
indexed by checking that respondents in each higher response category tend to score
higher on the test as a whole. This ensures that each item and the whole test are
acting in concordance.14
Evidence Based on Relations to Other Variables
One type of external variable is the set of results of a second instrument designed
to measure the same construct. A second type arises if there is established theory
that implies some type of relationship of the construct of interest with the external
variable (i.e., positive, negative, or null, as the theory suggests). Then the presence
or the lack of that relationship with the external variable can be used as one of
the pieces of evidence. Usually the correlation coefficient is adequate to index the
strength of the relationship, but, where a non-linear relationship is suspected, one
should always check using a scatterplot. Examples of external variables are scores
on other tests, teachers or supervisors ratings, the results of surveys and interviews,
product reviews, and self-reports.
24
PSYCHOMETRICS
Just as we could apply the logic of the internal structure evidence down at the item
level, the same applies to this strand of evidence. Here the evidence is referred to
as differential item functioning (DIF). DIF occurs when, controlling for respondent
overall ability, an item favours one group of respondents over another. Finding DIF
implies that there is another latent variable (i.e., other than the construct) that is
affecting the probability of responses by members of the different groups. Ideally,
items should be functioning similarly across different subgroups. Respondents
background variables such as gender or race should not influence the probability of
responding in different categories. One way to investigate DIF is to calibrate the data
separately for each subgroup and compare the item estimates for large differences
(Wilson, 2005), but another approach directly estimates DIF parameters (Meulders
& Xie, 2004). DIF is clearly a threat to the validity of the test in the sense of fairness.
Longford, Holland, & Thayer (1993), and Paek (2002) have recommended practical
values for the sizes of DIF effects that are large enough to be worthy of specific
attention.
Evidence based on the consequences of using an instrument. Since the use of the
instrument may have negative consequences, this type of evidence should have a
significant influence on whether to use the instrument or not. If there is a negative
consequence from using the instrument, alternative instruments should be used
instead, or developed if none exists. If any alternative instrument will also have the
negative consequence, then perhaps the issue lies with the construct itself. Note that
this issue arises when the instrument is used according to the recommendations of
the measurer. If the instrument is used in ways that go beyond the recommendations
of the original measurer, then there is a requirement that the new usage be validated,
just as was the original use. For instance, if the instrument was designed for the use
for placement purposes only, using it for selection or diagnosis will be considered
as a misuse of the test and should be avoided. The cautionary message by Messick
(1994) below better reflects this point:
Validity, reliability, comparability, and fairness are not just measurement
issues, but social values that have meaning and force outside of measurement
wherever evaluative judgments and decisions are made (p. 2).
In thinking of test consequences, it is useful to think of the four-way classification
of intended versus unintended use and positive versus negative consequences
(Brennan, 2006). Intended use with positive consequence is seldom an issue and
is considered as an ideal case. Similarly, for ethical and legal reasons, there are no
questions on avoiding the intended use with negative consequences. The confusion
is with unintended uses. Unintended use with a positive consequence is also a
benefit. The major issue and confusion arises with unintended use with negative
consequences. The measurer has a limited responsibility and a limited power in
preventing this being the case once a test is broadly available. However, it is the
measurers responsibility to document the intended uses of the test.
25
CONCLUSION
Each use of an instrument is an experiment and hence requires a very careful design.
There is no machinery or mass production for producing the instruments we need
in education each instrument and each construct requires a customized approach
within a more general framework, such as that outlined above. The amount of effort
you put in the design of the instrument will determine the quality of the outcomes
and ease of the interpretation based on the outcome data.
In order to model real-life situations better, there have been many developments in
psychometric theory that allow extensions and increased flexibility starting from the
simple probability-based model we have used here. Models that allow the incorporation
of item features (e.g. the linear logistic test model (Janssen, Schepers, & Peres, 2004))
and respondent characteristics (e.g. latent regression Rasch models (Adams Wilson
& Wu, 1997)), and multidimensional Rasch models (Adams, Wilson & Wang, 1997)
have been developed and used extensively. Recently there have been important
developments introducing more general modelling frameworks and thus recognizing
previously distinct models as special cases of the general model (e.g., De Boeck &
Wilson, 2004; Skrondal & Rabe-Hesketh, 2004)). As a result, the range of tools that
psychometricians can use is expanding. However, one should always bear in mind that
no sophisticated statistical procedure will make up for weak design and/or poor items.
Psychometrics as a field, and particularly educational measurement, is growing
and having an effect on every students journey through their education. However,
as these developments proceed, we need principles that act as guarantors of social
values (Mislevy, Wilson, Ercikan & Chudowsky, 2003). Researchers should not
be concerned about valuing what can be measured, but rather stay focused on
measuring what is valued (Banta, Lund, Black & Oblander, 1996). Measurement in
the educational context should be aimed squarely at finding ways to help educators
and educational researchers to attain their goals (Black & Wilson, 2011).
This chapter is not an attempt to cover completely the whole range of knowledge
and practice in psychometrics rather, it is intended to outline where one might begin.
NOTES
1
2
3
4
5
6
26
Note, do not confuse this use of formative with its use in the previous paragraph.
These four building blocks are a close match to the 3 vertices of the NRCs Assessment Triangle
(NRC, 2001)the difference being that the last two building blocks correspond to the third vertex of
the triangle.
Borrowed from Wilson (2005).
The fundamental assumption in most of the modern measurement models is monotonicity. As the
ability of the person increases, the probability of answering correctly increases as well (unfolding IRT
models being an exceptionSee Takane, (2007).
i.e., It should provide useful information about certain locations on the construct map.
The carry-over effect can be better understood with the brainwashing analogy. Assume that the
respondent forgets his/her answers on the test items over repeated testings. Aggregating over the
sufficiently large (perhaps infinite) number of hypothetical administrations gives the true location of
the respondent (i.e., the True Score).
PSYCHOMETRICS
7
10
11
12
13
14
In the development below, we will assume that the items in question are dichotomous, but the
arguments are readily generalized to polytomous items also.
Recall that instrument-focused approach of CTT is also based on the number correct. There is an
important sense in which the Rasch Model can be seen as continuation and completion of the CTT
perspective (Holland & Hoskens, 2003).
Note that while some see this property as the advantage of the Rasch model, this has also been a point
of critique of the Rasch model. The critique lies in the fact that Rasch model ignores the possibility
that there is information in the different respondent response patterns with the same total. In our view,
the best resolution of the debate lies the view that the instrument is an experiment that needs to be
carefully designed with carefully-crafted items. This point will be elaborated later in the chapter.
quote from Occam cited in , Thorburn, 1918.
The natural logarithm of the odds ratio.
Note that these strands should not be confused with categories from earlier editions of the Test
Standards, such as construct validity, criterion validity, face validity , etc.
The simplest thing one can do is to examine the content of the items (this has been also intuitively
referred to as the face validity), though this is far from sufficient.
This information will also usually be reflected in the item fit statistics used in the Rasch model.
Another indicator is the point-biserial correlationthe correlation of the binary score with the total
score, also called as the item-test or item-total correlation.
REFERENCES
Adams, R. J., Wilson, M., & Wang, W. (1997). The multidimensional random coefficients multinomial
logit model. Applied Psychological Measurement, 21(1), 123.
Adams, R. J., Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to errors in
variables regression. Journal of Educational and Behavioral Statistics, 22(1), 4776.
American Educational Research Association (AERA), American Psychological Association (APA), and
National Council for Measurement in Education (NCME). (1999). Standards for psychological and
educational tests. Washington D.C.: AERA, APA, and NCME.
Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques (2nd ed.).
New York: Dekker.
Banta, T. W., Lund, J. P., Black, K. E., & Oblander, F. W. (1996). Assessment in practice: Putting
principles to work on college campuses. San Francisco: Jossey-Bass.
Baxter, J. (1995). Childrens understanding of astronomy and the earth sciences. In S. M. Glynn & R.
Duit (Eds.), Learning science in the schools: Research reforming practice (pp. 155177). Mahwah,
NJ: Lawrence Erlbaum Associates.
Black, P., Wilson, M., & Yao, S. (2011). Road maps for learning: A guide to the navigation of learning
progressions. Measurement: Interdisciplinary Research and Perspectives, 9, 152.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets.
Psychometrika, 64, 153168.
Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L.
Brennan (Ed.), Educational measurement (4th ed.).Westport, CT: Praeger.
Briggs, D., Alonzo, A., Schwab, C., & Wilson, M. (2006). Diagnostic assessment with ordered multiplechoice items. Educational Assessment, 11(1), 3363.
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of
Psychology, 3, 296322.
Campbell, N. R. (1928). An account of the principles of measurement and calculation. London:
Longmans, Green & Co.
Claesgens, J., Scalise, K., Wilson, M., & Stacy, A. (2009). Mapping student understanding in chemistry:
The perspectives of chemists. Science Education, 93(1), 5685.
Cooke, L. (2006). Is the mouse a poor mans eye tracker? Proceedings of the Society for Technical
Communication Conference. Arlington, VA: STC, 252255.
27
28
PSYCHOMETRICS
Michell, J. (1990). An introduction to the logic of psychological measurement. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Mislevy, R, J., Wilson, M., Ercikan, K., & Chudowsky, N. (2003). Psychometric principles in student
assessment. In T. Kellaghan, & D. L. Stufflebeam (Eds.), International handbook of educational
evaluation. Dordrecht, The Netherlands: Kluwer Academic Press.
National Research Council. (2001). Knowing what students know: The science and design of educational
assessment (Committee on the Foundations of Assessment. J. Pellegrino, N. Chudowsky, & R.
Glaser, (Eds.), Division on behavioural and social sciences and education). Washington, DC: National
Academy Press.
National Research Council. (2008). Early childhood assessment: Why, what, and how? Committee on
Developmental Outcomes and Assessments for Young Children, Catherine E. Snow & Susan B. Van
Hemel, (Eds.), Board on children, youth and families, board on testing and assessment, division of
behavioral and social sciences and education. Washington, DC: The National Academies Press.
Nisbet, R. J., Elder, J., & Miner, G. D. (2009). Handbook of statistical analysis and data mining
applications. Academic Press.
Nunnally, C. J. (1978). Psychometric theory (2nd ed.) New York: McGraw Hill.
Paek, I. (2002). Investigation of differential item functioning: Comparisons among approaches, and
extension to a multidimensional context. Unpublished doctoral dissertation, University of California,
Berkeley.
Ramsden, P., Masters, G., Stephanou, A., Walsh, E., Martin, E., Laurillard, D., & Marton, F. (1993).
Phenomenographic research and the measurement of understanding: An investigation of students
conceptions of speed, distance, and time. International Journal of Educational Research, 19(3),
301316.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark:
Danmarks Paedogogische Institut.
Saal, F. E., Downey, R. G., & Lahey, M. A. (1980). Rating the Ratings: Assessing the Psychometric
Quality of Rating Data. Psychological Bulletin. 88(2), 413428.
Scalise, K., & Wilson, M. (2011). The nature of assessment systems to support effective use of evidence
through technology. E-Learning and Digital Media, 8(2), 121132.
Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal
and structural equation models. Boca Raton, FL: Chapman & Hall/CRC.
Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989). Generalizability theory. American Psychologist,
44, 922932.
Spearman, C. C. (1904). The proof and measurement of association between two things. American
Journal of Psychology, 15, 72101.
Spearman, C, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271295.
Takane, Y. (2007). Applications of multidimensional scaling in psychometrics. In C. R. Rao, & S. Sinharay
(Eds.), Handbook of statistics, Vol. 26: Psychometrics. Amsterdam: Elsevier.
Thorburn, W. M. (1918). The myth of occams Razor. Mind, 27(107), 345353.
van der Linden, W. (1992). Fundamental measurement and the fundamentals of Rasch measurement. In
M. Wilson (Ed.), Objective measurement: Theory into practice Vol. 2. Norwood, NJ: Ablex Publishing
Corp.
van der Linden, W. J., & Hambleton, R. K. (Eds.) (1997). Handbook of modern item response theory.
New York: Springer.
Vosniadou, S., & Brewer, W. F. (1994). Mental models of the day/night cycle. Cognitive Science, 18,
123183.
Wang, W.-C., & Wilson, M. (2005). The Rasch testlet model. Applied Psychological Measurement,
29,126149.
Wiliam, D. (2011). Embedded formative assessment. Bloomington, IN: Solution Tree Press,
Wilson, M. (2005). Constructing measures: An item response modeling approach. Mahwah, NJ: Lawrence
Erlbaum Associates.
Wilson, M. (2009). Measuring progressions: Assessment structures underlying a learning progression.
Journal for Research in Science Teaching, 46(6), 716730.
29
30
GENERAL DESCRIPTION
(1)
Egp = X gp tgp
(2)
(3)
It is important to remember that in equation (3), all the three elements are random
variables. In CTT they are called random variables, although in the more general
probability theory they are classified as stochastic processes.
CTT as a theory requires very weak assumptions. These assumptions include:
(a) the measurement is an interval scale (note: there are other types of scales such
as classifications; those are not part of the CTT model although with some score
transformation they can be incorporated in CTT); (b) the variance of observed
scores s X2 is finite; and (c) the repeated sampling of measurements is linearly,
experimentally independent. Under those assumptions, the following properties
have been derived (Lord & Novick, 1968):
1. The expected error score is zero;
2. The correlation between true and error scores is zero;
32
3. The correlation between the error score on one measurement and the true score on
another measurement is zero;
4. The correlation between errors on linearly experimentally independent
measurements is zero;
5. The expected value of the observed score random variable over persons is equal
to the expected value of the true score random variable over persons;
6. The variance of the error score random variable over persons is equal to the
expected value, over persons, of the error variance within person (i.e., s2(Xgp));
7. Sampling over persons in the subpopulation of people with any fixed true score,
the expected value of the error score random variable is zero;
8. The variance of observed scores is the sum of the variance of true scores and the
variance of error scores; that is:
s X2 = s2 + sE2 .
(4)
It is important to note that the above properties are not additional assumptions of
CTT; rather, they can be mathematically derived from the weak assumptions and
easily met by most test data. Because of this, CTT is a test theory that provides, a
theoretical framework linking observable variablesto unobservable variablesa
test theory cannot be shown to be useful or useless (Hambleton & Jones, 1993).
From this discussion, it can be realized that with additional assumptions, CTT can
be stated as a model eligible for testing against data. This empiricism is pronounced
in modern test theory, especially in IRT where the model is tested against data in
each new test application.
RELIABILITY
One of the most important features in CTT is reliability. The term is concerned with
precision in measurement, and it is described as consistency of test scores over
repeated measurements (Brennan, 2001). This definition has remained largely intact
since the early days of modern measurement, although its emphasis has evolved to
focus more on standard errors of measurement (cf. Brennan, 2001; Osterlind, 2010).
Evolution of the terms development can be traced in each subsequent edition of the
Standards for Educational and Psychological Tests (cf. 1966, 1974, 1985, 1999).
The mathematics of reliability is quite straightforward. Working from the
formulation of CTT as given in formula (3) above (cf., X = + E), and E are
uncorrelated
rE = 0
(5)
This leads directly to Lord and Novicks final assumption, given above as the
8th property in the list above and expressed in Equation (4): that is, variances are
33
and
sE2 s X2 .
s2
s2
=
s X2
s2 + sE2
(6)
This ratio quantifies the reliability of using observed scores to describe the traits of a
population of individuals and rX is the reliability coefficient of the measurement. As
such, it is foundational to CTT. It is also obvious from equation (6) that the reliability
coefficient ranges from 0 to 1.
While this coefficient is easily derived, applying it to live data in a real-world testing
scenario is challenging at best, due primarily to practical considerations. From the
mathematical derivation we can see that reliability requires multiple measurements.
Further, in theory the measurements are presumed to be independenteven, a very
large number of them would be stochastic. Practically, this is difficult to achieve
even when forms of a test are strictly parallel. Using a given form and splitting it into
two halves does not obviate the problem. Another practical problem concerns the
attributes themselves. Attributes for educational and psychological measurements
are nearly always latent constructs or proficiencies. Here is where the problem
arises: as humans such latencies are labile, or changing in unpredictable and uneven
ways. At some level, this makes multiple measurements even more suspect.
These two practical difficulties are not easily overcome; nonetheless, recognizing
these conditions, reliability can be determined to a sufficient degree that it is useful
for our purposes. Due to these problems there is not a single, universally adopted
expression for the reliability coefficient. Instead, the reliability coefficient has many
expressions. Generally, they are of either about the internal consistency of a test or
its temporal stability. Internal consistency seeks to examine the degree to which the
individual elements of a test (i.e., items or exercises) are correlated. The Cronbachs
coefficient alpha (described more fully later on) is an example of gauging a tests
internal consistency. Similarly, a coefficient that indicates a tests temporal stability
tries to find a similar correlational relationship between repeated measurements.
Although parallel forms are not necessary to describe relationships among
quantities of interest under CTT, it is usually easier to describe those statistics with
respect to parallel forms. Parallel forms are measures that have the same true score
and identical propensity distribution, between the measures, for any person in the
population. That is, for any given person p in the population, if forms f and g satisfy
34
that tfp = tgp, and Ffp = Fgp, we say forms f and g are parallel. The requirements of
parallel forms can be reduced to tfp = tgp and s2(Efp) = s2(Egp) for any given person p,
if Xfp and Xgp are linearly experimentally independent, that is, the expected value of
Xfp does not depend on any given value of xgp, and that the expected value of Xgp does
not depend on any given value of xfp.
When two test forms are parallel, the distribution of any of the three random
variables, X, , and E, and any derived relationships (e.g., correlations, covariances)
involving those random variables are identical between the two forms. In other words,
the two forms are exchangeable. It matters not which test form is administered.
However, those random variables do not have to follow a particular distribution,
such as a normal distribution.
Then, too, there can be types of parallelism. Non-parallel forms, depending
on the degree to which they differ from parallelism, can be tau-equivalent forms,
essentially tau-equivalent forms, congeneric forms, and multi-factor congeneric
forms. Specifically, tau-equivalent forms relax the assumption of equal error variance
but the assumption of equal true scores still holds; essentially tau-equivalent forms
further relax the assumption of equal true scores by requiring only that the true
scores for any given person between two forms differ by a constant which depends
only on the forms but not the individual; congeneric forms allows a shortening
or lengthening factor of the measurement scale from one form to the other after
adjusting for the constant difference in true scores at the origin of one form; multifactor congeneric forms further breaks down the true score on either form into
different components and allows each component to have a relationship similar to
that exists between congeneric forms. For mathematical representations of those
types of non-parallelism, see Feldt and Brennan (1989).
If X and X are observed scores from two parallel forms for the same sample of
people from the population, we have
r XX = r X = r X2 ,
(7)
where X and X are test scores obtained from the two parallel forms.
That is, the reliability coefficient can be thought of as the correlation between two
parallel forms, which is the square of the correlation between the observed scores
and true scores.
Therefore, based on formula (7), if parallel forms are administered to the same
sample, the reliability coefficient is the correlation coefficient squared. Sometimes,
the same test form is administered twice assuming no learning has happened
between the two administrations, the reliability coefficient is then based on the two
administrations. This is the referred to as the test-retest reliability.
Often, a single test form is administered once and only one total test score is
available for each individual. In this case, formula (6) has to be used. The challenge
is that this formula provides the definition, not the calculation of reliability. Like the
35
true scores, the variance of true scores in the population is unknown and has to be
estimated from the data. Ever since Spearman (1910) and Brown (1910), different
coefficients have been proposed to estimate test reliability defined in formula (6).
Those approaches are based on the thinking that each test score is a composite score
that consists of multiple parts. Spearman-Browns split half coefficient is calculated
under the assumption that the full test score is the sum of two part-test scores and
that the two parts are parallel:
SB
rX =
2 r X1 X 2
(8)
1 + r X1 X 2
where r X1 X 2 is the correlation between the two parts. If X1 and X2 are two parallel
forms of the same test, the above equation also serves as a corrected estimation
for the reliability coefficient of the test if the test length is doubled. For more
information on the relationship between test length and test reliability, see Osterlind
(2010, pp. 143146).
As parallelism between the two parts is relaxed, other formulas can be used. The
applications of those formulas with degrees of parallelism can be found in Feldt and
Brennan (1989). Reuterberg and Gustafsson (1992) show how confirmatory factor
analysis can be used to test the assumption of tau equivalence and essentially tau
equivalence.
The most popular reliability coefficient remains Cronbachs coefficient alpha
(1951). This coefficient is a measure of internal consistency between multiple parts
of a test and is based on the assumption that part scores (often, item scores) are
essentially tau-equivalent (i.e., equal true score variance but error score variances
can be different across parts). Under this assumption, coefficient alpha is:
2
2
n sX sX f
r
=
a X
n 1
s X2
(9)
where n is the number of parts, s X2 is the variance of observed scores of the full test,
and s X2 f is the variance of observed scores for part f.
When the parts are not essentially tau equivalent, Cronbachs alpha is the lower
bound of the standard reliability coefficient. If the n parts are n items in a test that
are scored dichotomously (0 or 1), Cronbachs coefficient alpha reduces to KR-20
(Kuder & Richardson, 1937):
20
n f f (1 f f )
rX =
1
n 1
s X2
(10)
Another index is one closely related to reliability of a test: the standard error of
measurement (SEM). The SEM summarizes within-person inconsistency in scorescale units. It represents the standard deviation of a hypothetical set of repeated
measurements on a single individual (i.e., the standard deviation of the distribution
of random variable Egp in (2). In CTT models, it is usually assumed that the standard
error of measurement is constant for all persons to facilitate further calculations.
With this assumption,
SEM = sE = s X (1 r X )
(11)
One purpose of CTT is to make statistical inferences about peoples true scores so
that individuals can be compared to each other, or to some predefined criteria. Under
CTT, the true score of each person tp is fixed yet unknown. In statistics, we call such
a quantity a parameter. A natural following question is: Can we find an estimate for
that parameter? With only one test administration, the commonly used practice to
estimate a persons true score is to use the observed score xp. This is an unbiased
estimate of tp which is defined as the expected value of the random variable Xp, as
long as the weak assumptions of CTT hold. Sometimes, an additional distributional
assumption is added to a CTT model to facilitate the construction of an interval
estimation of an individuals true score. A commonly used assumption is that sE2 is
normally distributed. With this additional assumption, the interval estimation of tp is
x p zsE , where z is the value from the standard normal distribution corresponding
to the probability associated with the interval.
Another less commonly used construction of a point estimation and interval
estimation of tp depends on an additional assumption that, with a random sample
of multiple persons on whom test scores are observed, the random variables and
X follow a bivariate normal distribution. With this assumption, a point estimate of
an individuals true score is rX(xp mX) + mX , where rX is the reliability coefficient,
and mX is the population mean of observed scores , which can be replaced by the
sample mean of X in practice. The corresponding interval estimation for tp is
37
The idea that test scores are used to make inferences about people is directly related
to another important concept in measurement, namely, validity. The past five decades
has witnessed the evolution of the concept of validity in the measurement community,
documented particularly in the five editions of the Standards for Educational and
Psychological Testing published in 1954, 1966, 1974, 1985, and 1999, respectively
(referred to as the Standards since different titles are used in those editions). In the
first edition of the Standards (APA, 1954), validity is categorized into four types:
content, predictive, concurrent, and construct. In the second edition of the Standards
(AERA, APA, & NCME, 1966), validity is grouped into three aspects or concepts:
content, criterion, and construct. In the third edition of the Standards (AERA, APA,
& NCME, 1974), the three categories are called types of validity. In the fourth edition
of the Standards (AERA, APA, & NCME, 1985), the three categories are called
types of evidence and the central role of construct-related evidence is established.
In the fifth edition of the Standards (AERA, APA, & NCME, 1999), the content/
criterion/construct trinitarian model of validity is replaced by a discussion of sources
of validity evidence.
The description of sources of validity evidence in the Standards is consistent
with and perhaps influenced by Messicks treatment of validity as an integrated
evaluative judgment. Messick (1989) wrote:
Validity is an integrated evaluative judgment of the degree to which empirical
evidence and theoretical rationales support the adequacy and appropriateness
of inferences and actions based on test scores or other modes of assessment
Broadly speaking, then, validity is an inductive summary of both the existing
evidence for and the potential consequences of score interpretation and use.
Hence, what is to be validated is not the test or observation device as such
but the inferences derived from test scores or other indicators inferences
about score meaning or interpretation and about the implications for action that
the interpretation entails It is important to note that validity is a matter of
degree, not all or none Inevitably, then, validity is an evolving property and
validation is a continuing process. (p. 13)
The process of collecting validity evidence validationcan be carried out by
examining the test content, its relationships with criteria, and the adequacy and
appropriateness of inferences and actions based on test scores or other modes of
assessment (Messick, 1989, p. 13). More recently, Kane (2006) considers validation
as the process of evaluating the plausibility of proposed interpretations and uses
and validity as the extent to which the evidence supports or refutes the proposed
interpretations and uses (p. 17). Importantly, he divides the validation process
38
Notably, CTT models have been related to other techniques as a special case and
most such relationships are based on some mathematical and statistical equivalence.
Before talking about those equivalences, it is important to point out that CTT is
a measurement theory that bears both semantic and syntactic definitions. With a
semantic definition, the more abstract constructs can be linked to observable
behaviors. With a syntactic definition, those constructs and relationships between
them can be stately more broadly. These two aspects together are made possible
through a particular, mathematically convenient and conceptually useful, definition
of true score and on certain basic assumptions concerning the relationships among
true and error scores (Lord & Novick, 1968, p. 29).
CTT is also a theory of composite scores, with a focus on properties of intact
tests. If multiple forms are available, observed scores obtained from those forms
can be subject to a one-factor confirmatory factor analysis and the latent factor
serve the role of true score in CTT. Parallel and non-parallel test forms correspond
to constraints on parameters of factor analysis models. One the other hand, when
only one test form is available, treating items (or test parts) on that test as multiple
test forms, we can assess the applicability of different reliability coefficients. For
example, Reuterberg and Gustafsson (1992) have shown that Cronbachs coefficient
alpha assumes an equal factor loading from the latent factor to item scores but
does not assume equal residual variances. In this sense, CTT is a special case of
confirmatory factor analysis. However, this type of testing through factor analysis
is for assumptions that are later imposed to form different CTT models, not for
the weak assumptions of CTT themselves. For example, in the case of Cronbachs
coefficient alpha, we can use factor analysis to test the applicability of this reliability
coefficient for a particular test but it would be incorrect to claim that CTT does not
apply if factor analysis results are not consistent with data.
39
Although the focus of CTT is usually with the total test scores, analyzing items
that consist of the test is useful during the earlier stages of test development (e.g.,
field testing) and can be informative when examining item and test shifting. The
two most important statistics for any item within the CTT framework are (a) item
difficulty and (b) item discrimination. For a dichotomous item scored as correct
or incorrect, item difficulty (usually denoted as p) is the percentage of individuals
in the sample who answered the items correctly (that is, item difficulty measures
the easiness of an item in the sample). For a dichotomous item, the correlation
between item and total test scores is the point-biserial correlation. A large correlation
suggests larger difference in the total test scores between those who answered the
item correctly and those who answered the item incorrectly. That is, the correlation
between item and total test score is a measure of item discrimination. When multiple
score points are possible for one item, item difficulty is the average score on that
item expressed as a proportion of the total possible point; and item discrimination
is the Pearson product moment correlation between item and total test scores. In
reality, item discrimination is usually calculated as the correlation between the item
40
scores and total test scores excluding the item scores for the item being evaluated.
This corrected item discrimination eliminates the dependence of total test scores
on the item being evaluated.
From the above, it is obvious that both item difficulty and item discrimination
under CTT is dependent upon the sample of individuals whose responses are used
for those calculations. For example, the same item may have a large p values if
data are from a higher-ability group of individuals, compared to a lower-ability
one. Actually, this interdependency between item and sample is the most attacked
weakness of CTT, especially when it is compared to IRT.
AN ILLUSTRATIVE STUDY
Obviouslyand logicallyexamining test items and exercises after a test has been
administered to a group of examinees is the most frequent application of CTT.
Such item analysis has several purposes, including interpreting the results of an
assessment, understanding functioning of an item wholly, exploring parts of the
item (i.e., the stem, distractors), discovering its discriminating power, and much
more. While many of the statistics used for the purposes can easily be calculated by
hand, it is much more convenient to use a computer. And, of course, many computer
programs, both home grown and commercial, are available to do this. We explain
the output from one program, called MERMAC, to illustrate typical statistical
and graphical CTT output for item analysis. Figure 1 illustrates the output for one
multiple-choice item, in this case Question 44.
Note in Figure 1 that the item analysis is presented in two types: tabular and
graphical. In the table (left side of the figure), the results are reported for each fifth
of the population, divided on the basis of their total test score (the most able group
is at the top 5th; the least able is the 1st group). Such fractile groupings are common
in item analysis. In addition to showing item discrimination between five ability
groups, they can also be used in reliability analyses. In the table, the raw number
of examinees who endorsed a given response alternative is shown. This is useful
because following down the ability groups (from the top 5th to the 1st) one observes
that more of the less able examinees endorsed incorrect responses, showing greater
discrimination for the item. Additionally, it is instructive for both interpretation
41
of test results and for item improvement, to note which distractors were selected
by what ability group. Below the table are two rows, labeled DIFF and RPBI
meaning difficulty and point bi-serial correlation. The difficulty statistic is
the percent of examinees who endorsed each response alternative (both correct
and incorrect). For example, overall 71 percent of examinee responded correctly
to this item. The point bi-serial correlation is a theoretical conception of treating
dichotomous test items (typically multiple-choice) as a true dichotomy between
correct and anything not correct: as 1, 0. A correlation coefficient is then calculated
between this theoretical variable and the examinees total test score. This coefficient
is interpreted as a measure of the items discriminating power. A positive value for
the coefficient indicates good discrimination; hence, one looks for a positive RPBI
value for the correct alternative and negative value for the distractors, the case with
the example item in Figure 1.
The right side of the MERMAC output is a graphical representation of the table,
showing an asterisk for each ability group. The horizontal axis is percent endorsing
the correct response; hence it is a graph of the Difficulty row.
As an illustration, suppose the same test is administered to students taking the
same statistics course in four semesters. This test consists of 32 items: 4 multiplechoice items that clearly state there is only one answer, 7 multiple-choice items
that ask students to choose as many (as few) correct answers, the other 21 items are
constructed-response items where students are asked to conduct simple calculations
or to explain and interpret results related to topics covered in the course. The 11
multiple-choice items are worth 1 point each, with partial points possible for those
with multiple answers. Of those constructed-response items, 9 are worth 1 point
each, 6 worth 2 points each, 2 worth 3 points each, and 4 worth 4 points each. Partial
credits are possible for all constructed-response items. The total possible score for
this test is 54 and there are 54 students during the four semesters who took this test.
The data for four students and each item are in Table 1. Assuming the 32 items are
essentially tau equivalent, the Cronbachs coefficient alpha calculated from formula
(9) is .803. The corresponding SEM, calculated from formula (11), is 1.47. The 32
items can also be split in half so that the number of items and the total possible scores
are the same in the two split halves. The correlation between the two split parts is
.739, which results in a split-half reliability coefficient of 0.850 using equation (8).
The corresponding SEM, calculated from formula (11), is 1.12.
Item difficulties and corrected item discriminations are also in Table 1. There are
several very easy items. In this example, everyone answered Item 10 correctly so
this item does not have any discriminating power. Item 9 is a dichotomously scored
item and 4 out of the 54 students answered this item incorrectly, which renders a
discrimination coefficient rounded to zero. All but one answered Item 3 correctly
and the resultant item difficulty is .99 and item discrimination is .22. This is a very
easy item. In fact, it is so easy that an incorrect response is more likely given by a
person with a higher total test score than one with a lower total test score. This item
should be deleted.
42
I1 I2 I3
I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 I16 I17 I18 I19
0.5 1
2.5 1
0.5 1
2.5 1
53
2.5 1
0.5 1
2.5 0
0.5 0
54
0.5 1
2.5 1
0.5 1
Difficulty
.93 .89 .99 .79 .69 .78 .94 .91 .93 1.00 .80 .93 .80 .67 .59 .66 .69 .45 .38
Discrimination .28 .29 -.22 .54 .68 .48 .05 .15 .00 .00
Student
I20 I21 I22 I23 I24 I25 I26 I27 I28 I29 I30 I31 I32 Total Splithalf-1 Splithalf-2
1.5 0.5 1
27.5 15
12.5
2.5 31.5 15
16.5
53
0.5 2
54
Difficulty
.98 .35 .57 .57 .61 .59 .86 .61 .68 .69 .34 .81 .74
0.5 1
3.5 39
20
19
1.5 0
18.5
16.5
35
Discrimination .26 .14 .12 .15 .46 .46 .56 .32 .22 .13 .22 .14 .46
From the above, it is evident that the approach to mental measurement offered
by CTT is both powerful and useful. It represents an application of the theory of
true score and it has several practical applications in real-world testing situations,
including developing a test, reporting a score for an examinees, item analysis, and
some understanding of error in the measurement. For these reasons CTT remains a
most popular approach to measuring mental processes.
REFERENCES
American Educational Research Association, American Psychological Association, & National Council
on Measurement in Education. (1966). Standards for educational and psychological tests and
manuals. Washington, DC: American Psychological Association.
American Educational Research Association, American Psychological Association, & National Council
on Measurement in Education. (1974). Standards for educational and psychological testing.
Washington, DC: American Psychological Association.
American Educational Research Association, American Psychological Association, & National Council
on Measurement in Education. (1985). Standards for educational and psychological testing.
Washington, DC: American Psychological Association.
American Educational Research Association, American Psychological Association, & National Council
on Measurement in Education. (1999). Standards for educational and psychological testing.
Washington, DC: American Educational Research Association.
American Psychological Association (APA). (1954). Technical recommendations for psychological tests
and diagnostic techniques. Washington, DC: Author.
43
44