Proefschrift Robert Zwitser PDF

UvA-DARE (Digital Academic Repository)
Contributions to latent variable modeling in educational measurement

Zwitser, R.J.
Link to publication
Citation for published version (APA):

Zwitser, R. J. (2015). Contributions to latent variable modeling in educational measurement
General rights
It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s),
other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons).
Disclaimer/Complaints regulations
If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating
your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask
the Library: http://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam,
The Netherlands. You will be contacted as soon as possible.
UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl)
Download date: 16 Jun 2017

Modeling in Educational Measurement
Contributions to Latent Variable
Robert J. Zwitser
Contributions to Latent Variable Modeling in Educational Measurement Robert J. Zwitser
.
.
Contributions to Latent Variable Modeling in Educational Measurement

.
Printed by Ipskamp Drukkers, Enschede

Graphic design cover by Rachel van Esschoten, DivingDuck Design
Typeset with LATEX
ISBN: 978-94-6259-618-4

c 2015, Robert J. Zwitser. All rights reserved
This research was supported by Cito, Institute for Educational Measurement

.
Contributions to Latent Variable Modeling in

Educational Measurement
ACADEMISCH PROEFSCHRIFT
ter verkrijging van de graad van doctor

aan de Universiteit van Amsterdam
op gezag van de Rector Magnificus
prof. dr. D.C. van den Boom
ten overstaan van een door het college voor promoties ingestelde
commissie, in het openbaar te verdedigen in de Agnietenkapel
op woensdag 22 april 2015, te 14:00 uur
door
Robert Johannes Zwitser
geboren te Hasselt
Promotiecommissie:
Promotor: Prof. dr. G.K.J. Maris Universiteit van Amsterdam
Overige leden: Dr. L.A. van der Ark Universiteit van Amsterdam
Prof. dr. D. Borsboom Universiteit van Amsterdam
Prof. dr. C.A.W. Glas Universiteit Twente
Prof. dr. H. Kelderman Vrije Universiteit
Prof. dr. S. Kreiner University of Copenhagen
Prof. dr. F.J. Oort Universiteit van Amsterdam
Faculteit: Faculteit der Maatschappij- en Gedragswetenschappen

Contents
1 Introduction 1
1.1 The construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Latent variable models . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 This thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 CML Inference with MST Designs . . . . . . . . . . . . 3
1.3.2 The Nonparametric Rasch Model . . . . . . . . . . . . . 5
1.3.3 DIF in International Surveys . . . . . . . . . . . . . . . 5
1.4 Note about notation . . . . . . . . . . . . . . . . . . . . . . . . 6
2 CML Inference with MST Designs 7

2.1 Conditional likelihood estimation . . . . . . . . . . . . . . . . . 10
2.1.1 Estimation of item parameters . . . . . . . . . . . . . . 11
2.1.2 Comparison with alternative estimation procedures . . . 14
2.1.3 Estimation of person parameters . . . . . . . . . . . . . 16
2.2 Model fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 Model fit in adaptive testing . . . . . . . . . . . . . . . 17
2.2.2 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Item fit test . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 The Nonparametric Rasch Model 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Some models under consideration . . . . . . . . . . . . . . . . . 39
3.2.1 Parametric IRT models . . . . . . . . . . . . . . . . . . 39
3.2.2 Nonparametric IRT models . . . . . . . . . . . . . . . . 40
3.3 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 The existence of a sufficient statistic . . . . . . . . . . . 42
3.3.2 Ordinal sufficiency . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 Nonparametric Rasch model . . . . . . . . . . . . . . . . 51
3.4 Testable implications of ordinal sufficiency . . . . . . . . . . . . 52
3.4.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 DIF in International Surveys 63

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 Remove DIF items and ignore DIF in the model . . . . 65
4.1.2 Add subpopulation-specific item parameters and
compare person parameter estimates . . . . . . . . . . . 66
4.1.3 Add subpopulation-specific item parameters and adjust
the observed total score . . . . . . . . . . . . . . . . . . 69
4.1.4 DIF as an interesting outcome . . . . . . . . . . . . . . 70
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1 The construct . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.2 Purpose of the measurement model . . . . . . . . . . . . 71
4.2.3 Comparability . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.4 Difference with existing methods . . . . . . . . . . . . . 72
4.2.5 Estimation process . . . . . . . . . . . . . . . . . . . . . 73
4.2.6 Plausible responses and plausible scores . . . . . . . . . 74
4.2.7 Model fit evaluation . . . . . . . . . . . . . . . . . . . . 74
4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.1 Data set 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.2 Data set 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Illustrations and results . . . . . . . . . . . . . . . . . . . . . . 76
4.4.1 Exploring the model fit . . . . . . . . . . . . . . . . . . 77
4.4.2 Incomplete design . . . . . . . . . . . . . . . . . . . . . 79
4.4.3 A large data example . . . . . . . . . . . . . . . . . . . 80
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5 Discussion 91
5.1 The optimal CAT for high-stakes testing . . . . . . . . . . . . . 91
5.2 To order, or not to order: that is the question . . . . . . . . . . 93
5.3 We want DIF! . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Bibliography 97
References published chapters 105
Summary 107
Samenvatting 109
Dankwoord 113
Chapter 1
Introduction
Through all stages of education, from kindergarten to university, we use tests

to quantify what students know or can do. In this thesis, I focus on tests
that are designed to measure some sort of ability. Examples of such abilities
are the ability to read, the ability to write, or the ability to interpret graphs
and tables. It is generally accepted that these abilities, sometimes also more
generally referred to as constructs, cannot directly be measured in a single
observation. What can be observed is the response on a single task. Such a
task does not represent the construct as a whole, but represents one aspect
of the construct. Since one single task does not represent the total construct,
tests usually consist of multiple, separately scored tasks, usually called items.
One of the main questions in educational measurement is the following: how
to summarize the item scores into a meaningful final score that represents the
ability that is supposed to be measured. This question is prominent in this
thesis. In this introduction, I will first define the term construct in more detail.
Then, I will elaborate on latent variable models. Finally, I will introduce the
main chapters of this thesis.
1.1 The construct
There are different views on what a construct is. The first is based on the so-
called market basket approach (Mislevy, 1998), where the construct is defined
by a (large) set of items. For instance, if one wants to measure the ability
to interpret graphs at Grade 6 level, the construct interpreting graphs can
be defined with a large collection of tasks covering all relevant aspects at the
intended level. This should include tasks representing the diversity in types of
graphs as well as the diversity in complexity of the figures. If the construct is
1
2 introduction
defined by a large set of items, then it makes sense to define the final score as a
summary statistic on the total set of items, e.g., an estimate of the percentage
of tasks that is mastered.
Another view is to consider a construct as a latent variable (Lord &
Novick, 1968). Since the work of Spearman (1904) and the development of
factor analysis, psychologists mostly think of a psychological construct (e.g.,
intelligence, depression, or introversion) as a trait that cannot directly be
observed, but that exists as a common cause that explains the covariance
between observed variables. The relationship between observed variables and
the latent trait is formalized in the item response theory (IRT, Lord, 1980).
In IRT, the latent trait is operationalized as a parameter in a latent variable
model. These models describe the statistical relationship between
observations on single tasks and the latent variable, usually denoted by θ.
This latent variable approach also became popular in educational testing.
The construct is then viewed as a latent variable, and scoring with respect to
the construct implies statistical inference about a student’s ‘θ-value’.
1.2 Latent variable models
In this thesis, I mainly focus on a particular class of latent variable models:

the unidimensional monotone latent variable models. These models share the
following three assumptions. The first is unidimensionality (UD), which
means that the model contains only one latent variable θ. The second is local
independence (LI), which means that conditional on θ, item scores are
statistically independent. The third is monotonicity (M), which means that
there is a monotone, non-decreasing relationship between item scores and the
latent variable θ.
Within the class of unidimensional monotone latent variable models,
several distinctions can be made. Here, I only describe the distinction
between parametric and nonparametric models. In parametric models, the
relationship between item scores and θ is described by a parametric item
response function (IRF). A well-known example is the Rasch Model (Rasch,
1960) for dichotomous items responses, scored with either 0 or 1. This model
chapter 1 3
is based on the following IRF:
exp(θ − bi )
P (Xi = 1|θ) = ,
1 + exp(θ − bi )
in which P (Xi = 1|θ) denotes the probability of a score 1, conditional on

θ, and bi denotes a parameter related to item i. The θ parameters are also
referred to as person parameters. Other well-known examples are the Two-
and the Three-Parameter Logistic Model (Birnbaum, 1968), and the Normal
Ogive Model (Lord & Novick, 1968). Nonparametric models put nonparametric
restrictions on the IRF P (Xi |θ). Examples are the Monotone Homogeneity
Model (MHM, Mokken, 1971), which only assumes UD, LI, and M, and the
Double Monotonicity Model (Mokken, 1971), which additional to UD, LI, and
M, also assumes invariant item ordering (IIO):
P (X1 = 1|θ) ≤ P (X2 = 1|θ) ≤ · · · ≤ P (XK = 1|θ),
for all θ, and for all K items. The main benefit of these nonparametric models
is that they put in general less restrictions on the data, and are therefore more
likely to fit the data. A counterpart, however, is that some of the well-known
applications of parametric models, such as inference from incomplete data, are
limited.
1.3 This thesis
In this thesis, I will describe three studies related to the use of unidimensional
monotone latent variable models in educational measurement. I will briefly
introduce them in the next three sections.
1.3.1 CML Inference with MST Designs

The first study is about conditional likelihood inference from multistage testing
(MST) designs. In MST designs, items are administered in blocks/modules
consisting of multiple items. The modules differ in difficulty. Modules are
administered to students depending on their responses to earlier modules. The
simplest example of a MST is a two stage test (Lord, 1971b). In the first
stage, all students take the same first module. This module is often called
4 introduction
the routing test. In the second stage, students with a score lower than or
equal to c on the routing test take a more easy module, whereas students
with a score higher than c on the routing test take a more difficult module.
MST is an example of adaptive testing (Van der Linden & Glas, 2010), which
means that the difficulty level of the test is adapted to the ability level of the
student. In order to know which items are easy and which items are difficult,
items used in an adaptive tests are usually pretested in a linear, non adaptive,
pretest. In such a pretest, item characteristics are determined. Thereafter, the
characteristics are assumed to be the same during the adaptive administration.
A consequence is that the final score also depends on this assumption about
the item characteristics. Therefore, especially in high-stakes testing where test
results can have important consequences for the test taker, it is important to
check these assumptions after the adaptive administration. This implies that
we want to estimate, or at least validate, the parameters of the model from the
adaptive test data. In this chapter, I focus on the estimation of item parameters
in MST designs.
It is generally known that item and person parameters cannot consistently
be estimated simultaneously (Neyman & Scott, 1948). For that reason, the
estimation procedure is usually performed in two steps. First, the item
parameters are estimated with a conditional likelihood (Andersen, 1973a) or
marginal likelihood (Bock & Aitkin, 1981) method. This step is called
calibration. In the section step, the person parameters are estimated,
conditional on the item parameters. For MST designs, it was already
described how item parameters can be estimated with the marginal likelihood
method (Glas, 1988; Glas, Wainer, & Bradlow, 2000). And it has been
claimed that for MST designs the conditional likelihood method can not be
used (Glas, 1988; Eggen & Verhelst, 2011; Kubinger, Steinfeld, Reif, &
Yanagida, 2012). In this chapter, I will illustrate that also in MST designs
item parameters can be estimated with the conditional likelihood method, a
method that in some cases is preferable over the marginal likelihood method.
This chapter is therefore not directly about the estimation of θ, but about the
calibration step that precedes the final scoring. With the item parameters
and the data obtained from the MST, the usual methods can be used to
obtain the final θ estimates.
chapter 1 5
1.3.2 The Nonparametric Rasch Model

The second study is about ordering individuals with the sum score. As
introduced above, one of main questions in educational measurement is how
to summarize item scores into a final score. A criterion with which the use of
a particular statistic could be justified, is as follows: if a unidimensional
model fits the data and if the model contains a sufficient statistic for the
parameter θ, then the sufficient statistic could be used as final score, since the
sufficiency property implies that the statistic contains all statistical
information about the parameter θ. Within the class of unidimensional
monotone latent variable models, both the Rasch model (Rasch, 1960), as
well as the One Parameter Logistic Model (Verhelst & Glas, 1995) contain a
sufficient statistic for θ. However, it might be that these models do not fit the
data, and then the justification argument described above does not hold. In
case of a lack of model fit, a nonparametric alternative might be considered.
Chapter 3 is about a nonparametric equivalent of the justification criterion
described above. Nonparametric models can be used for ordinal inferences. If
we want to justify the use of a statistic to order individuals, we must have a
statistic that contains all statistical information about the ordering with
respect to θ. For the well-known MHM, the use of the sum score has often
been justified based on the stochastic ordering of the latent trait (SOL)
property (see, e.g., Mokken, 1971, and Meijer, Sijtsma, & Smid, 1990). In this
chapter, however, we argue that this property is not satisfactory as
justification for using sum scores to order individual students. To arrive at a
nonparametric model that contains a statistic that keeps all available
statistical information about the ordering of θ, or at least does not contradict
it, we first define the ordinal sufficiency property. Then we take the sum
score as an example, and we will introduce a nonparametric model with an
ordinal sufficient statistic for the parameter θ: this model is called the
nonparametric Rasch Model.
1.3.3 DIF in International Surveys

The third study is about final scores in international surveys, especially the
Programme for International Student Assessment (PISA). A factor that
complicates the statistical modeling of surveys is the substantive amount of
6 introduction
differential item functioning (DIF). There is therefore no single model that

fits the data in each country. However, this is exactly what PISA assumes:
after data cleaning and the elimination of some bad performing items, PISA
fits a generalization of the Rasch model in an international calibration
(OECD, 2009a), and the person parameters are taken as final score. The last
couple of years, the consequences of ignoring DIF in the model have been a
topic of debate, and recently a couple of modeling approaches that take DIF
into account have been proposed (Kreiner & Christensen 2007; 2013; Oliveri
& Von Davier 2011; 2014). In this chapter, we explain that these approaches
are not fully satisfactory, and we propose an alternative, DIF-driven modeling
approach for international surveys. The core of this approach is that we
define the construct as a set of items. Therefore, comparisons with respect to
the construct, between different populations, are equivalent to comparisons of
the responses to these items. The only aspect that complicates these
comparisons is the incomplete data collection design. In this chapter, we
illustrate how latent variable models (plural, because different models are
used in different countries) are used to get an estimate of the complete data
matrix. Since we use different models in different countries, this procedure is
very flexible with respect to DIF. With the estimated complete data matrix,
all kinds of comparisons between countries can be made. We will illustrate
this with real PISA data.
1.4 Note about notation
The research projects that are described in the next three chapters are based
on collaboration with some colleagues. Therefore, I write we instead of I.
Furthermore, notation is sometimes not consistent between chapters. However,
within chapters we have striven to be consistent and to introduce all notation.
Chapter 2
Conditional Statistical Inference with

Multistage Testing Designs
Summary
In this paper it is demonstrated how statistical inference from multistage test
designs can be made based on the conditional likelihood. Special attention is given
to parameter estimation, as well as the evaluation of model fit. Two reasons are
provided why the fit of simple measurement models is expected to be better in
adaptive designs, compared to linear designs: more parameters are available for the
same number of observations; and undesirable response behavior, like slipping and
guessing, might be avoided owing to a better match between item difficulty and
examinee proficiency. The results are illustrated with simulated data, as well as
with real data.
This chapter has been accepted for publication as: Zwitser, R.J. & Maris, G. (in press).
Conditional Statistical Inference with Multistage Testing Designs. Psychometrika.
8 cml inference with mst designs
For several decades, test developers have been working on the

development of adaptive test designs in order to obtain more efficient
measurement procedures (Cronbach & Gleser, 1965; Lord, 1971a; Lord,
1971b; Weiss, 1983; Van der Linden & Glas, 2010). It is often shown that the
better match between item difficulty and the proficiency of the examinee
leads to more accurate estimates of person parameters.
Apart from efficiency, there are more reasons for preferring adaptive
designs over linear designs, where all items are administered to all examinees.
The first reason is that a good match between difficulty and proficiency might
decrease the risk of undesirable response behavior. Examples of such behavior
are guessing (slipping), which is an unexpected (in)correct response given the
proficiency of the examinee. The avoidance of guessing or slipping might
therefore diminish the need for parameters to model this type of behavior.
This implies that adaptive designs could go along with more parsimonious
models, compared to linear test designs. The second reason is that model fit
is expected to be better for adaptive designs. Conditional on a fixed number
of items per examinee, an adaptive design contains more items compared to a
linear design. This implies that, although the number of possible observations
is the same in both cases, the measurement model for an adaptive tests
contains more parameters than the same measurement model for a linear test.
An ultimate case is the computerized adaptive test (CAT, Weiss, 1983; Van
der Linden & Glas, 2010) with an infinitely large item pool. A CAT with N
dichotomous items has 2N different response patters. Since the corresponding
probabilities sum to one, the measurement model should estimate 2N − 1
probabilities. Observe, however, that the number of items in such a design is
also 2N − 1. In a later section, we will show that in this case the Rasch model
(Rasch, 1960) is a saturated model.
The usual approach for the calibration of a CAT is to fit an item response
theory model on pretest data obtained from a linear test. The estimated item
parameters are then considered as fixed during the adaptive test
administration (Glas, 2010). This approach is valid if the item parameters
have the same values during the pretest administration and the actual
adaptive test administration. However, factors like item exposure, motivation
of the examinees, and different modes of item presentation may result in
parameter value differences between the pretest stage and the test stage
chapter 2 9
(Glas, 2000). This implies that, for accountability reasons, one should want to
(re)calibrate the adaptive test after test administration.
In this paper, we go into the topic of statistical inference from adaptive test
designs, especially multistage testing (MST) designs (Lord, 1971b; Zenisky,
Hambleton, & Luecht, 2010). These designs have several practical advantages,
as ”the design strikes a balance among adaptability, practicality, measurement
accuracy, and control over test forms” (Zenisky et al., 2010). In MST designs,
items are administered in block/modules with multiple items. The modules
that are administered to an examinee depends on their responses to earlier
modules. An example of an MST design is given in Figure 2.1. In the first
stage, all examinees take the first module.1 This module is often called the
routing test. In the second stage, examinees with a score lower than or equal
to c on the routing test take module 2, whereas examinees with a score higher
than c on the routing test take module 3. Every unique sequence of modules
is called a booklet.
stage 1 stage 2
[1] [2]
X X X [3]
[1]
booklet 1 X+ ≤ c
[1]
booklet 2 X+ > c
Figure 2.1: Example of a multistage design.
In the past, only a few studies have focused on the calibration of items in an
MST design. Those were based on Bayesian inference (Wainer, Bradlow, & Du,
2000) or marginal maximum likelihood (MML) inference (Glas, 1988; Glas et
al., 2000). In this paper, we consider statistical inference from the conditional
maximum likelihood (CML) perspective (Andersen, 1973a). A benefit of this
method is that, in contrast to MML, no assumptions are needed about the
distribution of ability in the population, and it is not necessary to draw a
1 We use a superscript [m] to denote random variables and parameters that relate to the
mth module. Multiple modules, e.g., module 1 and 2, are denoted by the superscript [1,2].
random sample from the population. However, it has been suggested that the
CML method cannot be applied with MST (Glas, 1988; Eggen & Verhelst, 2011;
Kubinger et al., 2012). The main purpose of this paper is to demonstrate that
this conclusion was not correct. This will be shown in Section 2.1. In order
to demonstrate the practical value of this technical conclusion, we elaborate
on the relationship between model fit and adaptive test designs. In Section
2.2, we first show in more detail that the fit of the same measurement model
is expected to be better for adaptive designs in comparison to linear designs.
Then, second, we propose how the model fit can be evaluated. In Section 2.3,
we give some illustrations to elucidate our results. Throughout the paper, we
use the MST design in Figure 2.1 for illustrative purposes. The extent to which
our results for this simple MST design generalize to more complex designs is
discussed in Section 2.4.
2.1 Conditional likelihood estimation
Throughout the paper, we use the Rasch model (Rasch, 1960) in our derivations
and examples. Let X be a matrix with item responses of K examinees on N
items. The model is defined as follows:
N
K Y
Y exp[(θp − bi )xpi ]
P (X = x|θ, b) = , (2.1)
p=1 i=1
1 + exp(θp − bi )
in which xpi denotes the response of examinee p, p = 1, ..., K, on item i,

i = 1, ..., N , and in which θp and bi are parameters related to examinee p
and item i, respectively. The θ-parameters are often called ability parameters,
while the b-parameters are called difficulty parameters. The Rasch model is an
exponential family distribution with the sum score
X
Xp+ = Xpi sufficient for θp
i
and X
X+i = Xpi sufficient for bi .
p
Statistical inference about X is hampered by the fact that the person

parameters θp are incidental. That is, their number increases with the sample
chapter 2 11
size. It is known that, in the presence of an increasing number of incidental

parameters, it is, in general, not possible to estimate the (structural) item
parameters consistently (Neyman & Scott, 1948). This problem can be
overcome in one of two ways. The first is MML inference (Bock & Aitkin,
1981): If the examinees can be conceived of as a random sample from a
well-defined population characterized by an ability distribution G, inferences
can be based on the marginal distribution of the data. That is, we integrate
the incidental parameters out of the model. Rather than estimating each
examinee’s ability, only the parameters of the ability distribution need to be
estimated. The second is CML inference: Since the Rasch model is an
exponential family model, we can base our inferences on the distribution of
the data X conditionally on the sufficient statistics for the incidental
parameters. Obviously, this conditional distribution no longer depends on the
incidental parameters. Under suitable regularity conditions, both methods
can be shown to lead to consistent estimates of the item difficulty parameters.
2.1.1 Estimation of item parameters

Suppose that every examinee responds to all three modules (X[1] , X[2] , and
X[3] ). That is, we have complete data for every examinee. We now consider
how the (distribution of the) complete data relate(s) to the (distribution of the)
data from MST and derive the conditional likelihood upon which statistical
inferences can be based.
The complete data likelihood can be factored as follows:2
[1] [2] [3]

Pb (x|θ) =Pb[1] (x[1] |x+ )Pb[2] (x[2] |x+ )Pb[3] (x[3] |x+ )
[1] [2] [3]
Pb (x+ , x+ , x+ |x+ )Pb (x+ |θ)
2 Whenever possible without introducing ambiguity, we ignore the distinction between
random variables and their realizations in our formulae.

where
Q [m] [m]
[m] exp(−xi bi )
Pb[m] (x[m] |x+ ) = i
, m = 1, 2, 3,
γx[m] (b[m] )
+
γx[1] (b[1] )γx[2] (b[2] )γx[3] (b[3] )

[1] [2] [3] + + +
Pb (x+ , x+ , x+ |x+ ) = ,
γx+ (b)
γx (b) exp(x+ θ)
Pb (x+ |θ) = P + ,
s γs (b) exp(sθ)
and γs (b[m] ) is the elementary symmetric function of order s:
[m] [m]
X Y
γs (b[m] ) = exp(−xi bi ),
x:x+ =s i
[m]
which equals zero if s is smaller than zero or larger than the number of elements
in b[m] .
The various elementary symmetric functions are related to each other in
the following way:
X
γx+ (b) = γi (b[1] )γj (b[2] )γk (b[3] ).
i+j+k=x+
To turn a sample from X into a realization of data from MST, we do the

following: If the score of an examinee on module 1 is lower than or equal to
c, we delete the responses on module 3, otherwise, we delete the responses on
module 2. We now consider this procedure from a formal point of view.
Formally, considering an examinee with score module 1 lower than or
equal to c and deleting the responses on module 3 means that we consider the
[1]
distribution of X[1] and X[2] conditionally on θ and the event X+ ≤ c:
[1] Pb[1,2] (x[1,2] |θ) [1]

Pb[1,2] (x[1,2] |θ, X+ ≤ c) = [1]
, if x+ ≤ c. (2.2)
Pb[1,2] (X+ ≤ c|θ)
That is, the if refers to conditioning and deleting to integrating out. In the
following, it is to be implicitly understood that conditional distributions are
equal to zero if the conditioning event does not occur in the realization of the
random variable.
chapter 2 13
We now show that the conditional distribution in (2.2) factors as follows:
[1]
Pb[1,2] (x[1,2] |θ, X+ ≤ c)
[1,2] [1] [1,2] [1]
=Pb[1,2] (x[1,2] |x+ , X+ ≤ c)Pb[1,2] (x+ |θ, X+ ≤ c).
[1,2]
That is, the score X+ is sufficient for θ, and hence the conditional probability
[1,2] [1]
Pb[1,2] (x[1,2] |x+ , X+ ≤ c) can be used for making inferences about b[1,2] .
[1,2]
First, we consider the distribution of X[1] and X[2] conditionally on X+ ,
which is known to be independent of θ:
Q [1] [1] Q [2] [2]
[1,2] [1,2] i exp(−xi bi ) j exp(−xj bj )
Pb[1,2] (x |x+ ) = [1,2] )
γx [1,2] (b
+
where
[1,2]
nX
[1,2]
γx[1,2] (b )= γj (b[1] )γx[1,2] −j (b[2] ).
+ +
j=0
[1]
Second, we consider the probability that X+ is lower than or equal to c
[1,2]
conditionally on X+ :
Pc
j=0 γj (b[1] )γx[1,2] −j (b[2] )
[1] [1,2] +
Pb[1,2] (X+ ≤ c|x+ ) = Pn[1,2] .
j=0 γj (b[1] )γ [1,2]
x+ −j
(b[2] )
Hence, we obtain
Q [1] [1] Q [2] [2]
[1,2] [1] [1,2] i exp(−xi bi ) j exp(−xj bj )
Pb[1,2] (x |X+ ≤ c, x+ ) = Pc [1] [2]
. (2.3)
j=0 γj (b )γx[1,2] −j (b ) +
[1,2] [1]
We next consider the distribution of X+ conditionally on θ and X+ ≤ c.
[1] [2]
Since the joint distribution of X+ and X+ conditionally on θ has the following
form:
[1] [2]
γx[1] (b[1] )γx[2] (b[2] ) exp([x+ + x+ ]θ)
[1] [2] + +
Pb[1,2] (x+ , x+ |θ) =P ,
0≤j+k≤n[1,2] γj (b[1] )γk (b[2] ) exp([j + k]θ)
we obtain
[1,2] [1]
[1,2] [1] Pb[1,2] (x+ , X+ ≤ c|θ)
Pb[1,2] (x+ |θ, X+ ≤ c) = [1]
Pb[1,2] (X+ ≤ c|θ)
[1] [1,2]
(b[2] ) exp(x+ θ)
P
j≤c γj (b )γx[1,2]
+ −j
=P [1] [2]
.
0≤j+k≤n[1,2] γj (b )γk (b ) exp([j + k]θ)
j≤c
Finally, we can write the probability for a single examinee in MST who
receives a score lower than or equal to c on module 1:
[1]
Pb[1,2] (x[1,2] |θ, X+ ≤ c)
[1,2] [1] [1,2] [1]
=Pb[1,2] (x[1,2] |x+ , X+ ≤ c)Pb[1,2] (x+ |θ, X+ ≤ c)
Q [1] [1] Q [2] [2] [1,2]
i exp(−xi bi ) j exp(−xj bj ) exp(x+ θ)
= P [1] [2]
. (2.4)
j≤c
Obviously, a similar result holds for an examinee who receives a score higher
than c on module 1 and hence takes module 3. With the results from this
section, we can safely use CML inference, using (2.3) as the conditional
probability.
2.1.2 Comparison with alternative estimation procedures

The first way to deal with an MST design is to ignore the fact that the
assignment of items depends on the examinee’s previous responses. This
means that when an examinee receives a score lower than or equal to c on
module 1, we use the probability of the observations conditionally on θ only
Q [1] [1] Q [2] [2] [1,2]
[1,2]
exp(−xi bi ) j exp(−xj bj ) exp(x+ θ)
i
Pb[1,2] (x |θ) = P [1] [2]
(2.5)
instead of the correct probability in (2.4) as the basis for statistical inferences.
It has been observed that if we use the conditional likelihood corresponding
to the distribution in (2.5) as the basis for estimating the item parameters,
we get bias in the estimators (Eggen & Verhelst, 2011). In Section 2.3.1, we
illustrate this phenomenon. If we compare the probability in (2.4) with that
chapter 2 15
in (2.5), we see that the only difference is in the range of the sum in the
denominators. This reflects that in (2.4) we take into account that values of
[1]
X+ larger than c cannot occur, whereas in (2.5) this is not taken into account.
The second way to deal with an MST design is to separately estimate the
parameters in each step of the design (Glas, 1989). This means that inferences
with respect to X[m] are based on the probability of X[m] conditionally on
[m] [m]
X+ = x+ . This procedure leads to unbiased estimates. However, since the
parameters are not identifiable, we need to impose a separate restriction for
[1] [2]
each stage in the design (e.g., b1 = 0 and b1 = 0). As a consequence, it is
not possible to place the items from different stages in the design on the same
scale. More important, it is not possible to use all available information to
obtain a unique estimate of the ability of the examinee.
Third, we consider the use of MML inference. In the previous section, we
derived the probability function of the data conditionally on the design. For
MML inference, we could use the corresponding marginal (w.r.t. θ) probability
[1]
conditionally on the design (X+ ≤ c):
[1]
Pb[1,2] ,λ (x[1,2] |X+ ≤ c)
Z
[1] [1]
= Pb[1,2] (x[1,2] |θ, X+ ≤ c)fb[1] ,λ (θ|X+ ≤ c)dθ,
Rθ
in which λ are the parameters of the distribution of θ.

If we use this likelihood, we disregard any information about the
parameters that is contained in the (marginal distribution of the) design
[1]
variable: Pb[1] ,λ (X+ ≤ c).
We now consider how we can base our inferences on all available
information: the responses on the routing test X[1] ; the responses on the
other modules that were administered, which we denote by Xobs ; and the
[1]
design variable X+ ≤ c. The complete probability of the observations can be
written as follows:
Pb[1,2,3] (X[1] = x[1] , Xobs = xobs |θ) =Pb[2] (X[2] = xobs |θ)Pb[1] (X[1] = x[1] |θ)
[1]
Pb[1] (X+ ≤ c|X[1] = x[1] )+
Pb[3] (X[3] = xobs |θ)Pb[1] (X[1] = x[1] |θ)
[1]
Pb[1] (X+ > c|X[1] = x[1] ). (2.6)
From this, we immediately obtain the marginal likelihood function:
Pb[1,2,3] (X[1] = x[1] , Xobs = xobs )

Z
= Pb[1,2,3] (X[1] = x[1] , Xobs = xobs |θ)fλ (θ)dθ (2.7)
Rθ
Z
[1]
= Pb[2] (X[2] = xobs |θ)Pb[1] (X[1] = x[1] |θ)fλ (θ)dθ P (X+ ≤ c|x[1] )+
Rθ
Z
[1]
Pb[3] (X = x |θ)Pb[1] (X = x |θ)fλ (θ)dθ P (X+ > c|x[1] ).
[3] obs [1] [1]
Rθ
[1] [1] [1]

Since either P (X+ ≤ c|x[1] ) = 1 and P (X+ > c|x[1] ) = 0, or P (X+ ≤ c|x[1] )
[1]
= 0 and P (X+ > c|x[1] ) = 1, the marginal likelihood function we obtain is
equal to the marginal likelihood function we would have obtained if we had
planned beforehand to which examinees we would administer which modules.
This means that we may safely ignore the design and use a computer program
that allows for incomplete data (e.g., the OPLM program, Verhelst, Glas, &
Verstralen, 1993) to estimate the item and population parameters. This is an
instance of a situation where the ignorability principle applies (Rubin, 1976).
As already mentioned, a drawback of the marginal likelihood approach is
that a random sample from a well-defined population is needed and that
additional assumptions about the distribution of ability in this population
need to be added to the model. In Section 2.3.1, we show that
misspecification of the population distribution can cause serious bias in the
estimated item parameters.
2.1.3 Estimation of person parameters

In principle, it is straightforward to estimate the ability parameter θ of an
examinee who was administered the second module by the maximum likelihood
[1,2]
method from the distribution of the sufficient statistic X+ conditionally on
θ and the design:
[1,2]
γj (b[1] )γx[1,2] −j (b[2] ) exp(x+ θ)
P
[1,2] [1] j≤c +
Pb[1,2] (x+ |θ, X+ ≤ c) = P .
0≤j+k≤n[1,2] γj (b[1] )γk (b[2] ) exp([j + k]θ)
j≤c
chapter 2 17
As usual, we consider the item parameters as known when we estimate ability.

However, as is the case for a single-stage design, the ability is estimated at plus
(minus) infinity for an examinee with a perfect (zero) score and can be shown
to be biased. For that reason, we propose a weighted maximum likelihood
(WML) estimator as Warm (1989) did for single-stage designs.
2.2 Model fit
We have mentioned in the introduction that adaptive designs may be beneficial

for model fit. The arguments were that adaptive designs could probably avoid
different kinds of undesirable behavior, and that more parameters are available
for the same number of observations. In the next paragraph, we elucidate the
latter argument. Thereafter, in order to investigate the model fit, we propose
two goodness of fit tests for MST designs.
2.2.1 Model fit in adaptive testing

The Rasch model is known as a very restrictive model. Consider, for instance,
the marginal model with a normal distribution for the person parameters. In
a linear test design with N items, 2N − 1 probabilities are modeled with only
N + 1 parameters (i.e., N item parameters, plus two parameters for the mean
and standard deviation of the examinee population distribution (µ, and σ,
respectively), minus one parameter that is fixed for scale identification, e.g.,
µ = 0).
However, in the following example, we demonstrate that the Rasch model
is less restrictive in cases with adaptive designs. For this example, consider
a theoretically optimal CAT that selects items one-by-one from an infinitely
large item pool. This implies that knots do not exist in the paths of the
administration design. Consequently, a CAT of length two contains three items,
a CAT of length three contains seven items, and so on: a CAT of length N
contains 2N − 1 items.
Let us consider a CAT of length two. This design contains three items: one
routing item, and two follow up items. Hence, we obtain five parameters in
the model. This design has 22 possible outcomes. Since the probabilities of
these four outcomes sum to one, the model describes 22 − 1 probabilities with
four parameters. This over-parameterization could be solved by fixing another
parameters, for instance, by fixing σ to zero. With this fixation, we obtain the
following probabilities:
1
P (X1 = 0, X2 = 0) = P00 = ;
[1 + exp(−b1 )][1 + exp(−b2 )]
exp(−b2 )
P (X1 = 0, X2 = 1) = P01 = ;
[1 + exp(−b1 )][1 + exp(−b2 )]
exp(−b1 )
P (X1 = 1, X3 = 0) = P10 = ;
[1 + exp(−b1 )][1 + exp(−b3 )]
exp(−b1 − b3 )
P (X1 = 1, X3 = 1) = P11 = .
[1 + exp(−b1 )][1 + exp(−b3 )]
These equations could be transformed into the following equations for b1 , b2 ,

and b3 :

P10 + P11
b1 = − log ;
P01 + P00

P01
b2 = − log ;
P00

P11
b3 = − log .
P10
Two things are worth to be noticed. First, we can see the model is saturated.
Second, since σ was fixed to zero, the model results in person parameters
that are all equal, which is remarkable in a measurement context. This taken
together demonstrates nicely that the Rasch model is not suitable for statistical
inference from a CAT. It could easily be shown the same conclusion holds for
extensions to N items.
For MST designs, we easily find that the Rasch model is less restrictive
compared to linear designs. Consider, for instance, a test of four items per
examinee. In a linear design, we obtain fifteen probabilities and five parameters.
However, for the MST design with two stages with and two items within each
stage, we have six items (seven parameters) to model fifteen observation. Since
model restrictiveness is a ratio of the number of possible observations and
the number of parameters we see that the same model can be more or less
restrictive, depending on the administration design.
chapter 2 19
2.2.2 Likelihood ratio test

In order to evaluate model fit, we propose two tests that are based on the
method that was suggested by Andersen (1973b). He showed that the item
parameters b can be estimated by maximizing the conditional likelihood
PK PN
exp(− p=1 i=1 bi xpi )
L(b) = QK ,
p=1 γx+p (b)
as well as by maximizing L(t) (b), which is the likelihood for the subset of data
for which holds that X+ = t. This conclusion has led to the following likelihood
ratio test (LRT): In the general model, item parameters were estimated for
all score groups separately, while in the special model, only one set of item
parameters was estimated for all score groups together. For a complete design
with N items, Andersen (1973b) considered
N
X −1
Z=2 log[L(t) (b̂(t) )] − 2 log[L(b̂)] (2.8)
t=1
as the test statistic, in which b̂(t) are the estimates that are based on the subset
of data with examinees that have a total score equal to t.
Let us denote Kt as the number of examinees with sum score t. It is shown
that if Kt → ∞ for t = 1, · · ·, N − 1, then Z tends to a limiting χ2 -distribution
with (N − 1)(N − 2) degrees of freedom, i.e., the difference between the number
of parameters in the general model and the specific model.
This LRT can also be applied with incomplete designs. Then (2.8)
generalizes to
g −1
X NX
Z=2 log[L(gt) (b̂(gt) )] − 2 log[L(b̂)], (2.9)
g t=1
where Ng denotes the number of items in booklet g, L(gt) (b̂(gt) ) denotes the
likelihood corresponding to the subset of data with examinees that took booklet
g and obtained a total score t, and b̂(gt) denotes the estimates based on this
subset of data. This statistic can also be applied with an MST design. In that
case, the sum over t has to be adjusted for the scores that can be obtained. We
N [1] items N [2] items N [3] items
N [1] +N [2] +1 minimum and

score groups maximum
scores that do
not provide
N [1] +N [3] +1 statistical
score groups information
scale identification
Figure 2.2: Degrees of freedom in a general booklet design.
will illustrate this for the design in Figure 2.1.

Let N [m] be the number of items in module m. Then the number of
parameters estimated in the specific model is
X
N [m] − 1.
m
One parameter cannot be estimated owing to scale identification. In a general

booklet structure without dependencies between modules, we estimate N [1] +
N [2] − 1 parameters in each score group in booklet 1 and N [1] + N [3] − 1
parameters in booklet 2 (see Figure 2.2). In booklet 1, there are N [1] + N [2] + 1
score groups; in booklet 2, there are N [1] + N [3] + 1 score groups. However,
the minimum and the maximum score groups (dark grey in Figure 2.2) do
not provide statistical information and therefore the number of parameters
estimated in the general model is (N [1] + N [2] − 1)(N [1] + N [2] − 1) + (N [1] +
N [3] − 1)(N [1] + N [3] − 1). Finally, the number of degrees of freedom is
(N [1] + N [2] − 1)(N [1] + N [2] − 1)+

(N [1] + N [3] − 1)(N [1] + N [3] − 1)−
(N [1] + N [2] + N [3] − 1).
chapter 2 21
N [1] items N [2] items N [3] items

scores that do
c + N [2] + 1 not provide
score groups statistical
information
(dark grey),
and scores
that cannot be
N [1] +N [3] −c
obtained (light
score groups
grey)
scale identification
Figure 2.3: Degrees of freedom in an MST design.
The number of parameters of the general model in an MST design is slightly

different, owing to the fact that some scores cannot be obtained. This can be
illustrated by Figure 2.3. In booklet 1, there are c + N [2] + 1 score groups.
The score group t = 0 does not contain statistical information about b[1,2] , as
well as the score group t = c + N [2] about b[2] . In the latter case, all items
in X [2] must have been answered correctly. The same kind of reasoning holds
for booklet 2. The number of parameters estimated in the general model is
(c + N [2] )(N [1] − 1) + (c + N [2] − 1)N [2] + (N [1] + N [3] − c − 1)(N [1] − 1) +
(N [1] + N [3] − c − 2)N [3] . Therefore, the number of degrees of freedom is
(c + N [2] )(N [1] − 1) + (c + N [2] − 1)N [2] +

(N [1] + N [3] − c − 1)(N [1] − 1) + (N [1] + N [3] − c − 2)N [3] −
(N [1] + N [2] + N [3] − 1).
Score Groups In (2.8) and (2.9) the estimation of b(t) is based on the data
with sum score t. Here, t is a single value. In cases with many items, the number
of parameters under the general model becomes huge. Consequently, in some
score groups, there may be little statistical information available about some
parameters, e.g., information about easy items in the highest score groups.
The LRT may then become conservative, since the convergence to the χ2 -
distribution is not reached with many parameters and too little observations.
To increase the power, the procedure can also be based on W sets of sum scores
instead of single values t. Then
W
X
Z=2 log[L(Sv ) (b̂(Sv ) )] − 2 log[L(b̂)],
v=1
in which T is the set of possible sum scores t, v denotes the vth score group,
and Sv ⊂ T such that {S1 , S2 , · · ·, Sv , · · ·, SW } = T .
2.2.3 Item fit test

In the LRT defined above, the null hypothesis is tested against the alternative
hypothesis that the Rasch model does not fit. The result does not provide any
information about the type of model violation on the item level. Instead of a
general LRT, item fit tests can also be used to gain insight into the type of
misfit.
What is known about the maximum likelihood estimates is that
L
b̂(Sv ) → N (b(Sv ) , Σ(Sv ) ),
and, under the null hypothesis that the Rasch model holds,
∀v b(Sv ) = b. (2.10)
Since the Rasch model is a member of the exponential family, the

variance-covariance matrix, Σ, can be estimated by minus the inverse of the
second derivative of the log-likelihood function.
If the Rasch model does not fit, the estimates b(Sv ) can provide useful
information about the type of violation, for instance, if the item characteristic
curve (ICC) has a lower asymptote. In this case, the difference between the
parameters of the score groups will have a certain pattern. This is illustrated
by Figure 2.4. Figure 2.4a symbolizes a case where the Rasch model fits.
Here, all ICCs are parallel. The estimate of the item parameter (i.e., the scale
value that corresponds to a probability of 0.5 of giving a correct response to
that item) in the lower scoring group (solid arrow) is expected to be the same
chapter 2 23
1.0
1.0
0.8
0.8
0.6
0.6
P (Xi |θ)
P (Xi |θ)
0.4
0.4
0.2
0.2
0.0
0.0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
θ θ
(a) A fitting item (b) A non-fitting item
Figure 2.4: Parameter estimates under the Rasch model in three score groups.
as in the middle (dashed arrow) and the higher score group (dotted arrow).
However, if an item has an ICC with a lower asymptote (see Figure 2.4b),
then the estimates of the lower and the middle score groups will be different,
while the estimates of the middle and the high score groups are expected to
be almost the same.
2.3 Examples
In this section, we demonstrate some properties of CML inference in MST

with a simulation study and a real data example. In Section 2.3.1, we will
first describe the design of the simulation study. Then, we will compare the
inference from the correct conditional likelihood with the incorrect inference
from ordinary CML and from MML, in which the population distribution is
misspecified. Finally, we will demonstrate the robustness and efficiency of the
MST design. In Section 2.3.2, we will demonstrate with real data the benefits
of MST on model fit.
2.3.1 Simulation
Test and population characteristics
The first three examples are based on simulated data. We considered a test
of 50 items that was divided into three modules. The first module (i.e., the
routing test) consisted of 10 items with difficulty parameters drawn from a
uniform distribution over the interval from -1 to 1. The second and third
module both consisted of 20 items with difficulty parameters drawn from a
uniform distribution over the interval from -2 to -1 and the interval from 0
to 2, respectively. The person parameters were drawn from a mixture of two
normal distributions: with probability 2/3, they were drawn from a normal
distribution with expectation -1.5 and standard deviation equal to 0.5; with
probability 1/3 they were drawn from a normal distribution with expectation
1 and standard deviation equal to 1. When the test was administered in an
MST design, the cut-off score, c, for the routing test was 5.
Comparison of methods
In the first example, 10,000 examinees were sampled and the test was
administered in an MST design. The item parameters were estimated
according to three methods: first, according to the correct conditional
likelihood as in (2.3); second, according to an ordinary CML method that
takes into account the incomplete design, but not the multistage aspects of
the design; and third, the MML method, in which the person parameters are
assumed to be normally distributed. The scales of the different methods were
equated by fixing the first item parameter at zero.
The average bias, standard errors (SE), and root mean squared error
(RMSE) are per method and per module displayed Table 2.1. Both ordinary
CML and MML inference lead to serious bias in the estimated parameters.
The standard errors were nearly the same between the three methods.
Therefore, finally, the RMSEs of the proposed CML method are much lower
than the RMSEs of the ordinary CML and MML methods.
chapter 2 25
Table 2.1: Average bias, standard error (SE), and root mean squared error (RMSE)
of the item parameters per module.
method module 1 module 2 module 3

BIAS(δ̂, δ) MST CML 0.000 -0.001 -0.001
Ordinary CML 0.001 -0.089 0.291
Ordinary MML -0.003 0.097 -0.345
SE(δ̂) MST CML 0.033 0.036 0.055

Ordinary CML 0.034 0.037 0.052
Ordinary MML 0.030 0.035 0.052
RMSE (δ̂) MST CML 0.033 0.036 0.055

Ordinary CML 0.043 0.096 0.295
Ordinary MML 0.047 0.104 0.349
Goodness of fit
In a second simulation study, we demonstrated the model fit procedure that is
described in Section 2.2. The simulation consisted of 1,000 trials. In each trial,
three different cases were simulated.
• Case 1: the MST design described above.
• Case 2: a complete design with all 50 items, except for the easiest item
in module 3. The excluded item was replaced by an item according to
the 3-parameter logistic model (3PLM, Birnbaum, 1968) which is defined
as follows:
K Y N
exp[ai (θp − bi )xpi ]
Y
P (X = x|θ, a, b, c) = ci + (1 − ci ) ,
p=1 i=1
1 + exp[ai (θp − bi )]
(2.11)
where, compared to the Rasch model, ai and ci are additional parameters
for item i. This 3PLM item has the same item difficulty (i.e., the b-
parameter) as the excluded item. However, instead of a = 1 and c = 0,
which would make (2.11) equal to (2.1), we now have for this item a =
1.2 and c = 0.25. The slope (i.e., the a-parameter) was slightly changed,
so that the ICC is more parallel to the other ICCs.
• Case 3: an MST with the items of case 2.

1.0
1.0
0.8
0.8
0.6
0.6
P (Xi |θ)
P (Xi |θ)
0.4
0.4
0.2
0.2
0.0
0.0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
θ θ
(a) Case 1 (b) Case 2 and 3
Figure 2.5: (a) The ICCs of the 50 Rasch items for case 1. (b) The ICCs of the 49
Rasch items (gray), and the ICC of the 3PLM item (bold black) for case 2 and 3.
The ICCs of case 1 to 3 are displayed in Figure 2.5. Data were generated
for a sample of 10,000 examinees and the item parameters of the Rasch model
were estimated for each case. For the three cases above, an LRT as well as
item fit tests were performed in each trial based on five score groups in each
booklet. The score groups were constructed such that within each booklet the
examinees were equally distributed over the different score groups. The number
of degrees of freedom in cases 1 and 3 is
2 (number of booklets) ×
5 (number of score groups per booklet) ×
29 (number of estimated parameters per score group) −
49 (number of estimated parameters in the specific model)
= 241,
chapter 2 27
Table 2.2: Results Kolmogorov-Smirnov test for testing the p-values of the LRTs
against a uniform distribution.
Case D− p-value
Case 1 0.016 0.774
Case 2 0.968 <0.001
Case 3 0.048 0.100
and in case 2
5 (number of score groups per booklet) ×

49 (number of estimated parameters per score group) −
49 (number of estimated parameters in the specific model)
= 196.
Likelihood Ratio Test If the model fits, then the p-values of the LRTs and
the item fit tests are expected to be uniformly distributed over replications of
the simulation. This hypothesis was checked for each case with a Kolmogorov-
Smirnov test. The results are shown in Table 2.2. It can be seen that the Rasch
model fits in cases 1 and 3, but not in case 2.
Item Fit Test The distribution of the p-values of the item fit statistics is
displayed graphically by QQ-plots in Figure 2.6. The item fit tests clearly
mark the misfitting item in case 2. Notice that, as explained in Section 2.2.3,
the item fit test in case 2 shows an effect between the lower score groups (i.e,
between group 1 and 2, between group 2 and 3, and between group 3 and 4),
while the p-values of the item fit tests between score groups 4 and 5 are nearly
uniformly distributed.
Efficiency
The relative efficiency of an MST design is demonstrated graphically by the
information functions in Figure 2.7. Here, the information of three different
cases is given: All 50 items administered in a complete design, the average
information over 100 random samples of 30 of the 50 items administered in a
complete design, and the MST design described before. In the MST design,
the total test information is
[1] [1]
I(θ) = I [1,2] (θ)P (X+ ≤ c|θ) + I [1,3] (θ)P (X+ > c|θ).
Here, I [1,2] (θ) denotes the Fisher information function for modules 1 and 2.
The distribution of θ is also shown in Figure 2.7. It can be seen that, for
most of the examinees in this population, the MST with 30 items is much more
efficient than the linear test with 30 randomly selected items. In addition, for
many examinees, the information based on the MST is not much less than the
information based on all 50 items.
2.3.2 Real data

The data for the following examples were taken from the Dutch Entrance Test
(in Dutch: Entreetoets), which consists of multiple parts that are
administered annually to approximately 125,000 grade 5 pupils. In this
example, we took the data from 2009, which consists of 127,746 examinees.
One of the parts is a test with 120 math items. To gain insight into the item
characteristics, we first analyzed a sample of 30,000 examinees3 with the
One-Parameter Logistic Model (OPLM, Verhelst & Glas, 1995; Verhelst et
al., 1993). The program contains routines to estimate integer item
discrimination parameters, as well as item difficulty parameters.
The examples in this section illustrate the two factors by which model fit
could be improved with MST designs. First, the difference in restrictiveness
of the same model in different administration designs, and second, the
avoidance of guessing owing to a better match between item difficulty and
examinee proficiency.
Better fit owing to more parameters

In Section 2.2.1, we explained that the restrictiveness of measurement models
depends on the administration design. In order to demonstrate this, two small
examples are given.
3 A sample had to be drawn because of limitations of the OPLM software package w.r.t.
the maximum number of observations.

chapter 2 29
Table 2.3: Item characteristics of nine selected math items in Example 1.
Item no. ai bi prop. correct Item no. ai bi prop. correct

19 4 0.311 0.529 85 4 0.247 0.576
30 2 0.311 0.520 88 3 0.194 0.597
66 2 0.334 0.510 110 3 0.372 0.488
79 3 0.435 0.450 118 4 0.254 0.571
83 2 0.402 0.479
In the first example, nine items were randomly selected from the set of 120
math items. The items were sorted based on the proportion correct in the
original data set. Then they were assigned to two designs:
• a MST design with the three most easy items in module 2, the three most
difficult items in module 3, and the remaining three items in module 1;
• a linear test design with six items, namely the first two of each module.
In the MST design, module 2 will be administered to examinees with a sum

score 0 or 1 on module 1, while module 3 will be administered to examinees
with a sum score 2 or 3 on module 1. Observe that in both designs six items
are administered to each examinee, so in both cases 64 (26 ) different response
patters could occur. However, in the MST case, the Rasch model has 8 free
item parameters to model 63 probabilities, while in the linear test only 5 free
parameters are available. Since the number of different score patterns is limited,
model fit could be evaluated by a comparison between the observed frequencies
(O), and the expected frequencies according to the model (E). The difference
between the two could be summarized with the total absolute difference (TAD):
X
T AD = |Ox − Ex |,
x
in which Ox and Ex are the observed and expected frequency of response

pattern x.
The sampling of items was repeated in 1,000 trials. In each trial, parameters
of both designs were estimated and the TAD for both designs was registered.
The mean TAD over the 1,000 trials was 11,317 for the linear design, while it
was 9,432 for the MST design.
In the second example, nine particular items were selected. The item
characteristics of these items in the original test, based on the OPLM model
(Verhelst & Glas, 1995), are displayed in Table 2.3. The focus in this example
is not on the variation in b-parameters, but on the variation in a-parameters.
With these nine selected items, two administration designs are simulated:
1. a linear test with item 30, 66, 79, 85, 110, and 118
2. a MST with the following modules:
• module 1 (routing test): item 79, 88, and 110 (ai = 3)

• module 2: item 30, 66, and 83 (ai = 2)
• module 3: item 19, 85, and 118 (ai = 4)
Observe that for an individual examinee the maximum difference in

a-parameters is two within the linear test, while it is only one within a
booklet of the MST. We expect that the model fit is better in the second
case, because we avoid that items that have large differences in a-parameters
are assigned to the same examinee.
For both cases, the Rasch model was fitted on the data of the total sample
of 127,746 examinees. The LRTs, based on two score groups, confirm the lack
of fit of both cases, Z(5) = 1,660.16, p < 0.001, and Z(12) = 139.44, p <
0.001, respectively. However, the ratio Z/df indicates that the fit of the Rasch
model is substantially better in the MST design compared to the linear
design. This observation is confirmed by the TAD statistics. The TAD of the
linear test was 15,376, while the TAD of the MST was 4,453.
Better fit owing to avoidance of guessing

For the following example, we have selected 30 items that seem to have parallel
ICCs, although the LRT, based on two score groups, indicated that the Rasch
model did not perfectly fit, Z(29) = 400.93, p < 0.001. In addition to these
30 items, also one 3PLM item was selected. We can consider this example as
an MST by allocating the items to three modules, after which the data of the
examinees with a low (high) score on the routing test are removed from the
third (second) module.
In order to demonstrate the item fit tests, we drew 1,000 samples of 1,000
examinees from the data. First, we estimated the parameters of the 30 Rasch
chapter 2 31
items with a complete design and an MST design. In both cases, all items seem
to fit the Rasch model reasonably well (see Figure 2.8a and Figure 2.8b).
Then we added the 3PLM item to the Rasch items and again analyzed the
complete design and the MST design. It can be seen from Figure 2.8c that
the 3PLM item shows typical misfit in the complete design. The item fit test
was based on three score groups. There is a substantial difference between
the parameter estimates of the lower and the middle score group, while there
seems to be a little difference between the estimates of the middle and the
higher score groups. If the 3PLM item is administered in the third module of
an MST design, the fit improves substantially (see Figure 2.8d).
2.4 Discussion
In this paper, we have shown that the CML method is applicable with data
from an MST. We have demonstrated how item parameters can be estimated
for the Rasch model, and how model fit can be investigated for the total test,
as well as for individual items.
It is known that CML estimators are less efficient than MML estimators.
When the requirements of the MML method are fulfilled, then the MML
method may be preferable above the CML method. However, in practice, for
instance in education, the distribution of person parameters may be skewed
or multi-modal owing to all kinds of selection procedures. It was shown in an
example that, when the population distribution is misspecified, the item
parameters can become seriously biased. For that reason, in cases where not
so much is known about the population distribution, the use of the CML
method may be preferable.
In this paper, we have used the Rasch model in our examples. Although
the Rasch model is known as a restrictive model, we have emphasized that the
Rasch model is less restrictive in adaptive designs compared to linear designs.
However, if more complicated models are needed, then it should be clear that
the method can easily be generalized to other exponential family models, e.g.,
the OPLM (Verhelst & Glas, 1995) and the partial credit model for polytomous
items (Masters, 1982).
Our presumption was that adaptive designs are more robust against
undesirable behavior like guessing and slipping. This has been illustrated by
the simulation in Section 2.3.1. The fit for case 1 and the lack of fit for case 2
were as expected. However, notice that the Rasch model also fits for case 3.
In that case, one of the items is a 3PLM item, but this item was only
administered to examinees with a high score on the routing test, i.e.,
examinees with a high proficiency level. In general, it could be said that
changing the measurement model into a more complicated model is not the
only intervention possible in cases of misfit. Instead, the data generating
design could be changed. The example with real data in Section 2.3.2 did
show that this could also be done afterward. This means that a distinction
could be made between multistage administration and multistage analysis.
Data obtained from a linear test design can be turned into an MST design for
the purpose of calibration. However, this raises the question how to estimate
person parameters in this approach. Should they be based on all item
responses, or only the multistage part with which the item parameters were
estimated? The answer to this question is left for future research.
The design can also be generalized to more modules and more stages, as
long as the likelihood on the design contains statistical information about the
item parameters. It should however be kept in mind that estimation error
with respect to the person parameters can be factorized into two components:
the estimation error of the person parameters conditional on the fixed item
parameters, and the estimation error of the item parameters. The latter part
is mostly ignored, which is defensible when it is very small compared to the
former part. However, when stages are added, while keeping the total number
of items per examinee fixed, more information about the item parameters is
kept in the design, and therefore less information is left for item parameter
estimation. A consequence is that the estimation error with respect to the the
item parameters will increase. When many stages are added, it is even possible
that the increase of estimation error of the item parameters is larger than the
decrease of estimation error of the person parameters conditional on the fixed
item parameters. An ultimate case is a CAT, in which all information about
the item parameters is kept in the design and where no statistical information
is left for the estimation of item parameters. This implies that adding more
and more stages does not necessarily lead to more efficiency. Instead, there
exists an optimal design with respect to the efficiency of the estimation of the
person parameters. Finding the solution with respect to this optimum is left
chapter 2 33
open for further research.

1.0
Rasch items
0.8
0.6
p-value
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Uniform Distribution
(a) Case 1
1.0
1.0
Rasch items Rasch items

3PLM score gr1 - gr2 3PLM score gr1 - gr2
0.8
0.8

0.6
0.6
p-value
p-value
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Uniform Distribution Uniform Distribution
(b) Case 2 (c) Case 3
Figure 2.6: QQ-plots of the p-values of the item fit tests against the quantiles of a
uniform distribution.
chapter 2 35
10
50 items
2
30 items (multistage)
30 items (random out of 50)
density theta
1.6
8
1.2
6
f (θ)
I(θ)
0.8
4
0.4
2
0
-3 -2 -1 0 1 2 3
Figure 2.7: Person information I(θ) in a complete design with 50 items, an MST
design with 30 items, and a complete design with 30 items, given the density f (θ).
1.0
1.0
0.8
0.8
0.6
0.6
p-value
p-value
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(a) Rasch - Complete design (b) Rasch - MST design

1.0
1.0

3PLM score gr low - mid 3PLM score gr low - mid
3PLM score gr mid - high 3PLM score gr mid - high
0.8
0.8
0.6
0.6
p-value
p-value
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
(c) Rasch & 3PLM - Complete design (d) Rasch & 3PLM - MST design
Figure 2.8: QQ-plots of the p-values of the item fit tests from the Entrance Test
example against the quantiles of a uniform distribution.
Chapter 3
Ordering Individuals with Sum Scores:

the Introduction of the Nonparametric
Rasch Model
Summary
When a simple sum or number correct score is used to evaluate the ability of individual
testees, then, from an accountability perspective, the inferences based on the sum
score should be the same as inferences based on the complete response pattern. This
requirement is fulfilled if the sum score is a sufficient statistic for the parameter of
a unidimensional model. However, the models for which this does hold, are known
as being restrictive. It is shown that the less restrictive (non)parametric models
could result in an ordering of persons that is different compared to an ordering based
on the sum score. To arrive at a fair evaluation of ability with a simple number
correct score, ordinal sufficiency is defined as a minimum condition for scoring. The
Monotone Homogeneity Model, together with the property of ordinal sufficiency of
the sum score, is introduced as the nonparametric Rasch Model (npRM). A basic
outline for testable hypotheses about ordinal sufficiency, as well as illustrations with
real data, are provided.
This chapter has been conditionally accepted for publication as: Zwitser, R.J. & Maris, G.
(submitted). Ordering Individuals with Sum Scores: the Introduction of the Nonparametric
Rasch Model. Psychometrika.
38 the nonparametric rasch model
3.1 Introduction
One of the elementary questions in psychological and educational measurement

is: how to score a test? Usually, tests consists of multiple items about the same
topic. One of the issues is whether the scores on the individual items could
fairly be summarized with only one total score, or whether multiple sub scores
are needed. The answer to this question could be justified with the use of item
response theory (IRT) models. If a unidimensional model fits the data, then it
is defensible to report only one total score per person.
Assume that we have a unidimensional test, then the next question is: how
should the total score be computed? One approach could be to estimate the
person parameter, and report these to the testees. However, although this
approach might be intuitively clear for those who have a basic knowledge of
statistics, for the general public a more acceptable approach to communicating
test results might be via an observed score, specifically the sum score (Sijtsma
& Hemker, 2000).
But if someone wants to report sum scores instead of parameter estimates,
then, from an accountability perspective, the following question does arise:
are inferences based on the sum score the same as inferences based on the
parameter estimate? In case of the Rasch model (RM, Rasch, 1960; Fischer,
1974; Hessen, 2005; Maris, 2008) the answer is clearly yes, because in this model
the sum score is a sufficient statistic for the person parameter. This property, as
we will explain in section 3.3, implies that all available information in the data
about the ordering of individual testees is in correspondence with the ordering
of the sum scores. However, the RM is known as a restrictive model. One of
the less restrictive alternatives is the nonparametric Monotone Homogeneity
Model (MHM, Mokken, 1971; see also Sijtsma & Molenaar, 2002). A well-
known property of this model is that the person parameters are stochastically
ordered by the sum score (Mokken, 1971; Grayson, 1988; Huynh, 1994). This
property is very useful for comparisons between groups of persons, because it
implies that testees with a higher sum score have on average a higher value of
the person parameter than testees with a lower sum score. However, it will be
demonstrated in section 3.2.2 that this property is not satisfactory for making
ordinal inferences about individual testees, because the ordering based on the
sum score could be different compared to the ordering of the parameters based
on the available item responses. To arrive at a less restrictive nonparametric
chapter 3 39
model that enables the ordering of individuals based on the sum score, we define
the minimal condition in section 3.3: ordinal sufficiency. With this property we
can introduce the nonparametric Rasch Model. In section 3.4 we derive some
testable implications of ordinal sufficiency. This is illustrated with an example
based on real data.
3.2 Some models under consideration
All IRT models considered in this paper are unidimensional monotone latent
variable models for dichotomous responses, i.e., they all assume at least
Unidimensionality (UD), Local Independence (LI) and Monotonicity (M).
The score on item i is denoted by Xi : Xi = 1 for a correct response and
Xi = 0 otherwise. Let the random vector X = [X1 , X2 , · · ·, Xp ] be the total
score pattern on a test with p items and let x denote a realization of X. The
person parameter, sometimes refered to as ability parameter or latent trait, is
denoted by θ.
3.2.1 Parametric IRT models

Examples of parametric unidimensional monotone latent trait models are the
Rasch Model (RM, Rasch, 1960),
exp(θ − δi )
P (Xi = 1|θ) = P (xi |θ) = ,
1 + exp(θ − δi )
and the Two-Parameter Logistic Model (2PLM, Birnbaum, 1968),
exp[αi (θ − δi )]
P (xi |θ) = ,
1 + exp[αi (θ − δi )]
in which αi and δi are parameters related to item i. Both models contain

sufficient statistics for their parameters.
Definition 1. A statistic H(X) is sufficient for parameter θ if the conditional

distribution of X, given the statistic H(X), does not depend on the parameter
θ, i.e.,
P (X = x|H(X) = a, θ) = P (X = x|H(X) = a).
Sufficiency implies that all statistical information in the data X about the
parameter θ is kept by the statistic H(x). It has already been mentioned that
in the RM the sum score X
Xi = X+
i
is a sufficient statistic for θ. Another well-known example of a sufficient statistic

is the weighted sum score X
αi Xi
i
in the 2PLM, if the weights are known. Therefore, we can easily demonstrate
that in a case where the 2PLM fits the data well, inferences based on θ could
be different compared to inferences based on X+ .
In section 3.3.2 we will also consider the Normal Ogive Model (NOM, Lord
& Novick, 1968),
θ−δi
−t2

1
Z
P (xi |θ) = √ exp dt = Φ(θ − δi ).
−∞ 2π 2
3.2.2 Nonparametric IRT models

A well-known nonparametric model is the Monotone Homogeneity Model
(MHM, Mokken, 1971; see also Sijtsma & Molenaar, 2002). The MHM only
assumes UD, LI and M. For the MHM, it has been shown that X+ has a
likelihood ratio ordering in θ (Grayson, 1988; Huynh, 1994), i.e.,
P (X+ = a|θ2 ) P (X+ = a|θ1 )

∀a > b, θ2 > θ1 : ≥ . (3.1)
P (X+ = b|θ2 ) P (X+ = b|θ1 )
From (3.1) it can easily be derived that
P (Θ > s|X+ = a) ≥ P (Θ > s|X+ = b), (3.2)
for all s, and a > b. The property in (3.2) is called stochastic ordering of the
latent trait by X+ (SOL; Hemker, Sijtsma, Molenaar, & Junker, 1997), also
chapter 3 41
denoted by1
(Θ|X+ = a) ≥ (Θ|X+ = b), if a > b,
st
which equals
E(g(Θ)|X+ = a) ≥ E(g(Θ)|X+ = b)
for all a > b, and all bounded increasing functions g (Ross, 1996, prop. 9.1.2.).
This implies that all statistics for central tendency of Θ, e.g., the median, mode,
or mean, are ordered by X+ .
The SOL property has been used as justification for ordering individuals
with the sum score (see, e.g., Mokken, 1971, and Meijer et al., 1990). However,
for making ordinal inferences about individuals (e.g., passing or failing an exam)
this property might not be sufficient. Consider, for instance, a test with three
items that satisfy the assumptions of the MHM. The first item is a Guttman
item (Guttman, 1950), whereas the last two items have a constant probability
of success, e.g., P (xi |θ) = 0.5. Next, consider two persons. The first person
answers the second and third item correct, while the second person only answers
the first item correct. According to the SOL property, we would conclude that
(Θ|X+ = 2) ≥ (Θ|X+ = 1).

st
However, the item characteristics above imply that
(Θ|X = [0, 1, 1]) < (Θ|X = [1, 0, 0]).

st
Recall that the models considered in this paper are all unidimensional models.
This implies all available information in the data about individual differences
can be summarized with only one score per subject. The accountability issue
mentioned above can be rephrased into the question whether the ordering based
on the sum score is the same as the ordering based on the complete response
pattern. This example demonstrates that the answer to this question is no for
1 In general,
X ≥ Y denotes P (X > a) ≥ P (Y > a) for all a,
st
X > Y denotes P (X > a) > P (Y > a) for all a,
st
X = Y denotes P (X > a) = P (Y > a) for all a.
st
the MHM.
So far, the only model that satisfies this condition is the RM. However,
the RM is known as a restrictive model, which leads to the wish for less
restrictive nonparametric alternatives (Meijer et al., 1990). This alternative
will be considered at the end of the next section.
3.3 Sufficiency
Before we propose a nonparametric alternative that justifies the use of the

sum score for the purpose of individual measurement (section 3.3.3), we first
describe the property of sufficiency in more detail. In section 3.3.1 we describe
the condition under which a sufficient statistic exists. This leads to another
representation of sufficiency whereby we can easily propose ordinal sufficiency
as a weaker form of sufficiency that still enables ordinal measurement with an
observed score (section 3.3.2).
3.3.1 The existence of a sufficient statistic

The derivations in this section are based on the work of Milgrom (1981). We
start with three lemmas.
Lemma 1. X ≥ Y if and only if E(g(X)) ≥ E(g(Y )) for all bounded non-

st
decreasing functions g.
Proof. See Ross (1996, prop. 9.1.2.).
Lemma 2. The distribution of X conditionally on Θ has monotone likelihood

ratio (MLR) if and only if
∀x1 , x2 , Θ :(Θ|X = x2 ) > (Θ|X = x1 )

st
or
(Θ|X = x2 ) < (Θ|X = x1 )
st
or
(Θ|X = x2 ) = (Θ|X = x1 ).
st
chapter 3 43
Proof. (if)
∀x : P (Θ ≤ θ|X = x)
is a bounded non-decreasing function of θ. Hence, if we assume for x1 and x2

that (Θ|X = x2 ) > (Θ|X = x1 ), we infer that
st
∀x3 : E[P (Θ|X = x3 )|X = x2 ] > E[P (Θ|X = x3 )|X = x1 ].
Since
∀x : E[P (Θ|X = x)|X = x] = 1/2,
we obtain that
E[P (Θ|X = x1 )|X = x2 ] > E[P (Θ|X = x2 )|X = x1 ],
or explicitly
∞ θ
P (x1 |θ∗ )f (θ∗ ) P (x2 |θ)f (θ) ∗
Z Z
dθ dθ
−∞ −∞ P (x1 ) P (x2 )
Z ∞Z θ
P (x2 |θ∗ )f (θ∗ ) P (x1 |θ)f (θ) ∗
> dθ dθ,
−∞ −∞ P (x2 ) P (x1 )
and hence
Z ∞ Z θ
[P (x1 |θ∗ )P (x2 |θ) − P (x2 |θ∗ )P (x1 |θ)]f (θ∗ )f (θ)dθ∗ dθ > 0. (3.3)
−∞ −∞
Notice that Lemma 2 holds for all Θ, which denotes the random variable
(uppercase). This implies that (3.3) does hold for every prior f (θ). Therefore,
∀θ∗ < θ : P (x1 |θ∗ )P (x2 |θ) > P (x2 |θ∗ )P (x1 |θ),
which completes the first part of the proof.

(only if) This part of the proof is trivial as MLR implies stochastic ordering
(see, e.g., Ross, 1996).
Lemma 3. If the distribution of X conditionally on Θ has monotone likelihood

ratio (MLR), then there exists a function H such that both X⊥⊥ θ|H(X) and
H(X)|Θ has MLR.
Proof. Let x1 and x2 be such that (Θ|X = x1 ) = (Θ|X = x2 ), and let g be a

st
non-decreasing bounded function such that
H(x) = E(g(Θ)|X = x)
then H(x1 ) = H(x2 ) and
P (x1 )
P (x1 |θ) = P (x2 |θ)
P (x2 )
such that
P (x2 |θ)
P (X = x2 |H(X) = H(x2 ), θ) = P
x1 :H(x1 )=H(x2 ) P (x1 |θ)
P (x2 |θ)
=P P (x1 )
x1 :H(x1 )=H(x2 ) P (x2 ) P (x2 |θ)
P (x2 )
=P ,
x1 :H(x1 )=H(x2 ) P (x1 )
which does not depend on θ, and therefore completes the first part of the proof.
For the second part, let x1 and x2 be such that (Θ|X = x2 ) > (Θ|X = x1 ).
st
Then, obviously, H(x2 ) > H(x1 ). Since (Θ|X = x) = (Θ|H(X) = H(x)) we
st
obtain that (Θ|H(X) = H(x2 )) > (Θ|H(X) = H(x1 )), and the conclusion
st
follows from Lemma 2.
With these lemmas we can now describe under which conditions a sufficient
statistic does exist.
Theorem 1.
∃H : X⊥
⊥ Θ|H(X)
chapter 3 45
if and only if
∀x1 , x2 , Θ :(Θ|X = x2 ) > (Θ|X = x1 ),

st
or
(Θ|X = x2 ) < (Θ|X = x1 ),
st
or
(Θ|X = x2 ) = (Θ|X = x1 ).
st
Proof. Direct from Lemma 2 and 3.
With this representation of sufficiency, we can introduce the minimal

condition for ordering persons with an observed score.
3.3.2 Ordinal sufficiency

From Theorem 1 it follows that sufficiency of statistic H implies that
(Θ|X = x2 ) > (Θ|X = x1 ), if H(x2 ) > H(x1 ), (3.4)

st
and
(Θ|X = x2 ) = (Θ|X = x1 ), if H(x2 ) = H(x1 ). (3.5)
st
The core of this paper is the following: if the purpose of a test is to order
subjects, then (3.4) is the only property of interest: if we order subjects based
on an observed score, then their posterior distributions of Θ should be
stochastically ordered in the same direction. Therefore, we call the condition
in (3.4) ordinal sufficiency (OS).
Definition 2. A statistic H(X) is ordinally sufficient for Θ if H(x2 ) > H(x1 )

implies (Θ|X = x2 ) > (Θ|X = x1 ).
st
OS allows the ordering based on H to be coarser than the ordering based
on the response patterns. That is, the following can occur:
(Θ|X = x2 ) > (Θ|X = x1 ), for some x2 and x1 , for which H(x2 ) = H(x1 ).
st
In the next section, we consider for some specific IRT models whether the
sum score is ordinal sufficient for θ.
Determining ordinal sufficiency of the sum score in a

particular model
Normal Ogive Model The following counter example shows that for the
NOM the sum score is not ordinal sufficient for θ.
Consider a test with 9 items of which the item parameters are
δ = [δ1 , δ2 , · · ·, δ9 ] = [2, 2, 2, 2, 2, −2, −2, −2, −2].
Furthermore, assume that

Θ ∼ N (0, 1).
The δi parameters indicate that the first 5 items are difficult and that the last
4 items are easy.
Consider the following two answer patterns:
• x1 = [1, 1, 1, 1, 1, 0, 0, 0, 0];
• x2 = [0, 0, 0, 0, 0, 1, 1, 1, 1].
The posterior distributions of Θ for these two answer patterns are displayed
in Figure 3.1. From this counter example it can be seen that these posterior
distributions are not stochastically ordered.
2PL Model Consider two response vectors x1 and x2 . These vectors can,
after applying the same permutation of indices to both, be expressed as
x1 = y ∪ (1 − z),
x2 = y ∪ z,
in which y is the common part of x1 and x2 .

It is derived in the appendix that X+ is ordinal sufficient for θ if for an item
response model P (xi |θ) it can be shown that
 P (zg |θ2 )
  P (z |θ ) 
g 2
X 1−P (zg |θ2 ) X 1−P (zg |θ2 ) n
zg log   > (1−zg ) log   , θ2 > θ1 , z+ > ,
g
P (zg |θ1 )
g
P (zg |θ1 ) 2
1−P (zg |θ1 ) 1−P (zg |θ1 )
(3.6)
chapter 3 47
1.0
x1
x2
0.8
F(θ|X = x)
0.6
0.4
0.2
0.0
-1.0 -0.5 0.0 0.5 1.0
Figure 3.1: Posterior distributions of θ for two response patterns under the NOM.
in which z is the subset of n items to which is responded differently in x1 and

x2 ,
Xn
z+ = zg ,
g=1
and where it is assumed that x+2 > x+1 .

For the 2PLM, (3.6) results in
exp[αg (θ2 − δg )] exp[αg (θ2 − δg )]

X X
zg log > (1 − zg ) log
g
exp[αg (θ1 − δg )] g
exp[αg (θ1 − δg )]
⇓
X X
zg αg (θ2 − θ1 ) > (1 − zg )αg (θ2 − θ1 ), θ2 > θ1
g g
⇓
X X n
αg > αg , z+ > . (3.7)
g:z=1 g:z=0
2
Here it can be seen that OS of X+ depends on the αi parameters.

Let α be the vector of all parameters αi , i.e.,
α = [α1 , α2 , ..., αp ]
and let α0 and α∗ be subsets of α such that
α0 ∪ α∗ = α,
α0 ∩ α∗ = ∅,
dim(α0 ) > dim(α∗ ).
Following from (3.7), X+ is ordinal sufficient if

X X
∀α0 , α∗ : αi0 > αj∗ . (3.8)
i j
It can be seen that (3.8) holds if, for p even, the sum of the smallest p2 + 1

elements of α is larger than the sum of the p2 − 1 largest elements of α. If p

is odd, then the sum of the smallest p2 + 12 elements of α has be larger than

the the sum of the largest p2 − 12 elements of α.

Nonparametric IRT Models In Section 3.2.2, it was shown that in the

MHM the sum score is not ordinal sufficient. In this section, we consider two
additional assumptions. The first is invariant item ordering (IIO):
P (x1 |θ) ≤ P (x2 |θ) ≤ · · · ≤ P (xp |θ), for all θ.
The MHM model together with IIO, is known as the Double Monotonicity
Model (DMM, Mokken, 1971; see also Sijtsma & Molenaar, 2002). The second
assumption is monotone traceline ratio (MTR, Post, 1992):
P (xi |θ)
is a non-decreasing function of θ, for all i < j.
P (xj |θ)
In order to show that both the addition of IIO and MTR to the MHM do not
result in a model with an ordinal sufficient sum score, we, once again, consider
an example of a three-item test. The item response functions (IRFs) are as
chapter 3 49
1.0
0.8
P (Xi = 1|θ)
0.6
0.4
0.2
0.0
-3 -2 -1 0 1 2 3
Figure 3.2: IRFs for the three items in (3.9): P (x1 |θ) (solid), P (x2 |θ) (dashed), and
P (x3 |θ) (dotted).
follows:
exp(θ)
P (x1 |θ) =
exp(θ) + 1
exp(θ) + 1.2
P (x2 |θ) = (3.9)
exp(θ) + 2
exp(θ) + 1
P (x3 |θ) =
exp(θ) + 1.2
depicted in Figure 3.2. These IRFs satisfy IIO,
P (x1 |θ) ≤ P (x2 |θ) ≤ P (x3 |θ), for all θ,

1.0
x1
x2
0.8
F(θ|X = x)
0.6
0.4
0.2
0.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
Figure 3.3: Posterior distributions of θ for two response patterns under the DMM
and MTR.
as well as MTR:
P (x1 |θ)
increasing in θ;
P (x2 |θ)
P (x1 |θ)
increasing in θ;
P (x3 |θ)
P (x2 |θ)
increasing in θ.
P (x3 |θ)
Now consider two particular response patterns:
• x1 = [1, 0, 0];
• x2 = [0, 1, 1].
The posterior distributions of θ given x1 and x2 , based on a standard normal

distribution of Θ, are displayed in Figure 3.3. This counter example shows that
the additional assumptions of IIO and MTR to the MHM does not lead to a
model with an ordinal sufficient sum score.
chapter 3 51
3.3.3 Nonparametric Rasch model

None of the nonparametric models described above have an ordinal sufficient
sum score for θ. Therefore, we propose a new nonparametric model. This
model assumes UD, LI, M, and OS. Since it is known that the RM is the
only model for which the sum score is a sufficient statistic for the latent trait
(Fischer, 1974; Hessen, 2005; Maris, 2008), we call the this new model the
nonparametric Rasch Model (npRM).
The difference between the MHM and npRM can also be displayed in the
following way. Under the MHM, the ordering based on some score patterns
is in accordance with the ordering based on the sum score. Specifically, if z,
which is the uncommon part of x1 and x2 (see Section 3.3.2), is such that is
contains only ones, then
(Θ|X = x2 ) > (Θ|X = x1 ).

st
This can easily be demonstrated with two score patterns that differ on only
one item (e.g. in a case of 3 items [1,1,0] and [0,1,0]). The likelihood ratio of
these two score patterns is
P [1, 1, 0|θ] P (x1 |θ)P (x2 |θ)[1 − P (x3 |θ)]

=
P [0, 1, 0|θ] [1 − P (x1 |θ)]P (x2 |θ)[1 − P (x3 |θ)]
P (x1 |θ)
= .
1 − P (x1 |θ)
Because P (x1 |θ) is a monotone increasing function in θ (model assumption),
P (x1 |θ)
1 − P (x1 |θ)
is also increasing in θ, as well as the likelihood ratio of these two score

patterns. This likelihood ratio ordering implies stochastic ordering (Ross,
1996). For the case of a three-item test, all pairs of response patterns for
which this does hold, are displayed in Figure 3.4. However, in order to
conclude that the sum score is ordinal sufficient, also score patterns for which
z contains zeros and ones, have to lead to a stochastic ordering of θ that is in
accordance with the ordering based on the sum score. In other words, not
only the patterns in Figure 3.4 should have a stochastic ordering of the
110 100
Q Q
QQ QQ
111 101 010 000
Q Q
QQ QQ
011 001
Figure 3.4: Partial likelihood ratio order for a three-item test under the MHM.
110 100
Q
S Q
QQ QQ
S
111 101 S 010 000
Q Q
S

QQ QQS S
011 001
Figure 3.5: Partial likelihood ratio order for a three-item test under the npRM.
posterior distributions of θ, but all patterns that are displayed in Figure 3.5
should meet this condition. For instance, it must hold that
(Θ|X = [0, 1, 1]) > (Θ|X = [1, 0, 0]).

st
The question whether this ordering does hold or not, could be verified with a
statistical test. This test will be introduced in the next section.
3.4 Testable implications of ordinal sufficiency
In order to determine whether the sum score is ordinal sufficient, we introduce

some testable implications of OS. First two lemma’s are provided.
Lemma 4. OS of the sum score for a set of items implies OS of the sum score
for any subset of items.
Proof. Let xk denote a realization of X, xik denote the response on item i in

[i]
xk , xk denote the subset of xk without item i, x+k denote the sum score of
[i] [i]
responses in xk , and x+k denote the sum score of responses in xk .
chapter 3 53
It needs to be proven that if x+1 > x+2 , and
f (θ|X = x1 )
f (θ|X = x2 )
is increasing in θ, then
[i]
f (θ|x1 )
[i]
f (θ|x2 )
[i] [i]
is also increasing in θ if x+1 > x+2 .
Since we assume local independence,
P (xi |θ)P (x[i] |θ)f (θ)

f (θ|x) = ,
P (xi |x[i] )P (x[i] )
and
[i] [i]
f (θ|x1 ) P (xi1 |θ)f (θ|x1 ) P (xi2 |x2 )
= [i] [i]
.
f (θ|x2 ) P (xi2 |θ)f (θ|x2 ) P (xi1 |x1 )
[i] [i]
Let x1 and x2 be such that x+1 > x+2 . Following from local independence, we
are free to assume xi1 = xi2 = xi . Then, we find using the above relation that
[i]
f (θ|x1 ) f (θ|x1 )
[i]
∝ .
f (θ|x2 ) f (θ|x2 )
[i]
Since the right hand side is increasing in θ if x+1 > x+2 , and x+k = x+k + xi ,
the result follows.
Lemma 5. If OS of the sum score holds in all subsets of (p − 1) items, then it

also holds for all p items, provided p is even.
Proof. Any pair of response patterns x1 and x2 for which x+1 > x+2 can, after
applying the same permutation of indices to both, be expressed as
x1 = y∪z
x2 = y ∪ (1 − z)
where z+ > (1 − z)+ . We consider two cases.

The first case is when y 6= ∅. Then
f (θ|z) f (θ|x1 )
∝ .
f (θ|1 − z) f (θ|x2 )
Since the left hand side is increasing in θ if z+ > (1 − z)+ , and

x+k = z+ + y+ , it follows that the right hand side is also increasing in θ, and
that x+1 > x+2 , which completes the first part of the proof.
The second case is when y = ∅. Then any pair of response patterns x1 and x2
with x+1 > x+2 and with an even number of items, can, after applying the
same permutation of indices to both, be expressed as
x1 = (xi = 1) ∪ z
x2 = (xi = 0) ∪ (1 − z)
where z = x[i] and z+ > (1 − z)+ .

Observe that
f (θ|z) P (xi = 1|θ) f (θ|x1 )
∝ .
f (θ|1 − z) P (xi = 0|θ) f (θ|x2 )
Since both parts of the left hand side are increasing in θ if z+ > (1 − z)+ ,
it follows that the right hand side is also increasing in θ if x+1 > x+2 , which
completes the second part of the proof.
From these lemmas it follows that if (3.4) holds for X, it also has to hold
for all subsets of X.
Let I denote the set of all p item indices, i.e., I = {1, 2, 3, · · ·, i, · · ·, p}, let S
denote a subset of item indices, i.e., S ⊂ I, let S denote the vector of responses
[S]
on the items in S, and let X+ denote the sum score of the items of X that are
not in subset S. Following from local independence,
Z
[S] [S]
P (X+ |s) = P (X+ |θ)f (θ|s)dθ.
It is already mentioned that under the MHM, θ has monotone likelihood ratio
(MLR) in X+ (Grayson, 1988; Huynh, 1994). A well-known property of the
MLR is that also the reverse is true, i.e. X+ has monotone likelihood ratio in
chapter 3 55
θ. If the likelihood ratio increases in θ, then it follows that if
[S] [S]
(X+ |s2 ) > (X+ |s1 ),
st
then
(Θ|s2 ) > (Θ|s1 ).
st
This leads to the following analogy for testing ordinal sufficiency: the null
hypothesis (H0 ) is that H(X) is ordinal sufficient for θ. If H0 is true, then
∀s1 , s2 : (Θ|s2 ) > (Θ|s1 ), if H(s2 ) > H(s1 ).

st
These multiple sub-hypotheses can be tested by determining whether
[S] [S]
∀s1 , s2 : (X+ |s2 )>(X+ |s1 ), if H(s2 ) > H(s1 ).
st
If one of these sub-hypotheses is rejected, then H(X) is not ordinal sufficient

for θ. For testing stochastic ordering, we refer to the literature about the
Kolmogorov and Smirnov theorems (see, e.g., Doob, 1949; Conover, 1999b).
3.4.1 Example
This procedure will briefly be demonstrated with an example. The examples are
based on data from the Dutch Entrance Test (in Dutch: Entreetoets), which
consists of multiple parts that are administered annually to 125,000 grade 5
pupils. One of the parts is a test with 120 math items. To gain insight into
the item characteristics, we first analyzed a sample of 30,000 examinees2 with
the One-Parameter Logistic Model (OPLM, Verhelst & Glas, 1995; Verhelst et
al., 1993). The OPLM with integer αi parameters did not fit the data well,
R1c = 5,956, df = 357, p < 0.001, however, the item parameter estimates can
be informative for the selection of subsets of items for this illustration. The
parameters of a selection of six items are displayed in Table 3.1.
The smallest subsets that can be tested on ordinal sufficiency are subsets
of three items. According to the rule in Section 3.3.2 the subset that contains
2 A sample had to be drawn because of limitations of the OPLM software package w.r.t.
the maximum number of observations.

Table 3.1: Estimated OPLM parameters of six items from the example data set.
item αi δi
1 2 0.275
2 4 -0.156
3 3 -0.296
6 2 -0.460
8 4 0.104
10 5 -0.029
item 1, 2, and 3 has an ordinal sufficient sum score. This is confirmed by the
[S] [S]
empirical cumulative distributions (ecds) of (X+ |[0, 0, 1]) and (X+ |[1, 1, 0])
in Figure 3.6a. In contrast, the subset with item 1, 6, and 10 does not have
[S]
an ordinal sufficient sum score. Figure 3.6b displays the ecds of (X+ |[0, 0, 1])
[S] [S]
and (X+ |[1, 1, 0]). A third example are the ecds of (X+ |[0, 0, 1]) and
[S]
(X+ |[1, 1, 0]), based on the subset with item 1, 6 and 8. According to the
[S] [S]
αi -parameters, the (X+ |[0, 0, 1]) and (X+ |[1, 1, 0]) are not expected to be
stochastically ordered. This expectation is confirmed by Figure 3.6c.
These three cases can also be tested with the one-sided Kolmogorov-Smirnov
(KS) test (Conover, 1999a). The corresponding hypotheses are
[S] [S]
H0 : (X+ |[0, 0, 1]) ≤ (X+ |[1, 1, 0]);
st
[S] [S]
HA : (X+ |[0, 0, 1]) > (X+ |[1, 1, 0])
st
The KS-test was performed with the ks.test3 function in R (R Development

Core Team, 2013). The test statistics are D− = 0.0002, p = .9998; D− =
0.0829, p < .001; and D− = 0.0016, p = .9792, respectively.
3.5 Discussion
In the present study, the minimal conditions for ordinal inferences about
individuals are considered. It was shown that common nonparametric models,
which are known for their ordering properties (i.e., SOL), are not fully
satisfactory for the purpose of measurement at the level of individuals. The
3 This function computes the classical KS-test for continuous distributions, and therefore
does not allow for ties. However, alternative analyses with the ks.boot function from the
Matching package (Sekhon, 2011), a function that allows for ties, show similar results.
chapter 3 57
reason was that the ordering based on the sum score is not always in
accordance with the ordering based on the complete response pattern. In
order to guarantee this accordance, OS has been defined as a minimal
condition for fair scoring of individuals.
This aspect of fairness should be distinguished from measurement error and
the asymptotic behavior of a statistic. It could be that the ordinal inferences
change if the test administration is extended or repeated. However, these
changes are then based on additional information. OS refers to inferences
based on all available information.
OS is a property that can hold for any scoring rule, but this study only
focused on the sum score. It has been shown that OS of the sum score need
not hold for the NOM, but that it does for the RM as well as for 2PLMs with
a relatively homogeneous set of discrimination parameters. The latter case
implies that ignoring the weights in the scoring rule need not have an effect on
ordinal inferences.
In Section 3.3.2, it was shown that the MHM, as well as the extensions
with IIO or/and MTR, does not imply OS of the sum score. However, this
does not mean that these models are useless. The property that the latent
trait is stochastically ordered by the sum score is, for instance, very useful in
survey applications. It implies that the means (or other statistics of central
tendency) of the posterior distributions of θ are ordered in accordance with the
sum score. This says that people with a higher sum score have on average a
larger ability compared to people with a lower sum score, and therefore groups
of people can be ordered based on the sum score.
The introduction of OS and the npRM leaves some topics for further
research. The first is about model fit. It was shown in section 3.4 that OS has
testable implications. However, the proposed procedure contains many
pairwise subtests. For instance, for a test with ten items, 29,002 subtests (!)
have to be performed on the same data. Maybe, the procedure could be
reduced to those subtests that provide the most information about the
null-hypothesis that the sum score is ordinal sufficient. This topic needs
further study.
The second is how to equate two tests that both have an ordinal sufficient
score. In order words, how do these score distributions relate to each other?
The final comment is about deriving other OS test statistics. Ordinal
sufficiency is a property that can hold for any scoring rule. And for any
scoring rule provided, the test described above can be used in order to
determine whether that scoring rule is ordinal sufficient or not. However, this
approach can also be used in the reverse direction, i.e., it can be used in order
to find the scoring rule that is ordinal sufficient for a particular test. For
monotone latent variable models, there always exists an ordinal sufficient
statistic for the latent trait. For instance, the statistic that assigns the value 0
to those who made all items incorrect, the value 1 to those who made some
items incorrect and some items correct, and the value 2 to those who made all
items correct. This example is practically of limited value, however, it
demonstrates that one can look for a statistic that assigns examinees to
categories, such that the ordering between categories is ordinal sufficient.
This also demonstrates that OS is as condition a good deal weaker than
sufficiency. Whereas most IRT models do not allow for a sufficient statistic,
they all admit of (at least one) OS statistic.
chapter 3 59
Appendix
Let n be the number of items in z, and define

X
X+ = Xi
i
X
Y+ = Yh
h
X
Z+ = Zg
g
with realizations x+ , y+ and z+ , respectively.

Following from the partitioning of x1 and x2 ,
x+2 = y+ + z+ ,
x+1 = y+ + (n − z+ ).
In cases where
x+2 > x+1 ,
we obtain
z+ > (n − z+ ),
n
z+ > .
2
Now we consider the posterior distribution
P (xi |θ)xi [1 − P (xi |θ)]1−xi f (θ)

Q
i
f (θ|X = x) = .
P (X = x)
The likelihood ratio can be written as

P (xi2 |θ)xi2 [1−P (xi2 |θ)]1−xi2 f (θ)
Q
i
f (θ|X = x2 ) P (X=x2 )
=
f (θ|X = x1 ) P (xi1 θ)xi1 [1−P (xi1 θ)]1−xi1 f (θ)
Q
i
P (X=x1 )
P (xi2 |θ)xi2 [1 − P (xi2 |θ)]1−xi2

Q
i P (X = x2 )
= Q xi1 [1 − P (x |θ)]1−xi1
P (X = x1 ) i P (x i1 |θ) i1
Y P (xi2 |θ)xi2 [1 − P (xi2 θ)]1−xi2 P (X = x1 )
=
i
P (xi1 |θ)xi1 [1 − P (xi1 |θ)]1−xi1 P (X = x2 )
Y P (yh |θ)yh [1 − P (yh |θ)]1−yh Y P (zg |θ)zg [1 − P (zg |θ)]1−zg P (X = x1 )
=
h
P (yh |θ)yh [1 − P (yh |θ)]1−yh g P (zg |θ)1−zg [1 − P (zg |θ)]zg P (X = x2 )
zg
P (zg |θ) 1 − P (zg |θ) 1−zg P (X = x1 )
Y
= .
g
1 − P (zg |θ) P (zg |θ) P (X = x2 )
The natural logarithm of likelihood ratio is

  z  1−z
g g P (X = x )
! 
f (θ|X = x2 ) Y P (zg |θ) 1 − P (zg |θ) 1 
log = log     
f (θ|X = x1 ) g 1 − P (zg |θ) P (zg |θ) P (X = x2 )
!    
P (X = x1 ) X P (zg |θ) X 1 − P (zg |θ)
= log + zg log  + (1 − zg ) log  
P (X = x2 ) g 1 − P (zg |θ) g P (zg |θ)
!    
P (X = x1 ) X P (zg |θ) X P (zg |θ)
= log + zg log  + (zg − 1) log  
P (X = x2 ) g 1 − Pg (zg |θ) g 1 − P (zg |θ)
!  
P (X = x1 ) X P (zg |θ)
= log + (2zg − 1) log  . (3.10)
P (X = x2 ) g 1 − P (zg |θ)
It is generally known that if
f (θ2 |X = x2 ) f (θ1 |X = x2 )
> ,
f (θ2 |X = x1 ) f (θ1 |X = x1 )
then
f (θ2 |X = x2 ) f (θ1 |X = x2 )

log > log .
f (θ2 |X = x1 ) f (θ1 |X = x1 )
chapter 3 61
Now, following from (3.10), the likelihood ratio can be written as

!  
P (X = x1 ) X P (zg |θ2 )
log (2zg − 1) log 
+  >
P (X = x2 ) g 1 − P (zg |θ2 )
!  
P (X = x1 ) X P (zg |θ1 )
log + (2zg − 1) log  
P (X = x2 ) g 1 − P (zg |θ1 )
⇓
   
X P (zg |θ2 ) X P (zg |θ1 )
(2zg − 1) log   > (2zg − 1) log  
g 1 − P (zg |θ2 ) g 1 − P (zg |θ1 )
⇓
   
X P (zg |θ2 ) X P (zg |θ1 )
(2zg − 1) log  − (2zg − 1) log   > 0
g 1 − P (zg |θ2 ) g 1 − P (zg |θ1 )
⇓
    
X P (zg |θ2 ) P (zg |θ1 )
(2zg − 1) log   − log   > 0
g 1 − P (zg |θ2 ) 1 − P (zg |θ1 )
⇓
P (zg |θ2 )

X  1−P (zg |θ2 ) 
(2zg − 1) log 
 P (zg |θ )  > 0

g 1
1−P (zg |θ1 )
⇓
 P (zg |θ2 )  P (z |θ ) 

g 2
X  1−P (zg |θ2 )  X  1−P (zg |θ2 ) 
 P (zg |θ )  >
zg log   (1 − zg ) log 
 P (zg |θ )  ,

g 1 g 1
1−P (zg |θ1 ) 1−P (zg |θ1 )
n
θ2 > θ1 , z+ > .
2
1.0
001
110
0.8
F (X+ |S)
0.6
[S]
0.4
0.2
0.0
0 20 40 60 80 100 120
[S]
X+ |S
(a)
1.0
1.0
001 001
110 110
0.8
0.8
F (X+ |S)
F (X+ |S)
0.6
0.6
[S]
[S]
0.4
0.4
0.2
0.2
0.0
0.0
0 20 40 60 80 100 120 0 20 40 60 80 100 120

[S] [S]
X+ |S X+ |S
(b) (c)
[S] [S]
Figure 3.6: The ecds of (X+ |[0, 0, 1]) and (X+ |[1, 1, 0]). S contains: item 1, 2, and
3 (a); 1, 6, and 10 (b); 1, 6, and 8 (c).
Chapter 4
Monitoring Countries in a Changing

World. A New Look at DIF in
International Surveys
Summary
This paper discusses the issue of differential item functioning (DIF) in international
surveys. DIF is likely to occur in international surveys. What is needed is a
statistical approach that takes DIF into account, while at the same time allowing
for meaningful comparisons between countries. Some existing approaches are
discussed and an alternative is provided. The core of this alternative approach is to
define the construct as a large set of items, and to report in terms of summary
statistics. Since the data are incomplete, measurement models are used to complete
the incomplete data. For that purpose different models can be used across countries.
The method is illustrated with PISA’s reading literacy data. The results indicate
that this approach fits the data better than the current PISA methodology,
however, the league tables are nearly the same. The implications for monitoring
changes over time are discussed.
This chapter has been submitted for publication as: Zwitser, R.J., Glaser, S. & Maris, G.
(submitted). Monitoring Countries in a Changing World. A New Look at DIF in
International Surveys.
64 dif in international surveys
4.1 Introduction
Since a couple of decades, educational surveys have been administered

repeatedly with the purpose to explore what students know or can do, to
compare participating countries or economies, to measure trends over time,
and/or to evaluate educational systems. Examples of such surveys are the
Programme for International Student Assessment (PISA), the Trends in
International Mathematics and Science Study (TIMSS), the Progress in
International Reading Literacy Study (PIRLS), and the European Survey on
Language Competences (ESLC).
The constructs that are supposed to be measured in these surveys are
operationalized in sets of items. These sets are usually too large to administer
completely to each student in the sample. Therefore, the survey is
administered in an incomplete design, in which only subsets of items are
administered to each student. This implies that there are structural missing
data. In order to get comparable scores from incomplete data, surveys make
use of latent variable models (see e.g., Lord & Novick, 1968, for a general
introduction). The models, of which the parameters can be estimated with
the observed incomplete data, describe the distribution on the total set of
items.
To obtain unbiased results, it is important to find a model that fits the
data. However, an issue that complicates the statistical modeling, is the
occurrence of differential item functioning (DIF). Holland & Wainer (1993)
define DIF as follows: ‘DIF is a relative term. An item may perform
differently for one group of examinees relative to the way it performs for
another group of examinees’. In this paper, we focus on two types of DIF.
The first is uniform DIF, which, in terms of latent variable models, means
that conditional on the latent variable, the probability of a particular item
response varies across subpopulations. The second is non-uniform DIF, which
means that the correlation between a particular item response and the latent
variable varies across subpopulations.
It has been demonstrated more than once that there is DIF in educational
surveys (Kreiner, 2011; Kreiner & Christensen, 2013; Oliveri & Ercikan, 2011;
Oliveri & Von Davier, 2011; Oliveri & Von Davier, 2014; OECD, 2009a). The
presence of DIF is usually seen as a threat to validity, and as something that
limits score comparability between subpopulations (American Educational
chapter 4 65
Research Association, American Psychological Association, & National

Council on Measurement in Education, 1999). In this paper, however, we
want to emphasize that for surveys DIF could be one of the main interesting
outcomes, and that it need not invalidate meaningful comparisons between
countries. The purpose of this paper is to describe and illustrate such a
method.
The paper is structured as follows. First, we discuss some of the existing
approaches concerning DIF and surveys (Sections 4.1.1, 4.1.2, and 4.1.3).
Then, in Section 4.2, the new method is proposed and compared to the
existing ones. Throughout the paper, we take the reading literacy items of
PISA as an example. In Section 4.3 we describe two example data sets,
whereafter in Section 4.4 three examples of our methodology are provided.
The results and implications are finally discussed in Section 4.5.
4.1.1 Remove DIF items and ignore DIF in the model

The current practice in PISA is a two-stage calibration procedure (OECD,
2009a, chapter 9). First, a calibration is performed in each country1 . Based
on this calibration, items with poor psychometric properties (the so-called
dodgy items) are marked. One of the criteria to mark an item as dodgy is if
the item difficulty in a particular country is significantly lower or higher than
the average of all available countries. If an item is dodgy in more than 10
countries, then it is removed from the survey, otherwise it is scored as ‘not
administered’ in the countries in which it performs dodgy (OECD, 2009a,
chapter 9). Therefore, it could be said that PISA attempts to remove DIF
with respect to item difficulty. Then, in the second stage, PISA applies the
Mixed Coefficients Multinomial Logit Model (Adams, Wilson, & Wang, 1997;
OECD, 2009a) assuming a common scale for all countries together. This
model is a multidimensional generalization of the Partial Credit Model
(PCM), which is a Rasch model for partial credit scoring (Masters, 1982).
The person parameters, which are in this case a monotone transformation of
the expected sum score on the total set of items, are the basis for comparisons
between countries. The use of a common scale for all countries together
1 In fact, PISA consists of participating economies. However, since most economies are
countries, and since we think that the term countries is easier for the reader, we use the term
countries instead of economies.
implies that DIF is currently not taken into account in the psychometric
model. It has been demonstrated that the procedure described above does not
succeed in removing all DIF in PISA (Kreiner, 2011; Kreiner & Christensen,
2013; Oliveri & Von Davier, 2011; Oliveri & Von Davier, 2014; Oliveri &
Ercikan, 2011). In order to improve the model fit and to study the effect of
DIF on PISA’s final scores and rankings, different alternative methods have
been proposed. These alternatives are discussed in the following two sections.

compare person parameter estimates
To study the score scale comparability in international assessments, Oliveri &
Von Davier (2011; 2014) adjusted the measurement model with additional,
country-specific item parameters. In both papers, they showed that the
additional item parameters had a substantial positive effect on the model fit,
however, the influence of the model adjustments on the final scores was
limited. For all three domains (i.e., Science, Reading, and Mathematics) the
correlation between the country means based on the international parameters
(i.e., the PISA approach) and the country means based on partially-unique
country parameters is at least 0.987 (Oliveri & Von Davier, 2011; 2014).
A problem with this approach, however, is that the (equated) person
parameters are used as final scores, while for these scores the relationship
(e.g., the ordering) between countries depends on an arbitrary decision about
the scaling method. This is illustrated in the following example.
Consider the PCM in the following parameterization:
Pj
exp(jθ − g=1 big )
P (Xi = j|θ) = Pmi Ph (j = 0, · · ·, mi ), (4.1)
1 + h=1 exp(hθ − g=1 bih )
in which it is assumed that the response to item i, denoted by Xi , falls in the

score range {0, 1, · · ·, mi }, and where big , g = 1, · · ·, mi , are the parameters of
item i. The example consists of 20 reading literary items (15 dichotomous, and
5 partial credit with possible scores {0, 1, 2}) of PISA 2006 and two particular
countries: Poland and The Netherlands. Further details about the data are
described in Section 4.3.2.
Let us assume that item covariance structures are different in Poland and
chapter 4 67
The Netherlands. In that case, country specific parameters are needed in the
model. In this example, we choose to estimate separate PCMs on the data
of Poland and The Netherlands, respectively. What we obtain are two sets
of parameters that are not on the same scale. How to equate these scales?
Many answers to this question have been considered (see, for instance, Kolen
& Brennan, 2004, chapter 6). However, to illustrate the in principle arbitrary
aspect of scaling with DIF, we just consider the following three options.
The first option is to fix the means of both sets of item parameters at the
same value, e.g., at zero. This mean-mean method (Loyd & Hoover, 1980)
was also applied by Oliveri & Von Davier (2011; 2014). The corresponding
scatterplot with item parameters2 is depicted in Figure 4.1a. Observe that
the dots are not on an approximately straight line, which indicates that there
is DIF between these two countries. The cumulative distribution of person
parameters is displayed in Figure 4.1b. The θ-distributions are approximately
equal.
2 The parameters of polytomous items are connected with a dotted line.

68
Item parameters Netherlands dif in international surveys
0.0 0.2 0.4 0.6 0.8 1.0

Poland
4
Netherlands
F(θ|X = x)
2
0
-2
-4
-4 -2 0 2 4 -4 -2 0 2 4
Item parameters Poland θ
(a) Item parameters - option 1 (b) Person parameters - option 1

Item parameters Netherlands
0.0 0.2 0.4 0.6 0.8 1.0
Anker Poland
4
Non Anker Netherlands

F(θ|X = x)
2
0
-2
-4
-4 -2 0 2 4 -4 -2 0 2 4
(c) Item parameters - option 2 (d) Person parameters - option 2

Item parameters Netherlands
0.0 0.2 0.4 0.6 0.8 1.0
Anker Poland
4
Non Anker Netherlands

F(θ|X = x)
2
0
-2
-4
-4 -2 0 2 4 -4 -2 0 2 4
(e) Item parameters - option 3 (f) Person parameters - option 3
Figure 4.1: Different anchoring options.

chapter 4 69
The second and third option are based on equating with an anchor set of
items. In the presence of DIF, the choice of an anchor is problematic. There
can be clusters of items, such that within a cluster the relative difficulties are
invariant across populations (Bechger, Maris, & Verstralen, 2010; Bechger &
Maris, 2014). An example of such clusters is shown in Figures 4.1c en 4.1e. The
items depicted with black and grey circles, respectively, could be seen as two
of such clusters. It is not uncommon in psychometrics to take one cluster as
anchor and consider these items as the items without DIF. The remaining items
are then considered to be the DIF items. In this example we have two clusters,
but taking one or the other as anchor (denoted as options 2 and 3), or fixing the
mean of the item parameters (option 1) is statistically equivalent. These three
options have exactly the same likelihood, and are only a re-parameterization
of each other. This implies that there is no empirical evidence to prefer one
option over the other. Observe, however, that the relative difference between
person parameter distributions is different across these three options. The θ-
distributions related to options 2 and 3 are depicted in Figures 4.1d and 4.1f,
respectively. With option 2 the distribution of Poland is stochastically larger
than the distribution of The Netherlands, while the reverse is true with option 3.
Since the ordering of the person parameter distribution of person parameters
depends on an arbitrary decision about the scaling method, we argue that
person parameters are in this case not a suitable basis for comparing countries.

adjust the observed total score
Kreiner & Christensen (2007) also came up with a model with subpopulation-
specific item parameters. Since they also doubted whether comparisons of
person parameters in such models are meaningful, they suggested to estimate
adjusted true scores on a specific set of items. Their procedure implies that for
each focus group the estimated true score is adjusted as if the members of the
focus group were members of the reference group. This procedure was applied
in their paper about the reading items of PISA 2006 (Kreiner & Christensen,
2013). It was observed that the average number of correct items in Belgium
was 15.09, but if these Belgian students would have been from Azerbaijan,
they would have scored 17.59 (see Kreiner & Christensen, 2013, Table A.1.).
Is it meaningful to adjust observed scores based on a DIF analysis? We think
that the answer depends on the type of DIF. If there is an external source
that has a temporary effect on the scores, and the magnitude of the effect
can empirically be estimated, then it makes sense to adjust test scores. For
instance, wind assistance in a 100 meter sprint exists independent of the ability
of the athlete, and benefit of wind assistance can be modeled explicitly (see
e.g., Linthorne, 2014). Since it could have happened that the same athlete
would have ran the race without wind, the adjusted sprint time is a meaningful
measure. In a testing context, however, the sources of DIF are in most cases
unclear (Sandilands, Oliveri, Zumbo, & Ercikan, 2013; American Educational
Research Association et al., 1999). Sources that are reported in a survey context
(e.g., language, cultural, or gender differences) are in a fixed way related to the
subpopulation. They are fixed in the sense that it could not have happened
that the same student was a girl instead of a boy, or that he was English
speaking instead of French speaking. Therefore, we think that DIF-based score
adjustments like ‘if this student was a member of another subpopulation’ have
a limited meaning in a survey context.
4.1.4 DIF as an interesting outcome

Besides the papers mentioned in Section 4.1.1, PISA itself also reports
examples of DIF. For instance, with respect to the math items of PISA 2003,
“among OECD countries, the Slovak Republic ranks around fourteenth
(twelfth to seventeenth) and thirteenth (ninth to seventeenth) for the ‘space
and shape’ and ‘quantity’ scales, but around twenty-fourth (twenty-fourth to
twenty-fifth) in the ‘uncertainty’ scale” (OECD, 2004). Since opposite
patterns occur in other countries (OECD, 2004), one can easily conclude that
there is DIF between countries with respect to the math items. We believe
that this kind of DIF should not be eliminated. Instead, it should be
considered as an interesting outcome of an educational survey. Therefore, the
incomplete survey data should be treated in a way that takes DIF into
account. The previous sections, however, illustrate that the existing
approaches in which DIF is taken into account are not fully satisfactory. In
the next section, we propose another DIF driven approach to model the data
obtained from international surveys.
chapter 4 71
4.2 Method
The main idea is straightforward: define the construct as a large set of items,
collect the data in an incomplete design, use measurement models to complete
the incomplete data, and report in terms of summary statistics. The following
sections describe this procedure in more detail.
4.2.1 The construct

The starting-point of our approach is that the subject domain of interest is
defined by a large set of items. This is in accordance with the third variation of
the market basket approach (Mislevy, 1998). What follows is that the primary
interest is in knowing how students perform on this set of items. Statistically
speaking, we want to estimate the distribution of X, which represents the
matrix with responses from all students to all items. What we therefore need
to have, is an unbiased estimate x. One way to achieve this, is to administer the
complete set of items to a simple random sample of students. However, this is
practically untenable. The current practice is to draw a stratified sample and to
administer the survey with an incomplete design. The correction for stratified
sampling is already taken into account with the student weights and replicate
weights (OECD, 2009b, chapter 3 and 4). What remains is how the incomplete
observed data may be used to get an unbiased estimate of the complete data.
This is the point where the measurement model comes in.
4.2.2 Purpose of the measurement model

The primary role of the model is to describe the scores on the missing data,
conditionally on the observed data. If the model fits, then the estimated
complete matrix x is an unbiased estimate of X. Observe that for this
purpose it is not required that the same model is used for each country.
Instead, it is possible to have different models for different countries, or even
to have different models for different subpopulations within a country. It only
needs to be verified whether the model(s) is (are) suitable to impute missing
data.
4.2.3 Comparability
One of the main applications of surveys is to compare performance across
participating countries. This implies the comparison of subsets of X. Let us
denote the subsets corresponding to countries a and b as Xa and Xb . Under
the conditions described above, xa and xb are unbiased estimates of Xa and
Xb , respectively, and are therefore comparable between countries.
Consequently, the same holds for functions applied to X. If one is interested
in the function f of X (e.g., the sum score over items), then f (xa ) and f (xb )
are unbiased estimates of f (Xa ) and f (Xb ). Notice that this only holds if the
same function f is applied to both xa and xb . With two different functions,
e.g., f1 (xa ) and f2 (xb ), the comparability property is lost.
4.2.4 Difference with existing methods

It could be that one of the models to which is referred in Sections 4.1.2 and
4.1.3 does fit the data. In that case, these models fit in our approach because
they can be used to estimate the complete data, and consequently also
summary statistics on the complete data. The difference between our
approach and the methods described in Sections 4.1.2 and 4.1.3 is about what
is done with the estimated complete data. Oliveri & Von Davier (2011; 2014)
use the model to compare the estimated person parameters. However, the
estimated person parameter is a function of the data, and the item
parameters. If the item parameters differ in different subpopulations, the
comparison of person parameters is based on two different functions of the
data (cf., f1 and f2 ). As we have explained in the section above, scores based
on different functions of the data are not comparable.
Kreiner & Christensen (2013) illustrate their method with complete data,
and use the model to adjust the total scores. This implies that they compare the
observed performance of country a with the adjusted performance of country
b. In contrast, the method that we propose does not contain such adjustments.
If we have complete data, we compare the performance of country a with the
(non-adjusted) performance of country b.
chapter 4 73
4.2.5 Estimation process

For the method described above, it is needed to find a subdivision of the
population, such that for each subpopulation a suitable model can be used.
To simplify the illustrations, this paper only verifies the application of
different models for different countries and for different time points in the
survey. To stay close to illustrations provided in the related research that was
discussed in Section 4.1, two models are taken under consideration: the PCM
as in (4.1), and the One-Parameter Logistic Model (OPLM, Verhelst & Glas,
1995):
Pj
exp[ai (jθ − g=1 big )]
P (Xi = j|θ) = Pmi Ph (j = 0, · · ·, mi ),
1 + h=1 exp[ai (hθ − g=1 bih )]
which compared to (4.1) also contains integer ai -parameters that are considered
as being known before. These are also called discrimination indices.
For both the PCM and the OPLM, the b-parameters are estimated with
the conditional likelihood method (Andersen, 1973a). An important question
related to the OPLM is how to obtain these ai -parameters. Observe that items
with the same value for the ai -parameter do form a Rasch subscale. Therefore,
an OPLM can also be considered as a collection of Rasch subscales. One
way to estimate this model, is to 1) find the different Rasch subscales, and 2)
investigate how these are related to each other, e.g., find the different ai -values
(see Bolsinova, Maris, and Hoijtink (2012) for recent developments with respect
to this approach). For this paper, the ai -parameters are estimated with the
OPCAT procedure in the OPLM program (Verhelst et al., 1993). Since this is
also an estimation procedure, the sample is split randomly into two subsamples.
The first subsample, which is chosen to be approximately twice as large as the
second, is used for estimating the ai -parameters. The second subsample is
used for estimating the bi -parameters of the OPLM, while considering the ai -
parameters as known, and for evaluating the model fit. This estimation is
performed with the OPLM program (Verhelst et al., 1993).
When the item parameters are estimated, then the next step is to estimate
person parameters. In order to do so, five plausible values (PV) are drawn from
the posterior distribution of the person parameters, conditional on the data and
the estimated item parameters. The sample from this posterior distribution
can be drawn with a Gibbs sampler, in which the mean and the variance of
the prior normal distribution are also estimated. A general description of this
method can be obtained from the ESLC Technical Report (Council of Europe,
2012). Detailed discussions about the estimation algorithm (Marsman, Maris,
Bechger, & Glas, 2013a), and about the advantages of using PV (Marsman,
Maris, Bechger, & Glas, 2013b) are elsewhere available.
4.2.6 Plausible responses and plausible scores

As mentioned before, the person parameters are not directly comparable
between countries if the sets of item parameters are not the same across these
countries. However, the PV can be used to obtain plausible responses.
Plausible responses are samples from the item response distribution according
to the measurement model, the estimated item parameters for the particular
country, and the PV for the person parameters in that country. For each PV,
one set of plausible responses was drawn for the parts of the data that were
missing. A full matrix with observed and plausible responses for each student
on each item is an estimate of X. It is already noticed in Section 4.2.3 that
someone might be interested in statistics computed on X. In this paper, we
consider the sum over item responses (observed and plausible), and call these
plausible scores. However, also other summary statistics could be taken. In
Section 4.5, we will discuss that in this approach there are neither right nor
wrong summary statistics.
4.2.7 Model fit evaluation

This approach relies upon a fitting model. In this paper, the fit is evaluated
with two methods. The first method is the R1c statistic. This statistic is, under
the null-hypothesis that the model fits, asymptotically chi-square distributed
(Verhelst et al., 1993). The second method is the exploration of item fit plots.
These plots provide the theoretical (πi|x+ ) and observed (Pi|x+ ) probability
of a particular response (e.g., the correct response) on item i, conditional on
the observed (weighted) sum scores (i.e., the sufficient statistic for the person
parameter, from here on denoted as x+ ). The observed x+ are first binned
into score groups, such that each score group had a substantial number of
observations. Then the weighted average value of πi|x+ and Pi|x+ are computed
chapter 4 75
Table 4.1: Sample sizes data set 1.
Year Canada Mexico

2003 1097 592
2006 1738 2085
in each score group, using the observed frequency of scores in the score group.
4.3 Data
In order to illustrate the proposed method, PISA’s ‘reading literacy’ scale, is

taken as an example3 . For this construct, the PISA cycles of 2003 and 2006
contain the same 28 items. Although PISA does not claim that this set of items
defines reading literacy, we do so to illustrate the method.
4.3.1 Data set 1

For the first two examples, only the data of booklet 11 of PISA 2003 and
booklet 6 of PISA 2006 were taken, because these booklets contain all 28
reading literacy items, and can therefore be considered as complete data.
Within these booklets, two particular countries with a large sample sizes are
selected: Canada, and Mexico. Within these countries, only the cases without
incidental missing values are selected. The sample sizes of the resulting data
set are displayed in Table 4.1.
4.3.2 Data set 2

For the third example all reading literacy data of PISA 2006 were taken. The
total sample consists of 398,750 students. For the example, the following data
cleaning was performed. First, all students with more than 50% missing item
responses within the administered booklet were removed from the data (i.e.,
4,432 students ≈ 1%). Then, two countries were excluded: the USA, because
no reading items were administered, and Liechtenstein, because of the very
small sample size. The sample sizes per country are displayed in the third and
fourth column of Table 4.4. The third column denotes the remaining sample
3 The data were retrieved from http://pisa2003.acer.edu.au/downloads.php and
http://pisa2006.acer.edu.au/downloads.php on August 22nd, 2013.

after the exclusions described above. The fourth column denotes the number
of students that took reading items.
Since the construct is defined in a set of items, and not all reading items
are administered in each country (OECD, 2009a, Tabel 12.5), only the subset
of 20 items was taken that was administered in each country. These are the
same items as those that were taken by Kreiner and Christensen (see Kreiner
& Christensen, 2013, Table 3).
Treatment of missing data

PISA distinguishes three types of missing data: not administered , which not
only means that items are not administered to a person, but also could mean
that an items is removed due to poor psychometric qualities; invalid response,
which implies a missing or double response; and not reached , which denotes
consecutive invalid responses at the end of the test. Invalid and not reached
response codes are considered as wrong (i.e., scored as 0). The coding of the
not administered items needs some more explanation.
All reading items are distributed among 14 booklets. In some countries,
some items are not administered in a particular booklet (see OECD, 2009a,
Table 12.5). These cases are not considered as wrong responses, but literally
as not administered. This implies that in the OPLM program (Verhelst et
al., 1993) additional booklets are defined. These booklets are equivalent to
the corresponding original booklet, besides the not administered item. After
doing that, the data still contain not administered responses. For these cases,
the following rule was defined: for countries where a particular item has more
then 5% not administered responses, an additional booklet was defined for
those students who have a not administered response on that particular item.
This was the case for item R220Q06 in Estonia (multiple booklets) and item
R227Q02T in Azerbaijan (only booklet 13). In cases where a particular item
has less then 5% not administered responses, the responses were considered as
wrong, i.e., scored as 0.
4.4 Illustrations and results
This section describes the method with three of illustrations. The first two
examples are based on data set 1. In the first example (Section 4.4.1) the fit of
chapter 4 77
1.0
1.0
0.8
0.8
P (Xi = 1)
P (Xi = 1)
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
5 10 15 20 25 30 5 10 15 20 25 30
Weighted sum score Weighted sum score
(a) Item 17 (b) Item 25
Figure 4.2: Pi|x+ (dots), and πi|x+ (line) under the PCM for both countries and both
cycles.
different measurement models is explored. The second example (Section 4.4.2)

illustrates the observed sum score distribution of Canada and Mexico, as well
as the estimated sum score distribution based on incomplete data. The last
example (Section 4.4.3) is based on data set 2, and illustrates the impact of
DIF on the average sum scores of countries, and the corresponding rankings.
4.4.1 Exploring the model fit

The same model for both countries
First, the PCM is fitted on the total data set with both countries and both
cycles. This is actually what PISA also does. It turns out that this PCM does
not fit, R1c (231) = 2,233.36, p < 0.0001.
The item fit plots in Figure 4.2 provide a closer look at the fit of the
PCM. The two examples of item 17 and 254 indicate that the addition of
discrimination parameters would improve the model fit significantly.
Therefore, the OPLM (Verhelst & Glas, 1995) is fitted on the total data set
with both countries and both cycles. For this analysis, the data for each
4 The item numbering is according to the order in which the items appear in booklet 6 of
PISA 2006.
Table 4.2: Size subsamples and fit statistics OPLM.
Year Country Sample size Fit statistics

subsample 1 subsample 2 R1c (99) p
2003 Canada 700 397 105.932 0.2985
Mexico 400 192 122.306 0.0561
2006 Canada 1200 538 131.750 0.0155
Mexico 1400 685 150.115 0.0007
Table 4.3: Estimated a-parameters of the OPLM for both countries and both cycles.
Item nr. a Item nr. a Item nr. a Item nr. a

Item 1 3 Item 8 4 Item 15 3 Item 22 3
country and each cycle is split into two subsamples (see section 4.2.5). The
corresponding sample sizes are displayed in column 3 and 4 of Table 4.2. The
integer discrimination parameters are obtained based on subsample 1. These
estimated values are displayed in Table 4.3. The OPLM is estimated with
subsample 2. Although the model fit test is significant, R1c (99) = 255.411,
p < 0.0001, the ratio between the fit statistic and the number of degrees of
freedom (approx. 255/99) indicates that the fit of this model is substantially
better than the fit of the PCM (approx. 2,233/231). This improvement in fit
can also be seen in the item fit plot of item 17 and 25 (see Figure 4.3).
Different models in different countries

In order to investigate DIF between countries, the OPLMs are also estimated
for each country, separately. The corresponding fit statistics, each time based
on subsample 2, are displayed in column 5 and 6 of Table 4.2. The fit
statistics demonstrate that separate models for each country does improve the
fit substantially. Another way to illustrate DIF, is by comparing πi|x+ and
Pi|x+ , where πi|x+ is based on item parameters obtained from a different
country. For Canada and Mexico 2006, two examples are given in Figure 4.4.
chapter 4 79
1.0
1.0
0.8
0.8
P (Xi = 1)
P (Xi = 1)
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
30 40 50 60 70 80 90 30 40 50 60 70 80 90
Figure 4.3: Pi|x+ (dots), and πi|x+ (line) under the OPLM for both countries and
both cycles.
The expected proportions for Canada 2006 are estimated based on the item
parameters that are obtained from the OPLM for Mexico 2006, while the
observed proportions are obtained from the data of Canada 2006. It can be
seen that for the students in Canada in 2006 item 17 is relatively more easy,
and item 21 has a larger discriminative power, compared to students in
Mexico.
4.4.2 Incomplete design

In Section 4.2.2, it is suggested that the primary role of the latent variable
model is to describe the expected score distribution for the missing data. This
is demonstrated in the following example, based on data set 1. For each student,
the responses on all 28 items are available. Now both subsamples are divided
into two groups. For the first group, only the responses on item 1 to 18 are
taken, i.e., item 19 to 28 are considered as missing, while for the second group
only the responses on item 11 to 28 are taken. A graphical representation of
the incomplete design is given in Figure 4.5.
In order to compute the plausible score distribution, the steps as described
in Sections 4.2.5 and 4.2.6 are performed. The distributions of plausible scores
1.0
1.0
0.8
0.8
P (Xi = 1)
P (Xi = 1)
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
40 60 80 100 40 60 80 100
Figure 4.4: πi|x+ (line) and Pi|x+ (dots), where πi|x+ is based on parameters obtained
from Mexico 2006, and where Pi|x+ is based on the data of Canada 2006.
of both countries, based on separately estimated OPLMs in each country, are

displayed in Figure 4.6a. The plausible scores are displayed in grey. The
observed total score on the all 28 items, is displayed in black. It can easily
be seen that the plausible scores, based on 18 out of 28 items, provide a good
estimation for the the sum score on all 28 item. We call this example 1.
In order to demonstrate that the estimation of plausible scores is quite
robust against model misspecification, the plausible scores are also computed
based on the item parameters obtained from a PCM that was fitted on the
complete data of Canada and Mexico separately (example 2), and on the item
parameters obtained from a PCM that was fitted on the complete data of
Canada and Mexico in both 2003 and 2006 together (example 3). Although
these models do not fit, Figures 4.6b and 4.6c, respectively, display that the
missing data can still be estimated accurately.
4.4.3 A large data example

In this final example, the impact of DIF on the average sum score per country
and the corresponding league table is illustrated. For each country, separate
PCMs and OPLMs were fitted. The R1c /df-ratios of the PCMs and OPLMs
chapter 4 81
Item
1, · · · 10,11, · · · 18,19, · · · 28
Group 1 missing
Subsample 1
Group 2 missing
Group 1 missing
Subsample 2
Group 2 missing
Figure 4.5: Graphical representation of the incomplete design.
are displayed in Figure 4.7. From there it can be seen that for the PCMs only
two countries have an R1c /df smaller than 2, while for the OPLMs all R1c /df
are smaller than 2. For this example, we took the OPLM for each country.
Then, for each student, five plausible values were drawn, and these were
transformed to the plausible scores metric. Since five of the items are partial
credit items with three categories, the scale reaches from 0 to 25. Finally,
standard errors were computed according to the regular procedure with student
and replicate weights (see OECD, 2009b), Chapter 2-4).
The original PISA scores (OECD, 2007) and the average plausible scores
per country are depicted in Figure 4.8. The correlation between the two is
0.991. There are differences in ranks, but these are small. To illustrate the
significance of the differences, the 95% confidence levels of the ranks are
simulated as follows. For each country, a score was sampled from a normal
distribution with mean equal to the country’s mean score, and standard
deviation equal to the standard error of the country’s mean estimate. With
these sampled scores, the rank of the countries is determined. Next, this
1.0
Canada
Cumulative distribution
Mexico
0.8
0.6
0.4
0.2
0.0
0 10 20 30
Total score
(a) Example 1
1.0
1.0
Canada Canada
Mexico Mexico
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 10 20 30 0 10 20 30
Total score Total score
(b) Example 2 (c) Example 3
Figure 4.6: Distribution of observed (black) and plausible (grey) scores.
sampling scheme is repeated 10,000 times. Consequently, each country

obtains 10,000 rank estimates. Finally, per country the 2.5th and 97.5th
percentile of the simulated ranks are taken as a 95% certainty estimate. The
resulting intervals for the original PISA scores, as well as for the plausible
scores computed in this example, are displayed in Table 4.4 and Figure 4.9.
It can be seen that for the majority of countries the interval is slightly larger
when based on plausible scores. This is because these scores are based on 20
chapter 4 83
10
R1c/df OPLM
8
6
4
2
2 4 6 8 10
R1c/df PCM
Figure 4.7: R1c /df-ratios of the PCMs and OPLMs estimated per country.
items instead of 28 items, and because the plausible scores contain an additional
source of uncertainty. Besides the uncertainty caused by sampling students,
and the estimation error of the parameters based on the observed data, the
plausible scores also contain measurement error in the sampled responses for
the missing data. This latter source of uncertainty, however, is relatively small
compared to the first two.
The most remarkable observation is that besides one country all other
intervals do overlap. Only for Macao-China the rank based on the plausible
scores on the 20 items is significantly higher compared to the PISA ranking.
This overlap illustrates that taking (non-)uniform DIF into account does not
affect the rank order of countries significantly.
4.5 Discussion
The aim of this paper was to provide a statistical method for educational
surveys that takes DIF into account, but at the same time provides final scores
that are comparable between countries. The results make clear that the model
fit improves substantially if country specific item parameters are included in
the method. The final league table, however, is nearly the same as the PISA
league table that is based on an international calibration. This illustrates that
10 12 14 16 18
Plausible score
8
6
300 350 400 450 500 550
PISA score
Figure 4.8: Mean scores per country based on the PISA’s Reading Literacy data from
2006. On the x-axis the original PISA scores, on the y-axis the plausible scores on
the subset of 20 items.
the current PISA methodology is quite robust against (non-)uniform DIF.

DIF is not the only aspect in which the PISA methodology has been
criticized. Other critiques where about the multilevel and multidimensional
structure of the data (Goldstein, 2004), or about the effect of item positions
on item parameters (Le, 2007). These papers suggest to extend the model
with additional parameters. However, when these parameters are added, then
the person parameters are for the same reason as explained in this paper not
directly comparable between countries. Therefore, it should be made clear
how these extended models provide summary statistics that permit
inter-country comparisons, something which is not done in the papers
mentioned above.
Some other critiques on PISA are about the multidimensional structure of
the constructs. In every PISA cycle, one domain (i.e., Math, Reading, or
Science) is chosen to be the major domain. For major domains additional
items are administered, and the scores on the sub domains are reported on
different scales. This implies that every domain has a multidimensional
structure, otherwise the breakdown into subscales would be redundant. Does
this imply that every domain should be modeled with a multidimensional
chapter 4 85
model? The answer to this question is not necessarily yes. An alternative to

the complex technique and reports of a multidimensional models could be an
analysis with the following two steps. First, as described above, fit a
unidimensional model in order to solve the missing data problem. Then,
perform a secondary profile analysis (Verhelst, 2012) in order to test whether
the country’s profile of observed subscores is significantly different compared
to the profile of subscores according to the model. This might, for instance,
make clear that the country stays behind on a particular subset of the
domain.
In reaction to an earlier paper of Kreiner (2011), Adams (2011) pointed
out which factors need to be taken into account if someone is considering an
alternative, more complicated model:
1. The need to comprehensively cover the constructs;
2. The need for analytic techniques that work simultaneously for more than
50 countries;
3. The need to integrate with appropriate sampling methodologies;
4. The requirement to provide a database that is accessible to and usable

by secondary data analysts;
5. The need to support the construction of described proficiency scales;
6. The need to permit inter-country comparison of overall levels of

performance;
7. The need to support the construction of scales that retain their meaning
over time.
We think that our approach fits the first six requirements. The seventh, we
return to later on.
In this paper, we took the sum score as a summary statistic of X, but also
another statistic could have been chosen. If we define the construct as a large set
of items, then, as explained in Section 4.2.3 the data are comparable, and so are
all kinds of summary statistics. Which summary statistics should be taken is a
question that policy makers should answer. Statistical techniques can provide
suggestions about statistics that could display interesting information, however,
there is no empirical ground for considering some statistics to be wrong. If

we can estimate every summary statistic we like without bias from complete
data, we only need to satisfy ourselves that we obtain unbiased estimates from
incomplete data as well.
A question that is not considered in this paper, is which items should and
should not be included in an international survey. This question is a topic of
debate between the participating countries. The aim of PISA is to measure
whether the knowledge and skills of 15-year old students is such that they are
prepared for life. It could be that different countries have different (kinds of)
items in mind when pondering this question. In order to find a compromise,
there are mainly two options: define the greatest common divisor, or define
a broader set that also contains items that fit the ideas of some but not all
countries. The latter approach might cause DIF, but that, as explained in this
paper, does not cause methodological problems. Instead, this diversity in the
item set has the opportunity to display the diversity among countries. One way
or the other, one central requirement for the method proposed in this paper is
that comparisons are based on one set of items that is administered in every
country. Therefore, the participating countries in PISA should agree whether
responses to the final set of items does reflect preparedness for life. This set of
items defines the validity of the survey instrument.
At this stage, we come back to the seventh requirement that was mentioned
above, i.e., ‘the need to support the construction of scales that retain their
meaning over time’. Comparisons between time points, e.g., 2003 and 2006,
could be of particular interest for national policy makers. To turn back to the
example of the Slovak Republic that was given in Section 4.1.4, it is likely that
the findings in 2003 are a reason for the government of the Slovak Republic
to put extra effort in skills related to the ‘uncertainty’ scale. If they would do
so, then it is to be expected that the differences between the three subscales
becomes smaller. This would imply that, for a randomly chosen 15-year old
student in the Slovak Republic, the items related to the ‘uncertainty’ scale
will become relatively more easy. A recent report of the OECD (2012) shows
that countries do react on survey outcomes. The majority of participating
countries report that PISA has become an instrument for benchmarking student
performance, and that PISA has had an influence on policy reform. Here, we
want to emphasize that, if countries change their policy based on the survey
chapter 4 87
outcome, and if they succeed, then they actively create DIF between countries
and within countries over time.
But not only item properties are changing over time. More general, the
world around us is (rapidly) changing. Imagine, for instance, that a construct
like IT literacy would have been part of the PISA survey5 . Some items that
would have been suitable in 2000, would definitely not have been suitable
anymore in 2012. This implies that the content of the construct changes over
time and that, in order to measure whether the skills and knowledge of
15-year-old students are sufficient for real life, the set of items should also
change over (a longer period of) time. But are observed scores comparable
over time if the item set changes? We think they are. Compare this case, for
instance, with stock market indices like the Dow Jones Industrial Average.
Because the Dow Jones Industrial Average has to reflect at each time point
the current state of the market, the content on which the index is based
changes over time (Dieterich, 2013). In the same way, if the consortium of
participating countries agrees at each time point that the current set items
covers the construct of interest at that particular time point, then comparison
in terms observed scores are meaningful, because at each time point they
reflect the construct of interest. For example, a conclusion could then be of
the following structure: ‘country A scores 60% of the items of 2006 correct,
while they score 70% of the 2009 items correct’. Turning back to Adams’s
seventh requirement that an alternative method needs to support the
construction of scales that retain their meaning over time, it could be
concluded that the approach suggested in this paper also fits this requirement.
To conclude, if surveys have impact on policy decisions, then it is likely that
item properties change over time. Moreover, the surveys could serve policy
makers to change the world around us. They can compare the results of the
surveys with their own benchmarks, and choose to adjust their policy. And if
they succeed, survey outcomes should detect this. Therefore, surveys are part
of a dynamic system, and what we need is methodology that does justice to
these dynamics, rather than methodology that is rooted in an inherently static
view on an educational system. In the this paper, we started with providing
another look at DIF in international surveys. However, our approach to focus at
5 Around 2000, it has been discussed whether this construct should be part of the PISA
survey.
a large set of items also provides opportunities to study qualitative differences

between countries and within countries over time. This is a topic for further
research. If we more and more succeed to display all the dynamics described
above, we further improve the survey instruments with respect to their main
purpose: monitoring countries in a changing world.
chapter 4 89
Table 4.4: Results Reading Literacy per country: N1 = total sample size, N2 =
sample that took reading items, UR = upper rank, LR = lower rank.
PISA 2006 Plausible scores
Code Country N1 N2 Score SE UR LR Score SE UR LR
KOR Korea 5171 2791 556 3.8 1 1 17.62 0.17 1 1
FIN Finland 4710 2533 547 2.1 2 2 16.65 0.13 2 3
HKG Hong Kong-China 4566 2465 536 2.4 3 3 16.74 0.16 2 3
CAN Canada 22505 12129 527 2.4 4 5 15.69 0.12 4 5
NZL New Zealand 4798 2559 521 3.0 4 6 15.41 0.17 4 6
IRL Ireland 4572 2467 517 3.5 5 8 15.24 0.18 5 8
AUS Australia 14081 7556 513 2.1 6 9 14.87 0.12 6 11
POL Poland 5540 2975 508 2.8 7 11 14.88 0.14 6 12
SWE Sweden 4420 2371 507 3.4 7 12 14.82 0.17 6 13
NLD Netherlands 4863 2664 507 2.9 7 12 14.59 0.19 7 17
BEL Belgium 8834 4846 501 3.0 9 16 14.27 0.18 11 21
EST Estonia 4861 2631 501 2.9 9 16 14.50 0.16 9 17
CHE Switzerland 12167 6579 499 3.1 10 19 14.54 0.16 8 17
JPN Japan 5923 3196 498 3.6 10 20 14.22 0.18 12 22
TAP Chinese Taipei 8801 4740 496 3.4 11 21 14.51 0.19 8 18
GBR United Kingdom 13044 7045 495 2.3 13 21 13.86 0.12 18 26
DEU Germany 4875 2704 495 4.4 11 23 13.86 0.26 15 28
DNK Denmark 4515 2430 494 3.2 12 22 14.04 0.16 15 24
SVN Slovenia 6589 3634 494 1.0 15 20 14.35 0.13 11 19
MAC Macao-China 4746 2560 492 1.1 17 22 14.64 0.14 8 15
AUT Austria 4922 2646 490 4.1 14 25 13.93 0.19 16 26
FRA France 4673 2530 488 4.1 16 27 13.98 0.21 14 26
ISL Iceland 3756 2009 484 1.9 22 27 13.95 0.12 17 24
NOR Norway 4664 2507 484 3.2 21 28 13.56 0.16 22 29
CZE Czech Republic 5927 3246 483 4.2 21 29 13.58 0.24 19 31
HUN Hungary 4483 2401 482 3.3 22 29 13.61 0.18 20 29
LVA Latvia 4699 2561 479 3.7 23 30 13.71 0.20 18 29
LUX Luxembourg 4559 2446 479 1.3 25 29 13.42 0.12 25 30
HRV Croatia 5203 2774 477 2.8 25 30 13.27 0.13 26 32
PRT Portugal 5093 2785 472 3.6 27 33 13.03 0.19 28 34
LTU Lithuania 4728 2544 470 3.0 29 33 13.03 0.16 29 34
ITA Italy 21671 11636 469 2.4 30 33 13.25 0.12 27 32
SVK Slovak Republic 4724 2550 466 3.1 30 35 12.90 0.18 29 35
ESP Spain 19512 10516 461 2.2 33 35 12.50 0.12 34 36
GRC Greece 4847 2609 460 4.0 32 35 12.85 0.20 30 35
TUR Turkey 4922 2656 447 4.2 36 38 12.44 0.23 33 37
CHL Chile 5141 2783 442 5.0 36 39 11.95 0.24 36 39
RUS Russian Federation 5734 3076 440 4.3 36 39 11.82 0.23 37 39
ISR Israel 4392 2362 439 4.6 36 39 11.78 0.21 37 39
THA Thailand 6153 3342 417 2.6 40 41 10.00 0.13 41 43
URY Uruguay 4671 2523 413 3.4 40 43 10.64 0.18 40 40
MEX Mexico 30383 16433 410 3.1 40 43 9.97 0.15 41 43
BGR Bulgaria 4418 2374 402 6.9 41 49 9.75 0.31 41 47
SRB Serbia 4771 2565 401 3.5 43 47 9.35 0.16 43 48
JOR Jordan 6433 3494 401 3.3 43 47 9.21 0.14 44 49
ROU Romania 5102 2733 396 4.7 43 50 8.83 0.29 45 52
IDN Indonesia 10485 5649 393 5.9 43 51 8.76 0.27 46 52
BRA Brazil 8981 4890 393 3.7 45 50 9.23 0.15 44 49
MNE Montenegro 4436 2369 392 1.2 46 50 8.63 0.11 49 52
COL Colombia 4179 2259 385 5.1 47 52 9.52 0.21 42 47
TUN Tunisia 4534 2422 380 4.0 49 52 8.77 0.20 47 52
ARG Argentina 4068 2207 374 7.2 50 52 8.97 0.29 44 52
AZE Azerbaijan 5184 2784 353 3.1 53 53 6.34 0.13 53 53
QAT Qatar 6061 3281 312 1.2 54 54 5.62 0.10 54 54
KGZ Kyrgyzstan 5313 2870 285 3.5 55 55 4.60 0.13 55 55
Expected sum score

KOR PISA score
FIN
HKG
CAN
NZL
IRL
AUS
POL
SWE
NLD
BEL
EST
CHE
JPN
TAP
GBR
DEU
DNK
SVN
MAC
AUT
FRA
ISL
NOR
CZE
HUN
LVA
LUX
HRV
PRT
LTU
ITA
SVK
ESP
GRC
TUR
CHL
RUS
ISR
THA
URY
MEX
BGR
SRB
JOR
ROU
IDN
BRA
MNE
COL
TUN
ARG
AZE
QAT
KGZ
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55
ranking
Figure 4.9: 95% rank intervals per country.

Chapter 5
Discussion
In the previous three chapters, I have described the results of three research
projects related to the use of latent variable models in educational testing.
Each chapter ended with some topics for discussion. In this section, I will
briefly discuss the topics of this thesis from a more personal perspective. I will
do this chapter by chapter.
5.1 The optimal CAT for high-stakes testing
Following from the research about item parameter estimation in multistage

testing designs, I would like to discuss an issue that was already introduced in
the discussion of Chapter 2, and that is the relationship between the estimation
error of both item and person parameters. The main question is: what is an
optimal computerized adaptive test (CAT) for high-stakes testing?
In high-stakes testing, test results can have important consequences for
the test taker. These consequences force us to be careful with assumptions
that can have major impact on the inferences from the test data. Since
differences in testing stakes may result in different item characteristics
(Brennan, 2006), I think that it is preferable to use item parameter estimates
that are based on the actual test data and not on pretest data that are
obtained under other testing conditions. Furthermore, all sources of random
error should be taken into account in the final θ estimate. As I already
brought up in the discussion in Chapter 2, this latter requirement has
interesting implications for the optimality of adaptive testing designs.
Let us consider first a traditional CAT that is based on a calibrated item
bank, and where items are selected based on the Fisher information criterion
(Fisher, 1922; Van der Linden & Pashley, 2010). A well-known property of such
91
92 discussion
CATs is that, compared to linear tests, person parameters can be estimated

more efficiently. The increase of efficiency depends on a couple of factors. One
is the availability of items which provide high information at the level of the
current θ estimate. It is straightforward that the availability increases if the
size of the item bank increases. And in the limiting case, an optimal CAT is
based on an infinitely large calibrated item bank.
But are CATs based on large item banks really optimal? This way of
optimizing the accuracy of the θ estimates rests on the assumption that the
item parameters are known. In fact, the traditional (optimal) CAT aims to
administer a subset S to an individual student such as to minimize the posterior
variance of ability:
VAR(Θ|xS , λ), (5.1)
where λ denotes a vector of fixed item parameters.

If we take seriously the requirement that item parameters should be
estimated from the actual test administration, we run into trouble. We may
still use the parameters λ from the item bank to identify the set of items S to
administer to an individual student. After test administration, however, we
need to recompute the item parameters from the actual test administration
data.
The total uncertainty about θ now depends both on the set of items S and
on the uncertainty regarding λ. The correct posterior variance, taking into
account uncertainty regarding item parameters, is the following:
E[VAR(Θ|xS , Λ)|xS ] + Var[E(Θ|xS , Λ)|xS ] (5.2)
With a typical linear test (i.e., the same S for all students) administered to a
large enough number of students, the second term in (5.2) is negligible, at the
expense of the first term being potentially inflated. In contrast, in CAT the
first term is minimized, at the expense of the second term being inflated.
This brings me to the following position: optimal inferences from
high-stakes adaptive testing data implies a balance between the estimation
error of person and item parameters. In that sense, MST could be more
efficient compared CAT. I consider the minimization of (5.2) one of the main
outstanding questions for CAT research.
chapter 5 93
5.2 To order, or not to order: that is the question
In Chapter 3, I have discussed the use of the sum score for ordering individual
test takers. This was quite a formal discussion with the purpose to distinguish
between the stochastic ordering of the latent trait (SOL) property (Hemker et
al., 1997) that enables inferences about groups of students, and the sufficiency
property that enables inferences about individual students. The representation
of sufficiency in terms of the stochastic ordering of posterior distributions of
θ provided the baseline for the nonparametric alternative for the Rasch model
(npRM): if the purpose is to order individuals, then we have to investigate
whether the ordering of sum scores corresponds with the stochastic ordering
of the posterior distributions of the person parameters. The only difference
between this npRM and the parametric Rasch model (RM, Rasch, 1960) is
that it is not required that equal sum scores imply stochastically equal posterior
distributions.
This representation of sufficiency, brings me to the following issue in high-
stakes testing: based on which evidence can we classify students into ordered
groups, and when do we have to decide that we cannot make a distinction
between students?
In cases where the RM fits we feel comfortable to order based on the sum
score, because all available statistical informations is kept by the sum score. If
the RM does not fit, one can consider the less restrictive npRM, however, the
difference between both models is only about cases with equal sum scores. In
case of unequal sum scores, both models are equally restrictive. If the RM is
seen as restrictive, then the same could be said about the npRM. But on which
argument can we classify individual students if the npRM also does not fit?
I will mention three options, and discuss them in the context of high-stakes
testing.
The first is to leave the requirement of ordinal sufficiency of the sum score.
Then we arrive at the monotone homogeneity model (Mokken, 1971). If we do
this, and we use the sum score as final statistic, we decide to classify students
into groups based on a summary statistic, while other available information
does not support this classification. I think that this is questionable. If we
classify in high-stakes conditions, then I think that this classification should
not be contradicted by other available test data. The second option is to use
a parametric model with more parameters, e.g., the Two-Parameter Logistic
94 discussion
model (Birnbaum, 1968). However, a disadvantage of these models is that the

relation between observed responses and the final score (i.e., estimated ability)
is less than transparent, and strictly not in agreement with the sum score. The
third option is the one that I would like to emphasize in this discussion: find
another ordinal sufficient statistic, coarser than the sum score. One trivial
option is already provided in Chapter 3: assign the score 0 if all responses are
incorrect, assign the score 1 if some responses are correct and some responses
are incorrect, and assign the score 2 if all responses are correct. This scoring
rule is, of course, rather silly, yet it shows that we may be able to form sum
score groups, which have the ordinal sufficiency property. And what if we
cannot find such groups? In that case, we may ask ourselves whether we have
enough evidence to make ordinal judgements.
5.3 We want DIF!
In Chapter 4, I have considered the methodology for educational surveys from

the perspective of differential item functioning (DIF). DIF is usually seen as
a threat to validity, and therefore as something that should be avoided. But
for educational surveys, I have proposed another view on DIF: DIF could be
an interesting survey outcome, reflecting the diversity among countries and the
dynamics over time.
I think that the discussion about DIF needs more nuance. DIF is only a
threat if it appears where it is not expected. For instance, if a test is designed
to measure a unidimensional construct, and if the theory about the construct
is such that the construct can be represented as a latent variable, and the
relationship between item scores and latent variables is assumed to be the
same in each subpopulation, then DIF contradicts the assumptions, and might
cause bias in the latent variable estimation.
So, the question is: what do we expect with respect to educational
surveys? In the most recent PISA survey, 65 economies were involved
(OECD, 2014). The participating countries are spread over six continents, so
differences with respect to cultural background, curriculum, and school
system are evident. Because of the diversity among countries, I expect that
item scores and background variables are dependent, also after conditioning
on the total score. This implies that I expect DIF. Moreover, it provides new
chapter 5 95
possibilities to use a survey instrument as a monitoring system. For instance,

it could be that students in some country stay behind on a particular type of
math exercises. If item analyses provide such information, then the country’s
content experts and policy makers could think of explanations and
interventions. They could use this information to improve the curriculum,
and if they succeed, then this interaction effect is expected to be different in
the next cycle of the survey. These interventions might contribute to the
quality of the educational system, but at the same time create DIF over time.
However, successful and instrumental DIF that we should not want to avoid.
96 discussion
Bibliography
Adams, R. (2011, 19 April). Comments on Kreiner 2011: Is the

foundation under PISA solid? A critical look at the scaling model
underlying international comparisons of student attainment. Retrieved
from http://www.oecd.org/pisa/47681954.pdf
Adams, R., Wilson, M., & Wang, W. (1997). The multidimensional random
coefficients multinomial logit model. Applied Psychological Measurement,
21 (1), 1-23.
American Educational Research Association, American Psychological
Association, & National Council on Measurement in Education. (1999).
Standards for educational and psychological testing. Washinton, DC:
American Educational Research Association.
Andersen, E. B. (1973a). Conditional inference and models for
measuring. (Unpublished doctoral dissertation). Mentalhygiejnisk
Forskningsinstitut.
Andersen, E. B. (1973b). A goodness of fit test for the Rasch model.
Psychometrika, 38 , 123-140.
Bechger, T. M., & Maris, G. (2014). A statistical test for differential item pair
functioning. Psychometrika.
Bechger, T. M., Maris, G., & Verstralen, H. H. F. M. (2010). A different view
on DIF (Measurement and Research Department Reports No. 2010-4).
Cito.
Birnbaum, A. (1968). Some latent trait models and their use in inferring
an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical
theories of mental test scores (p. 395-479). Reading: Addison-Wesley.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation
of item parameters. Psychometrika, 46 , 443-460.
97
98 bibliography
Bolsinova, M., Maris, G., & Hoijtink, H. (2012, July). Unmixing Rasch scales.
Paper presented at the V European Congress of Methodology, Santiago
de Compostela, Spain.
Brennan, R. L. (2006). Perspectives on the evaluation and future of educational
measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed.,
chap. 1). Westport: Praeger Publishers.
Conover, W. J. (1999a). Practical nonparametric statistics (3rd ed.). New
York: John Wiley & Sons.
Conover, W. J. (1999b). Statistics of the Kolmogorov-Smirnov type. In
W. J. Conover (Ed.), Practical nonparametric statistics (3rd ed., p. 428-
473). John Wiley & Sons.
Council of Europe. (2012). First european survey on language competences:
Technical report. Retrieved from http://www.surveylang.org/
Cronbach, L. J., & Gleser, G. C. (1965). Psychological test and personnel
decisions (2nd ed.). Urbana: University of Illinois Press.
Dieterich, C. (2013, March). In or out, DJIA companies reflect
changing times. The Wall Street Journal . Retrieved from
http://online.wsj.com/news/articles/
SB10001424127887324678604578342113520798752
Doob, J. (1949). Heuristic approach to the Kolmogorov-Smirnov theorems.
The Annals of Mathematical Statistics, 20 , 393-403.
Eggen, T. J. H. M., & Verhelst, N. D. (2011). Item calibration in incomplete
designs. Psychológica, 32 , 107-132.
Fischer, G. H. (1974). Einfuhrung in die Theorie Psychologischer Tests. Bern:
Verlag Hans Huber. (Introduction to the theory of psychological tests.)
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics.
Philosophical Transactions of the Royal Society of London. Series A,
Containing Papers of a Mathematical or Physical Character , 222 , 309-
368. doi: 10.1098/rsta.1922.0009
Glas, C. A. W. (1988). The Rasch model and multistage testing. Journal of
Educational Statistics, 13 , 45-52.
Glas, C. A. W. (1989). Contributions to estimating and testing Rasch models
(Unpublished doctoral dissertation). Arnhem: Cito.
Glas, C. A. W. (2000). Item calibration and parameter drift. In W. J. Van der
Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory
99
and practice (p. 183-199). Kluwer Academic Publishers.

Glas, C. A. W. (2010). Item parameter estimation and item fit analysis. In
W. J. Van der Linden & C. A. W. Glas (Eds.), Elements of adaptive
testing (p. 269-288). Springer.
Glas, C. A. W., Wainer, H., & Bradlow, E. (2000). MML and EAP estimation in
testlet-based adaptive testing. In W. J. Van der Linden & C. A. W. Glas
(Eds.), Computerized adaptive testing: Theory and practice (p. 271-287).
Kluwer Academic Publishers.
Goldstein, H. (2004). International comparisons of student attainment: some
issues arising from the PISA study. Assessment in Education, 11 (3),
319-330. doi: 10.1080/0969594042000304618
Grayson, D. A. (1988). Two-group classification in latent trait theory: Scores
with monotone likelihood ratio. Psychometrika, 53 , 383-392.
Guttman, L. (1950). The basis for scalogram analysis. In S. Stouffer,
L. Guttman, E. Suchman, P. Lazarsfeld, S. Star, & J. Clausen (Eds.),
Measurement and Prediction (Vol. 4, p. 60-90). Princeton, NY: Princeton
University Press.
Hemker, B. T., Sijtsma, K., Molenaar, I. W., & Junker, B. W. (1997).
Stochastic ordering using the latent trait and the sum score in polytomous
IRT models. Psychometrika, 62 (3), 331-347.
Hessen, D. J. (2005). Constant latent odds-ratios models and the Mantel-
Haenszel null hypothesis. Psychometrika, 70 (3), 497-516.
Holland, P., & Wainer, H. (Eds.). (1993). Differential item functioning.
Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Huynh, H. (1994). A new proof for monotone likelihood ratio for the sum of
independent Bernoulli random variables. Psychometrika, 59 , 77-79.
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking.
methods and practices (Second ed.). New York: Springer.
Kreiner, S. (2011). Is the foundation under PISA solid? A critical
look at the scaling model underlying international comparisons of
student attainment. (Tech. Rep.). Dept. of Biostatistics, University of
Copenhagen.
Kreiner, S., & Christensen, K. B. (2007). Validity and objectivity in
health-related scales: Analysis by graphical loglinear Rasch models. In
M. Von Davier & C. H. Carstensen (Eds.), Multivariate and mixture
100 bibliography
distribution Rasch models (p. 329-346). New York: Springer.

Kreiner, S., & Christensen, K. B. (2013). Analyses of model fit and robustness.
A new look at the PISA scaling model underlying ranking of countries
according to reading literacy. Psychometrika. doi: 10.1007/s11336-013-
9347-z
Kubinger, K. D., Steinfeld, J., Reif, M., & Yanagida, T. (2012). Biased
(conditional) parameter estimation of a rasch model calibrated item pool
administered according to a branched testing design. Psychological Test
and Assessment Modeling, 52 (4), 450-460.
Le, L. T. (2007). Effects of item positions on their difficulty
and discrimination: A study in PISA science data across test
language and countries. Paper presented at the 72nd Annual Meeting
of the Psychometric Society, Tokyo, Japan. Retrieved from
http://research.acer.edu.au/pisa/2/
Linthorne, N. (2014, August). Wind assistance in the 100m sprint. Retrieved
from http://www.brunel.ac.uk/ spstnpl/Publications/
Lord, F. M. (1971a). The self-scoring flexilevel test. Journal of Educational
Measurement, 8 (3), 147-151.
Lord, F. M. (1971b). A theoretical study of two-stage testing. Psychometrika,
36 , 227-242.
Lord, F. M. (1980). Application of item response theory to practical testing
problems. Mahway, New Jersey: Lawrence Erlbaum Associates.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores.
Reading, MA: Addison-Wesley.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model.
Journal of educational measurement, 17 (3), 179-193.
Maris, G. (2008). A Note on ”Constant Latent Odds-Ratios Models and the
Mantel-Haenszel Null Hypothesis” Hessen, 2005. Psychometrika, 73 (1),
153-157.
Marsman, M., Maris, G., Bechger, T., & Glas, C. (2013a, July). Composition
algorithms for conditional distributions. Paper presented at the 78th
Annual Meeting of the Psychometric Society, Arnhem, The Netherlands.
Marsman, M., Maris, G., Bechger, T., & Glas, C. (2013b, October). A non-
parametric estimator of latent variable distributions. Paper presented at
the RCEC workshop on IRT and Educational Measurement, Enschede,
101
The Netherlands.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika,
47 , 149-174.
Meijer, R. R., Sijtsma, K., & Smid, N. G. (1990). Theoretical and empirical
comparison of the Mokken and the Rasch approach to IRT. Applied
Psychological Measurement, 14 (3), 283-298.
Milgrom, P. R. (1981). Good news and bad news: Representation theorems
and application. The Bell Journal of Economics, 12 (2), 380-391.
Mislevy, R. J. (1998). Implications of market-basket reporting for achievement-
level setting. Applied Psychological Measurement, 11 (1), 49-63.
Mokken, R. (1971). A theory and procedure of scale analysis. The Hague:
Mouton.
Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially
consistent observations. Econometrica, 16 , 1-32.
OECD. (2004). Learning for tomorrows world: First results from PISA 2003.
Retrieved from www.oecd.org/dataoecd/1/60/34002216.pdf
OECD. (2007). PISA 2006: Science competencies for tomorrows world:
Volume 1: Analysis.
OECD. (2009a). PISA 2006 technical report.
OECD. (2009b). PISA data analysis manual.
OECD. (2012). The policy impact of PISA: An exploration of the normative
effects of international benchmaring in school system performance
(OECD Education Working Paper No. 71). Organisation for Economic
Co-operation and Development.
OECD. (2014). PISA 2012 results in focus: What 15-year-olds know and what
they can do with what they know.
Oliveri, M. E., & Ercikan, K. (2011). Do different approaches to examining
construct comparability in multilanguage assessments lead to similar
conclusions? Applied Measurement in Education, 24 (4), 349-366. doi:
10.1080/08957347.2011.607063
Oliveri, M. E., & Von Davier, M. (2011). Investigation of model fit and score
scale comparability in international assessments. Psychological Test and
Assessment Modeling, 53 (3), 315-333.
Oliveri, M. E., & Von Davier, M. (2014). Toward increasing
fairness in score scale calibrations employed in international large-
102 bibliography
scale assessments. International Journal of Testing, 14 (1), 1-21. doi:

10.1080/15305058.2013.825265
Post, W. J. (1992). Nonparametric unfolding models. A latent structure
approach. Leiden: DSWO Press.
R Development Core Team. (2013). R: A language and environment for
statistical computing [Computer software manual]. Vienna, Austria.
Retrieved from http://www.R-project.org
Rasch, G. (1960). Probabilistic models for some intelligence and attainment
tests. Copenhagen: The Danish Institute of Educational Research.
(Expanded edition, 1980. Chicago, The University of Chicago Press)
Ross, S. M. (1996). Stochastic processes (Second ed.). New York: John Wiley
& sons.
Rubin, D. (1976). Inference and missing data. Biometrika, 63 , 581-592.
Sandilands, D., Oliveri, M. E., Zumbo, B. D., & Ercikan, K. (2013).
Investigating sources of differential item functioning in international
large-scale assessments using a confirmatory approach. International
Journal of Testing, 13 (2), 152-174. doi: 10.1080/15305058.2012.690140
Sekhon, J. S. (2011). Multivariate and propensity score matching software with
automated balance optimization: The matching package for R. Journal
of Statistical Software, 42 (7), 1-52.
Sijtsma, K., & Hemker, B. T. (2000). A taxonomy of IRT models for ordering
persons and items using simple sum scores. Journal of Educational and
Behavioral Statistics, 25 (4), 391-415.
Sijtsma, K., & Molenaar, I. (2002). Introduction to nonparametric item
response theory. Thousand Oaks, California: Sage Publications, Inc.
Spearman, C. (1904). General intelligence, objectively determined and
measured. American Journal of Psychology, 15 , 201-293.
Van der Linden, W. J., & Glas, C. A. W. (Eds.). (2010). Elements of adaptive
testing. New York: Springer.
Van der Linden, W. J., & Pashley, P. J. (2010). Item selection and ability
estimation in adaptive testing. In W. J. Van der Linden & C. A. W. Glas
(Eds.), Elements of adaptive testing (p. 3-30). Springer.
Verhelst, N. D. (2012). Profile analysis: A closer look at the PISA 2000 reading
data. Scandinavian Journal of Educational Research, 56 (3), 315-332. doi:
10.1080/00313831.2011.583937
103
Verhelst, N. D., & Glas, C. A. W. (1995). The one parameter logistic model:
OPLM. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models:
Foundations, recent developments and applications (p. 215-238). New
York: Springer Verlag.
Verhelst, N. D., Glas, C. A. W., & Verstralen, H. H. F. M. (1993). OPLM:
One parameter logistic model. Computer program and manual. Arnhem:
Cito.
Wainer, H., Bradlow, E., & Du, Z. (2000). Testlet response theory: An analog
for the 3pl model useful in testlet-based adaptive testing. In W. Van der
Linden & C. Glas (Eds.), Computerized adaptive testing: Theory and
practice (p. 245-269). Kluwer Academic Publishers.
Warm, T. (1989). Weighted likelihood estimation of ability in item response
theory. Psychometrika, 54 , 427-450.
Weiss, D. J. (Ed.). (1983). New horizons in testing: Latent trait test theory
and computerized adaptive testing. New York: Academic Press.
Zenisky, A., Hambleton, R. K., & Luecht, R. (2010). Multistage testing: Issues,
desings and research. In W. J. Van der Linden & C. A. W. Glas (Eds.),
Elements of adaptive testing (p. 355-372). Springer.
104 bibliography
References published chapters
Chapters 2, 3, and 4 have been submitted or accepted for publication as,

respectively:
Zwitser, R.J. & Maris, G. (in press). Conditional Statistical Inference with
Multistage Testing Designs. Psychometrika
Zwitser, R.J. & Maris, G. (conditionally accepted). Ordering Individuals with
Sum Scores: the Introduction of the Nonparametric Rasch Model.
Psychometrika
Zwitser, R.J., Glaser, S.S.F. & Maris, G. (submitted). Monitoring Countries
in a Changing World. A New Look at DIF in International Surveys
Interest of co-authors:
• G. Maris is the supervisor of this thesis
• S.S.F. Glaser is a master’s student who has contributed to the analyses

described in Section 4.4.3
105
106 references published chapters
Summary of ‘Contributions to Latent
Variable Modeling in Educational
Measurement’
One of the prominent questions in educational measurement is how to

summarise scored item responses into a final score, such that the final score
reflects the construct that is supposed to be measured. In answering this
question, latent variable models play an important role. This thesis considers
a couple of questions regarding the use of latent variable models in scoring
item responses.
Chapter 1 is an introduction to the rest of the thesis. It is explained that
there are different views on what a construct is. Furthermore, this chapter
introduces some general terms and theory, after which a broad overview of the
thesis is given.
The core of the thesis consists of Chapters 2 to 4, where three distinct
research projects are described. The first project, described in Chapter 2, is
about conditional likelihood inference from multistage testing designs. In
adaptive testing, the scoring of individual test takers is usually done via
estimates of person parameters. To obtain unbiased estimates, it is required
that the item parameters are also unbiased. This chapter shows how to
obtain item parameter estimates from multistage testing designs based on the
conditional likelihood method. Besides this technical result, some more
general issues related to adaptive testing, item parameter estimation, and
model fit are discussed. It is explained that simple measurement models are
more likely to fit the data obtained from adaptive designs compared to data
obtained from linear designs. This is illustrated with simulated data, as well
107
108 summary
as with real data taken from the Dutch Entreetoets.

In Chapter 3, the item response theory (IRT)-based justification of the use
of the sum score is considered. Two IRT-models are well-known for their
relationship between the sum score and the person parameter: the parametric
Rasch Model (RM), in which the sum score is a sufficient statistic for the
person parameter, and the nonparametric Monotone Homogeneity Model
(MHM), in which the latent trait is stochastically ordered by the sum score.
It is illustrated that there is a theoretical gap between the two: the RM
enables scoring individuals by means of the sum score, while the MHM
enables ordering groups by means of the sum score. To fill the gap, the
concept of ordinal sufficiency is defined, and the nonparametric Rasch Model
is introduced as a less restrictive nonparametric alternative that enables
ordering individuals by means of the sum score.
The final project, in Chapter 4, is about differential item functioning
(DIF) in international surveys. Usually, DIF is considered as a threat to
validity, and as a phenomenon that hinders the comparison of performances
between countries. However, in the approach described in Chapter 4, DIF is
not considered as a threat, but as an interesting survey outcome reflecting
qualitative differences between countries. To obtain comparable scores in a
context with DIF, it is proposed not the take the person parameter estimates
as a basis for comparison. Instead, it is proposed to define the construct as a
market basket of items, and to take (a summary statistic of) the item
responses as the basis for comparisons. Since survey data are usually
incomplete, the latent variable models - probably different models in different
countries - are used to describe the distribution of the item responses in the
market basket. This approach is illustrated with data from the PISA cycle of
2006.
Chapter 5 is a general discussion. Three issues related to Chapters 2 to 4
are raised. The first is about an optimal adaptive test design for high-stakes
testing. It is argued that this is not a computerized adaptive test (CAT) with
an infinitely large and calibrated item bank. Instead, a multistage test can lead
to more efficient results. The second is about what to do when the sum score
is not ordinal sufficient for the person parameter. It is argued that, especially
in high-stakes testing, one should look for a coarser statistic that is ordinal
sufficient. The third issue is an elaboration on the positive aspects of DIF.
Samenvatting van ‘Contributions to
Latent Variable Modeling in
Educational Measurement’
Een prominente vraag bij meten in het onderwijs is hoe de scores op

afzonderlijke toetsvragen samengevat moeten worden in een eindscore,
zodanig dat de eindscore het construct - dat is de te meten vaardigheid -
representeert. Bij het beantwoorden van deze vraag spelen
latentevariabelemodellen een belangrijke rol. Dit proefschrift beschouwt een
aantal vraagstukken omtrent het gebruik van latentevariabelemodellen en het
bepalen van eindscores.
Hoofdstuk 1 is een introductie. Er wordt in de eerste plaats beschreven dat
er verschillende visies zijn op wat een construct eigenlijk is. Verder worden een
aantal algemene termen en de nodige theorie geı̈ntroduceerd. Tenslotte wordt
een overzicht gegeven van de rest van het proefschrift.
De hoofdstukken 2 tot en met 4 vormen de kern van het proefschrift. In
deze hoofdstukken worden drie afzonderlijke onderzoeksprojecten beschreven.
Het eerste project, beschreven in hoofdstuk 2, gaat over conditionele
likelihood 1 inferentie bij multistage toetsing. Bij adaptief toetsen worden de
scores meestal toegekend via van schattingen van de persoonsparameters. Om
zuivere schatters te krijgen is het vereist dat de itemparameters ook zuiver
zijn. Dit hoofdstuk laat zien hoe bij multistage toetsing de itemparameters
geschat kunnen worden op basis van de conditionele likelihood methode.
Naast dit technische resultaat wordt ook een aantal algemene thema’s met
betrekking tot adaptief toetsen, itemparameterschattingen en model fit
1 Voor sommige woorden in de Nederlandse samenvatting wordt de Engelse term gebruikt,
omdat deze ook in het Nederlandse jargon gebruikt worden.
109
110 samenvatting
besproken. Daarbij wordt uitgelegd dat eenvoudige meetmodellen

waarschijnlijk beter passen op data van adaptieve toetsen in vergelijking met
data van lineaire toetsen. Dit wordt zowel geı̈llustreerd met gesimuleerde data
als met data van de Nederlandse Entreetoets.
Hoofdstuk 3 gaat over de onderbouwing van het gebruik van de somscore
met behulp van itemresponstheorie (IRT). Twee IRT-modellen zijn bekend
vanwege de relatie tussen de somscore en de persoonsparameter: het
parametrische Rasch Model (RM), waarin de somscore een sufficient statistic
is voor de persoonsparameter, en het niet-parametrische Monotone
Homogeneity Model (MHM), waarin de latente trek stochastisch geordend is
op basis van de somscore. In hoofdstuk 3 wordt betoogd dat het RM het
scoren van individuen op basis van de somscore onderbouwt, terwijl het MHM
het ordenen van groepen op basis van de somscore onderbouwt. Dit laat
ruimte voor een derde model. Om het derde model te kunnen introduceren,
wordt eerst het begrip ordinal sufficiency gedefinieerd. Het model dat
vervolgens wordt geı̈ntroduceerd is het niet-parametrische Rasch model. Dit
is een minder restrictief model dan het RM, waarmee het ordenen van
individuen op basis van de somscore kan worden onderbouwd.
Het laatste project, dat beschreven wordt in hoofdstuk 4, gaat over
differential item functioning (DIF) in internationale onderwijspeilingen.
Meestal wordt DIF gezien als een bedreiging voor de validiteit en als iets dat
de vergelijking van de prestaties van landen bemoeilijkt. Echter, in de
methode zoals die wordt beschreven in hoofdstuk 4, wordt DIF niet gezien als
een bedreiging, maar als een interessant resultaat dat kwalitatieve verschillen
tussen landen weergeeft. Om in een context met DIF te komen tot
vergelijkbare scores wordt voorgesteld om de vergelijking niet te baseren op
de persoonsparameters. In plaats daarvan wordt voorgesteld om het construct
te definiëren als een grote verzameling toetsvragen (de market basket), en om
vergelijkingen te baseren op een samenvattende statistiek op deze market
basket. Aangezien de data van peilingen meestal incompleet zijn, worden
latentevariabelemodellen - mogelijk verschillende modellen in verschillende
landen - gebruikt om de verdeling van de itemscores in de market basket te
schatten. Deze benadering is geı̈llustreerd met PISA data uit 2006.
Hoofdstuk 5 is een algemene discussie. Drie zaken met betrekking tot de
hoofdstukken 2 tot en met 4 worden nader bediscussieerd. De eerste is de
111
vraag wat een optimale adaptieve toets is voor high-stakes toetsing. Daarbij
wordt beargumenteerd dat dit niet een computer adaptieve toets (CAT) met
een oneindig grote en gekalibreerde vragenbank is. In plaats daarvan kan een
multistage toets tot efficiëntere resultaten leiden. De tweede gaat over wat te
doen als de somscore niet ordinal sufficient is voor de persoonsparameter. Er
wordt beargumenteerd dat in het geval van high-stakes toetsing men wellicht
een ruwere statistiek dan de somscore moet zoeken die wel ordinal sufficient is.
De derde is een uitwijding over de positieve aspecten van DIF.
112 samenvatting
Dankwoord
Dit proefschrift heb ik mogen schrijven in de tijd dat ik in dienst was bij Cito.
Graag bedank ik Cito voor alle geboden faciliteiten. Een aantal mensen heeft
de achterliggende jaren een bijzondere bijdrage geleverd. Ik wil hen hierbij
persoonlijk bedanken.
Gunter, ik heb jouw begeleiding enorm gewaardeerd. Door jouw creativiteit,
toegankelijkheid, beschikbaarheid, eigenwijsheid en heldere stijl van uitleggen
heb ik de achterliggende vijf jaar ontzettend veel geleerd. Bedankt voor al je
toewijding en vertrouwen.
Anton, bedankt voor mijn plek bij POK, voor je lessen in pragmatisch
en tactisch handelen, en voor je aanmoediging om hoofdstuk 4 helemaal te
herschrijven. Uiteindelijk is dit hoofdstuk er een stuk beter van geworden.
Bas, bedankt voor je betrokkenheid in brede zin. Al tijdens mijn
afstudeerstage was je als kamergenoot nieuwsgierig naar mijn bezigheden. En
in een later stadium, toen het stageonderzoek ontwikkelde tot hoofdstuk 3
van dit proefschrift, was je genuanceerde feedback op het iets te
ongenuanceerde manuscript een welkome aanvulling.
Timo, bedankt voor de vele momenten waarop je beschikbaar was voor
vragen en voor je feedback op verscheidene delen van het manuscript.
Han, bedankt voor je begeleiding in de laatste maanden. Mede door jouw
betrokkenheid is het gelukt in een drukke tijd tot een afronding te komen.
Saskia, Matthieu en Maarten, bewoners van de ‘prijzenkast’, bedankt voor
de vele gesprekken, koffiemomentjes, flauwe grappen en morele steun. Ik koester
vele mooie herinneringen aan B5.46. Ook dank ik de overige collega’s van Cito,
en in het bijzonder die van POK, voor de fijne tijd, de goede sfeer en de
voortdurende bereidheid om even mee te denken.
Michiel en Matthieu, bedankt voor jullie vriendschap en betrokkenheid in
113
114 dankwoord
de afgelopen jaren. Ik vind het fijn dat jullie mijn paranimfen zijn.
Pa en ma, van kinds af aan hebben jullie mijn nieuwsgierigheid positief
benaderd. Mijn studietijd heeft wat omzwervingen gekend, maar ook daarin
heb ik de mogelijkheid gekregen om mijn weg te vinden. Ik ben blij met dit
proefschrift als resultaat. Bedankt voor al jullie aanmoedigingen.
Janet en Thijmen, ik ben blij dat jullie werk en wetenschap zo kunnen
relativeren. Bedankt voor alles wat jullie mij geven.
Robert Zwitser
December 2014
.
Modeling in Educational Measurement
Contributions to Latent Variable
Robert J. Zwitser
Contributions to Latent Variable Modeling in Educational Measurement Robert J. Zwitser

Proefschrift Robert Zwitser PDF

Uploaded by

Copyright:

Available Formats

Proefschrift Robert Zwitser PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Proefschrift Robert Zwitser PDF

Uploaded by

Copyright:

Available Formats

UvA-DARE (Digital Academic Repository)

Contributions to latent variable modeling in educational measurement

Citation for published version (APA):

UvA-DARE is a service provided by the library of the University of Amsterdam (http://dare.uva.nl)

Download date: 16 Jun 2017

Contributions to Latent Variable Modeling in Educational Measurement

Printed by Ipskamp Drukkers, Enschede

This research was supported by Cito, Institute for Educational Measurement

Contributions to Latent Variable Modeling in

ter verkrijging van de graad van doctor

Robert Johannes Zwitser

Promotor: Prof. dr. G.K.J. Maris Universiteit van Amsterdam

Faculteit: Faculteit der Maatschappij- en Gedragswetenschappen

2 CML Inference with MST Designs 7

3 The Nonparametric Rasch Model 37

4 DIF in International Surveys 63

References published chapters 105

Through all stages of education, from kindergarten to university, we use tests

1.1 The construct

1.2 Latent variable models

In this thesis, I mainly focus on a particular class of latent variable models:

is based on the following IRF:

in which P (Xi = 1|θ) denotes the probability of a score 1, conditional on

P (X1 = 1|θ) ≤ P (X2 = 1|θ) ≤ · · · ≤ P (XK = 1|θ),

1.3 This thesis

1.3.1 CML Inference with MST Designs

1.3.2 The Nonparametric Rasch Model

1.3.3 DIF in International Surveys

differential item functioning (DIF). There is therefore no single model that

1.4 Note about notation

Conditional Statistical Inference with

For several decades, test developers have been working on the

Figure 2.1: Example of a multistage design.

2.1 Conditional likelihood estimation

in which xpi denotes the response of examinee p, p = 1, ..., K, on item i,

Statistical inference about X is hampered by the fact that the person

size. It is known that, in the presence of an increasing number of incidental

2.1.1 Estimation of item parameters

[1] [2] [3]

2 Whenever possible without introducing ambiguity, we ignore the distinction between

random variables and their realizations in our formulae.

γx[1] (b[1] )γx[2] (b[2] )γx[3] (b[3] )

and γs (b[m] ) is the elementary symmetric function of order s:

To turn a sample from X into a realization of data from MST, we do the

[1] Pb[1,2] (x[1,2] |θ) [1]

We now show that the conditional distribution in (2.2) factors as follows:

2.1.2 Comparison with alternative estimation procedures

in which λ are the parameters of the distribution of θ.

From this, we immediately obtain the marginal likelihood function:

Pb[1,2,3] (X[1] = x[1] , Xobs = xobs )

[1] [1] [1]

2.1.3 Estimation of person parameters

As usual, we consider the item parameters as known when we estimate ability.

2.2 Model fit

We have mentioned in the introduction that adaptive designs may be beneficial

2.2.1 Model fit in adaptive testing

These equations could be transformed into the following equations for b1 , b2 ,

2.2.2 Likelihood ratio test

N [1] items N [2] items N [3] items