Nichols 1994

A Framework for Developing Cognitively Diagnostic Assessments
Author(s): Paul D. Nichols

Source: Review of Educational Research, Vol. 64, No. 4 (Winter, 1994), pp. 575-603
Published by: American Educational Research Association
Stable URL: http://www.jstor.org/stable/1170588 .
Accessed: 02/01/2014 22:23
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Educational Research Association is collaborating with JSTOR to digitize, preserve and extend
access to Review of Educational Research.
http://www.jstor.org
This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

All use subject to JSTOR Terms and Conditions
Review of Educational Research
Winter 1994, Vol. 64, No. 4, pp. 575-603
A Framework for Developing Cognitively

Diagnostic Assessments
Paul D. Nichols
University of Wisconsin-Milwaukee
Over the past decade or so, a growing number of writers have argued that
cognitive science and psychometrics could be combined in the service of
instruction. Researchers have progressed beyond statements of intent to the
hands-on business of researching and developing diagnostic assessments
combining cognitive science and psychometrics, what I call cognitively
diagnostic assessment (CDA). In this article, I attempt to organize the many
loosely connected efforts to develop cognitively diagnostic assessments. I
consider the development of assessments to guide specific instructional
decisions, sometimes referred to as diagnostic assessments. Many of my
arguments apply to program evaluation as well-assessments that reveal
the mechanisms test takers use in responding to items or tasks provide
important information on whether instruction is achieving its goals. My
goal in this article is to characterize CDA in terms of the intended use of
assessment and the methods of developing and evaluating assessments.
Towardsthis goal, I (a) outline the societal trends that motivate the develop-
ment of CDA, (b) introduce a framework within which the psychological
and statistical aspects of CDA can be coordinated, and (c) summarize efforts
to develop CDA in afive-step methodology that can guidefuture development
efforts. Finally, I address some of the issues developers of CDA must resolve
if CDA is to succeed.
Over the past decade or so, a growing numberof writers have argued that
cognitive science and psychometricscould be combinedin the service of instruc-
tion (Bejar, 1984; Haertel& Calfee, 1983; Linn, 1986; Messick, 1984; Snow &
Lohman, 1989). They have criticized traditionaltesting for losing sight of the
psychology of the performancebeing tested (Glaser, 1981; Glass, 1986). Tradi-
tional testing practices appearto place more emphasis on statistical technique
than on the psychology of the constructbeing measured(Anastasi, 1967). They
have arguedthat, given some knowledge of the goals and methodsof instruction
and of the psychology of the construct,"educationaltests might be made more
diagnostic of malfunctions in learning and more informative for instructional
adaptation"(Snow & Lohman, 1989, p. 266).
I am gratefulto Dean Colton,DavidF. Lohman,DavidJ. Mittelholtz,and Robert

L. Brennanfor their helpful commentson draftsof this article. Correspondence
concerningthis articleshould be addressedto Paul D. Nichols, 765 EnderisHall,
P.O. Box 413, Departmentof EducationalPsychology,Universityof Wisconsin-
Milwaukee,Milwaukee,WI 53201.
575

Nichols
Researchershave progressedbeyond what Pellegrino(1992) has called verbal

statements of intent and handwaving about proposed solutions to the hands-
on business of researchingand developing diagnostic assessments combining
cognitive science and psychometrics, what I call cognitively diagnostic assessment
(CDA).' They are constructingnew assessments informed by research on the
psychology of learningand achievement and embracingnew statisticalmodels
for combiningobservations.They design problemsand tasks throughsystematic,
research-basedvariations in problem characteristics.They score performance
using statisticalmodels intendedto make inferences about qualities of learners'
processes and knowledge structures.
The defining characteristicof CDA is that it makes explicit the substantive
assumptions the test developer is using to construct test materials and assign
scores. These substantive assumptions describe the processes and knowledge
structuresa performerin the test domain would use, how they develop, and how
more competentperformersdiffer from less competentperformers.Often, these
substantive assumptions are embodied in psychological theories but may be
representedless formally(DuBois & Shalin, in press).These substantiveassump-
tions are testable because they are explicit.
In this article, I attemptto organize the many loosely connected efforts to
develop CDA. I consider the development of assessments to guide specific
instructionaldecisions, sometimesreferredto as diagnosticassessment.Many of
my argumentsapply to programevaluationas well-assessments that reveal the
mechanisms test takers use in responding to items or tasks provide important
informationon whetherinstructionis achieving its academic goals. In contrast,
I do not address tests intended to select students for a particulareducational
institutionor program,even thoughI believe an approachthatconsiderspsychol-
ogy could benefit the developmentof such tests, as well. My goal in this article
is to characterizeCDA in termsof the intendeduse of assessmentandthe methods
of developing and evaluatingassessments. Towardsthis goal, I (a) outline the
societal trendsthatmotivatethe developmentof CDA, (b) introducea framework
withinwhich the psychologicalandstatisticalaspectsof CDA can be coordinated,
and (c) summarizeefforts to develop CDA in a five-step methodology that can
guide futuredevelopmentefforts.Finally,I addresssome of the issues developers
of CDA must resolve if CDA is to succeed. Perhapsthe most importantissue is
the success of CDA with regardsto informinginstructionaladaptation.The ability
to inform instructionaldecisions is the raison d'etre and the Achilles' heel of
CDA. Throughoutthis article, I contrastCDA with traditionalassessment that,
I argue, was intended for a different use and, subsequently,is developed and
evaluatedusing differentmethods.
Societal Demand for New Assessment Techniques
Researchers,frustratedby the lack of diagnosticity of traditionaltests, have
begunresearchinganddesigningnew assessmentsto revealthe mechanismsused
by test takersin responding.I believe thatmost readerswould agree thatresearch
is done within a social context and that assessments are socially constructed.
Furthermore,psychological and psychometric discourse can be used to serve
all kinds of social functions-which is not to endorse a social constructionist
philosophy. In this section, I sketch the different societal contexts that have
576

Frameworkfor CognitivelyDiagnostic Assessments
motivated,at least in part,developmentof traditionalanddiagnosticassessments.

I will not pretendto be an observerof social trends,but I will review what others
have observed.I presentthis context to highlightthe differencesin intendeduse
between CDA and traditionaltests, and to describe the role CDA is intendedto
play in education.
Traditional Context
Traditionalassessments were developed to confrontthe dilemma of educators
in the early part of the 20th century. Educatorswere faced with determining
which studentswould be able to profit best from uniform instructiondesigned
essentially for the majorityof the population(Glaser, 1981). Resources limited
the informationthey could gather about each student and precluded tailoring
programsto individualstudents' needs (Mislevy, in press). I largely agree with
Mislevy (in press), who argues that currenttest theory has been effective in
selecting students most likely to succeed in a particulareducational institution
or program.
Traditionalassessments developed for selection are based generally on what
Snow and Lohman (1989) have termedthe educationalpsychometricmeasure-
ment (EPM) approach.The test theories that dominatethe EPM conception are
aimed at estimatinga person's location on an underlyinglatent variable-a true
score in classical test theory (CTT) or a latent trait in unidimensional Item
Response Theory (IRT).2This location is typically interpretedas an amounton
the latentscale. The model is judged as to how well it places people into a single
sequence or aids selection into a single program(Mislevy, in press). EitherCTT
or IRT may usefully inform decisions about such linearly ordered alternatives
(Dawes & Corrigan,1974).
Perhapsbecause of the scientific community'sinfluence at the time of their
development,assessmentsbased on the EPMapproachfit well early-20th-century
theories of learning.For example, traditionalassessments are constructedusing
a sampling approachthat fits well associationistmodels of learning popularat
the turn of the century (Resnick & Resnick, 1992). Their design is based on
logical taxonomies and content specifications and lacks explicit psychological
models of the structuresandprocessesthatunderliedomainperformance(Snow &
Mandinach,1991). Scores are tied to contentareasratherthan cognitive mecha-
nisms. These psychological assumptionsdo not reflect necessarily the measure-
ment specialist's level of sophisticationin psychology, althoughShepard(1991)
reportsthatmany measurementspecialistshold views of learningconsistentwith
these early psychological theories. Rather,the substantiveassumptionsappear
to be implicit in the traditionalmethodspsychometriciansuse to develop assess-
ments. As Anastasi (1967) noted:
Psychometriciansappearto shed much of their psychologicalknowledge
as they concentrateupon the minutiaeof elegant statisticaltechniques.
Moreover,whenothertypes of psychologistsuse standardized tests in their
work,theytoo showa tendencyto slipdownseveralnotchesin psychological
sophistication.(p. 300)
The EPM approachtoward constructingassessments may not be helpful for
designing assessments intended to inform directly instructionaldecisions. As
577

Nichols
Bejar (1984) notes, scores derived from traditionalCTT or IRT approaches

provideonly generalinformationto guide specific instructionaldecisions. "Thus,
the studentwith a lower score could benefit, perhaps,from additionalor remedial
instruction,but thereare no guidelinesfor developingefficient coursesof instruc-
tion"(p. 185). Furthermore,scoreson new performance-basedor authenticassess-
ments often providelittle more informationthantraditionalassessmentsto guide
specific instructionaldecisions. Performance-basedor authenticassessmentsmay
well consist of tasks that are more representativeof some intended domain;
however, these assessmentscontinueto be developed and evaluatedwith an eye
towardthe same criterion-estimating a person'slocationon an underlyinglatent
continuum.In either case, scores indicate no more than the need for additional
instruction.
Current Context
The CDA approachhas been motivated,at least in part,by the currentemphasis
on helping individuals to succeed in educational opportunities,in contrast to
selecting individuals for those opportunities(Stiggins, 1991). The requirement
now is to design educationthathelps all studentssucceed (CarnegieCouncil on
AdolescentDevelopment, 1989; NationalEducationGoals Panel, 1991; National
Governors'Association, 1990). According to the National Governors'Associa-
tion Report on the Task Force on Education,"We must abandonthe view that
only a small proportionof our populationmust be well educated,that many can
get by with less knowledge and fewer skills" (p. 7). A source of the current
concern for the learning needs of all children is the recognition that there is a
strong connection between how well a nation can performand the existence of
high-quality, widely distributed education (American Association for the
Advancement of Science, 1989). Educatorsand policymakers are demanding
new assessments to help individualssucceed in educationalopportunities;they
requireassessments to evaluate school learningand directly inform instruction.
As partof this movementtoward"assessmentsof enablement"(Glaser, 1988)
or assessmentsintendedto improvefurtherlearning,researchershave developed
CDA informedby researchon the psychology of learningand achievement,and
embracingstatisticalmodels for making inferences regardingthe structuresand
processes that underlie domain performance. These new assessments make
explicit the test developer'ssubstantiveassumptionsregardingthe processes and
knowledgestructuresa performerin the test domainwould use, how the processes
and knowledge structuresdevelop, and how more competent performersdiffer
from less competent performers.The model is judged as to how well it allows
inferencesregardingthe processesandknowledgestructuresused by an individual
to respond to an item or task-e.g., strategies using the Rule Space Model
(Tatsuoka,1983, 1990, in press) or knowledge organizationusing the Pathfinder
algorithms (Johnson, Goldsmith, & Teague, in press; Goldsmith, Johnson, &
Acton, 1991). Thus, the cognitive assessmentapproachis distinctfromtraditional
approaches relying on logical taxonomies and content specifications and
employing statisticalapproachesdeveloped for selecting students.
However,the shift away from an EPM approachand towarda CDA approach
may not have occurredwithoutprecedingdevelopmentsin psychology and statis-
tics. In psychology, the "cognitiverevolution"made respectablesuch concepts
578

as thoughts, images, and ideas, and the processes they signify again became
legitimate areas of study (Lachman,Lachman,& Butterfield, 1979). Later,psy-
chologists moved beyond studyinggeneralmechanismsof thinkingand learning
and addressedthe processes and knowledge structuresinvolved in performing
everyday tasks in academic, industrial,and military settings (Glaser, 1988). In
statistics, statisticianswere developing applicationsof statistical inference not
based on the Neyman-Pearsonsystem of classical statistical inference (Jeffrey,
1983).
The CDA approachenables educatorsand policymakersto draw conclusions
aboutstudents'instructionalneeds andthe effectiveness of instructionalprograms
because CDA provides instructorsand policymakerswith informationon strate-
gies students use to attack problems, relationships students perceive among
concepts, and principles students understandin a domain-student outcomes
closely aligned with curriculummovementsin science and mathematics.In con-
trast, traditionalassessments, including performance-basedor authenticassess-
ments, provide instructorsand policymakerswith estimates of students'relative
positions with regardto the amountof contentacquired.CDAs appearto support
currentneeds of educatorswhereas traditionalassessments appearbetter suited
to past educationalapproachesthat stressed the accretion of facts.
Framework for Cognitively Diagnostic Assessment
Researchersdesigning cognitively diagnostic assessments realized that they
were lacking a test theory thatallowed them to diagnose the qualitiesof learners'
processesandknowledgestructures.This requiredmorethansummingthe correct
responses.3They were searching for new test theories-test theories suited to
diagnosingcognitive mechanisms.In this section,I attemptto describethepsycho-
logical and statisticalconsiderationsinvolved in adoptinga test theory.This may
seem a strangeidea to many measurementspecialists and psychologists, alike.
I arguedin the previous section that decisions regardinginstructionaladaptation
requirea differentapproachtowardassessmentthandecisions regardingselection.
In this section, I arguethat differentconceptionsof domain performancerequire
differenttest theories. All domains are not alike, and the test developer should
not ignorethe uniquenatureof the performance.I use examplesfrommathematics
to illustratethis argumentbecause mathematicshas been studied extensively by
psychologists and has, perhaps not coincidentally,been the focus of a number
of CDA researchefforts.
Following Ippel (1986, 1991) and Lohmanand Ippel (1993), I view test theory
as consisting of two related aspects: an observationdesign for constructingand
arrangingtasks or items, and a measurementdesign for collecting andcombining
responses.Currenttest theories confoundthese two aspects. The observationand
measurementdesigns provide the test developer a frameworkwithin which the
psychological and statisticalconsiderationscan be coordinated.I will arguelater
in this section that the validity of both designs must be evaluated within the
context of substantiveresearch. The observationand measurementdesigns are
discussed separately,but practicalissues concerningthem should be considered
in company because diagnosis is possible only throughtheir coordination.The
most sophisticatedinference from any diagnosticmeasurementdesign is limited
by the richness of the performance elicited through the observation design.
579

Nichols
Alternatively,the value of performanceelicited throughthe observationdesign

is limited by the measurementdesign's power to use the information.
Observation Design
The observationdesign describes the characteristicsof assessment activities,
such as tasks or items, thatmakedemandson the test taker,how these characteris-
tics are to be organizedin the constructionand orderingof tasks or items, and
the nature of the responses required.An example of characteristicsthat make
demands on the test taker is provided by Tatsuoka(1990). Some seventh- and
eighth-graders,when they must increase the numeratorof the first fraction of a
problem, reduce the whole number by 1 and add 10 to the first numerator,
as follows:
2 5/10 - 6/10 = 1 9/10.
But if the denominatorof the fraction is not 10, then the strategyproduces an
incorrectanswer, as follows:
2 1/5 - 2/5 = 1 9/5.
To detect this strategy,mixed fraction subtractionproblems must have both of

the following characteristics:the numeratorof the first fractionmust be less than
the numeratorof the second fraction,and one problemmust have a denominator
not equal to 10 (Tatsuoka,1990).
The purposeof the observationdesign is to constructand arrangeassessment
activities in ways that reveal the mechanisms test takers use in responding.
Assessmentactivitiesaredesignedto allow the efficient inferenceof the processes
and knowledge structurespossessed by the test taker and the processes and
knowledge structuresmissing. Testdevelopersmust inferfrom a model of perfor-
mance the sets of characteristicsthat will discriminategroups of test takersthat
have different sets of processes or knowledge structures.This sort of inference
involves deductive reasoning-reasoning from causes to possible effects. In the
case of CDA, deductivereasoningguides reasoningfrom a psychological model
of processes and knowledge structuresto inferencesaboutthe effects of different
types of tasks on observablebehaviors(Mislevy, in press).
The observationdesign implies thatthe test developerunderstandsthe cognitive
demands made on the test takerby the assessment activities. The development
of any systematic understandingof task demandsrequiressubstantiveresearch
on the psychology of the test domain.The substantiveresearchin test development
identifies what the task or item characteristicsare so that they can be systemati-
cally manipulatedto investigate the cognitive mechanismsinfluenced by each.
Typically,test developersarenot preparedto identifyor manipulatethese charac-
teristics (Snow & Lohman, 1989).
An example of manipulatingtaskcharacteristicsto investigatethe mechanisms
usedby test takersis providedby Marshall(1993) for assessingschemaknowledge
of arithmeticword problems.The assessmentis partof the StoryProblemSolver
(SPS)-a computer-basedinstructionalsystem developed by Marshall and her
associates (Marshall, 1993; Marshall,Barthuli,Brewer, & Rose, 1989) to help
580

students in remedial college and community college mathematics classes.

Accordingto the theory motivatingthe developmentof SPS, a schema consists
of four types of knowledge: feature recognition, constraintmapping,planning/
goal setting,and implementation(Marshall,Pribe& Smith, 1987). Featurerecog-
nition knowledge is used by the learnerto recognize that the schema fits the
problem.Constraintmappingknowledge is used by the learnerto ascertainthat
enough of the featuresof a selected schema are presentin the problemto solve.
Planning/goalsetting knowledge is used by the learnerto organize the different
schemas in a multi-stepproblemin a way such that the correct solution can be
obtained.Finally,implementationknowledge is used by the learnerto select and
execute the appropriatearithmeticoperationto solve the problem.
The SPS includes assessment designed to reveal learners' development and
use of the four knowledge types for the five schemas that appearto be sufficient
for arithmeticstory problems(Marshall,1993). The task shown in Figure 1 was
developed to assess featurerecognitionknowledge for the Change schema. The
Changeschema applies to a problemthathas a startingamount,some action that
takes place over time and that alters the amount, and an ending amount. For
example, the following problemfits the Change schema:
Alex was selling snow shovelsone morningwhen snowbeganto fall. Alex
startedout the day with thirtyshovels.Whenthe storeclosed thatevening,
Alex had two snow shovels left. How many snow shovels did Alex sell?
The assessment task asks test takers to identify specific numbers, words, or
phrasesfrom a story problemwith the appropriatepartsof the icon in Figure 1
representingthe Change schema. Using the mouse cursor,the test taker selects
the phrase or number,moves the mouse cursor into one part of the icon, and
copies the highlightedtext into the icon by pressing a mouse button.4
As this example illustrates, validity evidence for the observation design is
based on substantiveresearchin the test domainbecause the characteristicsthat
make demandson the test takerare identifiedthroughsubstantiveresearch.This
underscoresthe importanceof psychology in test development. For example,
Marshall'swork with schema knowledge extends researchthat began with Bart-
lett's (1932) study of text comprehension.Marshall developed and tested her
own theory of schema knowledge for arithmeticstory problems by examining
arithmetictexts andremedialmaterials(Marshall,1990), by evaluatinginstruction
designed to encourage elementary school children to develop the five story
problemschemas(Marshallet al., 1987), andby developinga computersimulation
of how schema knowledge can be used to solve problems(Marshall,1990). The
tasks in SPS were systematically constructedto discriminatebetween learners
who have specific schema knowledge structuresand learners who have not
developed such structures(Marshall,1993; Marshallet al., 1989). This approach
is fundamentallydifferent from a test developer following a content sampling
approachand constructinga certainnumberof addition,subtraction,multiplica-
tion, and division story problems.
Measurement Design
The measurementdesign defines the object of measurementand the unit of
analysis, and describes the procedureor set of proceduresto assign a value or
581

INSTRUCTIONS:Identify the parts of the problem that belong in
the diagram. Move the arrow over each part. Click and release the
mouse button. Drag the dotted rectangle into the diagram, and
click the mouse button again when you have positioned the
rectangle correctly in the diagram. If you make a mistake, return to
the problem and repeat the process. When you are finished, move
the arrow into the OKAYbox and click the mouse button.
Harry the computer programmer accidentally erased some of

his computer programs while he was hurrying to finish work
one Friday afternoon. Much to his dismay, when he
returned to work on Monday, he discovered that only 24
programs of his original 92 programs had survived. How
many computer programs had been destroyed?
I
l
X(~ )9~ (roGI24 programs
OKAY
FIGURE 1. Example of task to assess feature recognition for the Change schema
(from Marshall, 1993)

categoryto an object of measurement.In addition,the measurementdesign must

provide ways of addressingthe precision of the procedurefor assigning a value
or category. Test takers make careless mistakes responding to tasks or items,
and the measurementdesign must account for this when expressing precision
associatedwith the assignmentof a value. Perhapsthe most widely used measure-
ment procedureis to sum the numberof correctresponses. Additionalexamples
of measurementproceduresinclude latentclass analysis and Bayesian inference.
The purposeof the measurementdesign is to collect and combine test takers'
responsesin ways thatidentifythe mechanismstest takersuse in responding.The
measurementdesign supportsan inferenceregardingthe processesandknowledge
structurespossessed by a test taker. This sort of inference involves inductive
reasoning-reasoning from effects to possible causes. In the case of CDA, induc-
tive reasoningguides reasoningfromobservationsof a given studentto inferences
about that student'scognitive mechanisms(Mislevy, in press).
For example, the AutomatedCognitive Modeler (ACM) system of Langley,
Wogulis,and Ohlsson (1990) uses artificialintelligence (AI) and statisticalmeth-
ods to generatea productionsystem model of a student'sperformanceon multicol-
umnsubtractionproblems.In this measurementdesign, the objectof measurement
is the studentand the set of proceduresconsists of the AI and statisticalmethods
used to assign a productionsystem model to the student. A productionsystem
is a set of condition-operationpairs (Anderson, 1983). The condition identifies
certaindatapatternsandthe operationexecutes if the datapatternsmatchelements
in workingmemory.The system begins with a descriptionof a set of subtraction
problems,a set of responses from a student,and a set of mental operationsthe
student may have applied to solve the subtractionproblems. An example of a
set of operationsfor subtractionare shown in Table 1. For each problem, the
system generatesa set of solution paths, called path hypotheses, a studentmay
have followed to respondto a problem.The system producesa productionsystem
TABLE 1
Operations for subtraction (from Langley, Wogulis, & Ohlsson, 1990)
Add-Ten(number,row, Takesthe numberin a row and columnand replaces
column) it with thatnumberplus ten.
Decrement(number,row, Takesthe numberin a row and columnand replaces
column) it with thatnumberminusone.
Find-Difference(numberl, Takesthe two numbersin the same columnand
number2,column) writesthe differenceof the two as the result
for thatcolumn.
Find-Top(column) Takesa numberfrom the top row of columnand
writesthatnumberas the resultfor that
column.
Shift-Column(column) Takesthe columnwhich is both focused-onand
being processedand shifts both to the column
on its left.
Shift-Left(column) Takesthe columnwhich is focused-onand shifts
the focus of attentionto the columnon its left.
Shift-Right(column) Takesthe columnwhich is focused-onand shifts
the focus of attentionto the columnon its right.
583

Nichols
thatidentifiesinappropriateconditionsunderwhichthatstudentmay have applied

the subtractionoperations.
The ACM system uses an AI procedurecalled heuristic search to generatea
productionsystemfor a studenton a particularsubtractionproblem.Using heuris-
tic search,the mentaloperationsidentifiedearlierareappliedto the initialdescrip-
tion of the subtractionproblemto producean intermediatestatealong the solution
path. The operationsare applied to each successive intermediatestate until the
student'sresponse to the problem is generated.
The ACM system uses the X2statisticalprocedureto find the productionsystem
that most nearly describes the student'sresponses across the set of subtraction
problems. The X2statistic is used with an AI procedurecalled learning from
solutionpaths (Sleeman,Langley,& Mitchell, 1982). The system producesposi-
tive and negative applications of an operation. A positive application of an
operationgeneratesa state on a possible solution path;a negative applicationof
an operation generates a state off a possible solution path. The system adds
conditions to operationsso that the operationwould apply in the most positive
applicationsand the fewest negative applicationsof the operation.The conditions
on each operationallow the productionsystem to follow the solution path and
avoid pathsthatdo not lead to the student'sresponse.The ACM system employs
a forwardapproachin selecting the conditionsto add to an operator.The system
considers individualconditions in turn and selects that condition which, when
added to the action to form a productionrule, producesthe least X2value. This
process continues until enough conditions have been added to the action so that
none of the negative applicationsare produced.Then the system begins a back-
ward approachof droppingeach conditionfrom the productionrule unless drop-
ping that condition from the productionrule significantlyincreases the X2value
for that rule. The newly constrainedproductionsystem is applied to the next
problem and the system begins searchingfor possible solution paths.
The validity of the measurementdesign is supportedby substantiveresearch
suggesting appropriateproceduresfor combiningobservationsto assign a value
to an objectof measurement.Forexample,the decision model in the ACM system
is consistent with the system developers' psychological assumptionsregarding
students'subtractionproblemsolving. First,the systemgeneratesa set of possible
solution paths because the system developers assumed that all problem solving
involves search through some problem space (Langley, Wogulis, & Ohlsson,
1990; Ohlsson, 1990). Second, the system produces a productionsystem that
identifies inappropriateconditions under which that student may have applied
the subtractionoperationsbecause the system developersassumedthat students'
subtractionerrors are due to rules with the correct actions but the incorrect
conditions.Third,the system uses the X2statisticto discriminatebetweencompet-
ing conditions because the system developers assumed students slip sometimes
when responding,and so obtainedresponsepatternswould not match frequently
predictedresponse patternsexactly.
Traditional Approach
The traditionalapproachto assessment may be analyzed into an observation
design and a measurementdesign. I discuss with reservationthe observationand
measurementdesigns of traditionalassessmentsbecause the traditionalapproach
584

Frameworkfor Cognitively Diagnostic Assessments
confoundsthese two aspectsof test theory.A comparisonof the StanfordDiagnos-

tic MathematicsTest (SDMT; Beatty, Madden, Gardner,& Karlsen, 1976) for
grades 6 and 7 with VanLehn's (1982) diagnostic test illustrates differences
between assessmentsconstructedusing the traditionalapproachand assessments
constructedusing the CDA approach.Withinthe computationsubtest,the SDMT
has a concept/skill domain labeled Subtractionof Whole Numbers.Perhapsthe
best conceptualframeworkfor the SDMT is providedby GeneralizabilityTheory
(GT; Brennan, 1992; Cronbach,Gleser, Nanda, & Rajaratnam,1972), which I
presentas an extension of classical test theory.The observationdesign I associate
with the Subtractionof Whole Numbersconcept/skill domain describes the set
of measurementconditions,or universeof generalization,to which generalization
of the test score is intended. The universe is characterized,in part, by facets
representingthe item objectives. Given thata universeof generalizationhas been
identified, an observation design consists of a random sample from the item
objectives.Effortshave been made to addressissues of representation.For exam-
ple, items may be constructed according to specifications that represent the
distributionof content categories across majortextbooks.
However, efforts to representcontent are only vaguely directed at revealing
mechanisms test takers use in respondingto items or tasks. The objective for
each item in the Subtractionof Whole Numbersconcept/skill domain is shown
in Table 2. As Table 2 shows, the focus of item objectives is on students'
behaviors. In contrast, VanLehn'stest focuses on students' use of processes
underlyingthe behaviors.
The measurementdesign I associate with the Subtractionof Whole Numbers
concept/skill domain involves estimating a persons' universe score. The p X I
design in GT (Brennan, 1992) perhapsdescribes best the measurementdesign
for the Subtractionof Whole Numbersconcept/skilldomain. A value is assigned
a personby computingthe averagecorrector numbercorrect.Interactionbetween
TABLE 2
Item objectives for the Subtraction of Whole Numbers concept/skill domain in
the SDMT
The pupil will demonstratethe abilityto use the standardalgorithmfor subtraction
(verticalform)with renamingby:
Item
4 findingthe unknownaddend(remainder)when a numberin the tens is
subtractedfrom a numberin the hundreds.
5 findingthe unknownaddendwhen a numberin the hundredsis subtracted
fromanothernumberin the hundreds.
froma numberin the thousands.
from a numberin the thousandswith a zero in the tens place.
8 findingthe unknownaddendwhen a numberin the thousandsis subtracted
from anothernumberin the thousandswith zeros in the ones and in
the hundredsplaces.
9 findingthe unknownaddendwhen a numberin the thousandsis subtracted
from anothernumberin the thousandswith a zero in the tens place.
585

Nichols
people anditems is treatedas errorunderthe p X I design. In contrast,VanLehn's

(1982) diagnostictest uses thatsame information(i. e., differentresponsepatterns)
to diagnose buggy performance.
The measurementdesigns of CDA and traditionalassessmentstreatdifferently
the information in the matrix defined by persons and items. CDA uses the
informationfrom the interactionbetween persons and items in the matrix-test
takers' differentresponse patterns.On VanLehn's(1982) diagnostic test, a test
taker who systematicallyfollows the SubtractSmaller From Largerrule would
producea differentpatternof responsesand a differentdiagnosis thana test taker
systematicallyfollowing the Stops Borrowat Zero rule. Traditionalassessments
use the informationin the marginalsof the matrix.On the SDMT's Subtraction
of Whole Numbers subtest, test takers differ in the numbercorrect. These two
sources of informationfrom test takers' responses appearnot to overlap. Thus,
test takers assigned different diagnostic categories may be assigned the same
numbercorrectscore. This differencebetween the measurementdesigns of CDA
and traditionalassessments makes the two approachesirreconcilable.
Generally,traditionalassessments have no explicit substantivemodel of the
test domain. However, a mouse trap is difficult to build without some ideas
about mice; an achievementtest is difficult to build without some ideas about
achievement.Accordingto Mislevy (1993): "Standardtest theoryevolved as the
application of statistical theory with a simple model of ability that suits the
decision-makingenvironmentof most mass educationalsystems" (p. 20). As I
argued in the review of the societal contexts motivating the development of
traditionalassessment approaches,the substantivetheory appearsto be implicit
in traditionalmethods and reflects the assumptionsof psychological theories of
the early 20th century(Resnick & Resnick, 1992).
In summary,currentassessmentsusing traditionalapproaches,such as GT, are
basedon an observationdesign thatdescribesthe universeto which generalization
of the test score is intended.Generally,the universeof generalizationis character-
ized by descriptionsof content. The measurementdesign of traditionalassess-
ments addresses the sources of error in averaging or summing a test taker's
performanceover items. The focus on average scores is irreconcilablewith the
focus of the CDA measurementmodel on patternsof responses.These differences
could arguablybe attributedto differencesin the intendeduses of the assessments
andthe substantivemodels implicitlyor explicitly incorporatedinto theirdevelop-
ment. One purpose of the test development approachintroducedin the next
section is to make explicit the role of psychological assumptions in the test
developmentprocess.
Methodology for Psychology-Referenced Assessment
ResearchersdesigningCDA appearedto follow a generalpatternof test devel-
opment that constitutes a methodology for developing CDA. In this section, I
describethis five-step methodologywithin which are coordinatedthe substantive
and statisticalaspects of test theory (see Table 3). I do not conceive these steps
as discrete stages, nor do I conceive the sequence as inviolable. The description
of steps is simply a useful device to communicatethe activities associated with
developingCDA. To illustratethese steps, I recountthe developmentof Gitomer's
diagnostic logic gate assessment5as I have followed it throughpublicationsand
586

TABLE 3
Thefive steps of test developmentwithinpsychology-driventest development
STEP 1 SUBSTANTIVETHEORYCONSTRUCTION
The substantivebase concernsthe developmentof a model or theory
thatdescribesthe knowledgeand skills hypothesizedto be involved
in performanceand the item or task characteristicshypothesizedto
interactwith the knowledgeand skill.
STEP2 DESIGNSELECTION
In this step, the test developerselects the observationand measurement
designs. The selectionis informedby the substantivebase
constructedin Step 1. Subsequently,the test developerconstructsitems
or tasks thatwill be respondedto in predictableways by test takers
with specific knowledge,skills, and othercharacteristicsidentifiedas
importantin the theory.The procedurefor constructingassessments
is the operationalization of the assessmentdesign.
STEP3 TEST ADMINISTRATION
Test administration includesevery aspectof the contextin which test
takerscompletethe test:the formatof the items, the natureof the
requiredresponse,the technologyused to presenttest materialsand
recordresponses,and the environmentof the testingsession.
Decisions concerningthe contextof the testingsession shouldbe
informedby researchon how aspectsof the contextinfluencetest
takers'performance.
STEP4 RESPONSESCORING
The goal of this step is to assign values to test takers'patternsof
responsesso as to link those patternsto theoreticalconstructssuch
as strategiesor malrules.As with assessmentconstruction,a scoring
procedureis the operationalization of the assessmentdesign.
STEP5 DESIGNREVISION
Design revisionis the processof gatheringsupportfor a model or
theory.As with any scientifictheory,the theoryused in test
developmentis neverproven;rather,evidence is graduallyaccumulated
that supportsor challengesthe theory.In this step, the resultsof
administeringthe assessmentare used to revise the substantivebase
upon which was basedthe constructionof the assessment.
presentations.I use Gitomer's researchto illustrate this methodology because

his work has maturedto producea viable assessment that includes identifiable
observationandmeasurementdesigns.In addition,the assessmentresultsfromthe
diagnosticlogic gate assessmentarelinkedclosely with instructionaladaptations.I
emphasize that Gitomerhas never claimed to have followed these steps.
SubstantiveBase
The first step in this methodologyandthe foundationof CDA test development
is the constructionof a substantivebase. A substantivebase is constructedfrom
originalresearchand researchreviews, but also includes assumptionsabouthow
to best representlearning and individualdifferences. The substantivebase is a
dynamicelement in test development;substantiveresearchby the test developer
and by others continues duringand after the assessment is developed.
587

Nichols
The substantivebase is consulted in every stage of test development. As I

arguedwhen describingthe observationand measurementdesign, the substantive
base providesthe rationalefor both the observationand the measurementdesign.
Furthermore,the substantive base indicates limits in the generalizationfrom
scores. The importanceof the substantivebase was described well by Messick
(1989), who referredto it as constructtheory:
Constructtheoryas a guide to test constructionprovidesa rationalbasis
for selecting task content, for expecting certain consistencies in item
responses,and for predictingscore relationships.If one startswith well-
groundedtheory,the whole enterpriseis largelydeductivein nature,and
the approachto constructvalidationcan be well specified and relatively
rigorous.(p. 49)
A useful substantive base includes two elements: (a) a description of the
cognitive mechanismsa performerin the test domainwould use. The description
may include how the cognitive mechanisms develop and how more competent
performersdiffer from less competentperformers;and (b) a specificationof task
or item characteristicsthat are hypothesized to influence the cognitive mecha-
nisms used by performersin the domain. Taken together, a descriptionof the
performer'scognitive mechanismsand a specificationof task or item characteris-
tics constitutethe constructrepresentationof an assessment (Embretson,1983).
The substantivebase enables test developers to create a set of items or tasks
and infer the processes and knowledge structuresused to respondto those tasks.
The premise is simple: by recognizing the different processes and knowledge
structuresthat an individualbrings to a task or test, the test developer should be
able to constructa task or an item that requiresthose structures.Conversely,the
test developer should be able to infer the knowledge and structuresused by an
individual to respond to an item constructed to reveal such knowledge and
structures.The necessary conditions are that the test developer understandsthe
processes and knowledge structuresrequiredby performerson each task andthat
some subset of responseswill discriminateindividualswho differ on one or more
of those skills (Gitomer& Yamamoto, 1991). This is the process of inductive
and deductive reasoningthat guides the observationand measurementdesigns,
respectively.
As the work of Gitomer and his colleagues illustrates,the constructionof a
substantivebase to supporttest developmentis a laborioustask. Gitomer'searlier
work in this areafocused on defining those skills which characterizedcompetent
performanceof avionics technicians(Gitomer,1984). The avionics technicianis
asked to identify and repair malfunctions in airborneavionics equipment and
maintainthe troubleshootingequipment.Skills thatdifferentiatemore competent
from less competent technicians were explored using two approaches.First,
skilled performancewas characterizedthrougha review of the researchon ways
experts differ from novices. Next, a series of experiments was conducted to
identify differences in processes and knowledge structuresfor more and less
competentavionics technicians.A differenceidentifiedin this experimentalwork
was that skilled techniciansexhibited greaterproficiency in their understanding
of digital logic gates thandid less skilled technicians.Skill in readinglogic gates
appearedto enable effective troubleshootingand, thus, was an importantarea
of study.
588

Subsequently,Gitomer and Van Slyke (1988) examined technicians' under-

standing of logic gates through a manual error analysis of the responses of
avionics technicianson a logic gate test. Technicianswere asked to indicate the
outputvalue for 288 logic gates that varied in the type of gates (8), the number
of inputs (1, 2, or 3), and whetheror not inputs were negated.The erroranalysis
identified three classes of errors:(a) technicians who made rule-based errors
consistently answeredproblemsincorrectly,sharinga set of attributesindicating
a misconceptionin the knowledgeneeded to solve particularkinds of logic gates;
(b) technicianswho made weakness area errorshad difficulty answering,but did
not consistentlyanswerincorrectly,problemssharinga set of attributesindicating
at least an impasse, if not a misconception, in the knowledge needed to solve
particularkinds of logic gates; and (c) technicianswho made practiceareaerrors
made infrequenterrorsacross types of logic gates indicatingefficiency could be
improved. Technicians who made practice errors experienced difficulty when
memory demands became great because their knowledge was organized ineffi-
ciently. Some techniciansshowed more than one class of erroracross logic gates
with different sets of shared attributes.The error analysis classified 84 of the
119 avionics technicians in the study, or approximately71%, as making rule-
based and/or weakness area errors.Furthermore,over one third of the sample
showed practice area errors.An adaptive instructionalsystem, called GATES,
was developed using findings from the erroranalysis.
Design Selection
The second step in developingCDA is the constructionof the observationand
measurementdesigns. As I have argued, the validity of both the observation
design and the measurementdesign is evaluatedwith respect to the substantive
base. As the work of Gitomer(Gitomer,1987; Gitomer& Van Slyke, 1988) and
his colleagues illustrates, the task for the test developer is to construct and
organize observationsand combine responses in ways that are consistent with
the substantivebase. The observationand measurementdesigns used by Gitomer
and Van Slyke (1988) in the GATES tutor are summarizedin Figure 2. The
tutor performsan initial assessment followed, if warranted,by a more detailed
diagnosis.The initial assessmentwas intendedto distinguishbetween technicians
who had rule-basedmisconceptionsor weak conceptions of how to solve logic
gates, and technicians who understoodhow to solve logic gates but had ineffi-
ciently organizedknowledge.The more detailed diagnosis was intendedto iden-
tify the rule-basedmisconceptionsheld by the technicians.
The initial assessment consists of a circuit tracing test and a screening test.6
In the circuit tracing test, tutor users trace througha complex arrangementof
logic gates andindicatethe outputfor each gate.The observationdesign demanded
that logic gates vary in the type of gate, if the gate was negated or not negated,
and the numberof inputs. Furthermore,the observationdesign demandedthat
gates be arrangedin a complex circuit. The measurementdesign demandedthat
overall accuracy be computed on the task. Tutor users who had high overall
accuracy on the circuit tracing task exited the tutor whereas tutor users who
answeredmany logic gates incorrectlyattempteda screeningtest to diagnose the
source of their difficulty. In the screening test, tutor users indicatedthe correct
outputs for 48 single gates. The faceted observationdesign requiredthat gates
589

EXIT PRACTICE
FIGURE 2. A representation of the observation and measurement design of the GATES tu

vary in the type of gate, whetherthe gate was negatedor not negated,and number
of inputs. Furthermore,the observationdesign requiredthat gates be presented
singly. The measurementdesign demandedthat accuracybe computedfor each
gate type, for negated and nonnegatedgates, and for gates differing in number
of inputs.Underthe measurementdesign, low accuracyon any set of gates moved
the tutoruser to a diagnostic module for that set of gates. High accuracyacross
sets of gates moved the tutoruser to a practicemodule intendedto increase the
efficiency of the tutor user's access to knowledge of logic gates.
Both the observationand measurementdesigns of the initial assessment were
motivatedby substantiveconcerns.Substantiveresearchsuggestedtwo hypothe-
ses for low accuracyon the firstmodule;usersmay have hadconceptualdifficulty
with particularkinds of logic gates, or users may have had difficulty accessing
efficiently logic gate knowledge.The first taskinvolved tracinga complex circuit
and was scored using accuracybecause substantiveresearchshowed that some
less skilled avionics technicians experienced difficulty tracking outputs from
logic gates in a complex circuit. These less skilled technicians experienced
difficulty in managingmemorydemandsbecause theirlogic gate knowledge was
organizedless efficiently than that of more competenttechnicians.The second
task involved single gates and was scored using accuracy on kinds of gates
because substantive research showed that technicians' difficulties interpreting
logic gates may be due to misunderstandings,or at least impasses, in the knowl-
edge needed to solve those kinds of gates. Gates were presentedsingly to reduce
the role of efficient access to knowledge in users' performance.Furthermore,
the initialassessmentconsistedof two contrastingtasks-tracing complex circuits
versusansweringsingle gates-because substantiveresearchindicatedthattechni-
cians who have misconceptionsor impasses in the knowledge needed to solve
particularkinds of logic gates will show a different patternof accuracy than
technicians who have inefficient access to this knowledge. For example, low
accuracyon complex circuitsbut high accuracyon single gates indicates techni-
cians understandhow to solve logic gates but need to access this knowledge more
efficiently.Alternatively,low accuracyon complex circuitsbut high accuracyon
all but one gate type on single gates indicates technicians have an impasse or
misunderstandingin the knowledge needed to solve logic gates.
Tutor users who failed to indicate accurately the correct output for one or
more sets of logic gates in the screening test were presented with diagnostic
modules for those gates. Each module requiredtutorusers to indicatethe correct
output for single gates sharinga particularset of attributes.For example, tutor
users may have been presentedwith all gates of one type or all negated gates.
The observationdesign requiredthat all gates be presentedsingly and that gates
within a diagnosticmodule sharethe same attributes-all of one type, all negated
or nonnegated,or all with the same numberof inputs. The measurementdesign
requiredthat latent class analysis be used to assign techniciansto qualitatively
differentclasses correspondingto misconceptionsregardinglogic gates. This was
done by matchingresponsevectors with misconceptions.A latentclass approach
was used because actualresponsevectors rarelymatchexactly an ideal response
vector. Matches between actual responsesand predictedresponses increase sup-
port for a particularclassification, whereas mismatches reduce support for a
particularclassification.
591

Nichols
As withearlierassessmentcomponentsof the GATEStutor,the observationand

measurementdesigns of the diagnostic modules were motivatedby substantive
concerns. The error analysis of Gitomer and Van Slyke (1988) indicated that
three quartersof techniciansmay make rule-basederrors.Thus, each diagnostic
module consisted of logic gates that shareda set of attributesbecause the error
analysisindicatedtechniciansoften held misconceptionsin the knowledgeneeded
to solve particularkinds of logic gates. A latentclass analysis was used because
the erroranalysisindicatedthattechnicians'misconceptionsresultedin systematic
patternsof response.Furthermore,the two-stage assessmentin which tutorusers
first completed the screening test and then completed, if indicated, diagnostic
modules distinguishedbetween technicianswho held systematicmisconceptions
fromtechnicianswho had an impasse in the knowledgeneeded to solve particular
kinds of logic gates. For example, differentiallevels of accuracy for different
gates but consistent,rule-basederrorsindicates technicians had an impasse but
not a misunderstandingof the knowledge needed to solve that kind of logic gate.
Thus, the constructionof the assessmentcomponentswithin the GATEStutoris
consistent with Gitomer'ssubstantiveresearchon technicians' logic gate under-
standing.
Test Administration
In the third step of this methodology,test administration,the test developer
must consider aspects of administeringthe test that may influence test takers'
performance.The substantivebase can informdecisions regardingaspectsof test
administrationsuch as item or task format,natureof the response,and the context
of assessment.The workof Gitomerandhis colleagues providesno clearexample
of substantiveconsiderationsin test administration.However,examples from the
subtractiondomain illustrate aspects of test administrationthat influence test
takers' performance.
For example, substantiveresearchwould have much to say about the format,
especially the wording, of subtractionword problems intended for children.
Generally,a child's poor performanceon such problems may indicate a lack of
understandingof part-wholerelations.However,some researchersarguethat,by
the age of 4 or 5, children possess at least tacit understandingof part-whole
relations and that what they learn from instructionor through familiarization
with problem solving language is how certain verbal formats map onto those
relations(De Corte,Verschaffel,& De Win, 1985; Cummins, 1991). Specifically,
childrenat this age interpretcomparativetermsas simple possession statements.
For example, "Maryhas 5 more marblesthan John"is interpretedas "Maryhas
5 marbles." In addition, the word "altogether"is interpretedas "each." For
example,"MaryandJohnhave 5 altogether"is interpretedas "Maryhas 5 marbles
and John has 5 marbles."As these examples illustrate,test developerswould do
well to consult substantiveresearchwhen consideringthe formatof subtraction
word problems.
Response Scoring
As the descriptionin Table 3 indicates,responsescoring is the implementation
of the test theory.Practicalquestions regardinghow to manage scoring must be
answeredandthese may be challenging.Forexample,softwaremustbe developed
592

to score and compute individualscores and item and test statistics. I will leave
such questions for anotheroccasion.
Instead, I would like to discuss indicatorsthat may be used to evaluate the
implementation of the test theory. Generally, implementationof traditional
approachesis evaluatedusing item statisticsindicatingdifficulty and discrimina-
tion (Millman & Greene, 1989) and test statistics indicatingreliability (Feldt &
Brennan,1989). For the purposesof informinginstructionaldecisions, traditional
indicatorsof item and test functioningmay not be of much help. Again, the work
of Gitomerprovides an example. Gitomerand Yamamoto(1989) reportthe item
p values and biserials for 119 technicians who completed a 20-item diagnostic
logic gate assessment. As Gitomer and Yamamotonote, difficulty values were
moderate and biserials suggested that doing well on one item bodes well for
overalltest performance.But these item statisticswere developedfor assessments
intended to select students most likely to succeed in a uniform instructional
environmentratherthanassessmentsintendedto makeinferencesaboutcognitive
structuresandprocesses.Whatmeaninghave these statisticsfor a diagnostictest?
CDA-based assessments requireindicatorsthat may be used to evaluate the
implementationof test theory that is intendedto identify qualitativedifferences
in individuals' processes and knowledge structures.Researchershave offered
several alternativesto traditionalCTT indicatorsof test and item functioning.
At the test level, Brown and Burton(1978) proposetest diagnosticityas an index
to evaluate their diagnostic assessment of students' subtractionmisconceptions.
Under Brown and Burton'sapproach,any test takerholding a particularmiscon-
ception or combination of misconceptions will produce a particularresponse
vector for a given set of test items. The response vectors may be partitionedso
that identical response vectors, correspondingto different misconceptions, are
placed in the same partition.A perfect diagnostic test would have one response
vector in each partition.A less-than-perfectdiagnostic test would have at least
one partitionwith more than one response vector.
At the item level, Bart(Bart,1991;Bart& Williams-Morris,1990) has proposed
two indices. Both indices are computed using an item response-by-rulematrix.
For any item, each possible response may correspondto the test taker's use of
at least one rule or strategy.An item on which the test takeris asked to respond
with true or false will have two possible responses, and each response may
correspondto the use of one or more rules. Alternatively,a multiple-choiceitem
with four alternatives,such as the two items representedin Table 4, will have
four possible responsesand each response may correspondto one or more rules.
I have representedonly four rules in Table4. Note that an item response corres-
ponding to a rule is representedby a one in that cell of the table.
The index of response interpretabilitycaptures the degree to which each
response to the item is interpretableby at least one rule. The computationof
responseinterpretabilityis straightforward given an item response-by-rulematrix.
The index is the numberof responses that are interpretedby one or more rules
divided by the numberof responses. Values range from 0, which indicates no
rule-based responses, to 1, which indicates complete rule-basedresponses. In
Table4, the responseinterpretabilityof Item 1 is 1 whereasthe responseinterpret-
ability of Item 2 is 0.5.
593

Nichols
TABLE4
The Item Response by Rule matrixfor two multiple-choice items
Rule
Response 1 2 3 4
Item 1
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
Item2
1 0 0 1 1
2 0 0 0 0
3 0 0 0 0
4 1 1 0 0
The index of response discrimination captures the degree to which each

response to an item is interpretedby only one rule. Again, the computationof
response discriminationis readily understoodgiven an item response-by-rule
matrix.Response discriminationis computedin two steps. In the first step, 1 is
dividedby the numberof rulesthatmay be used to interpreta response(1 divided
by 0 is defined as 0). For example, the first response to Item 1 is interpretedby
only one rule. In the second step, the sum of the values from the previous step
is divided by the numberof responses to the item. Values range from 0, which
indicatesno rule-basedresponses,to 1, which indicateseach responseis interpre-
ted by only one rule. In Table 4, the response discriminationfor Item 1 is 1
whereas the response discriminationfor Item 2 is 0.25.
The CDA-basedmeasuresof item and test functioningdiffer from CTT-based
measures in at least two ways: (a) None of the CDA-based indices incorporate
the notion of consistently linearly rankingindividualswhereas CTT-baseddis-
criminationand reliabilityindices appearto be founded on this notion; and (b)
all of the CDA-based indices may be computed a priori whereas all the CTT-
based indices requirethat the items be administered.These differences reflect
differences in emphasis; CDA development focuses on diagnosing qualitative
differences in test takers' processes and knowledge structures,whereas CTT
developmentfocuses on selectingindividualsmost likely to succeed in a particular
educationalenvironment.Furthermore,CDA development emphasizes the test
developers' conception of the constructwhereas CTT developmentemphasizes
the practicalsuccess of the items.
Design Review
Design review is the process of gathering support for the observation and
measurementdesigns used in test development.The design is like a theory,and,
as with any scientific theory, the theory is never proven; rather,evidence is
graduallyaccumulatedthat supportsor challenges the design. Design review is
a process that continues before and after test administration.Initially,evidence
594

supportingthe design comes fromthe strengthof the researchbase. Such evidence

is similar to Ebel's notion of intrinsic rationalvalidity (Ebel, 1965).
After the test has been constructedand administered,othersourcesof evidence
may be gathered.One source of evidence regardingthe validity of the design is
the fit between the predictionsof the observationdesign and test takers' perfor-
mance. The persuasivenessof this sort of evidence lies in the test developer's
success in modeling how test takersmay have solved items or tasks. As Messick
(1989) explains:
Almostany kindof informationabouta test can contributeto its construct
validity... Possiblymostilluminatingof all aredirectprobesandmodeling
of the processesunderlyingtest responses,an approachbecoming more
accessible and more powerfulwith continuingdevelopmentin cognitive
psychology.(p. 17)
Design revision may be suggested by comparingthe resultsof test administra-
tion to the predictedresultsbasedon the substantiveresearchused in test develop-
ment. Anomalousresultssuggest revisions of the test design and areasof further
research.Furthermore,design revisionmay be suggestedby additionalsubstantive
researchoutsideof the assessmentcontext.In this way, test developmentbecomes
part of basic aptitudeand achievementresearch,and "testscan become vehicles
of communicationbetweenlaboratoryandfield"(Snow & Peterson,1985, p. 155).
For example, Gitomerrevised the GATEStutorbecause the latent class mea-
surementdesign does not associate a value for technicianswho made weakness
area errors-technicians who had difficulty answering,but did not consistently
answerincorrectly,problemssharinga set of attributes.In responseto this failure,
Gitomer and Yamamoto(1991) applied Yamamoto's(1989) HYBRID model to
assess technicians' understandingof logic gates. Using the HYBRID model,
"Individualswhose responses are not consistent with one of the LCM [latent
class model] classes may be modeled more conservativelyby a continuousmodel
that makes no strong assumptions about qualitativeunderstandingbut simply
quantifies their overall level of proficiency"(p. 175).
Conclusions on a New Assessment Approach
Cognitive scientists have attemptedto better understanddifferences between
more and less accomplishedperformersin many domains.Observersfrom both
cognitive science andeducationalmeasurementhave concludedthattest develop-
ers can take advantageof progress in cognitive science. They have concluded
that test developers can design assessments throughsystematic, research-based
variations in problem characteristics.Conversely,test developers can infer the
processes and knowledge structuresused by an individualthrough systematic,
research-basedpatternsof problemresponses.Gradually,the distinctionbetween
psychology and psychometricshas become obscured.But many issues must be
addressedbefore CDA can be considered as viable an assessmentmethodology
as the time-testedCTT,or even the relativelynew IRT.I addressseveral of those
issues in this final section.
What is the Role of SubstantiveFindings in Applications?
Substantivetheoryandresearchplay a prominentrole in CDA. The substantive
base is the foundationof CDA. Given the prominenceof substantivetheory and
595

Nichols
researchin CDA, the assumptionsconcerningits role shouldbe examinedclosely.

An assumptionunderlyingCDA is that substantivetheory and researchmay be
applied usefully to construct assessments. Critics may object that theory and
research from the laboratoryhave little to say about tasks and variables in
real world domains that are the focus of instruction.This argumentappearsto
misrepresentcognitive science as an enterprisesituatedsolely in the laboratory.
As the work in everyday memory (Loftus, 1991) and situated cognition (Lave,
1988) illustrates,cognitive scientists devote much time to understandinghow
people representand process informationin real world domains. The work of
VanLehn,Tatsuoka,and Gitomerreviewed in this article deals with individual
differencesin real world domains.
AnotherassumptionunderlyingCDA is that assessment results may be used
to test theoryand to informresearch.Criticsmay raise several objectionsagainst
this assumption.For instance,they may object that experimentis the only valid
way to test theory. True, experimentprovides powerful tests of theory,but not
the only tests (Bickhard, 1992). Experimentaltests in many fields, including
geology, evolutionarybiology, and astronomy,are generally not possible.
Criticsmay also objectthatalternativetheoriesmay accountfor the assessment
results and that successful CDA applicationsthereforehave no implicationsfor
the specific substantive base used to generate the assessment. In the social
sciences, I doubtthere is any single data set that cannotbe explainedin multiple
ways in a post hoc fashion. Note that the substantivebase in CDA is used to
generateassessmentsand predictresults.Any numberof explanationsare plausi-
ble when theorizing is post hoc, but fewer theories are successful in predicting
results.
WhoMakes These New Assessments?

Under the CDA approach,what kind of expertise must the test developer
possess? The emphasis in CDA given to multipledisciplines appearsto demand
an encyclopedist.However,few individualsare able to masterthe skills required
to be expert in many areas. A solution may be the use of a team to develop
CDA (Tittle, 1991). The team would include cognitive scientists to address
psychological issues, statisticiansto address statisticalissues, psychometricians
to addressmeasurementissues, and subjectmatterexperts to insurethe accuracy
of substantiveissues. The role of the team leader would be to coordinatethese
differentaspects of test development,and so the team leader must be familiar
with the issues in each area. The position of team leader may be best filled
by an educational psychologist or other professional whose training includes
educational,psychological, and measurementconcerns.
In this article, I have outlined several points on which CDA differs from
traditionalmeasurementapproaches.However,CDA andtraditionalmeasurement
are similarin the sense that a quality assessmentdepends on a careful, creative,
and conscientious test development staff. The development of a CDA is not
straightforwardand such a conclusion should not be drawn from the step-by-
step descriptionof test development. Even an assessment generatedusing an
algorithmhas an element of art.The success of a CDA depends in large parton
the competence of the test developmentstaff.
596

How Much Support Is There for the Interpretation and Use of CDA ?
I will concludethis articlewith a short,andincomplete,discussionof validation

for CDA. CDA has been used only on a small scale so summativeevaluationis
premature.Even on a small scale, the number of CDA projects is large and
increasing,and I lack space to review all the data from the many projects that
might provide evidence bearing on CDA interpretationand use. Reviews of
CDA projects are providedin Frederiksen,Glaser,Lesgold, and Shafto (1990);
Frederiksen,Mislevy, and Bejar (1993); and Nichols, Chipman,and Brennan(in
press). I will only sample the data to illustratethe evidence that supportsCDA
interpretationand use. I will use this opportunityto make the most reasonable
case, based on currentevidence, to guide use of CDA and to propose a program
of research,focusing on the consequentialbasis for test validity, to furtherour
understandingof CDA validation.
Much of the available data is offered to supportinterpretationsof CDAs. For
example, Marshall (1993) reports that scores on the schema-basedassessment
module in the SPS instructionalsystem (described earlier in the observation
design section) were related to posttest scores and to interview responses. In a
small study designed to test the efficacy of schema-basedinstruction,14 students
completedthe SPS, a 10-point,paper-and-pencilposttest,and a series of follow-
up interviews.A multiple-regressionanalysisshowedthata groupof four schema-
basedmeasuresaccountedfor 73% of the variancein the paper-and-pencilposttest
scores. Furthermore,scores from the schema-basedmeasures correspondedto
representationsof students'knowledgeconstructedfromthe follow-upinterviews.
As anotherexample,Goldsmith,Johnson,andActon (1991) showedthatrepresen-
tations of students' cognitive structures,modeled using the Pathfinderscaling
algorithm (Schvaneveldt, 1990), were related to classroom performance.In a
psychology course, the similarityof the students'representationsto the instruc-
tor's representationscorrelated.74 with earnedcourse points. In the mathematics
domain, Johnson, Goldsmithand Teague (in press) reportthat similarityto the
instructor'srepresentationcorrelated.58 with ACT math scores.
Little of the available data is offered to supportuse of CDAs. A weak test of
validity was provided by Anderson, Conrad,and Corbett(1989), who showed
that students performedbetter on a posttest when a form of CDA, knowledge
tracing, was part of computer-deliveredinstruction.However, the emphasis on
helping individuals succeed in educational opportunitiesmay prove to be the
Achilles' heel of CDA. Some proponentsof CDA appearto assume that these
new assessments will necessarily improve instructionand learning. However,
assessmentwill remaindetachedfrom instructionunless assessmentand instruc-
tion are based on common understandingsof learning and achievement. As
Nicolson (1990) argues,currentwork in CDA focusing on diagnosing students'
misconceptionsor bugs does not fit into the pedagogy of the classroom teacher.
The problemis a technocraticmodel of the relationshipbetween research,devel-
opment,andinnovation.Educationalinnovationmusttakeinto accountthe experi-
ences, practices, and value systems of the classroom (Ruthven, 1985); teachers
do not routinelydiagnose misconceptionsand they do not teach towardmiscon-
ceptions. When teachers notice a student having difficulty, they point out the
error,reteachthe specific procedureor idea, and send the child to work on more
597

Nichols
problems (Sleeman, Kelly, Mortinak,Ward, & Moore, 1989). In contrast, the

assessmentin Marshall'sSPS tutoror Gitomer's GATES tutor is linked closely
with instruction because the instructionand assessment all derive from one
common substantivemodel-schema theory in SPS and a model of skilled trou-
bleshooting in GATES.
Finally, future evaluation of CDA interpretationand use must acknowledge
the values and rationales that supportthe interpretationof evidence (Messick,
1989). However, as Messick (1989) observed:"Exposingthe value assumptions
of a constructtheory and its more subtlelinks to ideology-possibly to multiple,
crosscuttingideologies, is an awesomechallenge"(p. 62). A few of the sometimes
conflicting values and rationales supportingCDA and traditionalassessments
have been describedin the section "SocietalDemandfor New AssessmentTech-
niques"and may be summarizedas follows: (a) traditionaltests were developed
for selecting studentsmost likely to succeed in a particulareducationalprogram
whereasCDA tests were developedfor tailoringeducationalprogramsto students'
needs; (b) traditionaltests are aimed at estimatinga student'slocation, typically
interpretedas an amount, on a latent variable whereas CDA tests are aimed at
inferringa student'spsychological structuresand processes that underlieperfor-
mance; and (c) traditionaltest scores are tied to sampling models of domain
contentwhereasCDA test scores aretied to psychological models of domainper-
formance.
The expose of values and rationalesis especially importantbecause not all
values motivatingthe developersof CDA are sharedby developersof traditional
assessment.The questionsasked andinterpretationsmade by developersof tradi-
tional assessmentsmay not fit the rationalessupportingCDA and may foreclose
on their development.For example, much traditionalevidence of reliabilityand
validity requiresvariancein test scores. In contrast,diagnostic test scores could
show no varianceand still be adequateand appropriatemeasuresof the construct,
as in the case of a successful masterylearningprogram(Chipman,Nichols, &
Brennan,in press). As Moss (1994) notes: "Currentconceptions of reliability
and validity in educationalmeasurementconstrainthe kinds of assessmentprac-
tices likely to find favor, and these in turn constraineducationalopportunities
for teachersand students"(p. 10).
Notes
'I chose cognitively diagnostic assessment to avoid confusing diagnostic assess-
ments combiningcognitive science and psychometricswith traditionaldiagnostic
assessments.There are currentlya large numberof diagnostictests includingthe
NelsonDennyReadingTest,the StanfordDiagnosticMathematicsTest,the Stanford
DiagnosticReadingTest,andtheInstructional Testsof theMetropolitanAchievement
Tests.ThesetestsdifferfromCDA in threeaspects:(a) the designis basedon logical
taxonomiesandcontentspecificationsandlacksexplicitpsychologicalmodelsof the
structuresand processesthat underliedomainperformance;(b) the scores are tied
to content areas ratherthan cognitive mechanisms;and (c) the scores are often
computedusingmethodsdevelopedto select studentsratherthanmethodsdeveloped
to makeinferencesaboutcognitivestructuresandprocesses.Othertermswereconsid-
ered and rejected.Theory-referencedconstructionhas been used but was rejected
becausepsychometrictheorieshave been used in the past and this could be called
theory-referencedconstruction.
598

2I classify different IRT models as traditional or cognitively diagnostic based on

the approach to learning that the model assumes. I include as traditional those IRT
models that treat learner differences as differences on one or a handful of continuous
traits. For such models, learning is assumed to be incremental changes in level of
the trait continuum. I include as cognitively diagnostic extensions or modifications
of IRT by Tatsuoka (1990), Embretson (1984) and others that assume learning is
changes in cognitive processes and structures.
3IRT ability estimates and CTT number correct scores typically correlate .95
(Mislevy, in press).
4Instruction in the Story Problem Solver is presented on the computer and uses
icons and diagrams. Test takers become familiar with the computer environment and
the icons during the instruction that takes place before assessment.
5Technicians must be able to read and interpret logic gate symbols, common
components of schematic diagrams, in troubleshooting electronics equipment.
6The global assessment included a circuit troubleshooting task that was not scored
and so is not discussed.
References
American Association for the Advancement of Science. (1989). Science for all Ameri-
cans. Washington, DC: Author.
Anastasi, A. (1967). Psychology, psychologists, and psychological testing. American
Psychologist, 22, 297-306.
Anderson, J. R. (1983). The architecture of cognition. Cambridge, MA: Harvard
University Press.
Anderson, J. R., Conrad, E, & Corbett, A. T. (1989). Skill acquisition and the LISP
tutor. Cognitive Science, 13, 467-505.
Bart, W. M. (1991, April). A refined item digraph analysis of Siegler's balance beam
tasks. Paperpresented at the Annual Meeting of the American Educational Research
Association, Chicago.
Bart, W. M., & Williams-Morris, R. (1990). A refined item digraph analysis of a
proportional reasoning test. Applied Measurement in Education, 3, 143-165.
Bartlett, F. C. (1932). Remembering: A study in experimental and social psychology.
London: Cambridge University Press.
Beatty, L. S., Madden, R., Gardner,E. E, & Karlsen, B. (1976). Stanford Diagnostic
Mathematics Test:Manualfor administering and interpreting.New York: Harcourt
Brace Jovanovich.
Bejar, I. I. (1984). Educational diagnostic assessment. Journal of Educational Mea-
surement, 21, 175-189.
Bickhard, M. H. (1992). Myths of science: Misconceptions of science in contemporary
psychology. Theory & Psychology, 2, 321-337.
Brennan, R. L. (1992). Elements of generalizability theory (2nd ed.). Iowa City, IA:
American College Testing.
Brown, J. S., & Burton, R. R. (1978). Diagnostic models for proceduralbugs in basic
mathematical skills. Cognitive Science, 2, 155-192.
Carnegie Council on Adolescent Development (1989). Turning points: Preparing
American youth for the 21st century. Washington, DC: The Carnegie Corporation
of New York.
Chipman, S. F., Nichols, P., & Brennan, R. L. (in press). Introduction. In P. Nichols,
S. F. Chipman, & R. L. Brennan (Eds.), Cognitively Diagnostic Assessment. Hills-
dale, NJ: Erlbaum.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam,N. (1972). The dependability
of behavioral measurements: Theory of generalizability for scores and profiles.
New York: Wiley.
599

Nichols
Cummins, D. D. (1991). Children's interpretations of arithmetic word problems.

Cognition and Instruction, 8, 261-289.
Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychologi-
cal Bulletin, 81, 95-106
De Corte, E., Verschaffel, L., & De Win, L. (1985). The influence of rewording
verbal problems on children's problem representation and solutions. Journal of
Educational Psychology, 77, 460-470.
DuBois, D., & Shalin, V. L. (in press). Adapting cognitive methods to real world
objectives: An application to job knowledge testing. In P. Nichols, S. F. Chipman, &
R. L. Brennan (Eds.), Cognitively Diagnostic Assessment. Hillsdale, NJ: Erlbaum.
Ebel, R. L. (1965). Essentials of educational measurement (1st ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Embretson, S. E. (1983). Construct validity: Construct representation versus nomo-
thetic span. Psychological Bulletin, 93, 179-197.
Embretson, S. E. (1984). A general latent trait model for response processes. Psycho-
metrika, 49, 175-186.
Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational
Measurement (3rd ed., pp. 105-146). New York: Macmillan.
Frederiksen, N., Glaser, R., Lesgold, A., & Shafto, M. G. (Eds.). (1990). Diagnostic
monitoring of skill and knowledge acquisition. Hillsdale, NJ: Erlbaum.
Frederiksen, N., Mislevy, R. J., & Bejar, I. (Eds.). (1993). Test theory for a new
generation of tests. Hillsdale, NJ: Erlbaum.
Gitomer, D. H. (1984). A cognitive analysis of a complex troubleshooting task.
Unpublished doctoral dissertation, University of Pittsburgh.
Gitomer,D. H. (1987, October). Using erroranalysis to develop diagnostic instruction.
Paperpresented at the meeting of the Military Testing Association, Ottawa, Ontario.
Gitomer, D. H., & Van Slyke, D. A. (1988). Erroranalysis and tutor design. Interna-
tional Journal of Machine Mediated Learning, 2, 333-350.
Gitomer, D. H., & Yamamoto, K. (1989, March). Using embedded cognitive task
analysis in assessment. Paper presented at the Annual Meeting of the American
Educational Research Association, San Francisco.
Gitomer, D. H., & Yamamoto, K. (1991). Performancemodeling that integrates latent
trait and class theory. Journal of Educational Measurement, 28, 173-189.
Glaser, R. (1981). The future of testing: A research agenda for cognitive psychology
and psychometrics. American Psychologist, 36, 923-936.
Glaser, R. (1988). Cognitive science and education. International Social Science
Journal, 115, 21-44.
Glass, G. V. (1986). Testing old, testing new: Schoolboy psychology and the allocation
of intellectual resources. In B. S. Plake & J. C. Witt (Eds.), Thefuture of testing
(pp. 9-28). Hillsdale, NJ: Erlbaum.
Goldsmith, T. E., Johnson, P. J., & Acton, W. H. (1991). Assessing structuralknowl-
edge. Journal of Educational Psychology, 83, 88-96.
Haertel, E., & Calfee, R. (1983). School achievement: Thinking about what to test.
Journal of Educational Measurement, 20, 119-132.
Ippel, M. J. (1986). Component-testing:A theory of cognitive aptitude measurement.
Amsterdam, The Netherlands: Free University Press.
Ippel, M. J. (1991). An information-processing approach to item equivalence. In P.
L. Dann, S. H. Irvine, & J. H. Collis (Eds.), Advances in computer-based human
assessment (pp. 377-396). Boston: Kluwer Academic Publishers.
Jeffrey, R. C. (1983). The logic of decision. Chicago: Chicago Press.
Johnson, P. J., Goldsmith, T. E., & Teague, K. W. (in press). Similarity, structure,
and knowledge: A representational approach to assessment. In P. Nichols, S. F.
600

Chipman, & R. L. Brennan (Eds.), Cognitively Diagnostic Assessment. Hillsdale,

NJ: Erlbaum.
Lachman, R., Lachman, J. L., & Butterfield, E. C. (1979). Cognitive psychology and
information processing: An introduction. Hillsdale, NJ: Erlbaum.
Langley, P., Wogulis, J., & Ohlsson, S. (1990). Rules and principles in cognitive
diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.),
Diagnostic monitoring of skill and knowledge acquisition (pp. 217-250). Hillsdale,
NJ: Erlbaum.
Lave, J. (1988). Cognition in practice. Boston: Cambridge University Press.
Linn, R. L. (1986). Educational testing and assessment: Research needs and policy
issues. American Psychologist, 41, 1153-1160.
Loftus, E. E (1991). The glitter of everyday memory ... and the gold. American
Lohman, D. F., & Ippel, M. J. (1993). Cognitive diagnosis: From statistically-based
assessment toward theory-based assessment. In N. Fredriksen, R. J. Mislevy, & I.
Bejar (Eds.), Test theory for a new generation of tests (pp. 41-71). Hillsdale,
NJ: Erlbaum.
Marshall, S. P. (1990, April). What students lear (and remember) from word problem
instruction. In S. F. Chipman (Chair), Penetrating to the mathematical structure
of word problems. Symposium conducted at the Annual Meeting of the American
Educational Research Association, Boston.
Marshall, S. P. (1993). Assessing schema knowledge. In N. Fredriksen, R. J. Mis-
levy, & I. Bejar (Eds.), Test theory for a new generation of tests (pp. 155-180).
Hillsdale, NJ: Erlbaum.
Marshall, S. P., Barthuli, K. E., Brewer, M. A., & Rose, F. E. (1989). STORY
PROBLEM SOLVER:A schema-based system of instruction (CRMSE Tech. Rep.
No. 89-01). San Diego: Center for Research in Mathematics and Science Education.
Marshall, S. P., Pribe, C. A., & Smith, J. D. (1987). Schema knowledge structures
for representingand understandingarithmetic story problems (Tech. Rep. Contract
No. N00014-85-K-0661). Arlington, VA: Office of Naval Research.
Messick, S. (1984). The psychology of educational measurement. Journal of Educa-
tional Measurement, 21, 215-237.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed.,
pp. 13-104). New York: Macmillan.
Millman, J., & Greene, J. (1989). The specification and development of tests of
achievement and ability. In R. L. Linn (Ed.), Educational Measurement (3rd ed.,
Mislevy, R. J. (1993). Foundations of a new test theory. In N. Fredriksen, R. J.
Mislevy, & I. Bejar (Eds.), Test theory for a new generation of tests (pp. 1940).
Mislevy, R. J. (in press). Probability-based inference in cognitive diagnosis. In P.
Nichols, S. F. Chipman,& R. L. Brennan (Eds.), Cognitively Diagnostic Assessment.
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher,
23(2), 5-12.
National Education Goals Panel. (1991). Measuring progress towards the national
education goals: Potential indicators and strategies. Washington, DC: Author.
National Governors' Association. (1990). Educating America: State strategies for
achieving the national education goals. Washington, DC: Author.
Nichols, P., Chipman, S. F., & Brennan, R. L. (Eds.). (in press). Cognitively Diagnostic
Assessment. Hillsdale, NJ: Erlbaum.
601

Nichols
Nicolson, R. I. (1990). Design and evaluation of the SUMIT intelligent teaching

assistant for arithmetic. Interactive Learning Environments, 1, 265-287.
Ohlsson, S. (1990). Trace analysis and spatial reasoning: An example of intensive
cognitive diagnosis and its implications for testing. In N. Frederiksen, R. Glaser,
A. Lesgold, & M. G. Shafto (Eds.), Diagnostic monitoring of skill and knowledge
acquisition (pp. 251-296). Hillsdale, NJ: Erlbaum.
Pellegrino, J. W. (1992). Commentary:Understandingwhat we measure and measur-
ing what we understand. In B. R. Gifford & M. C. O'Connor (Eds.), Changing
assessments: Alternative views of aptitude, achievement, and instruction (pp. 275-
300). Boston, MA: Kluwer Academic Publishers.
Resnick, L. B., & Resnick, D. P. (1992). Assessing the thinking curriculum: New
tools for educational reform. In B. R. Gifford & M. C. O'Connor (Eds.), Changing
assessments: Alternative views of aptitude, achievement, and instruction (pp. 37-
75). Boston, MA: Kluwer Academic Publishers.
Ruthven, K. (1985). The AI dimension? In D. J. Smith (Ed.), Information technology
and education: Signposts and research directions (pp. 21-30). London: Economic
and Social Research Council.
Schvaneveldt, R. W. (Ed.). (1990). Pathfinder associative networks: Studies in knowl-
edge organization. Norwood, NJ: Ablex Publishing Co.
Shepard, L. A. (1991). Psychometricians' beliefs about learning. Educational
Researcher, 20, 2-9.
Sleeman, D., Kelly, A. E., Mortinak, R., Ward,R. D., & Moore, J. L. (1989). Studies
of diagnosis and remediation with high school algebra students. Cognitive Science,
13, 551-568.
Sleeman, D. H., Langley, P., & Mitchell, T. M. (1982, Spring). Learning from solution
paths: An approach to the credit assignment problem. AI Magazine, 48-52.
Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for
educational measurement. In R. L. Linn (Ed.), Educational Measurement (3rd ed.,
Snow, R. E., & Mandinach, E. B. (1991). Integrating assessment and instruction: A
research and development agenda (ETS Research Rep. No. RR-91-8). Princeton,
NJ: Educational Testing Service.
Snow, R. E., & Peterson, P. (1985). Cognitive analyses of tests: Implications for
redesign. In S. E. Embretson (Ed.), Test design: Developments in psychology and
psychometrics (pp. 149-166). New York: Academic Press.
Stiggins, R. J. (1991). Facing the challenges of a new era of educational assessment.
Applied Measurement in Education, 4, 263-273.
Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions
based on item response theory.Journal of Educational Measurement,20, 345-354.
Tatsuoka, K. K. (1990). Toward an integration of item-response theory and cognitive
error diagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.),
Diagnostic monitoring of skill and knowledge acquisition (pp. 453-488). Hillsdale,
NJ: Erlbaum.
Tatsuoka,K. K. (in press). Architectureof knowledge structuresand cognitive diagno-
sis: A statistical pattern recognition and classification approach. In P. Nichols, S.
F. Chipman, & R. L. Brennan (Eds.), Cognitively Diagnostic Assessment. Hillsdale,
NJ: Erlbaum.
Tittle, C. K. (1991). Changing models of student and teacher assessment. Educational
VanLehn, K. (1982). Bugs are not enough: Empirical studies of bugs, impasses, and
repairs in procedural skills. Journal of Mathematical Behavior, 3, 3-72.
602

Yamamoto, K. (1989). Hybrid model of IRT and latent class models (ETS Research
Rep. No. RR-89-41).Princeton,NJ: EducationalTestingService.
Author
PAULD. NICHOLSis AssistantProfessor,Departmentof EducationalPsychology,
Universityof Wisconsin-Milwaukee, 765 EnderisHall, PO Box 413, Milwaukee,
WI 53201. He specializesin alternativeassessmentand problemsolving.
ReceivedNovember17, 1993
RevisionreceivedJuly 5, 1994
AcceptedJuly 11, 1994
603


Nichols 1994

Uploaded by

Copyright:

Available Formats

Nichols 1994

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nichols 1994

Uploaded by

Copyright:

Available Formats

A Framework for Developing Cognitively Diagnostic Assessments

Author(s): Paul D. Nichols

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

A Framework for Developing Cognitively

I am gratefulto Dean Colton,DavidF. Lohman,DavidJ. Mittelholtz,and Robert

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

Researchershave progressedbeyond what Pellegrino(1992) has called verbal

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

motivated,at least in part,developmentof traditionalanddiagnosticassessments.

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

Bejar (1984) notes, scores derived from traditionalCTT or IRT approaches

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

Alternatively,the value of performanceelicited throughthe observationdesign

2 5/10 - 6/10 = 1 9/10.

2 1/5 - 2/5 = 1 9/5.

To detect this strategy,mixed fraction subtractionproblems must have both of

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

students in remedial college and community college mathematics classes.

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

Harry the computer programmer accidentally erased some of

X(~ )9~ (roGI24 programs

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

categoryto an object of measurement.In addition,the measurementdesign must

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

thatidentifiesinappropriateconditionsunderwhichthatstudentmay have applied

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

confoundsthese two aspectsof test theory.A comparisonof the StanfordDiagnos-

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

people anditems is treatedas errorunderthe p X I design. In contrast,VanLehn's

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

presentations.I use Gitomer's researchto illustrate this methodology because

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

The substantivebase is consulted in every stage of test development. As I

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

Subsequently,Gitomer and Van Slyke (1988) examined technicians' under-

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

FIGURE 2. A representation of the observation and measurement design of the GATES tu

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

As withearlierassessmentcomponentsof the GATEStutor,the observationand

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

The index of response discrimination captures the degree to which each

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

supportingthe design comes fromthe strengthof the researchbase. Such evidence

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

researchin CDA, the assumptionsconcerningits role shouldbe examinedclosely.

WhoMakes These New Assessments?

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

I will concludethis articlewith a short,andincomplete,discussionof validation

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

problems (Sleeman, Kelly, Mortinak,Ward, & Moore, 1989). In contrast, the

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

2I classify different IRT models as traditional or cognitively diagnostic based on

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

Cummins, D. D. (1991). Children's interpretations of arithmetic word problems.

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM

Chipman, & R. L. Brennan (Eds.), Cognitively Diagnostic Assessment. Hillsdale,

This content downloaded from 128.255.6.125 on Thu, 2 Jan 2014 22:23:19 PM