Nichols 1994
Nichols 1994
Nichols 1994
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Educational Research Association is collaborating with JSTOR to digitize, preserve and extend
access to Review of Educational Research.
http://www.jstor.org
Over the past decade or so, a growing number of writers have argued that
cognitive science and psychometrics could be combined in the service of
instruction. Researchers have progressed beyond statements of intent to the
hands-on business of researching and developing diagnostic assessments
combining cognitive science and psychometrics, what I call cognitively
diagnostic assessment (CDA). In this article, I attempt to organize the many
loosely connected efforts to develop cognitively diagnostic assessments. I
consider the development of assessments to guide specific instructional
decisions, sometimes referred to as diagnostic assessments. Many of my
arguments apply to program evaluation as well-assessments that reveal
the mechanisms test takers use in responding to items or tasks provide
important information on whether instruction is achieving its goals. My
goal in this article is to characterize CDA in terms of the intended use of
assessment and the methods of developing and evaluating assessments.
Towardsthis goal, I (a) outline the societal trends that motivate the develop-
ment of CDA, (b) introduce a framework within which the psychological
and statistical aspects of CDA can be coordinated, and (c) summarize efforts
to develop CDA in afive-step methodology that can guidefuture development
efforts. Finally, I address some of the issues developers of CDA must resolve
if CDA is to succeed.
Over the past decade or so, a growing numberof writers have argued that
cognitive science and psychometricscould be combinedin the service of instruc-
tion (Bejar, 1984; Haertel& Calfee, 1983; Linn, 1986; Messick, 1984; Snow &
Lohman, 1989). They have criticized traditionaltesting for losing sight of the
psychology of the performancebeing tested (Glaser, 1981; Glass, 1986). Tradi-
tional testing practices appearto place more emphasis on statistical technique
than on the psychology of the constructbeing measured(Anastasi, 1967). They
have arguedthat, given some knowledge of the goals and methodsof instruction
and of the psychology of the construct,"educationaltests might be made more
diagnostic of malfunctions in learning and more informative for instructional
adaptation"(Snow & Lohman, 1989, p. 266).
575
as thoughts, images, and ideas, and the processes they signify again became
legitimate areas of study (Lachman,Lachman,& Butterfield, 1979). Later,psy-
chologists moved beyond studyinggeneralmechanismsof thinkingand learning
and addressedthe processes and knowledge structuresinvolved in performing
everyday tasks in academic, industrial,and military settings (Glaser, 1988). In
statistics, statisticianswere developing applicationsof statistical inference not
based on the Neyman-Pearsonsystem of classical statistical inference (Jeffrey,
1983).
The CDA approachenables educatorsand policymakersto draw conclusions
aboutstudents'instructionalneeds andthe effectiveness of instructionalprograms
because CDA provides instructorsand policymakerswith informationon strate-
gies students use to attack problems, relationships students perceive among
concepts, and principles students understandin a domain-student outcomes
closely aligned with curriculummovementsin science and mathematics.In con-
trast, traditionalassessments, including performance-basedor authenticassess-
ments, provide instructorsand policymakerswith estimates of students'relative
positions with regardto the amountof contentacquired.CDAs appearto support
currentneeds of educatorswhereas traditionalassessments appearbetter suited
to past educationalapproachesthat stressed the accretion of facts.
Framework for Cognitively Diagnostic Assessment
Researchersdesigning cognitively diagnostic assessments realized that they
were lacking a test theory thatallowed them to diagnose the qualitiesof learners'
processesandknowledgestructures.This requiredmorethansummingthe correct
responses.3They were searching for new test theories-test theories suited to
diagnosingcognitive mechanisms.In this section,I attemptto describethepsycho-
logical and statisticalconsiderationsinvolved in adoptinga test theory.This may
seem a strangeidea to many measurementspecialists and psychologists, alike.
I arguedin the previous section that decisions regardinginstructionaladaptation
requirea differentapproachtowardassessmentthandecisions regardingselection.
In this section, I arguethat differentconceptionsof domain performancerequire
differenttest theories. All domains are not alike, and the test developer should
not ignorethe uniquenatureof the performance.I use examplesfrommathematics
to illustratethis argumentbecause mathematicshas been studied extensively by
psychologists and has, perhaps not coincidentally,been the focus of a number
of CDA researchefforts.
Following Ippel (1986, 1991) and Lohmanand Ippel (1993), I view test theory
as consisting of two related aspects: an observationdesign for constructingand
arrangingtasks or items, and a measurementdesign for collecting andcombining
responses.Currenttest theories confoundthese two aspects. The observationand
measurementdesigns provide the test developer a frameworkwithin which the
psychological and statisticalconsiderationscan be coordinated.I will arguelater
in this section that the validity of both designs must be evaluated within the
context of substantiveresearch. The observationand measurementdesigns are
discussed separately,but practicalissues concerningthem should be considered
in company because diagnosis is possible only throughtheir coordination.The
most sophisticatedinference from any diagnosticmeasurementdesign is limited
by the richness of the performance elicited through the observation design.
579
But if the denominatorof the fraction is not 10, then the strategyproduces an
incorrectanswer, as follows:
I
l
OKAY
FIGURE 1. Example of task to assess feature recognition for the Change schema
(from Marshall, 1993)
TABLE 1
Operations for subtraction (from Langley, Wogulis, & Ohlsson, 1990)
Add-Ten(number,row, Takesthe numberin a row and columnand replaces
column) it with thatnumberplus ten.
Decrement(number,row, Takesthe numberin a row and columnand replaces
column) it with thatnumberminusone.
Find-Difference(numberl, Takesthe two numbersin the same columnand
number2,column) writesthe differenceof the two as the result
for thatcolumn.
Find-Top(column) Takesa numberfrom the top row of columnand
writesthatnumberas the resultfor that
column.
Shift-Column(column) Takesthe columnwhich is both focused-onand
being processedand shifts both to the column
on its left.
Shift-Left(column) Takesthe columnwhich is focused-onand shifts
the focus of attentionto the columnon its left.
Shift-Right(column) Takesthe columnwhich is focused-onand shifts
the focus of attentionto the columnon its right.
583
TABLE 2
Item objectives for the Subtraction of Whole Numbers concept/skill domain in
the SDMT
The pupil will demonstratethe abilityto use the standardalgorithmfor subtraction
(verticalform)with renamingby:
Item
4 findingthe unknownaddend(remainder)when a numberin the tens is
subtractedfrom a numberin the hundreds.
5 findingthe unknownaddendwhen a numberin the hundredsis subtracted
fromanothernumberin the hundreds.
6 findingthe unknownaddendwhen a numberin the hundredsis subtracted
froma numberin the thousands.
7 findingthe unknownaddendwhen a numberin the hundredsis subtracted
from a numberin the thousandswith a zero in the tens place.
8 findingthe unknownaddendwhen a numberin the thousandsis subtracted
from anothernumberin the thousandswith zeros in the ones and in
the hundredsplaces.
9 findingthe unknownaddendwhen a numberin the thousandsis subtracted
from anothernumberin the thousandswith a zero in the tens place.
585
TABLE 3
Thefive steps of test developmentwithinpsychology-driventest development
STEP 1 SUBSTANTIVETHEORYCONSTRUCTION
The substantivebase concernsthe developmentof a model or theory
thatdescribesthe knowledgeand skills hypothesizedto be involved
in performanceand the item or task characteristicshypothesizedto
interactwith the knowledgeand skill.
STEP2 DESIGNSELECTION
In this step, the test developerselects the observationand measurement
designs. The selectionis informedby the substantivebase
constructedin Step 1. Subsequently,the test developerconstructsitems
or tasks thatwill be respondedto in predictableways by test takers
with specific knowledge,skills, and othercharacteristicsidentifiedas
importantin the theory.The procedurefor constructingassessments
is the operationalization of the assessmentdesign.
STEP3 TEST ADMINISTRATION
Test administration includesevery aspectof the contextin which test
takerscompletethe test:the formatof the items, the natureof the
requiredresponse,the technologyused to presenttest materialsand
recordresponses,and the environmentof the testingsession.
Decisions concerningthe contextof the testingsession shouldbe
informedby researchon how aspectsof the contextinfluencetest
takers'performance.
STEP4 RESPONSESCORING
The goal of this step is to assign values to test takers'patternsof
responsesso as to link those patternsto theoreticalconstructssuch
as strategiesor malrules.As with assessmentconstruction,a scoring
procedureis the operationalization of the assessmentdesign.
STEP5 DESIGNREVISION
Design revisionis the processof gatheringsupportfor a model or
theory.As with any scientifictheory,the theoryused in test
developmentis neverproven;rather,evidence is graduallyaccumulated
that supportsor challengesthe theory.In this step, the resultsof
administeringthe assessmentare used to revise the substantivebase
upon which was basedthe constructionof the assessment.
Design Selection
The second step in developingCDA is the constructionof the observationand
measurementdesigns. As I have argued, the validity of both the observation
design and the measurementdesign is evaluatedwith respect to the substantive
base. As the work of Gitomer(Gitomer,1987; Gitomer& Van Slyke, 1988) and
his colleagues illustrates, the task for the test developer is to construct and
organize observationsand combine responses in ways that are consistent with
the substantivebase. The observationand measurementdesigns used by Gitomer
and Van Slyke (1988) in the GATES tutor are summarizedin Figure 2. The
tutor performsan initial assessment followed, if warranted,by a more detailed
diagnosis.The initial assessmentwas intendedto distinguishbetween technicians
who had rule-basedmisconceptionsor weak conceptions of how to solve logic
gates, and technicians who understoodhow to solve logic gates but had ineffi-
ciently organizedknowledge.The more detailed diagnosis was intendedto iden-
tify the rule-basedmisconceptionsheld by the technicians.
The initial assessment consists of a circuit tracing test and a screening test.6
In the circuit tracing test, tutor users trace througha complex arrangementof
logic gates andindicatethe outputfor each gate.The observationdesign demanded
that logic gates vary in the type of gate, if the gate was negated or not negated,
and the numberof inputs. Furthermore,the observationdesign demandedthat
gates be arrangedin a complex circuit. The measurementdesign demandedthat
overall accuracy be computed on the task. Tutor users who had high overall
accuracy on the circuit tracing task exited the tutor whereas tutor users who
answeredmany logic gates incorrectlyattempteda screeningtest to diagnose the
source of their difficulty. In the screening test, tutor users indicatedthe correct
outputs for 48 single gates. The faceted observationdesign requiredthat gates
589
vary in the type of gate, whetherthe gate was negatedor not negated,and number
of inputs. Furthermore,the observationdesign requiredthat gates be presented
singly. The measurementdesign demandedthat accuracybe computedfor each
gate type, for negated and nonnegatedgates, and for gates differing in number
of inputs.Underthe measurementdesign, low accuracyon any set of gates moved
the tutoruser to a diagnostic module for that set of gates. High accuracyacross
sets of gates moved the tutoruser to a practicemodule intendedto increase the
efficiency of the tutor user's access to knowledge of logic gates.
Both the observationand measurementdesigns of the initial assessment were
motivatedby substantiveconcerns.Substantiveresearchsuggestedtwo hypothe-
ses for low accuracyon the firstmodule;usersmay have hadconceptualdifficulty
with particularkinds of logic gates, or users may have had difficulty accessing
efficiently logic gate knowledge.The first taskinvolved tracinga complex circuit
and was scored using accuracybecause substantiveresearchshowed that some
less skilled avionics technicians experienced difficulty tracking outputs from
logic gates in a complex circuit. These less skilled technicians experienced
difficulty in managingmemorydemandsbecause theirlogic gate knowledge was
organizedless efficiently than that of more competenttechnicians.The second
task involved single gates and was scored using accuracy on kinds of gates
because substantive research showed that technicians' difficulties interpreting
logic gates may be due to misunderstandings,or at least impasses, in the knowl-
edge needed to solve those kinds of gates. Gates were presentedsingly to reduce
the role of efficient access to knowledge in users' performance.Furthermore,
the initialassessmentconsistedof two contrastingtasks-tracing complex circuits
versusansweringsingle gates-because substantiveresearchindicatedthattechni-
cians who have misconceptionsor impasses in the knowledge needed to solve
particularkinds of logic gates will show a different patternof accuracy than
technicians who have inefficient access to this knowledge. For example, low
accuracyon complex circuitsbut high accuracyon single gates indicates techni-
cians understandhow to solve logic gates but need to access this knowledge more
efficiently.Alternatively,low accuracyon complex circuitsbut high accuracyon
all but one gate type on single gates indicates technicians have an impasse or
misunderstandingin the knowledge needed to solve logic gates.
Tutor users who failed to indicate accurately the correct output for one or
more sets of logic gates in the screening test were presented with diagnostic
modules for those gates. Each module requiredtutorusers to indicatethe correct
output for single gates sharinga particularset of attributes.For example, tutor
users may have been presentedwith all gates of one type or all negated gates.
The observationdesign requiredthat all gates be presentedsingly and that gates
within a diagnosticmodule sharethe same attributes-all of one type, all negated
or nonnegated,or all with the same numberof inputs. The measurementdesign
requiredthat latent class analysis be used to assign techniciansto qualitatively
differentclasses correspondingto misconceptionsregardinglogic gates. This was
done by matchingresponsevectors with misconceptions.A latentclass approach
was used because actualresponsevectors rarelymatchexactly an ideal response
vector. Matches between actual responsesand predictedresponses increase sup-
port for a particularclassification, whereas mismatches reduce support for a
particularclassification.
591
to score and compute individualscores and item and test statistics. I will leave
such questions for anotheroccasion.
Instead, I would like to discuss indicatorsthat may be used to evaluate the
implementation of the test theory. Generally, implementationof traditional
approachesis evaluatedusing item statisticsindicatingdifficulty and discrimina-
tion (Millman & Greene, 1989) and test statistics indicatingreliability (Feldt &
Brennan,1989). For the purposesof informinginstructionaldecisions, traditional
indicatorsof item and test functioningmay not be of much help. Again, the work
of Gitomerprovides an example. Gitomerand Yamamoto(1989) reportthe item
p values and biserials for 119 technicians who completed a 20-item diagnostic
logic gate assessment. As Gitomer and Yamamotonote, difficulty values were
moderate and biserials suggested that doing well on one item bodes well for
overalltest performance.But these item statisticswere developedfor assessments
intended to select students most likely to succeed in a uniform instructional
environmentratherthanassessmentsintendedto makeinferencesaboutcognitive
structuresandprocesses.Whatmeaninghave these statisticsfor a diagnostictest?
CDA-based assessments requireindicatorsthat may be used to evaluate the
implementationof test theory that is intendedto identify qualitativedifferences
in individuals' processes and knowledge structures.Researchershave offered
several alternativesto traditionalCTT indicatorsof test and item functioning.
At the test level, Brown and Burton(1978) proposetest diagnosticityas an index
to evaluate their diagnostic assessment of students' subtractionmisconceptions.
Under Brown and Burton'sapproach,any test takerholding a particularmiscon-
ception or combination of misconceptions will produce a particularresponse
vector for a given set of test items. The response vectors may be partitionedso
that identical response vectors, correspondingto different misconceptions, are
placed in the same partition.A perfect diagnostic test would have one response
vector in each partition.A less-than-perfectdiagnostic test would have at least
one partitionwith more than one response vector.
At the item level, Bart(Bart,1991;Bart& Williams-Morris,1990) has proposed
two indices. Both indices are computed using an item response-by-rulematrix.
For any item, each possible response may correspondto the test taker's use of
at least one rule or strategy.An item on which the test takeris asked to respond
with true or false will have two possible responses, and each response may
correspondto the use of one or more rules. Alternatively,a multiple-choiceitem
with four alternatives,such as the two items representedin Table 4, will have
four possible responsesand each response may correspondto one or more rules.
I have representedonly four rules in Table4. Note that an item response corres-
ponding to a rule is representedby a one in that cell of the table.
The index of response interpretabilitycaptures the degree to which each
response to the item is interpretableby at least one rule. The computationof
responseinterpretabilityis straightforward given an item response-by-rulematrix.
The index is the numberof responses that are interpretedby one or more rules
divided by the numberof responses. Values range from 0, which indicates no
rule-based responses, to 1, which indicates complete rule-basedresponses. In
Table4, the responseinterpretabilityof Item 1 is 1 whereasthe responseinterpret-
ability of Item 2 is 0.5.
593
TABLE4
The Item Response by Rule matrixfor two multiple-choice items
Rule
Response 1 2 3 4
Item 1
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
Item2
1 0 0 1 1
2 0 0 0 0
3 0 0 0 0
4 1 1 0 0
Design Review
Design review is the process of gathering support for the observation and
measurementdesigns used in test development.The design is like a theory,and,
as with any scientific theory, the theory is never proven; rather,evidence is
graduallyaccumulatedthat supportsor challenges the design. Design review is
a process that continues before and after test administration.Initially,evidence
594
How Much Support Is There for the Interpretation and Use of CDA ?
Notes
'I chose cognitively diagnostic assessment to avoid confusing diagnostic assess-
ments combiningcognitive science and psychometricswith traditionaldiagnostic
assessments.There are currentlya large numberof diagnostictests includingthe
NelsonDennyReadingTest,the StanfordDiagnosticMathematicsTest,the Stanford
DiagnosticReadingTest,andtheInstructional Testsof theMetropolitanAchievement
Tests.ThesetestsdifferfromCDA in threeaspects:(a) the designis basedon logical
taxonomiesandcontentspecificationsandlacksexplicitpsychologicalmodelsof the
structuresand processesthat underliedomainperformance;(b) the scores are tied
to content areas ratherthan cognitive mechanisms;and (c) the scores are often
computedusingmethodsdevelopedto select studentsratherthanmethodsdeveloped
to makeinferencesaboutcognitivestructuresandprocesses.Othertermswereconsid-
ered and rejected.Theory-referencedconstructionhas been used but was rejected
becausepsychometrictheorieshave been used in the past and this could be called
theory-referencedconstruction.
598
599
600
601
Yamamoto, K. (1989). Hybrid model of IRT and latent class models (ETS Research
Rep. No. RR-89-41).Princeton,NJ: EducationalTestingService.
Author
PAULD. NICHOLSis AssistantProfessor,Departmentof EducationalPsychology,
Universityof Wisconsin-Milwaukee, 765 EnderisHall, PO Box 413, Milwaukee,
WI 53201. He specializesin alternativeassessmentand problemsolving.
ReceivedNovember17, 1993
RevisionreceivedJuly 5, 1994
AcceptedJuly 11, 1994
603