Messick, 1980
Messick, 1980
Messick, 1980
SAMUEL MESSICK
Fifteen years ago or so, in papers dealing with personality measurement and the ethics of assessment,
I drew a straightforward but deceptively simple
distinction between the psychometric adequacy of a
test and the appropriateness of its use (Messick,
1964, 1965). I argued that not only should tests be
evaluated in terms of their measurement properties
but that testing applications should be evaluated in
terms of their potential social consequences. I urged
that two questions be explicitly addressed whenever
a test is proposed for a specific purpose: First, is the
test any good as a measure of the characteristics it
is interpreted to assess? Second, should the test be
used for the proposed purpose in the proposed way?
The first question is a scientific and technical one
and may be answered by appraising evidence for the
test's psychometric properties, especially construct
validity. The second question is an ethical one, and
its answer requires a justification of the proposed
use in terms of social values. Good answers to the
first question are not satisfactory answers to the
second. Justification of test use by an appeal to
empirical validity is not enough; the potential social
consequences of the testing should also be appraised,
1012 NOVEMBER 1980 AMERICAN PSYCHOLOGIST
Copyright 1980 by the American Psychological Association, Inc.
0003-066X/80/3511-1012$00.7S
consistency from one time to another, will be predictive of performance in those domains (Wernimont & Campbell, 1968). A key value question is
whether such "persistence forecasting," as Wallach
(1976) calls it, is desirable, in a particular domain
of application. In higher education, for example,
the appropriate model might not be persistence but
development and change, which suggests that in
such instances we be wary of selection procedures
that restrict individual opportunity on the basis of
behavior to date (Hudson, 1976).
The distinction stressed thus far between the
adequacy of a test as a measure of the characteristic it is interpreted to assess and the appropriateness of its use in specific applications underscores
in the first instance the evidential basis of test
interpretation, especially the need for construct
validity evidence, and in the second instance the
consequential basis of test use, through appraisal
of potential social consequences. In developing
this distinction in prior work I emphasized the
importance of construct validity' for test use as
well, arguing "that even for purposes of applied
decision making reliance upon criterion validity or
content coverage is not enough," that the meaning
of the measure must also be comprehended in order
to appraise potential social consequences sensibly
(Messick, 1975, p. 956). The present article
extends this argument for the importance of construct validity in test use still further by stressing
its role in providing a "rational foundation for predictive validity" (Guion, 1976b). After thus
elaborating the evidential basis of test use, I consider the value implications of test interpretations
per se, especially those that bear evaluative and
ideological overtones going beyond intended meanings and supporting evidence; the circle is thereby
completed with an examination of the consequential
basis of test interpretation. Finally, the dynamic
interplay between test interpretation and its value
implications, on the one hand, and test use and its
social consequences, on the other, is sketched in a
feedback model that incorporates a pragmatic
component for the empirical evaluation of testing
consequences.
inferences from test scores or other forms of assessment. . . . It is important to note that validity is
itself inferred, not measured. . . . It is, therefore,
something that is judged as adequate, or marginal,
or unsatisfactory" (p. 25). This document also
points out that the many forms of validity questions fall into two broad classes, those dealing with
inferences about what is being measured by the
test and those inquiring into the usefulness of the
measurement as a predictor of other variables.
Furthermore, there are a variety of validation
methods available, but they all entail in principle
a clear designation of what is to be inferred from
the scores and the presentation of data to support
such inferences.
,
Unfortunately, after this splendid beginning, this
and other official documentsnamely, the Division
of Industrial and Organizational Psychology's
(1975) Principles for the Validation and Use of
Personnel Selection Procedures and the Equal
Employment Opportunity Commission et al.'s
(1978) "Unifor'm Guidelines on Employee Selection Procedures" proceed, as Dunnette and Borman (1979) lament, to "perpetuate a conceptual
compartmentalization of 'types' of validity-
criterion-related, content, and construct. . . . the
implication that validities come in different types
leads to confusion and, in the face of confusion,
over-simplification" (p. 483). One consequence
of this simplism is that many test users focus on
one or another of the types of validity, as though
any one would do, rather than on the specific inferences they intend to make from the scores. There
is an implication that once evidence of one type of
validity is forthcoming, one is relieved of responsibility for further inquiry. Indeed, the "Uniform
Guidelines" seem to treat the three types of validity, in Guion's (1980) words, "as something of a
Holy Trinity representing three different roads to
psychometric salvation. If you can't demonstrate
one kind of validity, you've got two more chances!"
:
(p. 4).
.
Different kinds of inferences from test scores
require different kinds of evidence, not different
kinds of validity. By "evidence" I mean both
data, or facts, and the rationale or arguments that
cement those facts into a justification of test-score
inferences. "Another way to put this is to note
that data are not information; information is that
which results from the interpretation of data"
(Mitroff & Sagasti, 1973, p. 123). Or as Kaplan
(1964) states, '-'What serves as evidence is the
result of a process of interpretationfacts do not
1014 -NOVEMBER 1980 AMERICAN PSYCHOLOGIST
TABLE 1
Descriptive designation
Content validity
such as the ideologies of social science or of education or of social justice, and hence go beyond construct meaning per se.
INTERPRETIVE MEANINGFULNESS
Construct validation is a process of marshaling evidence to support the inference that an observed
response consistency in test performance has a
particular meaning, primarily by appraising the
extent to which empirical relationships with other
measures, or the lack thereof, are consistent with
that meaning. These empirical relationships may
be assessed in a variety of ways, for example, by
gauging the degree of consistency in correlational
patterns and factor structures, in group differences,
response processes, and changes over time, or in
responsiveness to experimental treatments. The
process attempts to link the reliable response consistencies summarized by test scores to nontest
behavioral consistencies reflective of a presumably
common underlying construct, usually an attribute
or process or trait that is itself embedded in a more
comprehensive network of theoretical propositions
or laws called a nomological network (Feigl, 1956;
Hempel, 1970; Margenau, 1950). An empirically
AMERICAN PSYCHOLOGIST NOVEMBER 1980 1015
grounded pattern of such links provides an evidential basis for interpreting the test scores in
construct or process terms, as well as a rational
basis for inferring testable implications of the
scores from the broader theoretical network of the
constructs meaning (Cronbach & Meehl, 1955;
Messick, 1975). Constructs are thus chosen or
created "to organize experience into general lawlike statements" (Gronbach, 1971, p. 462).
Construct validation entails both confirmatory
and disconfirmatory strategies, one to provide
convergent evidence that the measure in question
is coherently related to other measures of the same
construct as well as to other variables that it should
relate to on theoretical grounds, and the other to
provide discriminant evidence that the measure is
not related unduly to exemplars of other distinct
constructs (D. T. Campbell & Fiske, 1959). Discriminant evidence is particularly critical for discounting plausible counterhypotheses to the construct interpretation (Popper, 1959), especially
those pointing to the possibility that the observed
consistencies might instead be attributable to
shared method constraints, response sets, or other
contaminants.
Construct validity emphasizes two intertwined
sets of relationships for the test: one between the
test and different methods for measuring the same
construct Or trait, and the other between measures
of the focal construct and exemplars of different
constructs predicted to be variously related to it
on theoretical grounds. Theoretically relevant
empirical consistencies in the first set, indicating a
correspondence between measures of the same construct, have been called trait validity, and those in
the second set, indicating a lawful relatedness between measures of different constructs, have been
called nomological validity (D. T. Campbell, 1960;
Cronbach & Meehl, 1955). In order to discount
competing hypotheses involving alternative constructs or method contaminants, the two sets are
often analyzed simultaneously in a multitraitmultimethod strategy that employs multiple methods for assessing each of two or more different
constructs (D. T. Campbell & Fiske, 1959). Such
an approach highlights the need for both convergent
and discriminant evidence in both trait and nomological validity.
Trait validity deals with the fit between measurement operations and conceptual definitions of the
construct, and nomological validity deals with the fit
between obtained data patterns and theoretical predictions about those patterns (Cook & Campbell,
1016 NOVEMBER 1980 AMERICAN PSYCHOLOGIST
So-called "criterion-related validity" is usually considered to comprise two types, concurrent validity
and predictive validity, which differ respectively in
terms of whether the test and criterion data were
collected at the same time or at different times. A
validity. Consensual judgments about the relevance of the test domain as denned to, a particular
behavioral -domain of interest (as, for example,
when choosing a standardized achievement test to
evaluate a new curriculum), along with judgments
of the adequacy of content coverage in the test,
are the kinds of evidence usually offered for content
validity. But note that this is not evidence in
support of inferences from test scores, although it
might influence the nature of those inferences.
This attempt to define content validity as separate from construct validity produces a dysfunctional strain to avoid constructs, as if shunning
them in test development somehow lessens the
import "of response processes in test performance.
The important sampling consideration in test construction is not representativeness of the surface
content of tasks but representativeness of the processes employed by subjects in arriving at a
response (Lennon, 1956). This puts content
validity squarely in the realm of construct validity
(Messick, 1975). Rather than strain after nebulous distinctions, we should inquire how content
considerations contribute to construct validity and
how to strengthen that contribution (Tenopyr,
1977).
Loevinger (1957) incorporated content as an
important feature of construct validity by considering content representativeness and response consistency jointly. What she called "substantive
validity" is "the extent to which the content of the
items included in (and excluded from?) the test
can be accounted for in terms of the trait believed
to be measured and the context of measurement"
(Loevinger, 1957, p. 661). This notion was introduced "because of the conviction that considerations of content alone are not sufficient to establish
validity even when the test content resembles the
trait, and considerations of content cannot be
excluded .when the test content least resembles the
trait" (Loevinger, 1957, p. 657). The elimination
of certain items from the test because of poor
empirical response properties may sometimes distort
the test's representativeness in covering the construct domain as originally conceived, but it is
justified if the resulting test thereby becomes a
better exemplar of the construct as empirically
grounded (Loevinger, 1957; Messick, 1975).
Content validity has little to say about the scoring of content samples, and as a result scoring procedures are typically ad hoc (Guion, 1978b). Scoring models in the construct framework, in contrast,
logically parallel the structural relations inherent in
1018 NOVEMBER 1980 AMERICAN PSYCHOLOGIST
The issue of generalizability just broached for content sampling permeates all of validity. Several
aspects of generalizability of special concern have
been given distinctive labels, but unfortunately
these labels once again invoke the sobriquet
validity. The extent to which a measure's empirical relations and construct interpretation generalize to other population groups is called "population validity" (Shulman, 1970); to other situations
or settings, "ecological validity" (Bracht & Glass,
1968; Snow, 1974); to other times,- "temporal setting. The empirical verification of this rational
validity" (Messick & Barrows, 1972); and to hypothesis contributes to the construct validity of
other tasks representative of the operations called both the measure and the criterion, and the utility
for in the particular domain of interest, "task of the applied relation supports the practicality of
the proposed use. Thus, the evidential basis of
validity" (Shulman, 1970).
The label validity is especially unsuitable for test use is also construct validitybut elaborated
these important facets of generalizability, for such to determine the relevance of the construct to the
usage might be taken to imply that the more applied purpose and the utility of the measure in
generalizable a measure is, the more valid. This is the applied setting.
In all of this discussion I have tried to avoid the
not always the case, however, as in the measurement of such constructs as mood, which fluctuates language of necessary and sufficient requirements,
over time, or concrete operations, which typify a because such language seemed simplistic for a comparticular developmental stage, or administrative plex and holistic concept like test validity. On the
role, which operates in special organizational set- one hand, construct validation is a continuous,
tings, or delusions, which are limited to specific never-ending process developing an ever-expanding
psychotic groups. Rather, the appropriate degree mosaic of research evidence. At any point new
of generalizability for a measure depends upon the evidence may dictate a change in construct, theory,
nature of the construct assessed and the scope of or measurement, so that in the long run it is diffiits theoretical applicability. A closely related issue cult to claim sufficiency for any piece. On the
of "referent generality" (Coan, 1964; Snow, 1974), other hand, given that the mosaic of evidence is
called "referent validity" by Cook and Campbell reasonably dense, it is difficult to claim that any
(1979), concerns the extent to which research evi- piece is necessaryeven, as we have seen, empirical
dence supports a measure's range of reference and evidence for criterion-related predictive relationthe multiplicity of its referent terms. This con- ships in specific applied settings, provided, of
cept points to the need to tailor the level of con- course, that other evidence consistently supports
struct interpretation to the limits of the evidence a compelling rationale for the application.
Since the evidence in these evidential bases deand to avoid both oversimplification and overgeneralization in the connotation of construct rives from empirical studies evaluating hypotheses
labels. Nonetheless, constructs refer not only to about relationships or about the structure of sets
available evidence but to potential evidence, so of relationships, we must also be concerned about
that the choice of construct labels is influenced by the quality of those studies themselves and about
theory as well as by evidence and, as we shall see, the extent to which the research conclusions are
by ideologies about the nature of humanity and tenable or are threatened by plausible countersociety which add value implications that go hypotheses to explain the results (Guion, 1980).
Four classes of threats to the tenability and genbeyond evidential validity per se.
eralizability of research conclusions are discussed
by Cook and Campbell (1979), with primary
EVIDENTIAL BASIS OF TEST INTERPRETATION
reference to quasi-experimental and experimental
AND USE
research but also relevant to nonexperimental corTo recapitulate thus far, construct validity is the relational studies. These four classes deal, respecevidential basis of test interpretation. It entails tively, with the questions of (a) whether a relationboth convergent and discriminant evidence docu- ship exists between two variables, an issue called
menting theoretically relevant empirical relation- "statistical conclusion validity"; (b) whether the
ships (a) between the test and different methods relationship is plausibly causal from one variable
for measuring the same construct, as well as (b) to the other, called "internal validity"; (c) what
between measures of the construct and exemplars interpretive constructs underlie the relationship,
of different constructs predicted to be related called "construct validity"; and (d) the extent to
nomologically. For test use, the relevance of the which the interpreted relationship generalizes to
construct for the applied purpose is determined in and across other population groups, settings, and
addition, by developing rational hypotheses relating times, called "external validity."
the construct to performance in the applied domain.
I will not discuss here the first question raised
Some of the construct's nomological relations thus by Cook and Campbell except simply to affirm that
become criteria! when made specific to the applied the tenability of statistical conclusions about the
AMERICAN PSYCHOLOGIST NOVEMBER 1980 1019
Value issues have long been recognized in connection with test use. We have seen that one of the
key questions to be posed whenever a test is suggested for a specific purpose is "Should it be used
for that purpose?" Answers to that question
require an evaluation of the potential consequences
of the testing in terms of social values, but that is
no trivial enterprise. There is no guarantee that
at any point in time we will identify all of the
critical possibilities, especially those unintended
side effects that are distal to the manifest testing
aims.
There are few prescriptions for how to proceed
here, but one recommendation is to contrast the
potential social consequences of the proposed testing with the potential social consequences of alternative procedures and even of procedures antagonistic to testing. This pitting of the proposed test
use against alternative proposals is an instance of
what Churchman (1971) has called Kantian
inquiry; the pitting against antithetical counterproposals is called Hegelian inquiry. The intent
of these strategies is to draw attention to vulnerabilities in the proposed use and to expose its tacit
value assumptions to open debate. In the context
of testing, a particularly powerful and general form
of counterproposal is to weigh the potential social
consequences of the proposed test use against the
potential social consequences of not testing at all
(Ebel, 19,64).
The role of values in test use has been intensively
examined in certain selection applicationsnamely,
in those where different population groups display
significantly different means on predictors, or
criteria, or both. Since fair test use implies that
selection decisions will be equally appropriate
Test Interpretation
Test Use
Evidential Basis
Construct Validity
Construct Validity +
Relevance/Utility
Consequential Basis
Value Implications
Social Consequences
Validity as Evaluation of
Evidence and Consequence
Test validity is thus an overall evaluative judgment
of the adequacy and appropriateness of inferences
drawn from test scores. This evaluation rests on
four bases: (1) an inductive summary of convergent
and discriminant research evidence that the test
scores are interpretable in terms of a particular
construct meaning, (2) an appraisal of the value
implications of that interpretation, (3) a rationale
and evidence for the relevance of the construct and
the utility of the scores in particular applications,
and (4) an appraisal of the potential social consequences of the proposed use and of the actual
consequences when used.
Putting these bases together, we can see test
validity to have two interconnected facets linking
the source of justificationeither evidential or
consequentialto the function or outcome of the
testingeither interpretation or use. This crossing of basis and function is portrayed in Figure 1.
The interactions among these aspects are more
dynamic in practice, however, than is implied by
a fourfold classification. In an attempt to represent the interdependence and feedback among
the components, a flow diagram is presented in
Figure 2. The double arrows linking construct
validity and test interpretation in the diagram are
meant to imply a continuous process that starts
sometimes with a construct in search of proper
measurement and sometimes with an existing test
in search of proper meaning.
The model also includes a pragmatic component
for the evaluation of actual consequences of test
practice, pragmatic in the sense that this component is oriented, like pragmatic philosophy,
AMERICAN PSYCHOLOGIST NOVEMBER 1980 1023
Implications
for Test
terpretatlo
(Ed.), Handbook of industrial and organizational psychology. .Chicago: Rand McNally, 1976.
Snow, R. E. Representative and quasi-representative designs for research on teaching. Review of Educational
Research, 1974,44, 265-291.
Tenopyr, M. L. Content-construct confusion. Personnel
Psychology, 1977, 30, 47-54.
Thorndike, R. L. Personnel selection: Test and measurement techniques. New York: Wiley, 1949.
Thorndike, R. L. Concepts of culture-fairness. Journal of
Educational Measurement, 1971, S, 63-70.
Vickers, G. The art of judgment. New York: Basic Books,
1965.
Vickers, G. Value systems and social process. Harmondswo.rth, Middlesex, England: Penguin Books, 1970.
Wallach, M. A. Psychology of talent and graduate education. In S. Me'ssick (Ed.), Individuality in learning.
San Francisco: Jossey-Bass, 1976.
Webb, E. J., Campbell, D. T., Schwartz, R. D., & Securest,
L. U'nobtrusive measures: Nonreactive research in the
social sciences. Chicago: Rand McNally, 1966.
Wernimont, P. F., & Campbell, J. P. Signs, samples, and
criteria. Journal of Applied Psychology, 1968, 52, 372376.