Messick, 1980

Test Validity and the Ethics of Assessment
SAMUEL MESSICK
ABSTRACT: Questions of the adequacy of a test as

a measure of the characteristic it is interpreted to
assess are answerable on scientific grounds by appraising psychometric evidence, especially construct validity.
Questions of the appropriateness of test use in proposed
applications are answerable on qthical grounds by
appraising potential social consequences of the testing.
The first set of answers provides an evidential basis
for test interpretation, and the second set provides a
consequential basis for test use. In addition, this
article stresses (a) the importance of construct validity
for test use because it provides a rational foundation for
predictiveness and relevance, and (b) the importance of
taking into account the value implications of test interpretations per se. By thus considering both the evidential and consequential bases of both test interpretation and test use, the roles of evidence and social values
in the overall validation process are illuminated, and test
validity comes to be based on ethical as well as
evidential grounds.
Fifteen years ago or so, in papers dealing with personality measurement and the ethics of assessment,
I drew a straightforward but deceptively simple
distinction between the psychometric adequacy of a
test and the appropriateness of its use (Messick,
1964, 1965). I argued that not only should tests be
evaluated in terms of their measurement properties
but that testing applications should be evaluated in
terms of their potential social consequences. I urged
that two questions be explicitly addressed whenever
a test is proposed for a specific purpose: First, is the
test any good as a measure of the characteristics it
is interpreted to assess? Second, should the test be
used for the proposed purpose in the proposed way?
The first question is a scientific and technical one
and may be answered by appraising evidence for the
test's psychometric properties, especially construct
validity. The second question is an ethical one, and
its answer requires a justification of the proposed
use in terms of social values. Good answers to the
first question are not satisfactory answers to the
second. Justification of test use by an appeal to
empirical validity is not enough; the potential social
consequences of the testing should also be appraised,
1012 NOVEMBER 1980 AMERICAN PSYCHOLOGIST
Copyright 1980 by the American Psychological Association, Inc.
0003-066X/80/3511-1012$00.7S
Educational Testing Service

Princeton, New Jersey
not only in terms of what it might entail directly as

costs and benefits but also in terms of what it makes
more likely as possible side effects.
These two questions were phrased to parallel two
recurrent criticisms of testingthat some tests are
of poor quality and that tests are often misusedin
an attempt to separate the frequently blurred issues
in the typical critical interchange into (a) questions of test bias or the adequacy of measurement,
and (b) questions of test fairness or the appropriateness of use (Messick, 19,65).
It was in the context of appraising personality
measurement for selection purposes that I originally
stressed the need for ethical standards for justifying
test use (Messick, 1964). Although at that time
personality tests appeared inadequate for the selection task when systematically evaluated against
measurement and prediction standards, it seemed
likely that rapidly advancing research technology
would, in the relatively near future, produce psychometrically sophisticated personality assessment
devices. Therefore, questions might soon arise in
earnest as to the scope of their practical application
beyond clinical and counseling usage. With variables as value-laden as personality characteristics,
it seemed critical that values as well as validity be
considered in contemplating test use.
Kaplan (1964) pointed out that "the validity of
a measurement consists in what it is able to accomplish, or more accurately, in what we are able to do
with it. j . . The basic question is always whether
the measures have been so arrived at that they can
serve effectively as means to the given end" XP198). .Also at issue is whether the measures should
serve as means to the given end, in light of other
ends they might inadvertently serve and in consideration of the place of the given end in the social
This article was an invited address to the Divisions of
Educational Psychology and of Evaluation and Measurementi presented at the meeting of the American Psychological Association, New York City, September 1, 1979.
Requests for reprints should be sent to Samuel Messick,
Educational Testing Service, Princeton, New Jersey 08S41.
Vol. 35, No. 11, 1012-1027
fabric of pluralistic alternatives. For example,

should a psychometrically sound measure of "flexibility versus rigidity" be used for selection in a
particular college if it significantly improves the
multiple prediction of grade point average there?
What if the direction of prediction favored rigid
students? What if entrance to a military academy
were at issue, or a medical school? What if the
scores had been interpreted instead as measures of
"confusion versus control"? What if there were
large sex differences in the score distributions? In
a different arena, what minimal levels of knowledge
and skill should be required for graduation from
high school and in what areas?
It seemed clear at this point that value issues in
measurement were not limited to personality assessment, nor to selection applications, but should be
extended to all psychological and educational measurement (Messick, 1975). This is primarily because psychological and educational variables all
bear, either directly or indirectly, on human characteristics, processes, and products and hence are
inherently, though variably, value-laden. The
measurement of such characteristics entails value
judgmentsat all levels of test construction, analysis, interpretation, and useand this raises questions both of whose values are the standard and of
what should be the consequences of negative valuation. Values thus appear to be as pervasive and
critical for psychological and educational measurement as is testing's acknowledged touchstone,
validity. Indeed, "The root meaning of the word
'validity' is the same as that of the word 'value':
both derive from a term meaning strength" (Kaplan, 1964, p. 198).
It should be emphasized that value questions
arise with any approach to psychological and educational testingj whether it be norm-referenced or
criterion-referenced (Glaser & Nitko, 1971), a
construct-based ability test or a content-sampled
achievement test (Messick, 1975), a reactive task
or an unobtrusive observation (Webb, Campbell,
Schwartz, & Sechrest, 19,66), a sign or a sample
(Goodenough, 1969), or whatever, but the nature
of the critical value questions may differ from one
approach to another. For example, many of the
advantages of samples over signs derive from the
similarity of past behaviors to desired future behaviors, which 'makes it more likely that behaviorsample tests will be judged relevant in both content
and process to the task or job domain about which
inferences are to be drawn. It is also likely that
scores from such samples, because of behavioral
consistency from one time to another, will be predictive of performance in those domains (Wernimont & Campbell, 1968). A key value question is
whether such "persistence forecasting," as Wallach
(1976) calls it, is desirable, in a particular domain
of application. In higher education, for example,
the appropriate model might not be persistence but
development and change, which suggests that in
such instances we be wary of selection procedures
that restrict individual opportunity on the basis of
behavior to date (Hudson, 1976).
The distinction stressed thus far between the
adequacy of a test as a measure of the characteristic it is interpreted to assess and the appropriateness of its use in specific applications underscores
in the first instance the evidential basis of test
interpretation, especially the need for construct
validity evidence, and in the second instance the
consequential basis of test use, through appraisal
of potential social consequences. In developing
this distinction in prior work I emphasized the
importance of construct validity' for test use as
well, arguing "that even for purposes of applied
decision making reliance upon criterion validity or
content coverage is not enough," that the meaning
of the measure must also be comprehended in order
to appraise potential social consequences sensibly
(Messick, 1975, p. 956). The present article
extends this argument for the importance of construct validity in test use still further by stressing
its role in providing a "rational foundation for predictive validity" (Guion, 1976b). After thus
elaborating the evidential basis of test use, I consider the value implications of test interpretations
per se, especially those that bear evaluative and
ideological overtones going beyond intended meanings and supporting evidence; the circle is thereby
completed with an examination of the consequential
basis of test interpretation. Finally, the dynamic
interplay between test interpretation and its value
implications, on the one hand, and test use and its
social consequences, on the other, is sketched in a
feedback model that incorporates a pragmatic
component for the empirical evaluation of testing
consequences.
Validity as Inference From Evidence

According to the Standards for Educational and
Psychological Tests (American Psychological Association et al., 1974), "Questions of validity are
questions of what may properly be inferred from a
test score; validity refers to the appropriateness of
AMERICAN PSYCHOLOGIST NOVEMBER 1980 1013
inferences from test scores or other forms of assessment. . . . It is important to note that validity is
itself inferred, not measured. . . . It is, therefore,
something that is judged as adequate, or marginal,
or unsatisfactory" (p. 25). This document also
points out that the many forms of validity questions fall into two broad classes, those dealing with
inferences about what is being measured by the
test and those inquiring into the usefulness of the
measurement as a predictor of other variables.
Furthermore, there are a variety of validation
methods available, but they all entail in principle
a clear designation of what is to be inferred from
the scores and the presentation of data to support
such inferences.
,
Unfortunately, after this splendid beginning, this
and other official documentsnamely, the Division
of Industrial and Organizational Psychology's
(1975) Principles for the Validation and Use of
Personnel Selection Procedures and the Equal
Employment Opportunity Commission et al.'s
(1978) "Unifor'm Guidelines on Employee Selection Procedures" proceed, as Dunnette and Borman (1979) lament, to "perpetuate a conceptual
compartmentalization of 'types' of validity-
criterion-related, content, and construct. . . . the
implication that validities come in different types
leads to confusion and, in the face of confusion,
over-simplification" (p. 483). One consequence
of this simplism is that many test users focus on
one or another of the types of validity, as though
any one would do, rather than on the specific inferences they intend to make from the scores. There
is an implication that once evidence of one type of
validity is forthcoming, one is relieved of responsibility for further inquiry. Indeed, the "Uniform
Guidelines" seem to treat the three types of validity, in Guion's (1980) words, "as something of a
Holy Trinity representing three different roads to
psychometric salvation. If you can't demonstrate
one kind of validity, you've got two more chances!"
:
(p. 4).
.
Different kinds of inferences from test scores
require different kinds of evidence, not different
kinds of validity. By "evidence" I mean both
data, or facts, and the rationale or arguments that
cement those facts into a justification of test-score
inferences. "Another way to put this is to note
that data are not information; information is that
which results from the interpretation of data"
(Mitroff & Sagasti, 1973, p. 123). Or as Kaplan
(1964) states, '-'What serves as evidence is the
result of a process of interpretationfacts do not
1014 -NOVEMBER 1980 AMERICAN PSYCHOLOGIST
speak for themselves; nevertheless, facts must be

given a hearing, or the scientific point to the process
of interpretation is lost" (p. 375). Facts and
rationale thus blend in this view of evidence, and
the tolerable balanced between them in the arena of
test validity extends over a considerable range,
possibly even falling just short of the one extreme
where facts are left to speak for, themselves and the
other extreme where a logical rationale alone is
deemed self-evident.
By focusing on the nature of the evidence in
relation to the nature of the inferences drawn from
test scores, we come to view validity as a general
imperative in measurement. Validity is the overall
degree of justification for test interpretation and
use. It is "an evaluation, considering all things,
of a certain kind of inference about people who
obtain a certain score" (Guion, 1978b, p. 500).
Although it may prove helpful conceptually to
discuss the interdependent features of the generic
concept in terms of different aspects or facets, it is
simplistic to think of different types or kinds of
validity.
From this standpoint we are not very well served
by labeling different aspects of a general concept
with the name of the concept, as in criterion-related
validity, content validity > or construct validity, or
by proliferating a host of specialized validity modifiers, such as discriminant validity, trait validity,
factorial validity, structural validity, or population
validity, each delimiting some aspect of a broader
meaning. The substantive points associated with
each of these terms are important ones, but their
distinctiveness is blunted by calling them all
"validity." Since many of the referents are similar
but not identical, they tend to be assimilated one
to another, leading to confusion among them and
to a blurring of the different forms of evidence that
the terms wer$ invoked to highlight in the first
place. Worse still, any one of these so-called
validities, or a small set of them, might be treated
as the whole of validity, while the entire collection
to date might still not exhaust the essence of the
whole.
We would be much better off conceptually to use
labels more descriptive of the character and intent
of each aspect, such as content relevance and content coverage rather than content validity, or population generalizability rather than population
validity. Table 1 lists a number of currently used
validity terms along with a tentative descriptive
designation for each that is intended to underscore
differences among the concepts while at the same
time highlighting the key feature of each, such as

consistency or utility, and pointing to essential
areas of similarity and overlap, as with criterion
relatedness, nomological relatedness, and external
relatedness. With one possible exception to be
discussed .subsequently, none of these concepts
qualify for the accolade of validity, for at best they
are only one facet of validity and at worst, as in
the case of content coverage, they are not validity
at all. So-called "content validity" refers to the
relevance and representativeness of the task content used in test construction and does not refer to
test scores at all, let alone evidence to support
inferences from test scores, although such content
considerations do permit elaborations on score
Inferences supported by other evidence (Guion,
1977a, 1978a; Messick, 1975; Tenopyr, 1977).
I will comment on most of the concepts in Table
1 in passing while considering the claim of the one
exception noted earliernamely, construct validity
to bear the name "validity" and to wear the
mantle of all that name implies. I have pressed in
previous writing for the view that "all measurement should be construct-rejerenced" (Messick,
1975, p. 957). Others have similarly argued that
"any inference relative to prediction and ,.- . . all
inferences relative to test scores, are based upon
underlying constructs" (Tenopyr, 1977, p. 48).
Guion (1977b, p. 410) concluded that "all validity
is at its base some form of construct validity. . . .
It is the basic meaning of validity." I will argue,
building on Guion's (1976b) conceptual groundwork, that construct validity is indeed the unifying
concept of validity that integrates criterion and
content considerations into a common framework
for testing rational hypotheses about theoretically
relevant relationships. The bridge or unifying
theme that permits this integration is the meaningfulness or interpretability of the test scores, which
is the goal of the construct validation process. This
construct meaning provides a rational basis both
for hypothesizing predictive relationships and for
judging content relevance and representativeness.
I stop short, however, as did Guion (1980), of
equating construct validity with validity in general, but for different reasons. The main basis for
hesitancy on my part, as we shall see, is that validity entails an evaluation of the value implications
of both test interpretation and test use. These
implications derive primarily from the test's construct meaning, to be sure, and they feed back
into the construct validation process, but they
also derive in part from broader social ideologies,
TABLE 1
Alternative Descriptors for Aspects of Test Validity

Validity designation
Descriptive designation
Content relevancedomain specifications

Content coveragedomain representativeness
Criterion validity
Criterion relatedness
Predictive validity
Predictive utility
Concurrent validity
Diagnostic utility
Substitutability
Construct validity
Interpretive meaningfulness
Convergent validity
Convergent coherence
Discriminant validity
Discriminant distinctiveness
Trait validity
Trait correspondence
Nomological validity
Nomological relatedness
Factorial validity
Factorial composition
Substantive validity
Substantive consistency
Structural validity
Structural fidelity
External validity
External relatedness
Population validity
Population generalizability
Ecological validity
Ecological generalizability
Temporal validity
Temporal continuityacross developmental levels
Temporal generalizabilityacross
historical periods
Task validity
Task generalizability
Content validity
such as the ideologies of social science or of education or of social justice, and hence go beyond construct meaning per se.
INTERPRETIVE MEANINGFULNESS
Construct validation is a process of marshaling evidence to support the inference that an observed
response consistency in test performance has a
particular meaning, primarily by appraising the
extent to which empirical relationships with other
measures, or the lack thereof, are consistent with
that meaning. These empirical relationships may
be assessed in a variety of ways, for example, by
gauging the degree of consistency in correlational
patterns and factor structures, in group differences,
response processes, and changes over time, or in
responsiveness to experimental treatments. The
process attempts to link the reliable response consistencies summarized by test scores to nontest
behavioral consistencies reflective of a presumably
common underlying construct, usually an attribute
or process or trait that is itself embedded in a more
comprehensive network of theoretical propositions
or laws called a nomological network (Feigl, 1956;
Hempel, 1970; Margenau, 1950). An empirically
grounded pattern of such links provides an evidential basis for interpreting the test scores in
construct or process terms, as well as a rational
basis for inferring testable implications of the
scores from the broader theoretical network of the
constructs meaning (Cronbach & Meehl, 1955;
Messick, 1975). Constructs are thus chosen or
created "to organize experience into general lawlike statements" (Gronbach, 1971, p. 462).
Construct validation entails both confirmatory
and disconfirmatory strategies, one to provide
convergent evidence that the measure in question
is coherently related to other measures of the same
construct as well as to other variables that it should
relate to on theoretical grounds, and the other to
provide discriminant evidence that the measure is
not related unduly to exemplars of other distinct
constructs (D. T. Campbell & Fiske, 1959). Discriminant evidence is particularly critical for discounting plausible counterhypotheses to the construct interpretation (Popper, 1959), especially
those pointing to the possibility that the observed
consistencies might instead be attributable to
shared method constraints, response sets, or other
contaminants.
Construct validity emphasizes two intertwined
sets of relationships for the test: one between the
test and different methods for measuring the same
construct Or trait, and the other between measures
of the focal construct and exemplars of different
constructs predicted to be variously related to it
on theoretical grounds. Theoretically relevant
empirical consistencies in the first set, indicating a
correspondence between measures of the same construct, have been called trait validity, and those in
the second set, indicating a lawful relatedness between measures of different constructs, have been
called nomological validity (D. T. Campbell, 1960;
Cronbach & Meehl, 1955). In order to discount
competing hypotheses involving alternative constructs or method contaminants, the two sets are
often analyzed simultaneously in a multitraitmultimethod strategy that employs multiple methods for assessing each of two or more different
constructs (D. T. Campbell & Fiske, 1959). Such
an approach highlights the need for both convergent
and discriminant evidence in both trait and nomological validity.
Trait validity deals with the fit between measurement operations and conceptual definitions of the
construct, and nomological validity deals with the fit
between obtained data patterns and theoretical predictions about those patterns (Cook & Campbell,
1979). The former is concerned with the meaning

of the measure as a reflection of the construct, and
the latter is concerned with the meaning of the
construct as reflected in the measure's relational
properties. Both aspects are intrinsic to construct
validity, and the. interplay between them leads
to iterative refinements of measures, constructs, and
theories over time. Thus, the paradox that measures are needed to define constructs and constructs
are needed to build measures is resolved, like all
existential dilemmas in science, by a process of
successive approximation (Kaplan, 1964; Lenzen,
1955).
It will be recalled that the Standards for Educational and Psychological Tests (APA et al., 1974)
condensed the variety of validity questions into
two types, those dealing with the intrinsic nature
or meaning of the measure and those dealing with
its use as an indicator or predictor of other variables. In the present context, this distinction
should be seen as a whole-part relationship: Evidence bearing , on the meaning of the measure
embraces all of construct validity, whereas evidence
for certain predictive relationships contributes to
that part called nomological validity. Some predictive relationshipsnamely, those between the
measure and specific applied criterion behaviors
are traditionally singled out for special attention
under the rubric of "criterion-related validity," and
it therefore follows that this too is subsumed conceptually as part of construct validity.
This does not mean, however, that construct
validity in general can replace criterion-related
validity in particular in applied settings. The
criterion correlates of a measure constitute strands
in the construct's nomological network, but their
empirical basis is still to be checked. Thus,
"criterion-related validity is intended to show the
validity, not of the test, but of that hypothesis" of
relationship to the criterion (Guion, 1978a, p. 207). .
The analysis of criterion variables within the measure's construct network, especially if conducted in
tandem with the construct validation of the criterion
measures themselves, provides a powerful rational
basis for criterion prediction (Guion, 1976b).
CRITERION RELATEDNESS
So-called "criterion-related validity" is usually considered to comprise two types, concurrent validity
and predictive validity, which differ respectively in
terms of whether the test and criterion data were
collected at the same time or at different times. A
more fundamental distinction would recognize that

concurrent correlations with criteria are usually
obtained either to appraise the diagnostic effectiveness of the test in detecting current behavioral patterns or to assess the suitability of substituting the
test for a longer, more cumbersome, or more expensive criterion measure. It would also be' more helpful in both the predictive and the concurrent case to
characterize the function of the relationship in terms
of utility rather than validity. Criterion relatedness
differs from the more general nomological relatedness
in being more narrowly stated and pointed toward
specific sets of data and specific applied settings.
In criterion relatedness we are concerned not just
with verifying the existence of relationships and
gauging their strength, but with identifying useful
relationships under the applied conditions. Utility
is the more appropriate concept in such instances
because it implies interpretation of the correlations
in the decision context in terms of indices of predic^
tive efficiency relative to base rates, mean gains in
criterion performance due to selection, the dollar
value of such gains relative to costs, and so forth
(Brogden, 1946; Cronbach & Gleser, 1965; Curtis
& Alf, 1969; Darlington & Stauffer, 1966; Hunter,
Schmidt, & Rauschenberger, 1977).
In developing rational hypotheses of criterion
relatedness, we not only need a conception of the
construct meaning of the predictor measures, as we
' have seen, but we also need to conceptualize criterion
constructs, basing judgments on data from job or
task analyses and the construct validation of provisional criterion measures (Guion, 1976b). In the
last analysis, the ultimate criterion is determined on
rational grounds (Thorndike, 1949); in any event,
it "can best be described as a psychological construct
. . . [and] the process of determining the relevance
of the immediate to the ultimate criterion becomes
one of construct validation" (Kavanagh, MacKinney, & Wolins, 1971, p. 35). It is particularly
crucial to identify criterion constructs whenever
potentially contaminated criterion measures, such as
ratings or especially multiple ratings from different
sources, are employed (James, 1973). In the face
of impure or contaminated criterion measures, the
question of the intrinsic nature of the relation between predictor and criterion comes to the fore
(Gulliksen, 1950), and construct validity is needed
to .broach that issue. "In other words, an orientation toward construct validation in criterion
research is the best way of guarding against a
hopelessly 'incomplete job of criterion development"
(Smith, 1976, p. 768). Thus, if construct validity
is not available on the predictor side, it better be

on the criterion side, and both "must have adequate
construct validity for their respective sides if the
theory is to be tested adequately" (Guion, 1976b,
p. 802).
Implicit in this rational approach to predictive
hypotheses there is thus also a rational basis for
judging the relevance of the test to the criterion
domain. This provides a means of coping with the
quasi-judicial term jab-relatedness, even in the case
where criterion-related empirical verification is
missing. "Where it is clearly not feasible to do the
study, the defense of the predictor can rest on a
combination of its construct validity and the
rational justification for the inclusion of the construct in the predictive hypothesis" (Guion, 1974,
p. 291). The case becomes stronger if the predicted relationship has been verified empirically in
other settings. Guion (1974), for one, has maintained that this stance offers better evidence of
job-relatedness than does a tenuous criterionrelated study done under pressure with small
samples, low variances, or questionable criterion
measures. On the other hand, the simple demonstration of an empirical relationship between a
measure and a criterion in the absence of a cogent
rationale is a dubious basis for justifying relevance
or use (Messick, 1964, 1975).
CONTENT RELEVANCE AND CONTENT COVERAGE
The other major basis for judging the relevance

of the test to the behavioral domain about which
inferences are to be drawn is so-called "content
validity." Content validity in its classic form
(Cronbach, 1971) is limited to the strict behavioral
language of task description, for otherwise, constructs are apt to be invoked and we have another
case of construct validity. There are two main
facets to content validity: One is content relevance,
which refers to the specification of the behavioral
domain in question and the attendant specification
Of the task or test domain. Specifying domain
boundaries is essentially a requirement of operational definition and, in the absence of appeal to
a construct theory of task performance, is limited
to a statement of admissible task characteristics
and behavioral requirements. The other facet is
content coverage, which refers to the specification
of procedures for sampling the domain in some
representative fashion. The concern is thus with
content sampling of a specified content domain,
which is a prescription for test construction, not
validity. Consensual judgments about the relevance of the test domain as denned to, a particular
behavioral -domain of interest (as, for example,
when choosing a standardized achievement test to
evaluate a new curriculum), along with judgments
of the adequacy of content coverage in the test,
are the kinds of evidence usually offered for content
validity. But note that this is not evidence in
support of inferences from test scores, although it
might influence the nature of those inferences.
This attempt to define content validity as separate from construct validity produces a dysfunctional strain to avoid constructs, as if shunning
them in test development somehow lessens the
import "of response processes in test performance.
The important sampling consideration in test construction is not representativeness of the surface
content of tasks but representativeness of the processes employed by subjects in arriving at a
response (Lennon, 1956). This puts content
validity squarely in the realm of construct validity
(Messick, 1975). Rather than strain after nebulous distinctions, we should inquire how content
considerations contribute to construct validity and
how to strengthen that contribution (Tenopyr,
1977).
Loevinger (1957) incorporated content as an
important feature of construct validity by considering content representativeness and response consistency jointly. What she called "substantive
validity" is "the extent to which the content of the
items included in (and excluded from?) the test
can be accounted for in terms of the trait believed
to be measured and the context of measurement"
(Loevinger, 1957, p. 661). This notion was introduced "because of the conviction that considerations of content alone are not sufficient to establish
validity even when the test content resembles the
trait, and considerations of content cannot be
excluded .when the test content least resembles the
trait" (Loevinger, 1957, p. 657). The elimination
of certain items from the test because of poor
empirical response properties may sometimes distort
the test's representativeness in covering the construct domain as originally conceived, but it is
justified if the resulting test thereby becomes a
better exemplar of the construct as empirically
grounded (Loevinger, 1957; Messick, 1975).
Content validity has little to say about the scoring of content samples, and as a result scoring procedures are typically ad hoc (Guion, 1978b). Scoring models in the construct framework, in contrast,
logically parallel the structural relations inherent in
behavioral manifestations of the construct being

measured. Loevinger (1957) drew explicit attention to the need for rational scoring models by
coining the term structural validity, which includes
"both the fidelity of the structural model to the
structural characteristics of non-test manifestations
of the trait and the degree of inter-item structure"
(p. 661).
Even in instances where the test is an undisputed
representative sample of the behavioral domain of
interest and the concern is with the demonstration
of task accomplishment per se regardless of the
processes underlying performance (cf. Ebel, 1961,
1977), empirical evidence of response consistency
and not just representative content sampling is
important. In such cases, inferences are usually
drawn from the sample performance to domain performance, and these inferences should be buttressed
by indices of. the internal-consistency type to gauge
the extent of generalizability to other items like
those in the sample, to other tests developed in
parallel fashion, and so forth (J. P. Campbell, 1976;
Cronbach, Gleser, Nanda, & Rajaratnam, 1972).
We should also consider the possibility that the
test might contain sources of variance irrelevant to
domain performance, which is a particularly important consideration in interpreting low scores. Content validity at best is a unidirectional concept:
Although it may undergird certain straightforward
interpretations for high scorers (such as "they
possess suitable skills to perform the tasks correctly, because they did so .repeatedly"), it provides
no basis for interpreting low scores in terms of
incompetence or lack of skill. To do that requires
the discounting of plausible counterhypotheses
about such irrelevancies in the testing as anxiety,
defensiveness, inattention, or low motivation
(Guion, 1978a; Messick, 1975, 1979). And the
empirical discounting of plausible rival hypotheses
is the hallmark of construct validation.
GENERALITY OF CONSTRUCT MEANING
The issue of generalizability just broached for content sampling permeates all of validity. Several
aspects of generalizability of special concern have
been given distinctive labels, but unfortunately
these labels once again invoke the sobriquet
validity. The extent to which a measure's empirical relations and construct interpretation generalize to other population groups is called "population validity" (Shulman, 1970); to other situations
or settings, "ecological validity" (Bracht & Glass,
1968; Snow, 1974); to other times,- "temporal setting. The empirical verification of this rational
validity" (Messick & Barrows, 1972); and to hypothesis contributes to the construct validity of
other tasks representative of the operations called both the measure and the criterion, and the utility
for in the particular domain of interest, "task of the applied relation supports the practicality of
the proposed use. Thus, the evidential basis of
validity" (Shulman, 1970).
The label validity is especially unsuitable for test use is also construct validitybut elaborated
these important facets of generalizability, for such to determine the relevance of the construct to the
usage might be taken to imply that the more applied purpose and the utility of the measure in
generalizable a measure is, the more valid. This is the applied setting.
In all of this discussion I have tried to avoid the
not always the case, however, as in the measurement of such constructs as mood, which fluctuates language of necessary and sufficient requirements,
over time, or concrete operations, which typify a because such language seemed simplistic for a comparticular developmental stage, or administrative plex and holistic concept like test validity. On the
role, which operates in special organizational set- one hand, construct validation is a continuous,
tings, or delusions, which are limited to specific never-ending process developing an ever-expanding
psychotic groups. Rather, the appropriate degree mosaic of research evidence. At any point new
of generalizability for a measure depends upon the evidence may dictate a change in construct, theory,
nature of the construct assessed and the scope of or measurement, so that in the long run it is diffiits theoretical applicability. A closely related issue cult to claim sufficiency for any piece. On the
of "referent generality" (Coan, 1964; Snow, 1974), other hand, given that the mosaic of evidence is
called "referent validity" by Cook and Campbell reasonably dense, it is difficult to claim that any
(1979), concerns the extent to which research evi- piece is necessaryeven, as we have seen, empirical
dence supports a measure's range of reference and evidence for criterion-related predictive relationthe multiplicity of its referent terms. This con- ships in specific applied settings, provided, of
cept points to the need to tailor the level of con- course, that other evidence consistently supports
struct interpretation to the limits of the evidence a compelling rationale for the application.
Since the evidence in these evidential bases deand to avoid both oversimplification and overgeneralization in the connotation of construct rives from empirical studies evaluating hypotheses
labels. Nonetheless, constructs refer not only to about relationships or about the structure of sets
available evidence but to potential evidence, so of relationships, we must also be concerned about
that the choice of construct labels is influenced by the quality of those studies themselves and about
theory as well as by evidence and, as we shall see, the extent to which the research conclusions are
by ideologies about the nature of humanity and tenable or are threatened by plausible countersociety which add value implications that go hypotheses to explain the results (Guion, 1980).
Four classes of threats to the tenability and genbeyond evidential validity per se.
eralizability of research conclusions are discussed
by Cook and Campbell (1979), with primary
EVIDENTIAL BASIS OF TEST INTERPRETATION
reference to quasi-experimental and experimental
AND USE
research but also relevant to nonexperimental corTo recapitulate thus far, construct validity is the relational studies. These four classes deal, respecevidential basis of test interpretation. It entails tively, with the questions of (a) whether a relationboth convergent and discriminant evidence docu- ship exists between two variables, an issue called
menting theoretically relevant empirical relation- "statistical conclusion validity"; (b) whether the
ships (a) between the test and different methods relationship is plausibly causal from one variable
for measuring the same construct, as well as (b) to the other, called "internal validity"; (c) what
between measures of the construct and exemplars interpretive constructs underlie the relationship,
of different constructs predicted to be related called "construct validity"; and (d) the extent to
nomologically. For test use, the relevance of the which the interpreted relationship generalizes to
construct for the applied purpose is determined in and across other population groups, settings, and
addition, by developing rational hypotheses relating times, called "external validity."
the construct to performance in the applied domain.
I will not discuss here the first question raised
Some of the construct's nomological relations thus by Cook and Campbell except simply to affirm that
become criteria! when made specific to the applied the tenability of statistical conclusions about the
existence and strength of relationships is of course

basic to the whole enterprise. I have already discussed construct validity and external generalizability, although it is important to note in connection
with the latter that I was referring to the generalizability of a measure's empirical relations and construct interpretation to other populations, settings,
and times, whereas Cook and Campbell (1979)
were referring to the generalizability of research
conclusions that two variables (and their attendant
constructs) are causally related one to the other.
My emphasis was on the generality of a measure's
construct meaning based on any relevant evidence
(Messick, 1975; Messick & Barrows, 1972)commonality of factor structures, for examplewhile
theirs was on the generality of a causal relationship
from one measure or construct to another -based on
experimental or quasi-experimental treatments.
Verification of the hypothesis of causal relationship is what Cook and Campbell term internal
validity, and such evidence contributes importantly
to the nomological basis of a measure's construct
meaning for those construct theories entailing
causal claims. Internal validity thus provides the
evidential basis for causal strands in a nomological
network. The tenability of cause-effect implications is important for the construct validity of a
variety of educational and psychological measures,
such as those interpreted in terms of intelligence,
achievement, or motivation. Indeed, the causal
overtones of constructs are one source of the value
implications of test interpretation, a topic I will
turn to shortly.
Validity as Evaluation of Implications

Since validity is an evaluation of evidence, a judgment rather than an entity, and since some evidential basis should be provided for the interpretation and use of any test, validity has always been
an ethical imperative in testing. As Burton (1978)
put it, "Validity (as the word implies) has been
primarily an ethical requirement of tests, a prerequisite guarantee, rather than an active component of the use and interpretation of tests" (p.
264). She went on to argue that with criterionreferenced testing, "Glaser in essence, was taking
traditional validity ,out of the realm of ethics into
the active arena of test use" (p. 264). Glaser
may have taken traditional validity into the active
arena of test use, as it were, but it never left the
realm of ethics because test use itself is an ethical
issue.
If test validity is the overall degree of justification

for test interpretation and use, and if human and
social values encroach on both interpretation and
use, as they do, then test validity should take
account of those value implications in the overall
judgment. The concern here, as in most ethical
issues, is with evaluating the present and future
consequences of interpretation and use (Churchman, 1961). If, as an intrinsic part of the overall
validation process, we weigh the actual and potential consequences of our testing practices in light
of considerations of what future society might need
or desire, theh test validity comes to be based on
ethical as well as evidential grounds.
CONSEQUENTIAL BASIS OF TEST USE
Value issues have long been recognized in connection with test use. We have seen that one of the
key questions to be posed whenever a test is suggested for a specific purpose is "Should it be used
for that purpose?" Answers to that question
require an evaluation of the potential consequences
of the testing in terms of social values, but that is
no trivial enterprise. There is no guarantee that
at any point in time we will identify all of the
critical possibilities, especially those unintended
side effects that are distal to the manifest testing
aims.
There are few prescriptions for how to proceed
here, but one recommendation is to contrast the
potential social consequences of the proposed testing with the potential social consequences of alternative procedures and even of procedures antagonistic to testing. This pitting of the proposed test
use against alternative proposals is an instance of
what Churchman (1971) has called Kantian
inquiry; the pitting against antithetical counterproposals is called Hegelian inquiry. The intent
of these strategies is to draw attention to vulnerabilities in the proposed use and to expose its tacit
value assumptions to open debate. In the context
of testing, a particularly powerful and general form
of counterproposal is to weigh the potential social
consequences of the proposed test use against the
potential social consequences of not testing at all
(Ebel, 19,64).
The role of values in test use has been intensively
examined in certain selection applicationsnamely,
in those where different population groups display
significantly different means on predictors, or
criteria, or both. Since fair test use implies that
selection decisions will be equally appropriate
regardless of an individual's group membership, and

since different selection systems yield different
proportions of selected individuals in the different
groups, the question of test fairness arises in earnest. In good Kantian fashion, several models of
fair selection were formulated and contrasted with
each other (deary, 1968; Cole, 1973; Darlington,
1971; Einhorn & Bass, 1971; Linn, 1973, 1976;
Thorndike, 1971); some, having been found incompatible or even mutually contradictory, offered good
Hegelian contrasts (Peterson & Novick, 1976). It
soon became apparent in comparing these models
that each accorded a different importance or value
to the various subsets of selected versus rejected
and successful versus unsuccessful individuals in
the different population groups (Dunnette & Borman, 1979; Linn, 1973). Moreover, the values
accorded are a function not only of desired criterion
performance but of desired individual and group
attributes (Novick & Ellis, 1977). Thus, each
model not only constitutes a different definition of
fairness but also implies a particular ethical position (Hunter & Schmidt, 1976). Each view is
ostensibly fair under certain conditions, so that
arguments over the fairness of test use turn out in
many instances to be disagreements as to what the
conditions are or ought to be.
With the recognition that fundamental value
differences were at issue, several utility models were
developed that required specific value positions to
be taken (Cronbach, 1976; Gross & Su, 1975;
Peterson & Novick, 1976; Sawyer, Cole, & Cole,
1976), thereby incorporating social values explicitly
with measurement technology. But making values
explicit does not determine choices among them,
and at this point it appears difficult if not impossible to be fair to individuals in terms of equity, to
groups in terms of parity or adverse impact, to
institutions in terms of efficiency, and to society in
terms of benefits and risks all at the same time. A
workable balancing of the needs of all of the parties
is likely to require successive approximations over
time, with iterative modifications of utility matrices
based on experience with the consequences of
decision processes to date (Darlington, 1976).
CONSEQUENTIAL BASIS OF TEST INTERPRETATION
In contrast to test use, the value issues in test

interpretation have not been as vigorously addressed. That social values impinge upon theoretical
interpretation may not be as obvious, but it is no
less serious. "Data come to us only in answer to
questions.. . . . How we put the question reflects

our values on the one hand, and on the other hand
helps determine the answer we get" (Kaplan, 1964,
p. 385). Facts and values thus go hand in hand
(Churchman, 1961), and "we cannot avoid ethics
breaking into inductive logic" (Braithwaite, 1956,
p. 174). As Kaplan (1964) put it, "Data are
the product of a process of interpretation, and
though there is some sense in which the materials
for this process are 'given' it is only the product
which has a scientific status and function. In a
word, data have meanirig, and this word 'meaning,'
like its cognates 'significance' and 'import,' includes
a reference to values" (p. 385). Thus, just as
data and theoretical interpretation were seen to be
intimately intertwined in the concept of evidence,
so data and values are intertwined in the concept
of interpretation, and fact, value, and meaning
become three faces of the substance of science.
Whenever an event or relationship is conceptualized, it is judgedeven if only tacitlyas belonging to some broader category to which value
already attaches. If a crime, for example, is seen
as a violation of the social order, the modal societal
response is to seek correction, which is a derivative
of the value context of this way of seeing. If crime
is seen as a violation of the moral order, expiation
will be sought. And if seen as a sign of distress,
especially if the distress can be assimilated to a
narrower category like mental illness, then a claim
of compassion and help attaches to the valuation.
In Vickers's (1970) terms, the conceptualization
of an event or relationship within a broader category is a process of "matching," which is an informational concept involving the comparison of
forms. The assimilation of the value attached to
the broader schema is a process of "weighing,"
which is a dynamic concept involving the comparison of forces. For Vickers (1970), "the elaboration
of the reality system and the value system proceed
together. Facts are relevant only to some standard
of value; values are applicable only to some configuration of fact" (p. 134). He uses the term
appreciation to refer to those conjoint judgments of
fact and value (Vickers, 1965).
In the construct interpretation of tests, such
appreciative processes are central, though typically
latent. Constructs are broader conceptual categories than the test behaviors, and they carry with
them into the testing arena value connotations
stemming from three major sources: First are the
evaluative overtones of the construct labels themselves; next are the value connotations of the
broader theories or nomological networks in which

constructs are embedded; and last are the implications of the still broader ideologies about the nature of humanity, society, and science that color
how we proceed. Ideology is a complex configuration of values, affects, and beliefs that provides,
among other things, an existential perspective for
viewing the worlda "stage-setting," as it were,
for interpreting the human drama in ethical, scientific, or whatever terms (Ed'el, 1970). The
ideological overlay subtly influences test interpretation, especially for very general constructs like
intelligence, in ways that go beyond empirically
verified connections; in the nomological network
(Crawford, 1979). The hope here in drawing
attention explicitly to the value implications of test
interpretation is that some of these ideological and
valuative links might be exposed to inquiry and
subjected either to empirical grounding or to policy
debate.
Exposing the value assumptions of a construct
theory and its more subtle links to ideologypossibily to multiple, cross-cutting ideologiesis an
awesome challenge. One approach is to follow
Churchman's (1971) lead arid attempt to contrast
each construct theory with an alternative perspective for interpreting the test scores, as in the
Kantian mode of inquiry; better still f,or probing
the ethical implications of a theory is to contrast
it with an antithetical, though plausible, Hegelian
counterperspective. This raises to the grander level
of theory-comparison the strategy of focusing on
plausible rival hypotheses and counterhypotheses
in evaluating the basis for relationships within a
theory. Systematic competition between countertheories in attempting to explain the conjoint data
derivable from each also tends to offset the concern
that scientific observations are theory-laden or
theory-dependent and that the presumption of a
single theory might thereby preclude uncovering
the most challenging test data for that theory
(Feyerabend, 1975; Mitroff, 1973). Moreover, as
Churchman (1961) stresses, although consensus is
the decision rule of traditional science, conflict is
the decision rule of ethics. Since the one thing .we
universally disagree about is "what ought to be,"
any scientific approach to ethics should allow for
conflict ^and debate, as should any attempt to assess
the ethical implications of science. "Thus, in
order to derive the 'ethical' implications of any
technical or scientific model, we explicitly incorporate a dialectical mode of examining (or testing)
models" (Mitroff & Sagasti, 1973, p. 133). In a
sense we are asking, as did Churchman's mentor

E. A. Singer (19S9), what the consequences" would
be if a given scientific judgment had the status
of an ethical judgment.
It should be noted that value issues intrude in
the testing process at all levels, not just at the
grand level of broad construct interpretation. For
example, values influence the relative emphasis on
different types of content in test construction
(Nunnally, 1967) and procedures for scoring the
quality of performance on content samples (Guion,.
1978b), but the concern here is limited to the value
implications of test interpretation. Consider first
the evaluative overtones of the construct label itself.
I have already suggested that a measure interpreted
in terms of "flexibility versus rigidity" would be
utilized differently if it were instead labeled "confusion versus control." Similarly, a measure called
"inhibited versus impulsive" would have different
consequences if it were labeled "self-controlled
versus uninhibited." So would a variable like
"stress" if it were relabeled "challenge." The point
is not that we would make a concept like stress
into a good thing by renaming it but that by not
presuming it to be a bad thing we would investigate
broader consequences, facilitative as well as
debilitative (McGrath, 1976). In choosing a construct label, we should strive for consistency between the trait and evaluative implications of the
name, attempting to capture as closely as possible
the essence of the construct's theoretical import,
especially its empirically grounded import, in terms
reflective of its salient value connotations. This
may prove difficult, however, because many traits
are ope/n to conflicting value interpretations and
thus call for systematic examination of counterhypotheses about value outcomes, if not to reach
convergence, at least to clarify the basis of the
conflict. Some traits may also imply different
value outcomes under different circumstances,
which suggests the possible utility of differentiated
trait labels to embody these value distinctions, as in
the case of "debilitating anxiety" and "facilitating
anxiety." Rival theories of the construct might
also highlight different value implications, of
course, and lead to conflict between the theories
not only in trait interpretation but also in value
interpretation.
Apart from its normative and evaluative overtones, perhaps the most important feature of a
construct in regard to value connotations is its
breadth, or the range of its theoretical and
empirical referents. This is the issue that Snow
1022 NOVEMBER 1980 AMERICAN PSYCHOLOGIST-
Test Interpretation
Test Use
Evidential Basis
Construct Validity
Construct Validity +
Relevance/Utility
Consequential Basis
Value Implications
Social Consequences
Figure 1. Facets of test validity.

(1974) called referent generality. The broader the
construct, the more difficult it is to embrace all of
its critical features in a single measure and the
more we are open to what Coombs (1954) has
called "operationism in reverse," that is, "endowing
the measures with all the meanings associated with
the concept" (p. 476). In choosing the appropriate breadth or level of generality for a construct
and its label, one is buffeted by opposing counterpressures toward oversimplification on the one hand
and overgeneralization on the other. At one
extreme is the apparent safety in using merely
descriptive labels tightly tied to behavioral exemplars in the test (such as Adding Two-Digit Numbers). Choices on this side sacrifice interpretive
power and range of application if the test might
also be defensibly viewed more broadly (e.g., Number Facility). At the other extreme is the apparent
richness of high-level inferential labels (such as
Intelligence, Creativity, or Introversion). Choices
on this side are subject to the dangers of mischievous dispositional connotations and the backlash
of conceptual imperialism.
At first glance, one might think that the appropriate level of construct reference should be tied
not to test behavior but to the level of generalization supported by the convergent and discriminant
research evidence in hand. But constructs refer to
potential relationships as well as actual relationships, so their level of generality should in principle
be tied to their range of reference in the nomological theory, with the important proviso that this
range be restricted or extended when research evidence so indicates. The scope of the original theoretical formulation is thus modified by the research
evidence available, but it is not limited to the
research evidence available. As Cook and Campbell (1979) put it, "The data edit the kinds of
general statements we can make" (p. 88). And
debating the value implications of test interpreta-
tion may also edit the kinds of general statements

we should make.
Validity as Evaluation of
Evidence and Consequence
Test validity is thus an overall evaluative judgment
of the adequacy and appropriateness of inferences
drawn from test scores. This evaluation rests on
four bases: (1) an inductive summary of convergent
and discriminant research evidence that the test
scores are interpretable in terms of a particular
construct meaning, (2) an appraisal of the value
implications of that interpretation, (3) a rationale
and evidence for the relevance of the construct and
the utility of the scores in particular applications,
and (4) an appraisal of the potential social consequences of the proposed use and of the actual
consequences when used.
Putting these bases together, we can see test
validity to have two interconnected facets linking
the source of justificationeither evidential or
consequentialto the function or outcome of the
testingeither interpretation or use. This crossing of basis and function is portrayed in Figure 1.
The interactions among these aspects are more
dynamic in practice, however, than is implied by
a fourfold classification. In an attempt to represent the interdependence and feedback among
the components, a flow diagram is presented in
Figure 2. The double arrows linking construct
validity and test interpretation in the diagram are
meant to imply a continuous process that starts
sometimes with a construct in search of proper
measurement and sometimes with an existing test
in search of proper meaning.
The model also includes a pragmatic component
for the evaluation of actual consequences of test
practice, pragmatic in the sense that this component is oriented, like pragmatic philosophy,
Implications
for Test
terpretatlo
< Evaluate Consequences
Figure 2. Feedback model for test validity.

toward outcomes rather than origins and seeks

justification for use in the practical consequences
of use. The primary concern of this component is
the balancing of the instrumental value of the test
in accomplishing its intended purpose with the
instrumental value of any negative side effects and
positive by-products of the testing. Most test
makers acknowledge responsibility for providing
general evidence of the instrumental value of the
test. The terminal value of the test in terms of the
social ends to be served goes beyond the test maker
to include as well the decisionmaker, policymaker,
and test user, who are responsible for specific evidence of instrumental value in their particular
setting and for the specific interpretations and
uses made of the test scores. In the final analysis,
"responsibility for valid use of a test rests on the
person who interprets it" (Cronbach, 1969, p. SI),
and that interpretation entails responsibility for
its value consequences.
Intervening in the model between test use and
the evaluation of consequences is a decision matrix
to emphasize the point that tests are rarely used
in isolation but rather in combination with other
information in broader decision systems. The
decision process is profoundly influenced by social
values and deserves, in its own right, massive
research attention beyond the good beginning provided by utility models. As Guion (1976a)
phrased it, "The formulation of hypotheses is or
should be applied science, the validation of hypotheses is applied methodology, but the act of making
. . . [a] decision is ... still an art" (p. 646). The
feedback model as portrayed is a closed system, to
emphasize the point that even when consequences
are evaluated favorably they should be continuously or periodically monitored to permit the
detection of changing circumstances and of delayed
side effects.
The model is closed and this article is closed
wjith the provocative words of Sir Geoffrey Vickers
(1970): "If indeed we have reached the end of
ideology (in Daniel Bell's phrase) it is not because
we can do without ideologies but because we should
now know enough about them to show a proper
respect for our neighbour's and a proper sense of
responsibility for our own" (p. 109).
REFERENCES
American Psychological Association, American Educational
Research Association, & National Council on Measurement in Education. Standards for educational and psychological tests. Washington, D.C.: American Psychological Association, 1974.
Bracht, G. H., & Glass, G. V. The external validity of

experiments! American Educational Research Journal,
1968, 5, 437^74.
Braithwaite, R. B. Scientific explanation. Cambridge,
England: Cambridge University Press, 1956.
Brogden, H. E. On the interpretation of the correlation
coefficient as a measure of predictive efficiency. Journal
of Educational Psychology, 1946, 37, 65-76.
Burton, N. W. Societal standards. Journal of Educational
Measurement, 1978, IS, 263-271.
Campbell, D. T. Recommendations for APA test standards
regarding construct, trait, or discriminant validity.
American Psychologist, 1960, 15, 546-553.
Campbell, D. T., & Fiske, D. W. Convergent and discriminant validation by the multitrait-multimethod matrix.
Psychological Bulletin, 1959, 56, 81-105.
Campbell, J. P. Psychometric theory. In M. D. Dunnette
(Ed.), Handbook of industrial and organizational psychology. Chicago: Rand McNally, 1976.
Churchman, C. W. Prediction and optimal decision:
Philosophical issues of a science of values. Englewood
Cliffs, N.J.: Prentice-Hal!, 1961.
Churchman, C. W. The design of inquiring systems: Basic
concepts of systems and organization. New York: Basic
Books, 1971.
Cleary, T. A. Test bias: Prediction of grades of Negro
and white students in integrated colleges. Journal of
Educational Measurement, 1968, 5, 115-124.
Coan, R. W. Facts, factors, and artifacts: The quest for
psychological meaning. Psychological Review, 1964, 71,
123-140.
Cole, N. S. Bias in selection. Journal of Educational
Measurement, 1973, W, 237-255.
Cook, T. D., & Campbell, D. T. Quasi-experimentation:
Design and analysis issues for field settings. Chicago:
Rand McNally, 1979.
Coombs, C. H.. Theory and methods of social measurement. In L. Festinger & D. Katz (Eds.), Research
methods in the behavioral sciences. New York: Holt,
Rinehart & Winston, 1954.
Crawford, C. George Washington, Abraham Lincoln, and
Arthur Jensen: Are they compatible? American Psychologist, 1979, 34, 664-672.
Cronbach, L. J. Validation of educational measures. Proceedings of the 1969 Invitational Conference on Testing
Problems: Toward a theory of achievement measurement . Princeton, N.J.: Educational Testing Service,
1969.
Cronbach, L. J. Test validation. In R. L. Thorndike
(Ed.), Educational measurement (2nd ed.). Washington,
D.C.: American Council on Education, 1971.
Cronbach, L. J. Equity in selectionWhere psychometrics and political philosophy meet. Journal of Educational Measurement, 1976, 13, 31-41.
Cronbach, L. J., & Gleser, G. C. Psychological tests and
personnel decisions (2nd ed.). Urbana: University of
Illinois Press, 1965.
Cronbach, L. J., Gleser, G., Nanda, H., & Rajaratnam, N.
The dependability of behavioral measurements: Theory
of generalizability for scores and profiles. New York:
Wiley, 1972.
Cronbach, L. J., & Meehl, P. E. Construct validity in psychological tests. Psychological Bulletin, 195S, 52, 281-302.
Curtis, E. W., & Alf, E. F. Validity, predictive efficiency,
and practical significance of selection tests. Journal of
Applied Psychology, 1969, 53, 327-337.
Darlington, R. B. Another look at "culture fairness."
Journal of Educational Measurement, 1971, 8, 71-82.
Darlington, R. B. A defense of "rational" personnel selection, and two new methods. Journal of Educational
Measurement, 1976, 13, 43-52.
Darlington, R. B., & Stauffer, G. F. Use and evaluation

problem of fairness. Journal of Applied Psychology,
of discrete test information in decision making. Journal
1978,^,499-506. (b)
of Applied Psychology, 1966, 50, 125-129.
Guion, R. M. On trinitarian doctrines of validity. ProDivision of Industrial and Organizational Psychology,
fessional Psychology, 1980,11, 385-398.
American Psychological Association. Principles for the Gulliksen, H. Intrinsic validity. American Psychologist,
validation and use of personnel selection procedures.
1950,5,511-517.
Hamilton, Ohio: Hamilton Print Co., 1975.
Hempel, C. G. Fundamentals of concept formation in empirical science. In O. Neurath, R. Carnap, & C. Morris
Dunnette, M. D., & Borman, W. C. Personnel selection
and classification systems. In M. R. Rosenzweig & L. W.
(Eds.),'Foundations of the unity of science: Toward an
Porter (Eds.), Annual Review of Psychology (Vol. 30).
international encyclopedia of unified science (Vol. 2).
Palo Alto, Calif.: Annual Reviews, 1979.
Chicago: University of Chicago Press, 1970.
Ebel, R. L. Must all tests be valid? American Psychol- Hudson, L. Singularity of talent. In S. Messick (Ed,),
. Individuality in learning. San Francisco: Jossey-Bass,
ogist, 1961, 16, 640-647.
,1976.
Ebel, R. L. The social consequences of educational testing.
Proceedings of the 1963 Invitational Conference on Test- Hunter, J. E., & Schmidt, F. L. Critical analysis of the
statistical and ethical implications of various definitions
ing Problems. Princeton, N.J.: Educational Testing
of test bias. Psychological Bulletin, 1976, 83, 1053-1071.
Service, 1964.
Ebel, R. L. Comments on some problems of employment Hunter, J. E., Schmidt, F. L., & Rauschenberger, J. M.
Fairness of psychological tests: Implications of four deftesting. Personnel Psychology, 1977, 30, 5S-63.
initions for selection utility and minority hiring. Journal
Edel, A. Science and the structure of ethics. In 0. Neuof Applied Psychology, 1977, 62, 245-260.
rath, R. Carnap, & C. Morris (Eds.), Foundations of the
unity of science: Toward an international encyclopedia James, L. R. Criterion models and construct validity for
criteria. Psychological Bulletin, 1973, 80, 75-83.
of unified science (Vol. 2). Chicago: University of
Kaplan, A: The conduct of inquiry: Methodology for beChicago Press, 1970.
havioral science. San Francisco: Chandler, 1964.
Einhorn, H. J., & Bass, A. R. Methodological considerations relevant to discrimination in employment testing, Kavanagh, M. J., MacKinney, A. C., & Wolins, L. Issues
in managerial performance: Multitrait-multimethod analPsychological Bulletin, 1971, 75, 261-269.
yses of ratings. Psychological Bulletin, 1971, 75, 3449.
Equal Employment Opportunity Commission, Civil Service
Commission, U.S. Department of Labor, & U.S. Depart- Lennon, R. T. Assumptions underlying the use of content
validity. Educational and Psychological Measurement,
ment of Justice. Uniform guidelines on employee selec1956, 16, 294-304.
tion procedures. Federal Register (August 25, 1978),
Lenzen, V. F. Procedures of empirical science. In O. Neu43 (166), 38290-38315.
rath, R. Carnap, & C. W. Morris (Eds.), International
Feigl, H. Some major issues and developments in the phiencyclopedia of unified science (Vol. 1, Pt. 1). Chicago:
losophy of science of logical empiricism. In H. Feigl &
University of Chicago Press, 1955.
M. Scriven, Minnesota studies in philosophy of science:
The foundations of science and the concepts of psychol- Linn, R. L. Fair test use in selection. Review of Educational Research, 1973, 43, -139-161.
ogy and psychoanalysis.
Minneapolis: University of
Linn, R. L. In search of fair selection procedures. Journal
Minnesota Press, 1956.
of Educational Measurement, 1976, 13, 53-58.
Feyerabend, P. Against method: Outline of an anarchist
theory of knowledge.
London, England: New Left Loevinger, J. Objective tests as instruments df psychological theory. Psychological Reports, 1957, 3, 635-694
Books, 1975.
(Monograph Supplement 9).
Glaser, R., & Nitko, A. J. Measurement in learning and
instruction. In R. L. Thorndike (Ed.), Educational Margenau, H. The nature of physical reality. New York:
McGraw-Hill, 1950. (Reprinted, Woodbridge, Conn.:
measurement (2nd ed.). Washington, D.C.: American
Oxbow, 1977.)
Council on Education, 1971.
Goodenough, F. L. Mental testing: Its history, principles, McGrath, J. E. Stress and behavior in organizations. In
M. D. Dunnette (Ed.), Handbook of industrial and orand applications. New York: Holt, Rinehart & Winganizational psychology. Chicago: Rand McNally, 1976.
ston, 1969.
Gross, A. L., & Su, W. Defining a "fair" or "unbiased" Messick, S. Personality measurement and college performance. Proceedings of the 1963 Invitational Conference
selection model: A question of utilities. Journal of Apon Testing Problems. Princeton, N.J.: Educational Testplied Psychology, 1975, 60, 345-351.
ing Service, 1964.
Guion, R. M. Open a new window: Validities and values
in psychological measurement. American Psychologist, Messick, S. Personality measurement and the ethics of assessment. American Psychologist, 1965, 20, 136-142.
1974, 29, 287-296.
Guion, R. M. The practice of industrial and organizational Messick, S. The standard problem: Meaning and values
in measurement and evaluation. American Psychologist,
psychology. In M. D. Dunnette (Ed.), Handbook of
1975, 30, 955-966.
industrial and organizational psychology. Chicago: Rand
Messick, S. Potential uses of noncognitive measurement in
McNally, 1976. (a)
,
education. Journal of Educational Psychology, 1979, 71,
Guion, R. M. Recruiting, selection, and job placement. In
281-292.
M. D. Dunnette (Ed.), Handbook of industrial and organizational psychology. Chicago: Rand McNally, 1976. Messick, S., & Barrows, T. S. Strategies for research and
evaluation in early childhood education. In I. J. Gordon
(b)
(Ed.), Early childhood education: The seventy-first yearGuion, R. M. Content validityThe source of my disconbook of the National Society for the Study of Education.
tent. Applied Psychological Measurement, 1977, 1, 1Chicago: University of Chicago Press, 1972.
10. (a)
Guion, R. M. Content validity: Three years of talk Mitroff, I. I. 'Be it resolved that structured debate notconsensus ought to form the epistemic cornerstone of
What's the action? Public Personnel Management, 1977,
OR/MS': A reaction to Ackoff's note on systems science.
6, 407-414. (b)
Interfaces, 1973, 3, 14-17.
Guion, R. M: "Content validity" in moderation. PersonMitroff, I. I., & Sagasti, F. Epistemology as general sysnel Psychology, 1978, 31, 205-213. (a)
tems theory: An approach to the design of complex
Guion, R. M. Scoring of content domain samples: The
decision-making experiments. Philosophy of Social Science, 1973, 3, 117-134'.

Novick, M. R., & Ellis, D. D. Equal opportunity in educational and employment selection. American Psychologist, 1977, 32, 306-320.
Nunnally, J. Psychometric theory. New York: McGrawHill, 1967.
Peterson, N. S., & Novick, M. R. An evaluation of some
models for culture-fair selection. Journal of Educational
Measurement, 1976, 13, 3-29.
Popper, K. R. The logic of scientific discovery. New York:
Basic Books, 1959.
Sawyer, R. L., Cole, N. S., & Cole, J. W. L. Utilities and
the issue of fairness in a decision theoretic model for
selection. Journal of Educational Measurement, 1976,
13, 59-76.
Shulman, L. S. Reconstruction of educational research.
Review of Educational Research, 1970, 40, 371-396.
Singer, E. A. Experience and reflection (C. W. Churchman,
Ed.). Philadelphia: University of Pennsylvania Press,
1959.
Smith, P. C. Behaviors, results, and organizational effectiveness: The problem of criteria. In M. D. Dunnette
(Ed.), Handbook of industrial and organizational psychology. .Chicago: Rand McNally, 1976.
Snow, R. E. Representative and quasi-representative designs for research on teaching. Review of Educational
Research, 1974,44, 265-291.
Tenopyr, M. L. Content-construct confusion. Personnel
Psychology, 1977, 30, 47-54.
Thorndike, R. L. Personnel selection: Test and measurement techniques. New York: Wiley, 1949.
Thorndike, R. L. Concepts of culture-fairness. Journal of
Educational Measurement, 1971, S, 63-70.
Vickers, G. The art of judgment. New York: Basic Books,
1965.
Vickers, G. Value systems and social process. Harmondswo.rth, Middlesex, England: Penguin Books, 1970.
Wallach, M. A. Psychology of talent and graduate education. In S. Me'ssick (Ed.), Individuality in learning.
San Francisco: Jossey-Bass, 1976.
Webb, E. J., Campbell, D. T., Schwartz, R. D., & Securest,
L. U'nobtrusive measures: Nonreactive research in the
social sciences. Chicago: Rand McNally, 1966.
Wernimont, P. F., & Campbell, J. P. Signs, samples, and
criteria. Journal of Applied Psychology, 1968, 52, 372376.
APA Congressional Science Fellowship Program

The American Psychological Association (APA) is now accepting applications for its
Congressional Science Fellowship Program, which is designated for 1981-1982 in the area of
child policy: The program, administered by the American Association for the Advancement of Science (AAAS) and funded for 1981-1982 by the Esther Katz Rosen Fund of
the American Psychological Foundation, provides an extraordinary opportunity for postdoctoral and midcareer individuals to learn about science-government interaction and to
make contributions to the more effective use of science in government. One fellow will
be selected by the APA to spend one year working as a special legislative assistant on the
staff of an individual congressperson or a congressional committee.
Applicants must have obtained a doctorate in psychology, must demonstrate exceptional
research ability and scientific expertise in some area of child psychology (e.g., developmental, child-clinical), and must have a strong interest in using scientific knowledge
toward the solution or prevention of societal problems involving and affecting children.
Applicants must also belong to APA, or be an applicant for membership.
The fellowship period covers one year beginning 1 September 1981 and requires residence in the Washington, B.C., area. The fellowship includes a stipend of $20,000 plus
nominal relocation and travel expenses.
Interested individuals should submit the following application materials: (a) a detailed
vita; (b) a statement of 500 words or less addressing the applicant's interest in the
fellowship and how it relates to the applicant's career goals; and (c) three letters of
reference on the applicant's ability to work on Capitol Hill as a special legislative assistant
with scientific expertise in psychology (sent directly to the address below). Application
materials must be postmarked by midnight January 23, 1981.
Finalists will be invited to APA's Central Office in Washington, D.C., for an interview
by a selection committee in late March 1981. Announcement of the award will be made
by early April 1981.
Send application materials to Joann Horai, Congressional Science Fellowship Program,
American Psychological Association, 1200 Seventeenth Street, N.W., Washington, D.C.
20036.

Messick, 1980

Uploaded by

Copyright:

Available Formats

Messick, 1980

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Messick, 1980

Uploaded by

Copyright:

Available Formats

Test Validity and the Ethics of Assessment

ABSTRACT: Questions of the adequacy of a test as

Educational Testing Service

not only in terms of what it might entail directly as

Vol. 35, No. 11, 1012-1027

fabric of pluralistic alternatives. For example,

Validity as Inference From Evidence

speak for themselves; nevertheless, facts must be

time highlighting the key feature of each, such as

Alternative Descriptors for Aspects of Test Validity

Content relevancedomain specifications

1979). The former is concerned with the meaning

more fundamental distinction would recognize that

is not available on the predictor side, it better be

The other major basis for judging the relevance

behavioral manifestations of the construct being

existence and strength of relationships is of course

Validity as Evaluation of Implications

If test validity is the overall degree of justification

regardless of an individual's group membership, and

In contrast to test use, the value issues in test

questions.. . . . How we put the question reflects

AMERICAN PSYCHOLOGIST NOVEMBER 1980 1021

broader theories or nomological networks in which

sense we are asking, as did Churchman's mentor

1022 NOVEMBER 1980 AMERICAN PSYCHOLOGIST-

Figure 1. Facets of test validity.

tion may also edit the kinds of general statements

< Evaluate Consequences

Figure 2. Feedback model for test validity.

toward outcomes rather than origins and seeks

Bracht, G. H., & Glass, G. V. The external validity of

AMERICAN PSYCHOLOGIST NOVEMBER 1980 1025

Darlington, R. B., & Stauffer, G. F. Use and evaluation

1026 NOVEMBER 1980 AMERICAN PSYCHOLOGIST

decision-making experiments. Philosophy of Social Science, 1973, 3, 117-134'.

APA Congressional Science Fellowship Program

You might also like