CHAPTER 4 Validity
CHAPTER 4 Validity
CHAPTER 4 Validity
CHAPTER 4
VALIDITY
In selecting or constructing an evaluation instrument the most important question is: To
what extent will the results serve the particular uses for which they are intended? This is
the essence of validity.
Many aspects of pupil behavior are evaluated in the school, and the results are expected to
serve a variety of uses. For example, achievement may be evaluated in order to diagnose
learning difficulties or to determine progress toward instructional objectives; scholastic
aptitude may be measured in order to predict success in future learning activities or to group
pupils for instructional purposes; and appraisals of personal-social development may be
obtained in order to better understand pupils or to screen them for referral to a guidance
counselor. Regardless of the area of behavior being evaluated, however, or the use to be
made of the results, all of the various procedures used in an evaluation program should
possess certain common characteristics. The most essential of these characteristics can be
classified under the headings of validity, reliability, and usability.
Validity refers to the extent to which the results of an evaluation procedure serve the
particular uses for which they are intended. If the results are to be used to describe pupil
achievement, we should like them to represent the specific achievement we wish to
describe, to represent all aspects of the achievement we wish to describe, and to represent
nothing else. Our desires in this regard are similar to the defense attorney in the courtroom
who wants the truth, the whole truth, and nothing but the truth. If the results are to be used
to predict pupil success in some future activity, we should like them to provide as accurate
an estimate of future success as possible. Basically, then, validity is always concerned with
the specific uset0 be made of evaluation results and with the soundness of our proposed
interpretations.
Reliability refers to the consistency of evaluation results. If we obtain quite similar
scores when the same test is administered to the same group on two different occasions, we
can conclude that our results have a high degree of reliability from one occasion to another.
Similarly, if different teachers independently rate the same pupils on the same instrument
and obtain similar ratings, we can conclude that the results have a high degree of reliability
from one rater to another. As with validity, reliability is intimately related to the type of
interpretation to be made. For some uses, we may be interested in asking how reliable our
evaluation results are over a given period of time, and for others, how reliable they are over
samples of the same behavior. In all instances in which reliability is being determined,
however, we are concerned with the consistency of the results, rather than with the extent to
which they serve the specific use under consideration.
Although reliability is a highly desired quality, it should be noted that reliability provides
no assurance that evaluation results will yield the desired information. As with a witness
testifying in a courtroom trialthe fact that he consistently tells the same story does not
guarantee that he is telling the truth. The truthfulness of his statements can be determined
only by comparing them with some other evidence. Similarly, with evaluation results
consistency is an important quality but only if it is accompanied by evidence of validity, and
that must be determined independently. Little is accomplished if evaluation results
consistently provide the wrong information. In short, reliability is a necessary but not a
sufficient condition for validity.
In addition to providing results which possess a satisfactory degree of validity and
reliability, an evaluation procedure must meet certain practical requirements. It should be
economical from the viewpoint of both time and money, it should be easily administered
and scored, and it should provide results that can be accurately interpreted and applied by
the school personnel available. These practical aspects of an evaluation procedure can all be
included under the heading of usability. The term usability, then, refers only to the
practicality of the procedure and implies nothing about the other qualities present.
Validity103
In this chapter we shall consider the validity of evaluation results, and in the following
chapter we shall turn our attention to reliability and usability.
NATURE OF VALIDITY
When using the term validity, in relation to testing and evaluation, there are a number of
cautions to be borne in mind.
1. Validity pertains to the results of a test, or evaluation instrument, and not to the
instrument itself. We sometimes speak of the validity of a test for the sake of
convenience, but it is more appropriate to speak of the validity of the test results, or more
specifically, of the validity of the interpretation to be made from the results.
2. Validity is a matter of degree. It does not exist on an all-or-none basis. Consequently, we
should avoid thinking of evaluation results as valid or invalid. Validity is best considered
in terms of categories that specify degree, such as high validity, moderate validity, and
low validity.
3. Validity is always specific to some particular use. It should never be considered a general
quality. For example, the results of an arithmetic test may have a high degree of validity
for indicating computational skill, a low degree of validity for indicating arithmetical
reasoning, a moderate degree of validity for predicting success in future mathematics
courses, and no validity for predicting success in art or music. Thus, when appraising or
describing validity, it is necessary to consider the use to be made of the results.
Evaluation results are never just valid; they have a different degree of validity for each
particular interpretation to be made.
TYPES OF VALIDITY
Three basic types of validity have been identified and are now commonly used in
educational and psychological measurement.1 They are: content validity, criterion-related
validity, and construct validity. The general meaning of these types of validity is indicated
in Table 4.1. Each type will be explained more fully as the chapter proceeds. For the sake of
clarity, the discussion will be limited to validity as it relates to testing procedures. It should
be recognized, however, that these three types of validity are also applicable to all of the
various kinds of evaluation instruments used in the school.
Content Validity
The content of a course or curriculum may be broadly defined to include both subjectmatter content and instructional objectives. The former is concerned with the topics, or
subject-matter areas, to be covered, and the latter with the behavioral changes sought in
pupils. Both of these aspects of content are of concern in determining content validity. We
should like any achievement test we construct, or select, to provide results which are
representative of the topics and behaviors we wish to measure. This is the essence of
content validity. More formally, content validity may be defined as the extent to which a test
measures a representative sample of the subject-metter content and the behavioral changes
under consideration.
TABLE 4.1
THREE TYPES OF VALIDITY
TYPE
Content Validity
CRITERION-RELATED
VALIDITY
MEANING
How well the test measures
the subject-matter content
and behaviors under
consideration
How well test performance
predicts future performance
or estimates current
performance on some valued
measure other than the test
itself
PROCEDURE
Validity103
CONSTRUCT VALIDITY
How test performance can
be described psychologically
TABLE SHOWING THE RELATIVE EMPHASIS TO BE CIVEN TO THE VARIOUS SUBJECTMATTER AREAS AND TO THE CHANCES IN BEHAVIOR FOR A TEST IN ELEMENTARY
SCHOOL SCIENCE
Subjectmatter Areas
Plants
Animals
Weather
Earth
Validity103
Sky
Total
10
50
15
50
25
100
with the earth, and 25 per cent with the sky. If the test is to measure a representative sample
of behavioral changes, 50 per cent of the items should measure the "understanding of
concepts," and 50 per cent should measure the "application of concepts." This, of course,
implies that the specific emphasis on "understanding" and "application" for each subjectmatter area will follow that indicated by the percentages in the table of specifications. For
example, 10 per cent of the test items concerned with plants should measure "understanding
of concepts," and 5 per cent of the test items should measure "application of concepts."
It should be noted that this procedure merely provides a rough check on content
validity. Such an analysis reveals the apparent relevance of the test items to the subjectmatter areas and behavioral changes to be measured. Content validity is concerned with the
extent to which the test items actually do call forth the responses represented in the table of
specifications. Test items may appear to measure "understanding" but not function as
intended because of defects in the items, unclear directions, inappropriate vocabulary, or
poorly controlled testing conditions. Thus, content validity is dependent on a host of factors
other than the apparent relevance of the test items. Most of what is written in this book
concerning the construction and selection of achievement tests is directed toward improving
the content validity of the obtained results.
Although our discussion of content validity has been limited to achievement testing,
content validity is also of some concern in. the measurement of aptitudes, interests,
attitudes, and personal-social adjustment. For example, if we are selecting an interest
inventory we should like it to cover those aspects of interest with which we are concerned.
Similarly, an attitude scale should include those attitudinal topics that are in accord with the
objectives we wish to measure. The procedure here is essentially the same as that in
achievement testing. It is a matter of analyzing the test materials and the outcomes to be
measured and judging the degree of correspondence between them.
Criterion-Related Validity
Whenever test scores are to be used to predict future performance or to estimate current
performance on some valued measure other than the test itself, we are concerned with
criterion-related validity. For example, reading readiness test scores might be used to predict
pupils' future achievement in reading, or a test of dictionary skills might be used to estimate
pupils' current skill in the actual use of the dictionary (as determined by observation). In the
first example, we are interested in prediction and thus in the relationship between the two
measures over an extended period of time. This type of validity is called predictive validity.
In the second example, we are interested in estimating present status and thus in the
relationship between the two measures obtained concurrently. A high relationship in this
case would show that the test of dictionary skills is a good indicator of actual skill in use of
the dictionary. This procedure for determining validity is called concurrent validity. In the
new test Standards,1 the designations of predictive validity and concurrent validity have
been subsumed under the more general categorycriterion-related validity. This appears to
be a desirable arrangement because the method of determining and expressing validity is the
same in both cases. The major difference resides in the time period between the two
obtained measures.
Criterion-related validity may be defined as the extent to which test performance is
related to some other valued measure of performance. As noted earlier, the second measure
of performance may be obtained at some future date (when we are interested in predicting
future performance), or concurrently (when we are interested in estimating present
performance). First let us examine the use of criterion-related validity from the standpoint
of predicting success in some future activity. Then we shall return to its second use.
Predicting Future Performance. Suppose that Mr. Young, a junior high school teacher,
wants to determine how well scores from a certain scholastic aptitude test predict success in
his seventh-grade arithmetic class. Since the scholastic aptitude test is administered to all
pupils when they enter junior high school, these scores are readily available to Mr. Young.
His biggest problem is deciding on a criterion of successful achievement in arithmetic. For
lack of a better criterion, Mr. Young decides to use a comprehensive departmental
examination that is administered to the various seventh-grade arithmetic sections at the end
of the school year. It is now possible for Mr. Young to determine how well the scholastic
aptitude test scores predict success in his arithmetic class by comparing the pupils'
scholastic aptitude test scores with their scores on the departmental examination. Do those
pupils who have high scholastic aptitude test scores also tend to have high scores on the
1
Validity103
departmental examination? Do those who have low scholastic aptitude test scores also tend
to have low scores on the departmental examination? If this is the case, Mr. Young is
inclined to agree that the scholastic aptitude test scores tend to be accurate in predicting
achievement in this arithmetic class. In short, he recognizes that the test results possess
criterion-related validity.
In our illustration, Mr. Young merely inspected the scholastic aptitude test scores and the
achievement test scores to determine the agreement between them. Although this may be a
desirable preliminary step, it is seldom sufficient for indicating criterion-related validity.
The usual procedure is to correlate statistically the two sets of scores and to report the
degree of relationship between them by means of a correlation coefficient. This enables
validity to be presented in precise and universally understood terms. They are, of course,
"universally understood" only by those who understand and can interpret correlation
coefficients. This should pose no great problem, however, since the meaning of correlation
coefficient can be easily grasped by persons whose computational skill goes no further than
that of simple arithmetic.
Rank-Difference Correlation. To clarify the calculation and interpretation of correlation
coefficients, let's consider the exact scores Mr. Young's pupils received on both the
scholastic aptitude test and the departmental examination in arithmetic. This information is
provided in the first two columns of Table 4.3. By inspecting these two columns of scores,
as Mr. Young did, it is possible to note that high scores in Column 1 tend to go with high
scores in Column 2. This comparison is difficult to make, however, since the sizes of the
test scores in the two columns are different.
TABLE 4.3
2D2 = 532
The agreement of the two sets of scores can be more easily made if the test scores are
converted to ranks. This has been done in Columns 3 and 4 of Table 4.3. Note that the pupil
who was first on the aptitude test ranked third on the arithmetic test; the pupil who was
second on the aptitude test ranked fourth on the arithmetic test; the pupil who was third on
the aptitude test ranked sixth on the arithmetic test; and so on. Comparing the rank order of
the pupils on the two tests, as indicated in Columns 3 and 4 of Table 4.3, gives us a fairly
good picture of the relationship between the two sets of scores. From this inspection we
know that pupils who had a high standing on the aptitude test also had a high standing on
the arithmetic test, and pupils who had a low standing on the aptitude test also had a low
standing on the arithmetic test. Our inspection of Columns 3 and 4 also shows us, however,
that the relationship between the pupils' ranks on the two tests is not perfect. There is some
shifting in rank order from one test to another. Our problem now isHow can we express
Validity103
the degree of relationship between these two sets of ranks in meaningful terms? This is
where the correlation coefficient becomes useful.
The rank-difference correlation is simply a method of expressing the degree of
relationship between two sets of ranks. The steps in determining a rank-difference
correlation coefficient are presented in the following computing guide.2 Mr. Young's data, in
Table 4.3, are used to illustrate the procedure. It will be noted that the Greek letter rho ( P) is
used to identify a rank-order correlation coefficient. From our computations for Mr. Young's
data we find that P = .60. This correlation coefficient is a statistical summary of the degree
of relationship between the two sets of scores in Mr. Young's data. In this particular
instance, it indicates the extent to which the fall aptitude test scores (predictor) are
predictive of the spring arithmetic test scores (criterion). In short, it refers to the criterionrelated validity of the aptitude test scores.
How good is Mr. Young's validity coefficient of .60? Should Mr. Young be happy with
this finding or should he be disappointed? Does this particular aptitude test provide a good
prediction of future performance in arithmetic?
Unfortunately, simple and straightforward answers cannot be given to such questions.
The interpretation of correlation coefficients is dependent upon information from a variety
of sources. First, we know that the following correlation coefficients indicate the extreme
degrees of relationship that it is possible to obtain between variables:
1.00 = perfect positive relationship .00 = no relationship
1.00 = perfect negative relationship
Since Mr. Young's validity coefficient is .60, we know that the relationship is positive but
somewhat less than perfect. Obviously, the nearer a validity coefficient approaches 1.00 the
happier we are with it because larger3 validity coefficients indicate greater accuracy in
predicting from one variable to another.
Another way of evaluating Mr. Young's validity coefficient of .60 is to compare it to the
validity coefficients obtained with other methods of predicting performance in arithmetic. If
this validity coefficient is larger than those obtained with other prediction procedures, Mr.
Young will continue to use the .scholastic aptitude test as the best means available to him
for predicting the arithmetic performance of his pupils. Thus, validity coefficients are large
or small only in relation to each other. Where criterion-related validity is an important
consideration, we shall always consider more favorable the test with the largest validity
coefficient. In this regard, even aptitude tests with rather low validity may be useful, however, if they are the best predictors available, and the predictions they provide are better
than chance.
Probably the easiest way of grasping the practical meaning of a correlation coefficient is
to note how the accuracy of prediction increases as the correlation coefficient becomes
larger. This is shown in the various charts presented in Table 4.4. The rows in each chart
represent the fourths of a group on some predictor (such as a scholastic aptitude test) and
the columns indicate the percentage of persons falling in each fourth on the criterion
measure (such as an achievement test). First note that for a correlation coefficient of .00,
being in the top quarter on the predictor provides no basis for predicting where a person
might fall on the criterion measure. His chances of falling in each quarter are equally good.
Now turn to the chart for a correlation coefficient of .60. Note, here, that if a person falls in
the top quarter on the predictor, he has 54 chances out of a 100 of falling in the top quarter
on the criterion measure, 28 chances out of 100 of falling in the second quarter, 14 chances
out of 100 of falling in the third quarter, and only 4 chances out of 100 of falling in the
bottom quarter. The remainder of the chart is read in a similar manner.
2
3
* Adapted from tables in R. L. Thorndike and E. Hagen, Measurement and Evaluation in Psychology and Educ
Validity103
By comparing the charts for the different-size correlation coefficients, it is possible to get
some feel for the meaning of correlation coefficient in terms of prediction efficiency. As the
correlation coefficient becomes larger, a person's chances of being in the same quarter on
the criterion measure as he is on the predictor are increased. This can be seen by looking at
the entries in the diagonal cells. With a correlation coefficient of 1.00, each diagonal cell
would, of course, contain 100 per cent of the casesindicating perfect prediction from one
measure to another.
Estimating Present Performance. Up to this point we have emphasized the role of
criterion-related validity in predicting future performance. Although this is probably its
major use, there are times when we are interested in the relation of test performance to some
other current measure of performance. In this case, we would obtain both measures at
approximately the same time and correlate the results. This is commonly done when a test is
being considered as a replacement for a more time-consuming method of obtaining
information. For example, Mr. Brown, the biology teacher, wondered if an objective test of
study skills could be used in place of the elaborate observation and rating procedures he was
currently using. He felt that if a test could be substituted for the more complex procedures,
he would have much more time to devote to individual pupils during the supervised study
period. An analysis of the specific pupil behaviors on which he rated the pupils' study skills
indicated that many of the procedures could be stated in the form of objective test questions.
Consequently, he developed an objective test of study skills that he administered to his
pupils. To determine how adequately his test measured study skills he correlated the test
results with his ratings of the pupils' study skills. A resulting correlation coefficient of .75
indicated considerable agreement between the test results and the criterion measure. This
correlation coefficient represents the criterion-related validity of Mr. Brown's test of study
skills.
We might also correlate test performance with some other current measure of
performance to determine if a predictive study is worth doing. For example, if a set of
scholastic aptitude test scores correlated to a sufficiently high degree (e.g., .60) with a set of
achievement test scores obtained at the same time, it would indicate that the scholastic
aptitude test had enough potential as a predictor to make a predictive study worthwhile. On
the other hand, a low correlation would discourage us from carrying out the predictive
study, because we know that the correlation would become still lower when the time period
between measures was extended. Other things being equal, the larger the time span between
two measures the smaller the correlation coefficient.
Expectancy Table. How well a test predicts future performance or estimates current
performance on some criterion measure can also be shown by directly plotting the data in a
twofold chart like the one shown in Figure 4.1. Here, Mr. Young's data (from Table 4.3)
have been tabulated by placing a tally showing each individual's standing on both the fall
aptitude scores and the spring arithmetic scores. For example, John scored 119 on the fall
aptitude test and 77 on the spring arithmetic test, so a tally, representing his performance,
was placed in the upper right-hand cell. The performance of all other pupils on the two tests
was tallied in the same manner. Thus, each tally mark in Figure 4.1 represents how well
each of Mr. Young's twenty pupils performed on the fall and spring tests. The total number
of pupils in each cell, and in each column and row, have also been indicated.
The expectancy grid shown in Figure 4.1 can be used directly as an expectancy table,
simply by using the frequencies in each cell. The interpretation of such information is
simple and direct. For example, of those pupils who scored above average on the fall
aptitude test, none scored below 65 on the spring arithmetic test, 2 out of 5 scored between
65 and 74, and 3 out of 5 scored between 75 and 84. Of those who scored below average on
the fall aptitude test, none scored in the top category on the spring arithmetic test and 4 out
of 5 scored below 65. These interpretations are limited to the group tested but from such
results one might make predictions concerning future pupils. We can say, for example, that
pupils who score above average on the fall aptitude test will probably score above average
on the spring arithmetic test. Other predictions can be made in the same way by noting the
frequencies in each cell of the grid in Figure 4.1.
More commonly, the figures in an expectancy table are expressed in percentages. This is
readily obtained from the grid by converting each cell frequency to a percentage of the total
number of tallies in its row. This has been done for the data in Figure 4.1 and the results are
presented in Table 4.5. The first row of the table shows that of the 5 pupils who scored
Validity103
above average on the fall aptitude test, 40 per cent (2 pupils) scored between 65 and'74 on
the spring arithmetic test, and 60 per cent (3 pupils) scored between 75 and 84. The
remaining rows are read in a similar manner. The use of percentage makes the figures in
each row and column comparable. Our predictions can then be made in standard terms (that
is, chances out of 100) for all score levels. Our interpretation is apt to be a little clearer
if we say Henry's chances of being in the top group on the criterion measure are 60 out of
100 and Ralph's are only 10 out of 100, than if we say Henry's chances are 3 out of 5 and
Ralph's are 1 out of 10.
TABLE 4.5
expectancy table showinc the relation between fall aptitude scores and spring arithmetic scores"
Expectancy tables take many different forms and may be used to show the relation
between various types of measures. The number of categories used with the predictor, or
criterion, may be as few as two or as many as seem desirable. Also, the predictor may be
any set of measures for which we wish to establish criterion-related validity and the
criterion may be course grades, ratings, test scores, or whatever other measure of success is
relevant.
When interpreting expectancy tables based on a small number of cases, like Mr. Young's
class of twenty pupils, our predictions should be regarded as highly tentative" Each
percentage is based on so few pupils that we can expect large fluctuations in these figures
from one group of pupils to another. It is frequently possible to increase the number of
pupils represented in the table by combining test results from several classes. Where this is
done, our percentages are, of course, much more stable, and our predictions can be made
with greater confidence. In any event, expectancy tables provide a simple and direct means
of indicating the validity of test results.
The "Criterion" Problem. In the determination of criterion-related validity, a major
problem is that of obtaining a satisfactory criterion of success. It will be recalled that Mr.
Young used a comprehensive departmental examination as the criterion of success in his
seventh-grade arithmetic class. Mr. Brown used his own ratings of the pupils' study skills. In
each instance the criterion of success was only partially suitable as a basis for test
validation. Mr. Young recognized that the departmental examination did not measure all of
the important learning outcomes that he aimed at in teaching arithmetic. There was not
nearly enough emphasis on arithmetic reasoning; the interpretation of graphs and charts was
sadly neglected; and, of course, the test did not evaluate the pupils' attitudes toward
arithmetic (which Mr. Young considered to be extremely important). Likewise, Mr. Brown
was well aware of the shortcomings of his rating of pupils' study skills. He sensed that some
pupils "put on a show" when they knew they were being observed. In other instances he felt
that some of the pupils were probably overrated on study skills because of their high
achievement in class work. Despite these recognized shortcomings, both Mr. Young and Mr.
Brown found it necessary to use these criterion measures because they were the best
criterion measures available.
The plights of Mr. Young and Mr. Brown in locating a suitable criterion of success for
the purpose of test validation are not unusual. The selection of a satisfactory criterion is one
of the most difficult problems in validating a test. For most educational purposes, no
adequate criterion of success exists. Those which are used tend to be lacking in
comprehensiveness and in most cases provide results that are less stable than those of the
test being validated.
The lack of a suitable criterion for validating achievement tests has important
implications for the classroom teacher. Since statistical types of validity will usually not be
available, teachers will have to depend on procedures of logical analysis to assure test
validity. This means carefully identifying the objectives of instruction, stating these
objectives in terms of specific changes in pupil behavior, and constructing or selecting
evaluation instruments which satisfactorily measure the behavioral changes sought in
Validity103
pupils. Thus, content validity will assume a role of major importance in the teacher's
evaluation of pupil progress.
Construct Validity
The two types of validity thus far described are both concerned with some specific
practical use of test results. They help us determine how well test scores represent the
achievement of certain learning outcomes (content validity), or how well they predict or
estimate a particular performance (criterion-related validity). In addition to these more
specific and immediately practical uses, we may wish to interpret test scores in terms of
some general psychological quality. For instance, rather than speak about a pupil's score on
a particular arithmetic test, or how well it predicts success in mathematics, we might want
to infer that the pupil possesses a certain degree of reasoning ability. This provides a broad
general description of pupil behavior which has implications for many different uses.
Whenever we wish to interpret test performance in terms of some psychological trait or
quality, we are concerned with construct validity. A construct is a psychological quality
which we assume exists in order to explain some aspect of behavior. Reasoning ability is a
construct. When we interpret test scores as measures of reasoning ability, we are implying
that there is a quality that can be properly called reasoning ability and that it can account to
some degree for performance on the test. Verifying such implications is the task of construct
validation.
Common examples of constructs are intelligence, scientific attitude, critical thinking,
reading comprehension, study skills, and mathematical aptitude. There is an obvious
advantage in being able to interpret test performance in terms of such psychological
constructs. Each construct has an underlying theory which can be brought to bear in
describing and predicting a person's behavior. If we say a person is highly intelligent, for
example, we know what behaviors might be expected of him in various specific situations.
Construct validity may be defined as the extent to which test performance can be
interpreted in terms of certain psychological constructs. Theprocess of determining
construct validity involves the following steps: (1) identifying the constructs presumed to
account for test performance; (2) deriving hypotheses regarding test performance from the
theory underlying the construct; (3) verifying the hypothesis by logical and empirical
means. For example, let us suppose that we wish to check the claim that a newly
constructed test measures intelligence. From what is known about "intelligence," we might
make the following predictions:
1. The test scores will increase with age (intelligence is assumed to increase with age until
approximately age sixteen).
2. The test scores will predict success in school achievement.
3. The test scores will be positively related to teachers' ratings of intelligence.
4. The test scores will be positively related to scores on other so-called intelligence tests.
5. The test scores will discriminate between groups which are known to differ, such as
"gifted" and "mentally handicapped."
6. The test scores will be little influenced by direct teaching.
Each of these predictions, and others, would then be tested, one by one. If positive results
are obtained for each prediction, the combined evidence lends support to the claim that the
test measures intelligence. If a prediction is not confirmed, say the scores do not increase
with age, we must conclude that either the test is not a valid measure of intelligence, or
there is something wrong with our theory. As Cronbach and Meehl 4 have indicated, with
construct validation both the theory and the test are being validated at the same time.
Methods Used in Obtaining Evidence for Construct Validation. As noted in our
illustration, there is no adequate single method of establishing construct validity. It is a
matter of accumulating evidence from many different sources. We may use both content
validity and criterion-related validity as partial evidence to support construct validity, but
neither of them alone is sufficient. Construct validation depends on logical inferences drawn
from a variety of types of data. The following procedures illustrate the broad range of
methods that might be used in obtaining evidence for construct validity:
1. Analysis of the mental process required by the test items. One may analyze the mental
processes involved by examining the test items to determine what factors they appear to
measure and/or by administering the test to individual pupils and having them "think aloud"
as they answer. Thus, examination of a science test may indicate that the test scores are
likely to be influenced by knowledge, comprehension, and quantitative ability. Similarly,
"thinking aloud" on an arithmetic reasoning test may verify that the items call for the
4
Validity103
intended reasoning process, or it may reveal that most problems can be solved by a simple
trial-and-error procedure.
Validity103
The type of validity that is of greatest importance for criterion-referenced mastery tests is
content validity. The procedures for obtaining content validity described earlier in this
chapter are as applicable here as they are with norm-referenced tests. The fact that criterionreferenced mastery tests are typically confined to a more delimited domain of learning tasks
(e.g., unit or chapter), even simplifies the process of defining and selecting a representative
sample of tasks. In some cases, the domain of tasks is so limited (e.g., addition of singledigit whole numbers) that a representative sample can be obtained without the use of a table
of specifications.
Although content validity is of primary concern with criterion-referenced mastery tests,
we might also be interested in using the test results to make predictions about pupils. We
might, for example, use a criterion-referenced pretest to predict which pupils are likely to
master the material in a unit of instruction, or use an end-of-unit mastery test to determine
which pupils should proceed to the next unit of instruction. Such instructional decisions
require some evidence (criterion-related validity) that our decisions are soundly based. This
evidence can be obtained by means of an expectancy table, like the one shown in Table 4.6.
It will be noted in this table that the majority of pupils with pretest scores of 20 or lower
failed to achieve mastery at the end of the unit. In such a case, a test score of 20 would
provide a good cutoff score for determining which pupils should proceed with the unit and
which should receive remedial help before proceeding. We would, of course, prefer a larger
number of pupils than thirty when selecting such cutoff scores, but this represents a realistic
classroom situation. As noted earlier, it is frequently possible to increase the number of
TABLE
pupils used in an expectancy table by combining test results
from4.6
several classes.
There
is nothing
the nature between
of criterion-referenced
mastery
to rule of
outstudents
construct attaininc mast
expectancy table
showinc
thein relation
pretest scores
and testing
the number
validity. So much of the supporting evidence for construct
validity
is
dependent
on
30)'
correlations and other statistical measures, however, (N
that= the
construct validity of a
criterion-referenced test would, of necessity, be based on rather meager evidence (i.e., only
that evidence not dependent on variability among scores).
* From N. E. Gronlund, Preparing Criterion-Referenced Tests for Classroom Instruction. New York: Macmillan,
Validity103
Validity103
Validity103
to note the nature of the validation group. How closely it compares in significant
characteristics to the group of pupils we wish to test determines how applicable the information is to our particular group.
In evaluating validity coefficients, it is also necessary to consider the nature of the
criterion used. For example, scores on a mathematics aptitude test are likely to provide a
more accurate prediction of achievement in a physics course in which quantitative problems
are stressed than in one where they play only a minor role. Likewise, we can expect scores
on a critical thinking test to correlate more highly with grades in social studies courses
which emphasize critical thinking than in those which depend largely on the memorization
of factual information. Other things being equal, the greater the similarity between the
behaviors measured by the test and the behaviors represented in the criterion, the higher the
validity coefficient.
Since validity information varies with the nature of the group tested and with the
composition of the criterion measures used, published validation data should be considered
as highly tentative. Whenever possible, the validity of the test results should be checked in
the specific local situation.
This discussion of factors influencing the validity of test results should make clear the
pervasive and functional nature of the concept validity. In the final analysis the validity of
test results is based on the extent to which the behavior elicited in the testing situation is a
true representation of the behavior being evaluated. Tims, anything in the construction or
the administration of the test which causes the test results to be unrepresentative of the
characteristics of the person tested contributes to lower validity. In a very real sense, then, it
is the user of the test who must make the final judgment concerning the validity of the test
results. He is the only one who knows how well the test fits his particular use, how well the
testing conditions were controlled, and how typical the responses were to the testing
situation.
of these influences can be found in the test instrument itself, some in the relation of teaching
to testing, some in the administration and scoring of the test, some in the atypical responses
of pupils to the test situation, and still others in the nature of the group tested and in the
composition of the criterion measures used. A major aim in the construction, selection, and
use of tests, and other evaluation instruments, is to control those factors which have an
adverse effect on validity and to interpret evaluation results in accordance with what
validity information is available.
SUMMARY
The most important quality to consider when selecting 'or constructing an evaluation instrument
is validity. This refers to the extent to which the evaluation results serve the particular uses for
which they are intended. In interpreting validity information, it is important to keep in mind
that validity refers to the results rather than to the instrument, that its presence is a matter of
degree, and that it is always specific to some particular use.
There are three basic types of validity. Content validity refers to the extent to which a test
measures a representative sample of the subject-matter content and the behavioral changes
under consideration. It is especially important in achievement testing and is determined by
logical analysis of test content. Criterion-related validity is concerned with the extent to which
test performance is accurate in predicting some future performance or estimating some current
performance. This type of validity can be reported by means of a correlation coefficient called a
validity coefficient or by means of an expectancy table. It is of special significance in all types
of aptitude testing, but is pertinent whenever test results are used to make specific predictions,
or whenever a test is being considered as a substitute for a more time-consuming procedure.
Construct validity refers to the extent to which test performance can be interpreted in terms of
certain psychological constructs. The process of construct validation involves identifying and
clarifying the factors which influence test scores so that the test performance can be interpreted
most meaningfully. This involves the accumulation of evidence from a variety of different
studies. Both of the other types of validity may be used as partial support for construct validity,
but it is the combined evidence from all sources that is important. The more complete the
evidence, the more confident we are concerning the psychological qualities measured by the
test.
Because criterion-referenced mastery tests are not designed to discriminate among
individuals, statistical types of validity are inappropriate. For this type of test, we must depend
primarily on content validity. Where the test scores are to be used for prediction (e.g., masterynonmastery), an expectancy table can be effectively used.
A number of factors tend to influence the validity of test results. Some of these influences
can be found in the test instrument itself, some in the relation of teaching to testing, some in the
administration and scoring of the test, some in the atypical responses of pupils to the test
situation, and still other in the nature of the group tasted and in the composition of the criterion
measures used. A major aim in the construction, selection and use of tests, and other evaluation
instruments, is to control those factors which have and adverse effect on validity and interpret
evaluation results in accordance with that validity information is available.