Scoring and Interpretation

Chapter 3

Scoring and

n a norm-referenced instrument such as the person's age or grade. The standard deviation (SD)
the PPVT-4 scale, raw scores become more of PPVT-4 standard scores is 15. The range of standard
meaningful when they are converted to scores within 1 SD of the mean—that is, between 85
normative scores or other types of derived scores. and 115—includes about 68% of the population, the
Normative scores allow for an individual's performance range of 2 SDs (70 to 130) includes about 95%, and
to be compared with that of a well-defined reference the range of 3 SDs (55 to 145) includes more than 99%.
group consisting of a large cross section of people of the The PPVT-4 standard score scale is the same as the
same age or in the same grade. In addition to being scale used in many other tests, which allows for a direct
more interpretable than raw scores, these normative comparison of PPVT-4 scores with the scores obtained
scores can be compared among different tests. on tests of language, achievement, and ability.
The first portion of this chapter describes the various Percentiles
normative scores available for the PPVT-4 instrument.
Percentiles (also known as percentile ranks) are
Next, the chapter explains the procedures for obtaining
commonly reported by examiners because they are
the various score types. In the next section, the calculation
readily understood. A percentile indicates the
and interpretation of confidence intervals are explained.
percentage of individuals in the reference group who
Next, the chapter discusses the growth scale value (GSV)
performed at or below the examinee's raw score. Thus,
scale, a nonnormative system that is ideal for measuring
a percentile of 50 signifies that the examinee's raw score
change. The chapter concludes with instructions for
is average for examinees of that age or grade. Although
completing several practice scoring exercises.
percentiles have a simple, straightforward
interpretation, they also have limitations. It is important
Types of Normative Scores to ensure that they are not misunderstood as being the
The PPVT-4 instrument has two types of normative percentage of test items answered correctly. Also,
scores: deviation and developmental. Standard scores, percentiles are on an ordinal or rank-order scale of
percentiles, normal curve equivalents (NCEs), and measurement, unlike standard scores, which form an
stanines are deviation-type normative scores because interval scale of measurement. Lacking the property of
they indicate how an examinee's raw score compares equal distances between units, percentiles cannot be
with the scores of people of the same age or in the same arithmetically manipulated (e.g., added, subtracted, or
grade. Age equivalents and grade equivalents are averaged) in the way that standard scores can.
developmental-type normative scores that designate
where the examinee's raw score falls on a developmental
growth curve. NCEs, like standard scores, communicate the distance
between the examinee's raw score and the average raw
Deviation-Type Normative Scores score in the normative reference group. Many state
programs use NCEs in reporting test results, because
Standard Scores
this scale has the convenient property that several NCE
A standard score indicates the distance of the examinee's values directly relate to percentile units. In particular,
raw score from the average for people of the same age NCEs of 1, 50, and 99 correspond to percentiles of 1,
or grade, taking into account the range of scores among 50, and 99, respectively. However, other NCE values do
examinees in that reference group. On the PPVT-4 not have direct relationships to percentiles.
scale, a standard score of 100 is the average score for
Stanines on average raw scores at different ages or grades and do
not take score variability into account, they can appear
Stanines are whole-number scores that range from
to be inconsistent with standard scores and percentiles.
1 through 9, with a mean of 5 and an SD of 2. Each
When interpreting normative scores, one must keep in
stanine represents a particular range of percentiles,
making stanines useful as cutoff scores and in other mind that developmental-type and deviation-type
applications where a greater level of precision is scores provide fundamentally different types of
not needed. information.

Comparability of Deviation-Type Normative Scores

GSV Scores
The distributions of raw scores in the norm sample
were adjusted to fit the normal-probability curve. The GSV score is useful for measuring change in PPVT-4
Therefore, all of the deviation-type norms reported for performance over time. The GSV is not a normative
the PPVT-4 instrument carry the same information but score, because it does not involve comparison with a
present it in varying ways. The graphical display on the norm group. Rather, it is a transformation of the raw
• record form cover shows the consistent relationship score and is superior to raw scores for making statistical
among the various deviation-type normative scores. comparisons. Methods for using GSV scores to assess
growth in vocabulary are discussed later in this chapter
Developmental-Type Normative Scores and in Appendix G.

Age equivalents and grade equivalents are

developmental-type normative scores, because they Converting a Raw Score to
locate an individual's performance along a growth curve Normative Scores and the GSV
across age or grade. In the case of the PPVT-4 scale, the The steps for converting raw scores to derived scores
curve is the hearing-vocabulary growth curve. An age are presented in this section, in the order in which they
When interpreting equivalent represents the age appear on the record form cover. The norm tables for
normative scores, (in years and months) at PPVT-4 Forms A and B are presented in Appendix B.
one must keep which an examinee's raw The scores for fictional examinee Noah are used to

in mind that score is the average score.

Likewise, the grade
demonstrate the conversion of a raw score to each type

of derived score.
equivalent signifies the grade
and deviation- (in tenths of a grade) at Converting a Raw Score to a Standard Score
type scores provide which a given raw score is Age-based standard scores, which range from 20 to
fundamentally the average score. Thus, a 160, are provided in Table B. 1 for individuals aged 2
different types of grade equivalent of 3.0 years 6 months through adulthood. Standard scores for
information. represents the average raw
score obtained by students
grade (i.e., kindergarten through Grade 12), which also
span from 20 to 160, are provided in Table B.2 (Fall,
at the beginning of third grade. Age-equivalent values July 1 through December 31) and Table B.3 (Spring,
range from 2:0 (i.e., 2 years 0 months) through 24 January 1 through June 30). To convert a raw score to a
(approximately the age at which average test standard score, first mark your choice of norms—Age,
performance plateaus). Grade: Fall, or Grade: Spring—on the record form
An age or grade equivalent does not necessarily mean cover. Next, consult the applicable table in Appendix B.
that the examinee's receptive vocabulary knowledge is Locate the section that corresponds to the examinee's
qualitatively the same as that of the average person at chronological age (in years and months) or the
that age or grade. Because of different life experiences, a examinee's grade (in school years and season). Find the
person aged 15 with a PPVT-4 age equivalent of 11:6 examinee's raw score in the appropriate Raw Score
may tend to know a different set of words than the column (A or B). Next, read across this row to the
average 11-year-old. Nevertheless, PPVT-4 age and Standard Score column, and locate the corresponding
grade equivalents can be useful for selecting standard score (see Figure 3.1). Transfer this score to
instructional materials or interventions that will be of the Standard Score box in the Score Summary area on
appropriate difficulty for the individual. the record form cover.

Unlike standard scores, age and grade equivalents are For example, Noah's raw score of 91 on Form A converts
not on an interval scale of measurement, and therefore to an age-based standard score of 90 (see Figure 3.2),
they should not be added, subtracted, or averaged. which is obtained by referring to the section for ages
Also, because age and grade equivalents are based only 6:2 through 6:3 in Table B.1.
Obtaining a Percentile, NCE, and Stanine Converting a Raw Score to an Age Equivalent,
Percentiles, NCEs, and stanines are obtained by Grade Equivalent, or GSV
converting the standard score, using Table B.4 in To convert a raw score to an age equivalent, use Table
Appendix B. This table applies to both age-based and B.5 in Appendix B. Table B.6, also in Appendix B, can
grade-based standard scores. To find the percentile, be used to convert the raw score to a grade equivalent.
NCE, and stanine in Table B.4, read down the standard In either table, locate the examinee's raw score in the
score column, either on the far left or far right. Then far-left column. Then, read across to the column for the
read across to locate the corresponding percentile, NCE, correct form, which shows the age equivalent (in years
and stanine. Record these values in the designated and months) or grade equivalent (in grade and tenths of
boxes in the Score Summary area on the record form a school year) for that raw score on that form. To obtain
cover. For example, as illustrated in Figure 3.3, Noah's the GSV, continue to read across (in either table) to the
age-based standard score of 90 converts to a percentile GSV column for the correct form, which is to the right
of 25, an NCE of 36, and a stanine of 4. of the Age Equivalent or Grade Equivalent column.
To compare GSVs on the PPVT-III and PPVT-4
Graphical Profile of Deviation-Type instruments, use Table B.7 to convert PPVT-III raw
Normative Scores scores to GSVs; for convenience, this table repeats
A Graphical Profile is included on the PPVT-4 record the PPVT-4 GSVs.
form cover as an aid to interpreting scores and may be
used with either age norms or grade norms. To use the
profile, mark the examinee's standard score on the
Standard Score line. Then draw a straight vertical line
through the standard score and across the other scales.
This Graphical Profile is used again later. For now,
simply verify that the values the drawn line intersects
correspond to the percentile, NCE, and stanine values
you obtained from the tables in Appendix B.
In Figure 3.4, Noah's age-based standard score of 90
is plotted, and the vertical line is drawn. This line
intersects with the percentile, NCE, and stanine values
that match those written in the Score Summary area. It
is important to note that Noah's score falls at the low -
end of the average range.
Errors of Measurement and
Confidence Intervals
The scores obtained from any test provide only an
estimate of a person's true ability in the trait or attribute
being measured. The true score cannot be k n o w n
because some degree of measurement error is always
present in the obtained score. Measurement errors
occur because all h u m a n behavior varies from time to
time and because all tests are imprecise to some degree.
The standard error of measurement (SEM) is the
statistic used to indicate the extent to which error
affects individual test scores. It represents the average
a m o u n t by which observed scores differ from true
scores. This statistic is calculated from reliability
coefficients using procedures described in Chapter 5 of
this manual.

To take measurement error into account w h e n

interpreting scores, it is good practice to report a
confidence interval for the obtained standard score. The
confidence interval is a range of scores that has a
specified probability of including the examinee's true
score. For example, there is a 90% probability that a
particular examinee's true score will fall within a 90%
confidence interval.
The GSV is an equal-interval scale. Therefore, GSV
scores can be added, subtracted, or averaged. GSVs can
be compared over time for many purposes, such as to
gauge the efficacy of vocabulary improvement
programs. Because GSVs from both PPVT-4 forms
(A and B) and both PPVT-III forms are all on a common
scale, scores across forms and editions may be
compared to measure growth. Furthermore, the fact
that GSVs can be averaged makes this scale a useful one
for tracking the progress of groups.
Standard scores and percentiles are less useful than
GSVs for measuring growth, because the reference
norm group changes as the examinee moves into a
higher age or grade level. If a person's vocabulary
increases at the average rate, his or her standard score
and percentile would stay the same, whereas the GSV
score would increase. See Appendix G for further
information about using GSV scores to measure growth.
Interpreting Extremely Low
Raw Scores
A raw score of 3 or lower corresponds to a score an
examinee would obtain through random responding.
Great caution should be taken when reporting or
interpreting such scores, because the test administration
may not have been a valid
measure of the examinee's ... because the
vocabulary level. The reliability of the
examinee may not have PPVT-4 scale is
understood the task, or there high, the chances
might have been some other are great that an
extraneous reason why the
examinee was unable to individual's obtained
demonstrate his or her score and true score
knowledge. If you question are very similar.
the validity of the test
Using GSV Scores to Measure administration, you may choose to report behavioral
Vocabulary Growth observations rather than derived scores.
The GSV scale was developed so that vocabulary
growth could be followed over a period of years on a Qualitative Interpretation of Item
single continuous scale. Standard scores, percentiles, Performance
stanines, and NCEs compare an examinee's vocabulary In addition to interpreting derived scores, the user may
knowledge with that of a reference group representing also perform a qualitative analysis of an examinee's
all individuals of the same age or grade. In contrast, the PPVT-4 performance by classifying incorrect responses
GSV measures an examinee's vocabulary with respect to by part of speech. A worksheet for completing this type
an absolute scale of knowledge. The test performance of of analysis is included as page 7 of each PPVT-4 record
any examinee—from a 2-year-old who functions at a form (and in Appendix C, which is a reproducible
low level to that of an adult with a high level of master). Each PPVT-4 stimulus word is classified
vocabulary—can be placed on the GSV scale. As an as a noun, verb, or attribute (adjective or adverb).
examinee's vocabulary grows, the GSV will increase. An example of a completed Form A worksheet is
presented in Figure 3.5.
To complete the analysis, use the following procedures.
If you are using a photocopy of Appendix C, write the
examinee's name and the test date in the spaces
provided. Circle the item numbers of the lowest
n u m b e r e d and highest numbered items administered
in order to indicate the range of items given. Tally the
n u m b e r of items administered in each Noun, Verb,
and Attribute column, and record these totals in the
"# Taken" boxes at the bottom of the columns. Next,
tally the n u m b e r of incorrect items within each Noun,
Verb, and Attribute column, and record these totals in
... a comparison the "# Incorrect" boxes at the
bottom of the columns. Sum the
of error types by # Taken values for the N o u n
part of speech columns, then s u m the # Taken
could indicate values for the Verb columns,
where vocabulary and then sum the # Taken
instruction should values for the Attribute
initially be columns. Then, transfer these
three totals to the respective
focused for the
boxes in the summary table for
greatest benefit. the PPVT-4 form administered
(the table is in the upper right-hand corner). Repeat
these steps for the # Incorrect values.

Nouns can usually be learned in a concrete,

straightforward manner, whereas words that denote
actions or that describe attributes of objects or actions
require a more abstract learning process. Therefore, a
comparison of error types by part of speech could
indicate where vocabulary instruction should initially be
focused for the greatest benefit. It might also suggest a
need for a more in-depth assessment of syntactic forms.

Practice Scoring Exercises

The following paragraphs describe exercises for scoring
practice. You will be able to test yourself to see whether
you understand and can apply accurately the scoring
rules presented in this chapter. Figure 3.6 represents a
reproducible worksheet that contains four scoring
boxes from the record form. Complete this worksheet
for each of the five cases presented in Chapter 2 (see
Figures 2.7 through 2.11). It is recommended that you
practice scoring without the aid of reference materials
(other than the n o r m tables). Refer to page 2 of the
record form when necessary.
These exercises provide practice in the following areas:
• Determining the Basal Set and Ceiling Set and
scoring across the critical range
• Calculating a raw score
• Obtaining and recording normative scores
• Obtaining the confidence interval at the 90% level
Content Development,

Tryout, Standardization,
and Norms Development
Analyses of Standardization Data
Item analyses were performed on data from the
complete age norm sample, using Rasch techniques to
verify that items were functioning as well as expected
on the basis of national tryout data. The results of the
analyses supported the retention of all items. Four
items, all in the later portions of the test, appeared to
be misordered by difficulty and thus were repositioned
in the item sequence.
The reanalysis of start points by age showed that, for
most ages, the start point could be moved upward by
one item set. With these reset start points,
approximately 85% of examinees established a basal
at their designated starting set.

Analysis of the Equivalence of Forms A and B

As described earlier in this chapter, great care was taken in
assembling the two PPVT-4 forms to make them closely
similar in terms of item content and difficulty. This effort
was made not to produce two forms whose raw scores
would be interchangeable; such an outcome would be
extremely difficult to obtain and would have few practical
benefits, as raw scores are not interpreted directly. Rather,
the forms were balanced in this manner to ensure that
they would measure the same construct and provide the
same kind of testing experience for the examinee.

The degree of similarity was evaluated in two ways. One,

shown in Table 4.17, is a comparison of the means and
standard deviations (SDs) of item difficulties within each
of the 19 item sets in the two forms. These difficulties
were determined from a joint Rasch calibration of all
actual item responses on the two forms, using the age
norm sample plus 136 other cases from the alternate-
forms reliability study; the linkage between the forms
was based on a total of 533 examinees who had taken
both forms. These data demonstrate that the two forms
span a similar range of difficulty and that their item sets
progress in difficulty at a similar rate from beginning to
end. Average item difficulties within sets are very similar.
In the first and last quarters of the test, Form A items are
slightly easier than Form B items, whereas in the middle
half of the test, Form A items are slightly more difficult.
The overall difference between average item difficulties
in the two forms is negligible (.04 logits).
difference relative to the SDs of 3.24 and 3.27. Thus, on
average, colorization had no effect on item difficulty. Of
the 143 items, 25 had a statistically significant (p < .05)
difference between their black-and-white and color
difficulties, with the colorized version being more
difficult for 17 items and easier for 8 items. Inspection of
item content did not suggest an explanation for any of
these changes in difficulty.

Development of Normative Scores

Approximately half of the norm sample took Form A, and
the remaining portion took Form B. Because the two
forms are not perfectly equivalent, raw scores could not
be pooled across forms for norms development. Instead,
person ability scores, produced by a Rasch calibration of
the two forms, were used for norming. The two forms
were calibrated jointly, with the linkage between them
established by the sample of 533 examinees who were
given both forms (in counterbalanced order) during
standardization. Prior to calibration, all items below each
examinee's Basal Item were scored 1, and all items above
the Ceiling Item were scored 0, so that the data matrix
would reflect how the test is scored in practice. Because
the PPVT-4 response scale is multiple-choice, which
allows the possibility of answering correctly by guessing,
this procedure was required in order to create a one-to-
one relationship between raw scores and Rasch ability
scores. The joint calibration produced an ability score for
each examinee. To avoid the decimal fractions and
negative values of the Rasch ability scale, these scores
were converted to w-ability scores. This process required
Effect of Colorization that each ability score be multiplied by 9.1024 and that a
An analysis was conducted to determine whether the constant (in this case, 150) be added to the resulting
change from black-and-white illustrations to color value (Woodcock & Dahl, 1971). Because w-ability scores
illustrations affected item difficulty. There are 143 items for Forms A and B are interchangeable, they were used in
that were essentially unchanged (except for colorization) all subsequent norms derivation procedures. In addition,
between the third and fourth editions. A joint Rasch w-ability scores have advantages for score interpretation
calibration (using only actual item scores) of all PPVT-III and are used as the PPVT-4 growth scale values (GSVs).
and PPVT-4 items was performed using the PPVT-4 age Five types of normative scores are reported for the PPVT-4
norm sample, the PPVT-4 instrument: standard scores, percentiles, stanines, normal
... colorization alternate-form linking sample, and
curve equivalents (NCEs), and age or grade equivalents.
had no effect the sample of examinees who had After standard score equivalents have been derived using
on item difficulty. been given one PPVT-III form and a normalizing procedure, only a simple conversion table
one PPVT-4 form, for a total is necessary for finding percentiles, stanines, and NCEs.
sample of 3,888. (In the last sample, the two forms were (The development of age and grade equivalents follows a
those that share few items.) This analysis produced separate process, described later in this section.)
separate Rasch difficulty values for the two versions of
each item and avoided the contaminating effect of having The procedure for obtaining standard score equivalents
the same examinee take both versions of an item. was the same for age and grade norms; the age norm
process is described here. For each of the 28 age groups
The average Rasch item difficulties for the black-and- used in collecting the age norm sample, the cumulative
white and color versions of the 143 unchanged items frequency distribution of w-ability scores was computed,
were .07 and .09 logits, respectively, a negligible and the midinterval percentile was identified for each
w-ability score value. These percentiles were then 15 at each age or grade level of the norm sample. This
converted through an area transformation to the information, along with the means and SDs of raw scores
corresponding standard score (i.e., the point on the and w-ability scores, is reported in Tables 4.18 and 4.19
normal curve where the cumulative area under the curve for the age norm and grade norm samples, respectively.
equals the percentile). This step normalizes the score Results for standard scores are very close to expectations.
distribution. The w-ability score values for every 10th In Table B.1 in Appendix B, standard scores ranging from
standard score from 70 through 130 were then plotted 20 to 160 are presented at 69 chronological-age intervals.
against age, producing seven growth curves, one for each Two-month intervals were used for ages 2 years 6 months
of the selected standard score values. Smooth lines were through 6 years; 3-month intervals for ages 7 through 8;
drawn through these points to regularize the growth 4-month intervals for ages 9 through 11; 6-month
trends and make them similar to one another in shape. As intervals for ages 12 through 17; and 1-, 2-, 5-, and
an example, Figure 4.3 shows the smoothed growth 10-year intervals for ages 18 and older. Standard scores
curve and the actual data points for a standard score of are presented by grade and season in Tables B.2 (fall)
100 (the median w-ability or GSV score). and B.3 (spring). Table B.4 presents the conversions
of standard scores to percentiles, stanines, and NCEs.
Figure 4.3 PPVT-4 median CSV growth curve A single table can be used for both Forms A and B and
for both age and grade norms for these conversions
because the score distributions had been adjusted to
fit the normal probability curve.

Age and Grade Equivalents

An age or grade equivalent represents the age (in years
and months) or grade (in school years and months)
at which a particular w-ability score is the average score.
To derive the age and grade equivalents, the average
w-ability score for each of the 69 age intervals or 26
grade/season intervals in the norm tables was obtained.
These points represent a function relating average
Next, w-ability score values for the seven selected standard
w-ability score to age or grade, as illustrated previously
scores were read from the curves for the 69 narrow age
in Figure 4.3. The same smoothing procedure that was
groups reported in the final norm tables. Standard score
used to develop the standard scores was applied to
equivalents of the intervening w-ability scores were
smooth this set of data points. The smoothed function
computed through linear interpolation. Extrapolation was
was then used to prepare the w-ability score to age
used to estimate standard scores above 130. For each of
equivalent and grade equivalent norm tables. To produce
the 69 narrow age groups, the standard scores of 110,
Tables B.5 (age equivalents) and B.6 (grade equivalents),
120, and 130 were regressed on their corresponding
w-ability scores were replaced with raw scores.
w-ability scores, and this linear regression equation was
used to assign standard scores up to 160. An analogous Age and grade equivalents can be useful measures for
procedure was followed to determine standard scores educators, but only for those ages at which receptive
below 70. Any minor irregularities in the trends of vocabulary is increasing at a reasonable rate. Therefore,
standard scores across age were smoothed. For the final age equivalents were derived only for ages 2 through
norm table, w-ability scores were replaced by the 24 years. Age equivalents of 2:6 and greater are based
equivalent raw scores on each form. on actual data; those below that value are extrapolations
and should be used with caution.
The ultimate objective of the norming procedure is to
produce a set of standard score conversions that will yield
a mean of approximately 100 and an SD of approximately
Chapter 5

his chapter discusses technical characteristics the fourth type, is a measure of stability that indicates
of the PPVT-4 instrument that have important the consistency of scores when the same set of items is
implications for the interpretation of scores. readministered after a period of time (in this case, about
The first portion of the chapter reports on the internal 4 weeks). It is sensitive to measurement error caused
consistency, alternate-form, and test-retest reliability by variability over time in the examinee's state
of PPVT-4 scores, and explains how the standard errors (motivation level, fatigue, etc.) as well as by any
of measurement (SEMs) were derived and how they incidental differences in the administration procedure.
can be used when interpreting scores. The chapter
concludes with several types of evidence supporting Internal Consistency Reliability
the validity of inferences based on PPVT-4 scores, Split-half reliability and coefficient alpha of each form
including content selection procedures, the curve of were calculated for each of the 28 age groups in the age
growth with age, correlations with other tests, and the norm sample and for each of the 13 groups in the grade
average scores obtained by individuals with a variety norm sample. The procedure for computing split-half
of clinical diagnoses or educational classifications. reliability began by dividing the form into halves, one
containing the odd-numbered items and the other the
Reliability of Scores even-numbered items. The anchored item difficulty
Chapter 4 briefly discussed measurement error and values from the calibration of the entire test were used
confidence intervals. This chapter presents the to convert raw scores on the halves to Rasch ability
reliability data on which that information was based scores, which were then correlated. The Spearman-
and explains how the various confidence intervals Brown prophecy formula was applied to these
were calculated. correlations to estimate the preliminary reliability
coefficient for the full length of each form. Finally, to
Reliability refers to the precision of scores, that is, the prevent differences between the samples taking Form A
degree to which they are free of measurement error. and Form B from affecting the results, each reliability
Reliability is expressed on a numerical scale ranging was adjusted by referencing it to the standard deviation
from 0 (no precision) to 1.0 (completely free of error). (SD) of ability scores in the complete norm sample at
Several types of reliability were computed for the that age or grade. The split-half reliabilities are

PPVT-4 instrument that are sensitive to different presented in Table 5.1 for the age norm sample and in
sources of measurement error. The first two types, Table 5.2 for the grade norm sample. As shown in the
split-half reliability and coefficient alpha, are indicators tables, the split-half reliabilities are consistently very
of internal consistency reliability, that is, the degree of high across the entire age and grade ranges, averaging
consistency of performance on different sections of a .94 or .95 on each form. One of the goals of this
test. The third type, alternate-form reliability, reflects revision was to improve measurement at the youngest
the similarity in performance on different but parallel age levels, and the data in these tables indicate that that
forms administered at about the same time. Both goal was accomplished. Reliabilities tend to be at least
internal consistency and alternate-form reliability as high, if not higher, at the preschool ages and at
mainly are sensitive to measurement error arising from kindergarten than at the older ages and higher grades.
the use of different sets of items. Test-retest reliability,
Internal consistency reliabilities (split-half and coefficient alpha) for each form were adjusted by the following method. First, the unadjusted reliability was used to compute the within-form
SEM. Next, that SEM and the combined-forms SD were inserted into the basic reliability formula (reliability = 1 - SEM /SD ) to produce a reliability value referenced to the SD of ability scores
2 2

in the entire norm sample at that age.

Computation of the other internal consistency reliability and 5.2). However, because scores were filled in for
statistic, coefficient alpha, requires that every examinee unadministered items, coefficient alpha reliabilities tend
have a score on every item. For unadministered items, to be overestimates.
scores were estimated by using the Rasch-based
parameters for item difficulty and person ability to Alternate-Form Reliability
calculate the probability of the examinee passing the Alternate-form reliability coefficients, derived from the
item, and then converting that probability to a 1 or 0. administration of two different test forms to the same
Coefficient alpha was computed separately for each group of people, are sensitive to two main sources of
form, and the values were adjusted for the full-sample error, content sampling and the effects of applying basal
SD by the same method used for the split-half and ceiling rules. Basal and ceiling rules are essential for
reliabilities. As with the split-half reliability coefficients, focusing test administration on the examinee's critical
alpha is consistently high at all ages and grades, range. However, measurement error affects where the
averaging .97 and .96 for Forms A and B, respectively, basal set and ceiling set are located on any particular
for both the age and grade breakdowns (see Tables 5.1 administration, which may affect the raw score.
During standardization, 508 examinees took both Test-Retest Reliability
Form A and Form B, usually in the same testing
Unlike the three types of reliability previously
session but occasionally as many as 7 days apart.
discussed, test-retest reliability is not affected by
Approximately half of the examinees took Form A first,
measurement error resulting from content sampling,
and the remaining examinees took Form B first. The
but instead is influenced solely by variability in a
demographic characteristics of this sample are
person's performance over time. This variability can
described in Table 5.3. Table 5.4 reports the
have numerous causes, which include intervening
correlations between Form A and Form B age-based
learning, practice effects from the prior administration,
standard scores for the five age groups. These
differences in the examinee's physical or emotional state
reliabilities are very high: adjusted for range restriction,
(e.g., fatigue, illness, or interest level), and unintended
all fall between .87 and .93, with a mean of .89. Also
differences in administration procedure. As with
noteworthy is the very close similarity of standard score
means and SDs on the two forms. alternate-form reliability, the basal and ceiling rules also
are a source of error.
During standardization, 340 examinees in five age of .93 (range = .92 to .96) is almost as high as the
groups were retested with the same PPVT-4 form an internal consistency reliability, indicating that PPVT-4
average of 4 weeks after the initial testing. performance is quite resistant to factors (such as those
Approximately half of the sample took Form A, and just listed) that might cause a person to perform
the remainder took Form B. Table 5.5 provides differently at different times. The fact that the test
information on the demographic characteristics of the assesses acquired knowledge and makes minimal
test-retest samples, and Table 5.6 reports the demands of the examinee may account for this very
correlations between age-based standard scores on the high level of score stability.
two administrations. The average test-retest correlation

Standard Error of Measurement

The SEM is a means of quantifying the typical amount
of error in the scores obtained on a particular test.
A person's true score on a test, which is unknown, is
the score the person would obtain if the test were error
free (that is, had a reliability of 1.0). The SEM may
be thought of as the average amount by which the
obtained score differs from the true score. It is based
on the reliability coefficient, and for the PPVT-4
instrument the split-half internal consistency reliability
coefficients were used for this purpose. The SEM for
each form, in standard score units, is presented in Table
5.1 for each of the 28 standardization age levels and in
Table 5.2 for each of the 13 grade levels. The average
SEM is the same on Form A and Form B. both across
ages (M = 3.6) and across grades (M = 3.7).
When a person's observed standard score is banded the P P V T - 4 standard score norm tables, are centered
on thewould
by the SEM, his or her true score, in theory, estimated
fall true standard score using the formula
within the resulting range of scores 6 8 % of the time. from Daniel ( 1 9 9 9 ) .
This range is referred to as a 68% confidence interval. When interpreting standard scores and their confidence
Confidence intervals corresponding to other intervals, one should consider that the distribution
percentages, such as 9 0 % or 9 5 % , may be constructed of measurement error is in the form of a normal
by using the appropriate multiple of the SEM. The 9 0 % probability curve. Therefore, it is highly probable that
and 95% confidence intervals, which are included in an individual's obtained score and true score are quite
close, and it is less probable that the true score falls
toward one of the extremes of the confidence interval.
Thus, it is important to keep in mind that the obtained
score is the best single estimate of a person's true score.

Validity is a characteristic of inferences drawn from
test scores. As a simple example, if a person obtains
a P P V T - 4 age-based standard score of 100, one might
infer that he or she has a level of receptive vocabulary
that is average for his or her age. The most important
assumption underlying this inference is that the PPVT-4
instrument measures vocabulary level; another
assumption is that the P P V T - 4 norms accurately
represent the population. The soundness of the latter
assumption is amply supported by the information
on the norming procedures (see Chapter 4 for details)
and will be further supported in the comparisons of
mean scores on the PPVT-4 and other instruments
reported later in this chapter. The discussion in the
present section focuses primarily on the first
assumption, that is, the question of what the PPVT-4
scale measures; this is referred to as construct validity.

Various types of evidence are relevant to construct

validity. One type is based on the content of the test
itself. Because the PPVT-4 instrument measures
achievement, one can evaluate how the Overall set of
test items compares with a specification of the domain
of knowledge the test is designed to assess. Another is The growth curve of average performance in the PPVT-4
provided by comparing the trend of average age norm sample, shown graphically in Figure 4.3 and
performance across age with the profile of growth and numerically in Table 4.18, follows the pattern typical
decline that the research literature would lead one to of measures of Gc. Median GSV scores increase steadily
expect. Construct validity can also be supported by until about age 30, maintain that level through the
correlations with other tests, an analysis of which early 60s, and then decrease. The steepest part of the
would reveal the degree of consistency between the growth curve occurs during the early ages, from age
observed pattern of correlations and the pattern that 2 years 6 months through about 14 years.
would characterize a valid vocabulary measure.
Broadly speaking, validity includes the accuracy of
Correlations With Other Tests
decisions that might be made using test scores. Here, Four of the correlation studies described in this section
useful evidence comes from the mean PPVT-4 scores compare PPVT-4 scores with scores obtained on
of individuals who have been independently assigned instruments that measure expressive vocabulary,
to a variety of clinical language ability, and reading achievement. These
Construct validity diagnoses or educational studies provide convergent evidence of the validity of
can also be classifications. In some PPVT-4 scores as measures of vocabulary knowledge,
supported by applied settings, diagnostic because it is expected that any measure of vocabulary
correlations with or classification decisions will will correlate very strongly with other vocabulary tests
and at a somewhat lower, but still substantial, level with
other tests... be influenced by PPVT-4
scores (in the context of measures of other aspects of language and with reading
other information), and it is helpful for the professional skill. The fifth study is a correlation between PPVT-4
to know how members of these clinical groups typically scores and PPVT-III scores, the purpose of which is
score on the instrument. Such information also to assess the degree of continuity in the construct
provides additional evidence of construct validity. measured by these two editions.

These studies were carried out during PPVT-4

Content Validity standardization, using samples that were reasonably
The qualitative or rational evidence of the content representative of the general population with respect
validity of the PPVT-4 scale as an achievement measure to sex, race/ethnicity, socioeconomic status (SES),
of hearing vocabulary for standard American English geographic region, and special-education classification.
is supported by the stimulus word selection process The data were collected by numerous examiners across
(see Chapter 4 for a description). In brief, these stimulus the United States. The demographic characteristics of
words were selected from a pool of words that could the correlation study samples are summarized in
be illustrated by color drawings and that represented Table 5.7 and Table 5.8.
20 content areas. The pool consisted primarily of
In each study, the sequence of administration of the
entries in Merriam-Webster's Collegiate Dictionary (2003)
instruments was counterbalanced so that the
and various editions of Webster's New Collegiate
comparability of mean scores on the two tests could
Dictionary (1953, 1967, 1981). Table 4.5 in Chapter 4
be measured, independent of practice effects. The
reports the comparability of Forms A and B with
correlations themselves were computed separately for
respect to the representation in each form of items
each of the sequences; adjusted for range restriction
in the 20 content categories.
using the formula of Cohen, Cohen, West, and Aiken
(2003, p. 58); and averaged using Fisher's
Test Performance and Age z transformation and the appropriate weighting.
Numerous studies of cognitive abilities over the lifespan About half of the cases in each study took each of the
have shown that crystallized ability (Gc), measured by tests two PPVT-4 forms.
of vocabulary and other skills that are dependent on
acquired knowledge, increases rapidly in childhood and
plateaus in early adulthood until its gradual decline
starting around the 60s or 70s (Kaufman & Lichtenberger,
2006). In contrast, many other cognitive abilities have
a briefer plateau and begin to decline at younger ages.
Correlations With the Expressive Vocabulary Test, the average PPVT-4 correlation with EVT-2 is lower
Second Edition than the average correlation of .89 between PPVT-4
As described in Chapter 4, the Expressive Vocabulary Forms A and B (as reported in Table 5.3), however, is
Test, Second Edition (EVT-2; Williams, 2007), was consistent with the interpretation of EVT-2 scores as
administered to the entire PPVT-4 age norm sample of also being measures of the construct of word retrieval.
3,540 individuals. Because this sample closely matches Correlations With Measures of Oral Language
the general U.S. population along multiple The project team also investigated the correlations of
demographic variables, the resulting correlations are PPVT-4 scores with scores from two oral language
robust and stable measures of the relationship between instruments. The Comprehensive Assessment of Spoken
the PPVT-4 and EVT-2 instruments. (The demographic Language (CASL; Carrow-Woolfolk, 1999) contains
characteristics of this sample are not included in Tables several subtests that measure the domain of lexical
5.7 and 5.8, as they are described in detail in Chapter 4.) knowledge: Basic Concepts, Synonyms, Antonyms, and
All examinees were administered the PPVT-4 Sentence Completion. These subtests were administered
instrument first. along with the PPVT-4 instrument (in a counterbalanced
The EVT-2 consists of two types of items, labeling items sequence) to two groups of children, one at preschool
in which the examinee says a word (noun, verb, or age (3 through 5 years) and the other at elementary-
descriptor) that describes a picture, and synonym items school age (8 through 12 years). The interval between
in which the examinee says a word that is a synonym administrations was very brief; most examinees took
of a stimulus word spoken by the examiner and that b o t h tests on the same day, and the m a x i m u m interval
appropriately describes a corresponding picture. The was 9 days for the younger group and 28 days for the
principle underlying the selection of words for the older group. Correlations and mean scores are presented
EVT-2 was similar to the one that guided PPVT-4 in Table 5.10.
development, namely, to sample standard American
In a second, independent study, the PPVT-4 scale
English vocabulary. In addition to measuring
was administered along with portions of the Clinical
vocabulary knowledge, however, the EVT-2 also
Evaluation of Language Fundamentals®, Fourth Edition
assesses word retrieval ability.
(CELF®-4 instrument; Semel, Wiig, & Secord, 2003),
Table 5.9 presents the correlations between PPVT-4 to a total of 111 children in two age groups (5 through
and EVT-2 scores for seven age groups, each including 8 years and 9 through 12 years). The CELF-4 subtests
between 245 and 750 examinees. Because this is the measure a wider range of aspects of language than the
entire norm sample, there was no need to adjust for PPVT-4 instrument, including receptive language and
range restriction. The correlations are remarkably expressive language. As in the CASL study, the two
uniform across age, ranging from .80 to .84 (M = .82). instruments were given in a counterbalanced sequence,
Thus, about two thirds of the variance is c o m m o n with a brief interval (averaging 1 or 2 days) between
between the two instruments, as would be expected if administrations. Results of this study are shown
they both measure vocabulary knowledge. The fact that in Table 5.11.
The correlations found in these two studies are of Correlations With the Croup Reading Assessment
similar magnitude (mid-.60s to high .70s) for and Diagnostic Evaluation
e x a m i n e e s of elementary-school age. These moderate to The significant role of vocabulary in the acquisition and
high correlations indicate that the broader oral language comprehension of reading was discussed in Chapter 1.
instruments measure related but somewhat different In order to assess the magnitude of the relationship of
abilities than the PPVT-4 instrument. The CASL PPVT-4 scores to reading achievement, the PPVT-4
Lexical/Semantic Composite is the criterion score most instrument was administered to 487 students in
closely aligned with the PPVT-4 score, with a kindergarten through Grade 11 w h o also took the
c o r r e l a t i o n of .79 that approximates that of the PPVT-4 Group Reading Assessment and Diagnostic Evaluation
scale with EVT-2 (.82). For the preschool-age sample, (GRADE; Williams, 2001). In addition to a Total Test
the moderate correlations (range = .41 to .54) show that score, the GRADE contains composite scores for
the PPVT-4 instrument is not measuring the construct Vocabulary and Comprehension and has subtests
measured by the CASL, although this may in part be a measuring additional skills, including Listening
function of the difficulty of obtaining reliable test scores Comprehension, Word Reading, and Phonological
from young children on expressive language tests. Awareness. Table 5.12 lists the correlations of GRADE
composite and subtest scores with the PPVT-4 scale for
nine grade-level groups of about 55 students each.
The PPVT-4 scale correlates .63, on average, with the PPVT-4 Studies With Special Populations
GRADE Total Test score, and correlates at a similar level The PPVT-4 scale, like other standardized measures,
with the Vocabulary and Comprehension composites is often used with individuals who are exceptional in
(mean correlations of .58 and .62, respectively). PPVT-4 some way. The following studies, which report the
scores correlate more strongly with the Vocabulary average PPVT-4 standard scores of 12 different groups
(M = .65) and Concepts (.66) subtests than with that represent specific clinical diagnoses or special-
any other GRADE subtest or composite. Even though education classifications, offer empirical evidence for
the examinee listens rather than reads when taking the the validity of the PPVT-4 scale for clinical purposes.
PPVT-4 scale, the level of correlation with Listening The difference between a group's mean score and the
Comprehension (M = .47) is lower than with the population average of 100 could be influenced by the
reading comprehension scores. The PPVT-4 instrument unique demographic characteristics of the group of
generally correlates at a moderate to low level with interest. To avoid this potential confound, multiple
the other subtests. Overall, the pattern of correlations regression was used to estimate the size of the true
reinforces the point made in Chapter 1 that vocabulary standard score difference between the group of interest
plays a central role in reading comprehension. and the norm sample, controlling for sex, race/ethnicity,
Correlations With the Peabody Picture Vocabulary and SES. This method of analysis is similar in principle
Test-Third Edition to using a demographically matched control group,
The PPVT-III and PPVT-4 instruments were but because data from a large portion of the age norm
administered in a counterbalanced sequence to 322 sample are used, it is more sensitive to small differences
examinees in five age groups, with most administrations and more likely to produce statistically significant
being conducted on the same day and the greatest results. Reported next are the differences between
interval between administrations being 11 days. the clinical samples and the general population, all
Examinees took the "alternate" form (e.g., PPVT-4 of which are statistically significant at the .001 level.
Form A and PPVT-III Form IIIB) to reduce content Data for these studies were collected during
overlap. Mean scores and correlations are reported standardization by numerous examiners across the
in Table 5.13. United States. For each case that was included, the
As shown in the table, correlations are consistently examiner was required to provide evidence that the
high. Their average (.84) indicates a strong relationship classification criteria specified for that study had been
between scores on the two editions, although this met. Individuals classified with mental retardation,
average is slightly lower than the correlation of .89 developmental delay, autism, hearing impairment, or
between the two PPVT-4 alternate forms. Notably, visual impairment were excluded from these samples
mean scores on the two editions are identical. (except for the mental retardation and hearing-
impairment studies). The demographic characteristics
of the special-population samples are reported
in Table 5.14.
Speech Impairment
The PPVT-4 instrument was administered to two samples
of examinees with speech impairments, one consisting of
children and adolescents aged 5 through 15 (N = 178)
and the other consisting of adults aged 50 through 96
(N = 60). The criterion for inclusion in the younger sample
was the existence of a communication disorder (such as
impaired articulation, stuttering, or voice impairment) that
adversely affected educational performance and for which
the child was receiving services from a speech-language
pathologist. For the adult sample, the criteria for inclusion
were the existence of dysarthria such as development
(cerebral palsy), recovering (stroke), or degenerative
(Amyotrophic Lateral Sclerosis [ALS] or Parkinson's
disease), and a lack of head injury.

Previous research has indicated that individuals with

speech impairments score similarly to those without
such impairments on measures of receptive vocabulary
(Carrow-Woolfolk, 1995). The same finding was
reported in the PPVT-III manual from a study involving
a sample of 50 children (aged 5 through 13) with Language Disorder
speech impairments whose average score did not differ The PPVT-4 scale was administered to two samples
significantly from that of a matched control group of individuals with language disorder: a child sample
(Dunn & Dunn, 1997). (ages 8 through 12, N = 65) and an adult sample
Table 5 . 1 5 presents the means and SDs of P P V T - 4 (ages 50 through 9 2 , N = 45). To be included in the
standard scores in the two clinical samples, along with younger sample, the person must have been diagnosed
the estimated differences between their mean scores and with a communication disorder classified as a language
those of the general population without speech disorder for which special services were being received.
impairments. Results are similar in the two age groups. Individuals in the older sample had been diagnosed
The differences from the general population are with aphasia and did not have head injury. As with the
statistically significant but modest in size. Examinees sample with language delay, these groups were expected
with speech impairments are estimated to score about to earn low PPVT-4 scores because of the strong
6 or 7 standard score points below the general relationship between vocabulary and language ability.
population mean, but these mean scores are well within Table 5 . 1 7 shows that these groups of examinees with
the average range. language disorders score lower than the general
population without language disorders. The size of the
deficit relative to the general population is about 10
standard score points for the child sample and 13
points for the adult sample.
Hearing Impairment Table 5.18 shows that, as expected, both groups scored
substantially lower than the reference group. The
The PPVT-4 instrument was administered to two samples
sample without cochlear implants scored about 1 SD
of children with hearing impairments, one group without
below the general-population mean, and the group
cochlear implants (N = 53) and the other with implants
with implants scored quite a bit lower, almost 1 SD
(N = 46). Ages in both groups ranged from 4 through 12.
below the first group and nearly 2 SDs below the
The group without cochlear implants had mild to
general-population mean.
moderate hearing loss (40 to 55dB) that adversely
affected educational performance, and attended
mainstream educational classes for most or all of the
school day. Of the 49 children for whom information was
available, 44 had impaired hearing in both ears and 5
in just one ear. Individuals were tested using the same
means of amplification that was routinely used in the
classroom, but sign language was not permitted for
PPVT-4 administration. The mean score for this group
was expected to be lower than the general-population
average because hearing impairment frequently causes
a delay in language and vocabulary development.
The children with cochlear implants are presumed to
have had more severe hearing impairment. Of the 46 Specific Learning Disability (Reading)
children, 38 had an implant in one ear, and 6 had The PPVT-4 instrument was given to a sample of
implants in both ears (information was unavailable 71 individuals aged 8 through 14 who had learning
for 2 children). In general, cochlear implants are disabilities in reading. This classification was made
appropriate for only those individuals with the most by the participating school and included a severe
severe hearing impairments, because the medical discrepancy between ability and reading achievement.
procedure required prevents any normal hearing once Because the PPVT-4 scale requires no reading, and
a person receives the implant. Government guidelines because vocabulary is a component of many ability tests
suggest that cochlear implantation be reserved for and correlates strongly with verbal ability (as discussed
individuals who are severely or profoundly deaf, who in Chapter 1), these individuals might be expected to
would receive minimal benefit from amplification by achieve average-level PPVT-4 scores. On the other
hand, a reading disability is a hindrance to learning
other means, or who can identify spoken words with
new vocabulary through reading and, therefore, could
less than 40% accuracy (U.S. Department of Health
depress PPVT-4 scores.
and Human Services, 2006). Children who meet
these criteria may have a range of conditions, from Table 5.19 shows that this sample scored substantially
prelingually deaf, to congenitally hard of hearing lower than the general population of individuals
in the severe to profound range, to those who lost without a reading disability. The estimated difference
hearing after learning to speak. Many children with is about 10 standard score points.
such histories would be expected to have significant
difficulty in a classroom setting (Teagle & Moore,
2002). However, children who develop normal
language skills before becoming deaf, and children
who acquire functional hearing through cochlear
implantation before the passage of critical periods of
language development, "may learn to develop spoken
language at a rate that parallels the rate of language
development in children with normal hearing"
(Teagle & Moore, 2002, p. 163).
Mental Retardation
The PPVT-4 instrument was administered to 70
examinees aged 6 through 17 who had been diagnosed
with mild to moderate mental retardation. To be included
in this study, the individual had to have a full-scale IQ
score in the range of 50 to 70 and have a deficit in
adaptive behavior. Because of the strong relationship
between vocabulary and general cognitive ability, study
participants were expected to have low PPVT-4 scores.
Table 5.20 presents the mean score for this group, which
is almost 2 SDs lower than that of the general population.
Attention-Deficit/Hyperactivity Disorder
The PPVT-4 scale was given to 91 examinees aged
6 through 17 with attention-deficit/hyperactivity
disorder (ADHD). In order to be included in this study,
individuals had to have met the criteria for ADHD as
specified by the Diagnostic and Statistical Manual of
Mental Disorders, Fourth Edition (DSM-IV™ manual;
American Psychiatric Association, 1994). The average
PPVT-4 score for this sample, reported in Table 5.23,
is about one-half SD lower than for individuals in the
Giftedness general population without ADHD.

The PPVT-4 instrument was given to 55 examinees

aged 8 through 17 who were enrolled in programs for
giftedness. The criteria for inclusion were those used by
the participating schools and included a score of 125 or
higher on a standardized intelligence test. Table 5.21
shows that this group obtained substantially higher
PPVT-4 scores than the general population; after
adjustment for demographic factors, the difference is
slightly less than 1 SD.

