Test Review: Intelligence
Test Review: Intelligence
Test Review: Intelligence
REVIEWER:
Bruce Gordon
Alvin Buckwold Child Development Program
Saskatoon, Saskatchewan
General Information
There have also been major changes at the subtest level. Seven new subtests
have been added. Many of these were created to update the WPPSI’s theoretical
foundations. One goal in this update was to better assess fluid reasoning abilities.
To this end, the Picture Concepts, Matrix Reasoning, and Word Reasoning subtests
were created. A second goal of the update was to measure processing speed
abilities in preschoolers similar to the WISC-III (Wechsler, 1991) and the WAIS-III
(Wechsler, 1997). Modified versions of the Coding and Symbol Search subtests
from the WISC-III were created for preschoolers in the older age band. Two other
subtests, Receptive Vocabulary and Picture Naming, were added to better assess
language skills.
Five subtests have been dropped from the WPPSI-R. Gone are Arithmetic,
Animal Pegs, Geometric Design, Mazes, and Sentences. Of note is a decreased
206
emphasis for the WPPSI-III in directly assessing memory skills. The test authors
suggest that the clinician wanting a good assessment of memory skills in a younger
child can turn to other tools such as the Children’s Memory Scales (Cohen, 1997).
Several changes were made to make the WPPSI more developmentally appropriate
for younger children. Test instructions were designed to better fit the language level of
preschoolers. Also, more teaching items and prompts were built into the subtests. As
well, outside of the Processing Speed subtests, an attempt was made to reduce possible
confounds of speed on performance by removing the bonus points for quick completion
of Block Design or Object Assembly items. An effort was also made to reduce possible
confounds of expressive language development on verbal reasoning abilities. For
example, the Word Reasoning subtest, which requires only a one-word answer from the
child, is part of the core verbal subtests whereas the Comprehension subtest, whereas
requires more verbal output, is relegated to the supplemental subtests.
A number of changes were made to give the WPPSI-III better clinical
utility.
A linking study was done with the WIAT-11 to allow for better assessment of ability-
achievement discrepancies. Testing time was reduced to allow for the waning
enthusiasm most preschoolers show for structured activities as the assessment drags
on tediously. Also, an attempt was made to simplify administration procedures. The
most dramatic example of this comes on the Object Assembly subtest. Gone is the
shield behind which the examiner would attempt to lay out the puzzle pieces, while
often frantically trying to keep the child from peeking over it or snatching it away. Now,
the clinician simply lays out the pieces in plain sight (much like a Blackjack dealer)
and flips them over. The challenge is still there of encouraging the child not to grab
the pieces before they are all turned over, but at least without the shield, the clinician
now has less to guard and a fighting chance at success. The new instructions for laying
out the pieces take some thinking through, but once mastered, they greatly speed up
administration time.
Structure
optional. All have a mean of 10 and a standard deviation of 3. Core subtests are
those considered to provide the best measures cognitive ability and are
of overall
used to calculate the FSIQ, VIQ, and PIQ. Supplemental subtests can be used
to provide additional clinical information about the child and may be used to
substitute for a core subtest if it were invalidated or inappropriate for the child.
&dquo;Optional&dquo; is the label given for the two subtests that make up the GLC for the older
age band, which are not included in any of the composite scores and may not be
used to substitute for a core subtest.
Table 1
Subtests for WPPSI-111 Younger Age Bond
the GLC composite along with Receptive Vocabulary or can substitute for Receptive
Vocabulary in VIQ and FSIQ.
Table 2
Sub tests for WPPSI-111 Older Age Band
subtest.
Technical Evaluation
Standardization Sample
The American norms for the WPPSI-111 are based on a sample of 1700 children
designed to match the 2000 U.S. Census in terms of ethnic origin, parent education
level, and geographic location. Two hundred children were included for each six-
month age interval. This sample would earn a rating of &dquo;Good&dquo; from the Flanagan &
Alfonso (1995) criteria in terms of the following: recency, having adequate numbers
of children at each age level, and being representative of the U.S. population.
Reliability
Internal consistency reliability for the IVPPSI-III is excellent at both the
composite score level and subtest level. The average reliability for FSIQ is .96 (safely
surpassing the Flanagan & Alfonso standard) with reliability coefficients for the
210
second-order factors ranging from .89 (PSQ) to .95 (VIQ). Reliability at the subtest
level is
particularly impressive with all of the subtests boasting average reliability
coefficients above .8 ranging from .83 (Symbol Search) to .95 (Similarities).
Flanagan & Alfonso set high standards for test-retest reliability which only the
Differential Ability Scales (Elliott, 1990) was able to meet in their 1995 review. They
argue that clinicians must look not just at the reliability coefficients generated by
the test-retest sample but at the characteristics of the sample itself. To meet their
standards, the test-retest sample should have the following characteristics: be
sufficiently large, match the demographic characteristics of the population, sample
adequately the entire age range of the test, and have a reasonably short test-retest
interval.
three age groups, the average test-retest correlation was .92 for FSIQ and ranged
from .86 to .91 for the composite scores. At the subtest level, across the three age
level samples, test-retest reliabilities ranged from .74 for
Object Assembly to .90
for Similarities. The average practice effect was 5.2 points for FSIQ, with smaller
practice effects seen for younger children.
The WPPSI-III has done a laudable job of building the case for interrater
reliability. The manual indicates that two examiners independently double-scored
all of the WPPSI-III standardization protocols and that their overall agreement
rate was .98 to .99. To assess interrater reliability for the subtests requiring more
impressive was that the raters chosen were not WPPSI-III experts but rather four
graduate students with no previous WPPSI-III experience. Still, they achieved
interrater reliability coefficients ranging from .92 to .95 at the item level and .97 to
.99 for their total subtest score.
Validity
The WPPSI-III’s technical manual (Wechsler, 2002b) makes the point that
contemporary definitions of validity point us away from evaluating it in terms of
three separate and distinct domains: content, criterion-related, and construct
validity. Instead, the current conceptualization is to view validity as more unitary in
nature and to evaluate a test’s validity in terms of the lines of validity evidence. The
goal is to make a judgement about whether there is evidence to support that the test
measures what it says it measures and that the results from the test can be used to
interpret what the testmakers say they can interpret.
The first lines of validity evidence cited that the WPPSI-III measures cognitive
abilities and processes in young children comes from what the manual refers to
as evidence based on test content and response processes. The case is made that
the content of the WPPSI-III was chosen based on extensive literature reviews and
expert advice. Further, during the development phase of the WPPSI-III, children’s
specific responses were studied to guide the understanding of the cognitive
processes involved.
To build the casethat the WPPSI-III matches its theoretical factor structure,
intercorrelational studies among the subtests were examined. The goal is to
show that the subtests from a proposed factor correlate highly with each
more
other (convergent validity) than with subtests from another proposed factor
(discriminant validity). At the younger age band, the two Verbal subtests (Receptive
Vocabulary and Information) do correlate higher with each other than the
Performance subtests. However, the two Performance subtests (Block Design and
Object Assembly) correlate about equally well with each other as they do with the
Verbal subtests. The test manual argues that this is likely due to the high g loadings
of all of the subtests and that cognitive abilities of younger children are not as well-
differentiated. This sounds like the beginnings of an argument that at the younger
age band the IVPPSI-III is actually a one-factor test like the DAS.
At the older age band, both the Verbal and the Processing Speed subtests
show the predicted pattern of correlating more highly with each other than with
212
the subtests from the other scales. However, again, there are difficulties with the
Performance subtests. While Block Design Reasoning correlate highly
and Matrix
with the other Performance subtests, they also show high correlations with some
of the Verbal and Processing Speed subtests. Again, the manual suggests that
this reflects that subtests with high g loadings will tend to correlate highly with
each other. As well, both Picture Concepts and Picture Completion fail to show
discriminant validity, correlating highly with subtests both on the Verbal and
Performance scales. The manual indicates this is likely due to importance of verbal
mediation for children in solving these types of problems.
The next line of validity evidence cited comes fromexploratory factor analyses
of the core subtests. The results were consistent with predictions for a two-factor
(Verbal-Performance) model for the younger age band and a three factor (Verbal-
Performance-Processing Speed) model for the older age band. Exploratory factor
analyses were repeated, this time adding in the supplemental subtests. Again,
for the younger age band, results matched the predicted two-factor model. For
the older age band, results were basically consistent with the predicted three-
factor model with the exception of the Picture Concepts subtest. It actually loaded
somewhat higher on the Verbal factor.
Confirmatory factor
analyses data are presented to provide another line of
evidence for the WPPSI-III’s validity. A two-factor model was found to work best
for the younger age band. For the older age band, two different three-factor models
worked equally well. One featured the problematic Picture Concepts subtest loading
on the Performance factor, whereas the other allowed Picture Concepts to load on
both the Verbal and Performance factors.
All of this begs the question of whether Picture Concepts is really a Performance
subtest. In a somewhat similar situation with the DAS, the decision was made that
the Early Numbers Concept subtest would be part of the core subtests contributing
to the overall score but not placed on either the Verbal or Nonverbal cluster.
Instead, the WPPSI-III development team actually turns to the DAS to justify
their decision to place Picture Concepts on the Performance factor. They argue
(Wechsler, 2002b) that Picture Concepts correlates higher with the DAS Nonverbal
Cluster than with its Verbal Cluster score. However, the justifications presented for
the decision to classify Picture Concepts as a Performance subtest seem strained,
and, as they admit, further research will be needed to understand what Picture
Concepts measures.
213
The next line of validity evidence for the WPPSI-III comes from studies
examining its relationship with other measures of intelligence for preschoolers. The
WPPSI-III’s FSIQ score correlates highly with the overall scores provided by the
WPPSI-R, WISC-III, Bayley Scales of Infant Development (Bayley, 1993), and the
DAS. This is evidence that the WPPSI-III is measuring a similar construct, namely
intelligence.
Of particular interest was the much lower than expected &dquo;Flynn effect&dquo;
(Flynn, 1987) for the WPPSI-III. One of the reasons for updating any measure of
intelligence is to take into account that average performance on intelligence tests
has been rising over the years. However, children in the WPPSI-III/WPPSI-R
comparison study scored just 1.2 points lower on the WPPSI-III. This compares
to &dquo;Flynn effects&dquo; of 5.3 and 2.9 FSIQ points for the WISC-III and the WAIS-III,
developmental delays, a critical issue is floor effects. At the composite score level,
the minimum possible valid score must be at least two standard deviations below
the mean, and it should be at least minus three standard deviations to have real
clinical utility. Table 3 shows the minimum possible overall score for the WPPSI-III
composite scores. The WPPSI-III’s minimum possible FSIQ score meets the
standard of falling three standard deviations below the mean across all age ranges.
At the second-order factor level, both the VIQ and PIQ meet the minus three
standard deviation goal across all age levels with one exception. The minimum
possible VIQ score is slightly higher at 58 until age 3:0. Floor effects are a problem
though for the new PSQ. The minimum possible PSQ is 78 at age 4:0. It is only at
age 5:o that the minimum PSQ dips below 2 standard deviations to 64 and not until
214
age 6:o that it drops below the three standard deviation mark to 49. The minimum
possible GLC score is at least two standard deviations below the mean across the
entire WPPSI-III age range, reaching the three standard deviation mark at age 3:0.
Table 3
Minimum Possible Valid Composite Scores on WPPSI-111
Flanagan and Alfonso (1995) further set the standard that for each subtest
to have an adequate floor, a raw score of i should be more than two standard
deviations below the mean. For the WPPSI-III this means that to meet this
standard, a subtest raw score of shouldproduce a scaled score no higher than
3. Similar to the other preschool intelligence tests, the WPPSI-III struggles with
this standard at its earliest age levels. At the younger age band, only Block Design
meets the standard across the entire age range. Receptive Vocabulary and Picture
Naming only comply by age 3:o and Object Assembly takes until age 3:6 to meet
the standard.
Several subtests are problematic for the older age band. Three of the seven core
subtests (Picture Concepts, Word Reasoning, and Coding) have inadequate floors at
age 4:0. As would be expected, given the difficulties experienced with the PSQ, both
of its subtests
(Coding and Symbol Search) struggle with floor effects. Coding only
reaches adequate floor by age 5:0 and it is only by age 6:o that Symbol Search
an
meets the standard. Along with Coding, two other of the core subtests (Picture
Concepts and Word Reasoning) have inadequate floors at age 4:0. Picture Concepts
achieves an adequate floor by age 4:6, but Word Reasoning takes until age 5:6 to
meet the standard.
Two of the
supplementary subtests, Comprehension and Similarities, also have
problems with inadequate floors. The minimum possible scaled score at age 4:0 is
6 for Comprehension and 7 for Similarities. Comprehension takes until age 6:o to
215
achieve anadequate floor, whereas it is only by age 6:4 that Similarities meets the
standard. This has particular importance for clinicians who might be tempted to
substitute these subtests for other core verbal subtests. Although some clinicians
might feel that Comprehension and Similarities could provide better clinical
information about verbal reasoning abilities, their individual inadequate floors can
result in raising the floor for the VIQ and the FSIQ. For example, using the core
subtests at age 4:0, the minimum possible scores are 55 for the VIQ and 51 for the
FSIQ. However, were a clinician to substitute Comprehension and Similarities for
Information and Vocabulary, then the minimum possible scores would inflate to 77
for the VIQ and 61 for the FSIQ.
Practical Evaluation
Excellent psychometrics are a necessary but not sufficient condition for a test
engage the child, to whether the scores produced by the test actually help in better
understanding the child’s difficulties and pointing towards solutions.
At the practical level, the WPPSI-III is a quite significant improvement over the
WPPSI-R. First, the reduction in administration time is a major improvement.
The WPPSI-R was a fine test that provided a great deal of excellent clinical
information, but it went on too long for the attention spans of most young children
who would actually be referred for a cognitive assessment. At the younger age band,
data indicate (Wechsler, 2002b) that So9o of the children can complete the four
core subtests in just under half an hour, with 90% done by 45 minutes. At the older
age band, 50% of the children completed the 7 core subtests in just under three-
quarters of an hour, with 90% done by 60 minutes. This is a critical improvement
because it makes completion of the WPPSI viable for a greater percentage of
children with developmental difficulties.
to try even though many find it difficult. Although only a supplemental subtest
for the older age band, it can be deployed strategically in the assessment to keep
motivation up. Word Reasoning with its riddle-like format is also quite engaging,
although one of my clients found the repetition of the clue more frustrating and
distracting than helpful. Vocabulary is the least enjoyable of the subtests, often
the equivalent of a root canal for a preschooler with language difficulties, but it is
no worse than any other vocabulary subtest from other measures. The new Block
Design is also now much less torturous than its predecessor. The WPPSI-R’s
Block Design required the child to fail three consecutive items with two trials each
(so six failures in a row) before the mercy rule could be invoked and the subtest
discontinued. All this repeated failure would convince some children that it was
pointless to continue trying with any of the test activities. The new format has
many items with only one trial, speeds up administration by dropping the shield
(like Object Assembly), and it includes some easier items involving stacking blocks.
The net effect is a much more endurable subtest for children who have trouble with
Block Design.
At our clinic, our mainstay for preschool cognitive assessment is the Differential
Ability Scales (DAS; Elliot, 1990). The DAS has both excellent psychometric
properties and clinical utility (Gordon & Elliott, 2001) for young children, and it
serves a worthy standard against which to compare the WPPSI-III.
The WPPSI-III has several practical/clinical advantages over the DAS. Most
importantly, it provides a much better assessment of young child’s verbal
a
reasoning and expressive language skills. The core verbal subtests provide a much
better opportunity to assess a child’s ability to think and communicate in phrases
and sentences than the DAS’ Picture Naming subtest.
In the Wechsler tradition, a good test should not simply quantify cognitive
abilities, but it should also provide a thorough clinical assessment of the client.
With the increased verbal opportunities, the WPPSI-III provides a significantly
better chance than the DAS of gaining information about the child’s feelings or
worldview. The supplemental verbal subtests are particularly useful for the clinician
interested in this type of information. For example, a six-year-old boy whom I
217
assessed finished the Similarities item &dquo;Mothers and sisters are both...&dquo; with a
conspiratorial glance and the word &dquo;annoying&dquo;. The answer provided valuable
insight into the world of a little boy coping with a baby sister.
The DAS still provides a better assessment of receptive language skills. The
WPPSI-III added the Receptive Vocabulary subtest, but its very PPVT format
requires only that the child point to the picture that best describes the target
word spoken by the examiner. The DAS’ Verbal Comprehension subtest allows
the clinician to assess the child’s understanding of short phrases and two-step
directions.
Both the DAS and WPPSI-III feature a four core-subtest battery for their
younger age band. However, this extends only to age 3:5 for the DAS, whereas the
shorter version of the WPPSI extends to age 3:11. Many three-year-olds at our clinic
with autism or fetal alcohol syndrome find four subtests about their limit, which
gives the WPPSI an advantage over the DAS for children from age 3:6 to age 3:11.
That said, the DAS’ format still provides much greater flexibility to tailor the test
Interpretation
Included in the WPPSI-III Technical and Interpretive Manual (Wechsler,
2002b) is a quite helpful guide to profile interpretation. It takes the clinician
through a lo-step process beginning with considering the meaning of the composite
scores, then examining differences between composite and subtest scores, and
finally analyzing details of how the child solved or failed items. The manual
218
the clinician consider whether the difference observed is not just statistically but
also clinically significant. The record form also provides some gentle shepherding
to the clinician to consider these questions with a section that walks one through
score discrepancy calculations, including attending to the base rate of the difference
interpretative process.
With the new WPPSI, the clinician now has two composite scores to
new
interpret: the PSQ and the GLC. The interpretative manual provides a good
explanation of what in theory these constructs are measuring. However, in practice
there are some complications. The PSQ subtests are difficult for many 4-year-olds.
The manual indicates that less than 10% of 4-year-olds were unable to do the core
subtests (Wechsler, 2oo2b), but information provided by Hazel Wheldon from TPC
(Wheldon, 2002) indicates that 20% of typical four-year-olds do not understand
how to do the Coding and Symbol Search subtests. So caution will be needed in
interpreting what the PSQ score does and does not mean when assessing young
four-year-olds. As well, in the clinical validity studies (Wechsler, 2002b), contrary
to predictions, no difference was found in PSQ scores between children with ADHD
and a matched control group without attention problems. Further research will be
needed to illuminate what the PSQ can offer to our understanding of the cognitive
abilities and processes of young children.
(Wechsler, 2002b), essentially no difference was found between the GLC (79.2) and
the VIQ (80.2) in a sample of children with limited English proficiency.
Conclusions
expressive language skills. I have not given up the DAS, but I find myself frequently
considering and sometimes choosing the WPPSI-III (which is not the case for the
WPPSI-R) for the children with developmental delays who come to my clinic.
References