Block 2
Block 2
Block 2
A statistic useful in describing sources of test score variability is the variance ( a^2 )—the standard
deviation squared. This statistic is useful because it can be broken into components. Variance from
true differences is true variance, and variance from irrelevant, random sources is error variance. If
a^2 represents the total variance, a^2tr the true variance, and a^2e the error variance, then the
relationship of the variances can be expressed as
The term reliability refers to the proportion of the total variance attributed to true variance. The
greater the proportion of the total variance attributed to true variance, the more reliable the test.
Because true differences are assumed to be stable, they are presumed to yield consistent scores on
repeated administrations of the same test as well as on equivalent forms of tests. Because error
variance may increase or decrease a test score by varying amounts, consistency of the test score—
and thus the reliability—can be affected.
Sources of error variance include test construction, administration, scoring, and/or interpretation.
Test construction
One source of variance during test construction is item sampling or content sampling, terms that
refer to variation among items within a test as well as to variation among items between tests.
Consider two or more tests designed to measure a specific skill, personality attribute, or body of
knowledge. Differences are sure to be found in the way the items are worded and in the exact
content sampled. The extent to which a test taker’s score is affected by the content sampled on the
test and by the way the content is sampled (that is, the way in which the item is constructed) is a
source of error variance. From the perspective of a test creator, a challenge in test development is to
maximize the proportion of the total variance that is true variance and to minimize the proportion of
the total variance that is error variance.
Test administration
Sources of error variance that occur during test administration may influence the test taker’s
attention or motivation. The test taker’s reactions to those influences are the source of one kind of
error variance. Examples of untoward influences during administration of a test include factors
related to the test environment: the room temperature, the level of lighting, and the amount of
ventilation and noise, for instance. Other environment-related variables include the instrument used
to enter responses and even the writing surface on which responses are entered. A pencil with a dull
or broken point can hamper the blackening of little grids. The writing surface on a school desk may
be riddled with heart carvings, the legacy of past years’ students who felt compelled to express their
eternal devotion to someone now long forgotten. Other potential sources of error variance during
test administration are test taker variables. Pressing emotional problems, physical discomfort, lack
of sleep, and the effects of drugs or medication can all be sources of error variance. A testtaker may,
for whatever reason, make a mistake in entering a test response. For example, the examinee might
blacken a “b” grid when he or she meant to blacken the “d” grid. An examinee may simply misread a
test item. Formal learning experiences, casual life experiences, therapy, illness, and changes in mood
or mental state are other potential sources of testtaker-related error variance. Examiner-related
variables are potential sources of error variance. The examiner’s physical appearance and demeanor
—even the presence or absence of an examiner—are some factors for consideration here. Some
examiners in some testing situations might knowingly or unwittingly depart from the procedure
prescribed for a particular test. On an oral examination, some examiners may unwittingly provide
clues by emphasizing key words as they pose questions. They might convey information about the
correctness of a response through head nodding, eye movements, or other nonverbal gestures.
Clearly, the level of professionalism exhibited by examiners is a source of error variance.
The advent of computer scoring and a growing reliance on objective, computer-scorable items
virtually have eliminated error variance caused by scorer differences in many tests. Individually
administered intelligence tests, some tests of personality, tests of creativity, various behavioral
measures, and countless other tests still require hand scoring by trained personnel. Manuals for
individual intelligence tests tend to be very explicit about scoring criteria lest examinees’ measured
intelligence vary as a function of who is doing the testing and scoring. In some tests of personality,
examinees are asked to supply open-ended responses to stimuli such as pictures, words, sentences,
and inkblots, and it is the examiner who must then quantify or qualitatively evaluate responses.
Here, it is the examiner’s task to determine which block constructions will be awarded credit and
which will not. Scorers and scoring systems are potential sources of error variance. A test may
employ objective-type items amenable to computer scoring of well-documented reliability. Yet even
then, the possibility of a technical glitch contaminating the data is possible. If subjectivity is involved
in scoring, then the scorer (or rater) can be a source of error variance. On an item that asks simply
whether two staff members were greeted in the morning, one rater might judge the patient’s eye
contact and mumbling of something to two staff members to qualify as a yes response. The other
observer might feel strongly that a no response to the item is appropriate. Such problems in scoring
agreement can be addressed through rigorous training designed to make the consistency—or
reliability—of various scorers as nearly perfect as can be.
Certain types of assessment situations lend themselves to particular varieties of systematic and
nonsystematic error. For example, consider assessing the extent of agreement between partners
regarding the quality and quantity of physical and psychological abuse in their relationship. As
Moffitt et al. (1997) observed, “Because partner abuse usually occurs in private, there are only two
persons who ‘really’ know what goes on behind closed doors: the two members of the couple”.
Potential sources of nonsystematic error in such an assessment situation include forgetting, failing to
notice abusive behavior, and misunderstanding instructions regarding reporting. A number of
studies have suggested underreporting or overreporting of perpetration of abuse also may
contribute to systematic error. Females, for example, may underreport abuse because of fear,
shame, or social desirability factors and overreport abuse if they are seeking help. Males may
underreport abuse because of embarrassment and social desirability factors and overreport abuse if
they are attempting to justify the report. Just as the amount of abuse one partner suffers at the
hands of the other may never be known, so the amount of test variance that is true relative to error
may never be known.
Reliability Estimates
A ruler made from the highest-quality steel can be a very reliable instrument of measurement. Every
time you measure something that is exactly 12 inches long, for example, your ruler will tell you that
what you are measuring is exactly 12 inches long. The reliability of this instrument of measurement
may also be said to be stable over time. Whether you measure the 12 inches today, tomorrow, or
next year, the ruler is still going to measure 12 inches as 12 inches. By contrast, a ruler constructed
of putty might be a very unreliable instrument of measurement. One minute it could measure some
known 12-inch standard as 12 inches, the next minute it could measure it as 14 inches, and a week
later it could measure it as 18 inches. One way of estimating the reliability of a measuring instrument
is by using the same instrument to measure the same thing at two points in time. In psychometric
parlance, this approach to reliability evaluation is called the test-retest method, and the result of
such an evaluation is an estimate of test- retest reliability.
Test-retest reliability is an estimate of reliability obtained by correlating pairs of scores from the
same people on two different administrations of the same test. The test- retest measure is
appropriate when evaluating the reliability of a test that purports to measure something that is
relatively stable over time, such as a personality trait. If the characteristic being measured is
assumed to fluctuate over time, then there would be little sense in assessing the reliability of the
test using the test-retest method. As time passes, people change. For example, people may learn
new things, forget some things, and acquire new skills. It is generally the case (although there are
exceptions) that, as the time interval between administrations of the same test increases, the
correlation between the scores obtained on each testing decreases. The passage of time can be a
source of error variance. The longer the time that passes, the greater the likelihood that the
reliability coefficient will be lower. When the interval between testing is greater than six months, the
estimate of test-retest reliability is often referred to as the coefficient of stability. A low estimate of
test-retest reliability might be found even when the interval between testings is relatively brief. This
may well be the case when the testings occur during a time of great developmental change with
respect to the variables they are designed to assess. An evaluation of a test-retest reliability
coefficient must therefore extend beyond the magnitude of the obtained coefficient. If we are to
come to proper conclusions about the reliability of the measuring instrument, evaluation of a test-
retest reliability estimate must extend to a consideration of possible intervening factors between
test administrations.
An estimate of split-half reliability is obtained by correlating two pairs of scores obtained from
equivalent halves of a single test administered once. It is a useful measure of reliability when it is
impractical or undesirable to assess reliability with two tests or to administer a test twice (because
of factors such as time or expense). The computation of a coefficient of split-half reliability generally
entails three steps:
Step 1. Divide the test into equivalent halves.
Step 2. Calculate a Pearson r between scores on the two halves of the test.
When it comes to calculating split-half reliability coefficients, there’s more than one way to split a
test—but there are some ways you should never split a test. Simply dividing the test in the middle is
not recommended because it’s likely this procedure would spuriously raise or lower the reliability
coefficient. Different amounts of fatigue for the first as opposed to the second part of the test,
different amounts of test anxiety, and differences in item difficulty as a function of placement in the
test are all factors to consider. One acceptable way to split a test is to randomly assign items to one
or the other half of the test. Another acceptable way to split a test is to assign odd-numbered items
to one half of the test and even-numbered items to the other half. This method yields an estimate of
split-half reliability that is also referred to as odd-even reliability. Yet another way to split a test is to
divide the test by content so that each half contains items equivalent with respect to content and
difficulty. In general, a primary objective in splitting a test in half for the purpose of obtaining a split-
half reliability estimate is to create what might be called “mini-parallel-forms,” with each half equal
to the other—or as nearly equal as humanly possible—in format, stylistic, statistical, and related
aspects. Step 2 in the procedure entails the computation of a Pearson r, which requires little
explanation at this point. However, the third step requires the use of the Spearman- Brown formula.
The Spearman-Brown formula allows a test developer or user to estimate internal consistency
reliability from a correlation of two halves of a test. It is a specific application of a more general
formula to estimate the reliability of a test that is lengthened or shortened by any number of items.
Because the reliability of a test is affected by its length, a formula is necessary for estimating the
reliability of a test that has been shortened or lengthened. The general Spearman-Brown ( rSB )
formula is : rSB =nrxy/1+(n-1)rxy where rSB is equal to the reliability adjusted by the Spearman-
Brown formula, rxy is equal to the Pearson r in the original-length test, and n is equal to the number
of items in the revised version divided by the number of items in the original version. A Spearman-
Brown formula could also be used to determine the number of items needed to attain a desired level
of reliability.