Course Data Analysis For Social Science Teachers Topic Module Id 1.3
Course Data Analysis For Social Science Teachers Topic Module Id 1.3
Course Data Analysis For Social Science Teachers Topic Module Id 1.3
1. Defining reliability, including the different types and how they are assessed;
2. Defining validity, including the different types and how they are assessed; and
3. Describing kinds of evidence that would be relevant to assessing the reliability and
validity of a particular measure.
As an informal example, imagine that you have been dieting for a month. Your clothes seem
to be fitting more loosely, and several friends have asked if you have lost weight. If at this point
your bathroom scale indicated that you had lost 10 pounds, this would make sense and you
would continue to use the scale. But if it indicated that you had gained 10 pounds, you would
rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement
method, psychologists consider two general dimensions: reliability and validity.
RELIABILITY
Test-Retest Reliability
When researchers measure a construct that they assume to be consistent across time, the scores
they obtain should also be consistent across time. Test-retest reliability is the extent to which
this is actually the case. For example, intelligence is generally thought to be consistent across
time. A person who is highly intelligent today will be highly intelligent next week. This means
that any good measure of intelligence should produce roughly the same scores for this
Page 1|7
individual next week as it does today. Clearly, a measure that produces highly inconsistent
scores over time cannot be a very good measure of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time,
using it again on the same group of people at a later time, and then looking at test-
retest correlation between the two sets of scores. This is typically done by graphing the data in
a scattered plot and computing the correlation coefficient. Figure 4.2 shows the correlation
between two sets of scores of several university students on the Rosenberg Self-Esteem Scale,
administered two times, a week apart. The correlation coefficient for these data is +.95. In
general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.
Figure 3.1 Test-Retest Correlation between Two Sets of Scores of Several College Students
on the Rosenberg Self-Esteem Scale, Given Two Times a Week Apart
Again, high test-retest correlations make sense when the construct being measured is assumed
to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five
personality dimensions. But other constructs are not assumed to be stable over time. The very
nature of mood, for example, is that it changes. So a measure of mood that produced a low test-
retest correlation over a period of a month would not be a cause for concern.
Internal Consistency
Page 2|7
would no longer make sense to claim that they are all measuring the same underlying construct.
This is as true for behavioral and physiological measures as for self-report measures. For
example, people might make a series of bets in a simulated game of roulette as a measure of
their level of risk-seeking. This measure would be internally consistent to the extent that
individual participants’ bets were consistently high or low across trials.
Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing
data. One approach is to look at a split-half correlation. This involves splitting the items into
two sets, such as the first and second halves of the items or the even- and odd-numbered items.
Then a score is computed for each set of items, and the relationship between the two sets of
scores is examined. For example, Figure 4.3 shows the split-half correlation between several
university students’ scores on the even-numbered items and their scores on the odd-numbered
items of the Rosenberg Self-Esteem Scale. The correlation coefficient for these data is +.88. A
split-half correlation of +.80 or greater is generally considered good internal consistency.
Figure 3.2 Split-Half Correlation between Several College Students’ Scores on the Even-
Numbered Items and their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem
Scale
Inter-rater Reliability
Perhaps the most common measure of internal consistency used by researchers in psychology
is a statistic called Cronbach’s α (the Greek letter alpha). Conceptually, α is the mean of all
possible split-half correlations for a set of items. For example, there are 252 ways to split a set
of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half
correlations. Note that this is not how α is actually computed, but it is a correct way of
interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to
indicate good internal consistency.
Page 3|7
judgments. For example, if you are interested in measuring university students’ social skills,
you can make video recordings of them as they interact with other students whom they are
meeting for the first time. Then you could have two or more observers watch the videos and
rate each student’s level of social skills. To the extent that each participant does, in fact, have
some level of social skills that can be detected by an attentive observer, and different observers’
ratings should be highly correlated with each other. Inter-rater reliability would also have been
measured in Bandura’s Bobo doll study. In this case, the observers’ ratings of how many acts
of aggression a particular child committed while playing with the Bobo doll should have been
highly positively correlated. Inter-rater reliability is often assessed using Cronbach’s α when
the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter
kappa) when they are categorical.
VALIDITY
Validity is the extent to which the scores from a measure represent the variable they are
intended to. But how do researchers make this judgment? We have already considered one
factor that they take into account — reliability. When a measure has good test-retest reliability
and internal consistency, researchers should be more confident that the scores represent what
they are supposed to. There has to be more to it, however, because a measure can be extremely
reliable but have no validity whatsoever. As an absurd example, imagine someone who believes
that people’s index finger length reflects their self-esteem and therefore tries to measure self-
esteem by holding a ruler up to people’s index fingers. Although this measure would have
extremely good test-retest reliability, it would have absolutely no validity. The fact that one
person’s index finger is a centimeter longer than another’s would indicate nothing about which
one has got higher self-esteem.
Discussions of validity are usually divided into several distinct “types.” But a good way to
interpret these types is that there are other kinds of evidence—in addition to reliability—that
should be taken into account when judging the validity of a measure. Here we consider three
basic kinds: face validity, content validity, and criterion validity.
Face Validity
Face validity is the extent to which a measurement method appears “on its face” to measure
the construct of interest. Most people would expect a self-esteem questionnaire to include items
about whether they see themselves as a person of worth and whether they think they have good
qualities. So a questionnaire that included these kinds of items would have good face validity.
The finger-length method of measuring self-esteem, on the other hand, seems to have nothing
to do with self-esteem and therefore has poor face validity. Although face validity can be
assessed quantitatively—for example, by having a large sample of people to rate a measure in
terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Page 4|7
Face validity is at best a very weak kind of evidence that a measurement method is measuring
what it is supposed to. One reason is that it is based on people’s intuitions about human
behavior, which are frequently wrong. It is also the case that many established measures in
psychology work quite well despite lacking face validity. The Minnesota Multiphasic
Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by
having people decide whether each of over 567 different statements applies to them—where
many of the statements do not have any obvious relationship to the construct that they measure.
For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t
frighten me or make me sick” both measure the suppression of aggression. In this case, it is not
the participants’ literal answers to these questions that are of interest, but rather whether the
pattern of the participants’ responses to a series of questions matches those of individuals who
tend to suppress their aggression.
Content Validity
Content validity is the extent to which a measure “covers” the construct of interest. For
example, if a researcher conceptually defines test anxiety as involving both sympathetic
nervous system activation (leading to nervous feelings) and negative thoughts, then his measure
of test anxiety should include items about both nervous feelings and negative thoughts. Or
he/she should consider that attitudes are usually defined as involving thoughts, feelings, and
actions toward something. By this conceptual definition, a person has a positive attitude toward
exercise to the extent that he or she thinks positive thoughts about exercising, feels good about
exercising and actually exercises. So to have good content validity, a measure of people’s
attitudes toward exercise would have to reflect all three of these aspects. Like face validity,
content validity is not usually assessed quantitatively. Instead, it is assessed by carefully
checking the measurement method against the conceptual definition of the construct.
Criterion Validity
Criterion validity is the extent to which people’s scores on a measure are correlated with other
variables (known as criteria) that one would expect them to be correlated with. For example,
people’s scores on a new measure of test anxiety should be negatively correlated with their
performance on an important school exam. If it were found that people’s scores are in fact
negatively correlated with their exam performance, then this would be a piece of evidence that
these scores really represent people’s test anxiety. But if it were found that people score equally
well on the exam regardless of their test anxiety scores, then this would cast doubt on the
validity of the measure.
A criterion can be any variable that one has a reason to think should be correlated with the
construct being measured, and there will usually be many of them. For example, one would
expect test anxiety scores to be negatively correlated with exam performance and course grades
and positively correlated with general anxiety and with blood pressure during an exam. Or
imagine that a researcher develops a new measure of physical risk-taking. People’s scores on
this measure should be correlated with their participation in “extreme” activities such as
snowboarding and rock climbing, the number of speeding tickets they have received, and even
the number of broken bones they have had over the years. When the criterion is measured at
Page 5|7
the same time as the construct, criterion validity is referred to as concurrent validity; however,
when the criterion is measured at some point in the future (after the construct has been
measured), it is referred to as predictive validity (because scores on the measure have
“predicted” a future outcome.
Criteria can also include other measures of the same construct. For example, one would expect
new measures of test anxiety or physical risk-taking to be positively correlated with existing
established measures of the same constructs. This is known as convergent validity.
Assessing convergent validity requires collecting data using the measure. Researchers John
Cacioppo and Richard Petty did this when they created their self-report on the Need for
Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty,
1982)[1]. In a series of studies, they showed that people’s scores were positively correlated with
their scores on a standardized academic achievement test and that their scores were negatively
correlated with their scores on a measure of dogmatism (which represents a tendency toward
obedience). In the years since it was created, the Need for Cognition Scale has been used in
literally hundreds of studies and has been shown to be correlated with a wide variety of other
variables, including the effectiveness of an advertisement, interest in politics, and juror
decisions (Petty, Briñol, Loersch, & McCaslin, 2009)[2].
Discriminant Validity
Discriminant validity, on the other hand, is the extent to which scores on a measure
are not correlated with measures of variables that are conceptually distinct. For example, self-
esteem is a general attitude toward the self that is fairly stable over time. It is not the same as
mood, which is how good or bad one happens to be feeling right now. So people’s scores on a
new measure of self-esteem should not be very highly correlated with their moods. If the new
measure of self-esteem were highly correlated with a measure of mood, it could be argued that
the new measure is not really measuring self-esteem; it is measuring mood instead.
When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence
of discriminant validity by showing that people’s scores were not correlated with certain other
variables. For example, they found only a weak correlation between people’s need for cognition
and a measure of their cognitive style—the extent to which they tend to think analytically by
breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found
no correlation between people’s need for cognition and measures of their test anxiety and their
tendency to respond in socially desirable ways. All these low correlations provide evidence
that the measure is reflecting a conceptually distinct construct.
********
Page 6|7
Self-Assessment:
1. Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then
assess its internal consistency by making a scatter plot to show the split-half
correlation (even- vs. odd-numbered items). Compute the correlation coefficient too if
you know how.
2. Discussion: Think back to the last college exam you took and think of the exam as a
psychological measure. What construct do you think it was intended to measure?
Comment on its face and content validity. What data could you collect to assess its
reliability and criterion validity?
Note: The above content is based on the work “Reliability and Validity of
Measurement” by Paul C. Price, Rajiv Jhangiani, I-Chant A. Chiang, Dana C. Leighton, &
Carrie Cuttler is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike
4.0 International License, except where otherwise noted.
References:
1. Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality
and Social Psychology, 42, 116–131.
2. Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition.
In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social
behavior (pp. 318–329). New York, NY: Guilford Press.
Page 7|7