Reliability and Validity of Measurement - Research Methods in Psychology - 2nd Canadian Edition

3/1/23, 12:18 AM Reliability and Validity of Measurement – Research Methods in Psychology – 2nd Canadian Edition
Home Read Sign in
Search in book …
Want to create or adapt OER like this? Learn how BCcampus supports open
education and how you can access Pressbooks.
RESEARCH METHODS IN PSYCHOLOGY – 2ND CANADIAN EDITION
CONTENTS
Chapter 5: Psychological Measurement
Reliability and Validity of

Measurement
Learning Objectives
1. Define reliability, including the different types and how they are assessed.
2. Define validity, including the different types and how they are assessed.
3. Describe the kinds of evidence that would be relevant to assessing the reliability

Previous: Understanding Psychological Measurement
and validity of a particular measure.
Next: Practical Strategies for Psychological Measurement
https://opentextbc.ca/researchmethods/chapter/reliability-and-validity-of-measurement/ 1/10
Again, measurement involves assigning scores to individuals so that they represent some char-
acteristic of the individuals. But how do researchers know that the scores actually represent the
characteristic, especially when it is a construct like intelligence, self-esteem, depression, or
working memory capacity? The answer is that they conduct research using the measure to con-
firm that the scores make sense based on their understanding of the construct being measured.
This is an extremely important point. Psychologists do not simply assume that their measures
work. Instead, they collect data to demonstrate that they work. If their research does not
demonstrate that a measure works, they stop using it.
As an informal example, imagine that you have been dieting for a month. Your clothes seem to
be fitting more loosely, and several friends have asked if you have lost weight. If at this point
your bathroom scale indicated that you had lost 10 pounds, this would make sense and you
would continue to use the scale. But if it indicated that you had gained 10 pounds, you would
rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement
method, psychologists consider two general dimensions: reliability and validity.
Reliability
Reliability refers to the consistency of a measure. Psychologists consider three types of con-

sistency: over time (test-retest reliability), across items (internal consistency), and across dif-
ferent researchers (inter-rater reliability).
Test-Retest Reliability
When researchers measure a construct that they assume to be consistent across time, then the
scores they obtain should also be consistent across time. Test-retest reliability is the extent to
which this is actually the case. For example, intelligence is generally thought to be consistent
across time. A person who is highly intelligent today will be highly intelligent next week. This
means that any good measure of intelligence should produce roughly the same scores for this
individual next week as it does today. Clearly, a measure that produces highly inconsistent
scores over time cannot be a very good measure of a construct that is supposed to be
consistent.
Assessing test-retest reliability requires using the measure on a group of people at one time,
using it again on the same group of people at a later time, and then looking at
test-retest correlation between the two

Previous: Understanding Psychological sets of scores. This is typically done by graphing the
Measurement
data in a scatterplot and computing Pearson’s r. Figure 5.2 shows the correlation between two
sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered
two times, a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation
of +.80 or greater is considered to indicate good reliability.
Figure 5.2 Test-Retest Correlation Between Two Sets of Scores of Several College Students on the
Rosenberg Self-Esteem Scale, Given Two Times a Week Apart
Again, high test-retest correlations make sense when the construct being measured is assumed
to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five per-
sonality dimensions. But other constructs are not assumed to be stable over time. The very na-
ture of mood, for example, is that it changes. So a measure of mood that produced a low test-
retest correlation over a period of a month would not be a cause for concern.
Internal Consistency
A second kind of reliability is internal consistency, which is the consistency of people’s re-

sponses across the items on a multiple-item measure. In general, all the items on such measures
are supposed to reflect the same underlying construct, so people’s scores on those items should
be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they
are a person of worth should tend to agree that that they have a number of good qualities. If
people’s responses to the different items are not correlated with each other, then it would no
longer make sense to claim that they are all measuring the same underlying construct. This is
as true for behavioural and physiological measures as for self-report measures. For example,
people might make a series of bets in a simulated game of roulette as a measure of their level
of risk seeking.

Previous: This measure
Understanding would beMeasurement
Psychological internally consistent to the extent that individual partic-
ipants’ bets were consistently high or low across trials.
Like test-retest reliability, internal consistency can only be assessed by collecting and analyz-
ing data. One approach is to look at a split-half correlation. This involves splitting the items
into two sets, such as the first and second halves of the items or the even- and odd-numbered
items. Then a score is computed for each set of items, and the relationship between the two sets
of scores is examined. For example, Figure 5.3 shows the split-half correlation between several
university students’ scores on the even-numbered items and their scores on the odd-numbered
items of the Rosenberg Self-Esteem Scale. Pearson’s r for these data is +.88. A split-half corre-
lation of +.80 or greater is generally considered good internal consistency.
Figure 5.3 Split-Half Correlation Between Several College Students’ Scores on the Even-Numbered
Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem Scale
Perhaps the most common measure of internal consistency used by researchers in psychology
is a statistic called Cronbach’s α (the Greek letter alpha). Conceptually, α is the mean of all
possible split-half correlations for a set of items. For example, there are 252 ways to split a set
of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correla-
tions. Note that this is not how α is actually computed, but it is a correct way of interpreting the
meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good
internal consistency.
Interrater Reliability
Many behavioural measures involve significant judgment on the part of an observer or a rater.

Previous: Understanding the
Inter-rater reliability is Psychological
extent to Measurement
which different observers are consistent in their judg-
ments. For example, if you were interested in measuring
Next: university
Practical Strategies students’ social
for Psychological skills, you
Measurement
could make video recordings of them as they interacted with another student whom they are
meeting for the first time. Then you could have two or more observers watch the videos and
rate each student’s level of social skills. To the extent that each participant does in fact have
some level of social skills that can be detected by an attentive observer, different observers’
ratings should be highly correlated with each other. Inter-rater reliability would also have been
measured in Bandura’s Bobo doll study. In this case, the observers’ ratings of how many acts
of aggression a particular child committed while playing with the Bobo doll should have been
highly positively correlated. Interrater reliability is often assessed using Cronbach’s α when the
judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa)
when they are categorical.
Validity
Validity is the extent to which the scores from a measure represent the variable they are in-
tended to. But how do researchers make this judgment? We have already considered one factor
that they take into account—reliability. When a measure has good test-retest reliability and in-
ternal consistency, researchers should be more confident that the scores represent what they are
supposed to. There has to be more to it, however, because a measure can be extremely reliable
but have no validity whatsoever. As an absurd example, imagine someone who believes that
people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem
by holding a ruler up to people’s index fingers. Although this measure would have extremely
good test-retest reliability, it would have absolutely no validity. The fact that one person’s in-
dex finger is a centimetre longer than another’s would indicate nothing about which one had
higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to inter-
pret these types is that they are other kinds of evidence—in addition to reliability—that should
be taken into account when judging the validity of a measure. Here we consider three basic
kinds: face validity, content validity, and criterion validity.
Face Validity
Face validity is the extent to which a measurement method appears “on its face” to measure
the construct of interest. Most people would expect a self-esteem questionnaire to include
items about whether they see themselves as a person of worth and whether they think they have
good qualities. So a questionnaire that included these kinds of items would have good face va-

Previous:
lidity. Understandingmethod
The finger-length Psychological Measurement
of measuring self-esteem, on the other hand, seems to have
nothing to do with self-esteem and therefore has poor face validity. Although face validity can
be assessed quantitatively—for example, by having a large sample of people rate a measure in

terms of whether it appears to measure what it is intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is measuring
what it is supposed to. One reason is that it is based on people’s intuitions about human be-
haviour, which are frequently wrong. It is also the case that many established measures in psy-
chology work quite well despite lacking face validity. The Minnesota Multiphasic Personality
Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having peo-
ple decide whether each of over 567 different statements applies to them—where many of the
statements do not have any obvious relationship to the construct that they measure. For exam-
ple, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten
me or make me sick” both measure the suppression of aggression. In this case, it is not the par-
ticipants’ literal answers to these questions that are of interest, but rather whether the pattern of
the participants’ responses to a series of questions matches those of individuals who tend to
suppress their aggression.
Content Validity
Content validity is the extent to which a measure “covers” the construct of interest. For exam-
ple, if a researcher conceptually defines test anxiety as involving both sympathetic nervous
system activation (leading to nervous feelings) and negative thoughts, then his measure of test
anxiety should include items about both nervous feelings and negative thoughts. Or consider
that attitudes are usually defined as involving thoughts, feelings, and actions toward some-
thing. By this conceptual definition, a person has a positive attitude toward exercise to the ex-
tent that he or she thinks positive thoughts about exercising, feels good about exercising, and
actually exercises. So to have good content validity, a measure of people’s attitudes toward ex-
ercise would have to reflect all three of these aspects. Like face validity, content validity is not
usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement
method against the conceptual definition of the construct.
Criterion Validity
Criterion validity is the extent to which people’s scores on a measure are correlated with oth-
er variables (known as criteria) that one would expect them to be correlated with. For exam-
ple, people’s scores on a new measure of test anxiety should be negatively correlated with their
performance on an important school exam. If it were found that people’s scores were in fact
negatively

Previous:correlated withPsychological
Understanding their exam performance,
Measurement then this would be a piece of evidence that
these scores really represent people’s test anxiety. But if it were found that people scored
equally well on the exam regardless of their test anxiety scores, then this would cast doubt on
the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated with the con-
struct being measured, and there will usually be many of them. For example, one would expect
test anxiety scores to be negatively correlated with exam performance and course grades and
positively correlated with general anxiety and with blood pressure during an exam. Or imagine
that a researcher develops a new measure of physical risk taking. People’s scores on this mea-
sure should be correlated with their participation in “extreme” activities such as snowboarding
and rock climbing, the number of speeding tickets they have received, and even the number of
broken bones they have had over the years. When the criterion is measured at the same time as
the construct, criterion validity is referred to as concurrent validity; however, when the crite-
rion is measured at some point in the future (after the construct has been measured), it is re-
ferred to as predictive validity (because scores on the measure have “predicted” a future
outcome).
Criteria can also include other measures of the same construct. For example, one would expect
new measures of test anxiety or physical risk taking to be positively correlated with existing
measures of the same constructs. This is known as convergent validity.
Assessing convergent validity requires collecting data using the measure. Researchers John Ca-
cioppo and Richard Petty did this when they created their self-report Need for Cognition Scale
to measure how much people value and engage in thinking (Cacioppo & Petty, 1982)[1]. In a
series of studies, they showed that people’s scores were positively correlated with their scores
on a standardized academic achievement test, and that their scores were negatively correlated
with their scores on a measure of dogmatism (which represents a tendency toward obedience).
In the years since it was created, the Need for Cognition Scale has been used in literally hun-
dreds of studies and has been shown to be correlated with a wide variety of other variables, in-
cluding the effectiveness of an advertisement, interest in politics, and juror decisions (Petty,
Briñol, Loersch, & McCaslin, 2009)[2].
Discriminant Validity
Discriminant validity, on the other hand, is the extent to which scores on a measure are not
correlated with measures of variables that are conceptually distinct. For example, self-esteem is
a general attitude toward the self that is fairly stable over time. It is not the same as mood,
which is how good or bad one happens to be feeling right now. So people’s scores on a new

Previous:
measure Understanding
of self-esteem Psychological
should Measurement
not be very highly correlated with their moods. If the new mea-
sure of self-esteem were highly correlated with a measure of mood, it could be argued that the
new measure is not really measuring self-esteem; it is measuring mood instead.
When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence
of discriminant validity by showing that people’s scores were not correlated with certain other
variables. For example, they found only a weak correlation between people’s need for cogni-
tion and a measure of their cognitive style—the extent to which they tend to think analytically
by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also
found no correlation between people’s need for cognition and measures of their test anxiety
and their tendency to respond in socially desirable ways. All these low correlations provide ev-
idence that the measure is reflecting a conceptually distinct construct.
Key Takeaways
Psychological researchers do not simply assume that their measures work. Instead,
they conduct research to show that they work. If they cannot show that they work,
they stop using them.
There are two distinct criteria by which researchers evaluate their measures: relia-
bility and validity. Reliability is consistency across time (test-retest reliability),
across items (internal consistency), and across researchers (interrater reliability).
Validity is the extent to which the scores actually represent the variable they are in-
tended to.
Validity is a judgment based on various types of evidence. The relevant evidence
includes the measure’s reliability, whether it covers the construct of interest, and
whether the scores it produces are correlated with other variables they are expected
to be correlated with and not correlated with variables that are conceptually
distinct.
The reliability and validity of a measure is not established by any single study but
by the pattern of results across multiple studies. The assessment of reliability and
validity is an ongoing process.
Exercises

1. Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then
assess its internal consistency by making a scatterplot to show the split-half corre-
lation (even- vs. odd-numbered items). Compute Pearson’s r too if you know how.
2. Discussion: Think back to the last college exam you took and think of the exam as
a psychological measure. What construct do you think it was intended to measure?
Comment on its face and content validity. What data could you collect to assess its
reliability and criterion validity?
1. Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and
Social Psychology, 42, 116–131. ↵
2. Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In
M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social be-
haviour (pp. 318–329). New York, NY: Guilford Press. ↵
LICENSE SHARE THIS BOOK

Research Methods in Psychology -

2nd Canadian Edition by Paul C. Price,
Rajiv Jhangiani, & I-Chant A. Chiang
is licensed under a Creative
Commons Attribution-
NonCommercial-ShareAlike 4.0
International License, except where
otherwise noted.
Powered by Pressbooks
Guides and Tutorials | Pressbooks Directory | Contact


Reliability and Validity of Measurement - Research Methods in Psychology - 2nd Canadian Edition

Uploaded by

Copyright:

Available Formats

Reliability and Validity of Measurement - Research Methods in Psychology - 2nd Canadian Edition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability and Validity of Measurement - Research Methods in Psychology - 2nd Canadian Edition

Uploaded by

Copyright:

Available Formats

3/1/23, 12:18 AM Reliability and Validity of Measurement – Research Methods in Psychology – 2nd Canadian Edition

Home Read Sign in

RESEARCH METHODS IN PSYCHOLOGY – 2ND CANADIAN EDITION

Chapter 5: Psychological Measurement

Reliability and Validity of

Reliability refers to the consistency of a measure. Psychologists consider three types of con-

A second kind of reliability is internal consistency, which is the consistency of people’s re-

be assessed quantitatively—for example, by having a large sample of people rate a measure in

Next: Practical Strategies for Psychological Measurement

LICENSE SHARE THIS BOOK

Next: Practical Strategies for Psychological Measurement

Research Methods in Psychology -

Guides and Tutorials | Pressbooks Directory | Contact

Next: Practical Strategies for Psychological Measurement

You might also like