Reliability PDF
Reliability PDF
Reliability PDF
Reliability
R
eliability and validity, the topics of this and the next chapter, are twins and
cannot be completely separated. These two concepts comprise the dual
“holy grail” of research, and outside of the central importance of theory,
they are crucial to any sort of meaningful research. Without reliability and validity,
research is nonsense. Many forms of reliability and of validity have been identified,
perhaps to the point that the words themselves have been stretched thin.
Test-Retest Reliability
Items 1 and 2 would probably have high internal consistency reliability. Items 1, 2,
3, and 4 would probably have moderately high reliability. Item 5 really does not fit
this test and would show little internal consistency reliability with the other items.
Page 3
Inter-Rater Reliability
A somewhat different sort of reliability is at issue when the same stimulus (per-
son, event, behavior, etc.) must be rated by more than one rater. For example, in
studies of the relationship between physical attractiveness and social development,
the researchers need to know how attractive the person is. (Research of this kind
asks questions such as, “do prettier people develop better social skills?”) How can
this rating be done? Calculate the ratio of length of nose to distance between the
ears? While some such physical indexes of attractiveness have been developed,
the most common way is to assemble a panel of “judges” to rate the “stimuli.”
(Sounds like figure skating judging, but there is no bribery involved.) The research-
er needs to look at the extent to which the raters agree on their ratings. When
inter-rater reliability is low, the researcher has to wonder if it is possible to classify
persons on a dimension such as attractiveness or whether his or her attempts to
do so have failed.
A Research Example
The researcher wants to know if personality traits are related to sexual activ-
ity. She cannot perform a true experiment because personality traits cannot be
controlled or manipulated, so she must be content with simply measuring the
traits and assessing sexual behavior. The trait of interest is Self-Monitoring, a blend
of extraversion, willingness to self-present, self-presentation abilities, and low social
anxiety. High Self-Monitors are outgoing people who know how to act in various
situations to get what they want, and do so. A measure of sexual behavior is cre-
ated just for this study, the “Sexual Activity Test” (SAT).
The Self-Monitoring Scale (SMS) already exists, so she will use it as it is. She must,
however, construct the SAT from scratch. The details of how she would actually
do this are complicated, but in the end she has a 20-item test that assesses vari-
ous interpersonal sexual behaviors. The researcher theorizes that, overall, SAT is a
“unitary construct.” A unitary construct has one, central idea or dimension rather
than several sub-dimensions. For example, some psychologists believe that IQ may
not be a unitary construct, arguing that there are several different kinds of intel-
ligence. If true, then a single IQ score is meaningless.
To determine if the SAT assesses a unitary construct, the researcher gives the
initial version of the test to a large sample of people who will not be in the real
study, then performs an item analysis. She looks at the internal consistency reli-
Page 4
ability of the 20 items to see if they “hang together.” (She also does some other
things that are beyond our interest here.) Coefficient alpha, a common measure
of internal consistency, turns out to be α=.45. This is too low, indicating that
the items are not measuring the same thing. She has two choices: (1) give up on
the idea that the SAT will assess a unitary construct and try to find the two or
more sub-dimensions that represent interpersonal sexual activity; or (2) find the
bad items and get rid of them. She chooses the latter. A bad item is one that is
poorly related to the other items, sort of like a human who refuses to fit in to
a social group. To find the “bad” items, she correlates each item with the total
score (the average of all the items). These 20 correlations are termed “item-total
correlations.” She looks for items with a poor relationship to the total score and
eliminates them from the SAT. Then she recalculates coefficient alpha on the new,
smaller test and, hooray, it is now α=.85 (very good). As the Japanese say: “the nail
that sticks up gets pounded down.”
Next, she want to make sure that the SAT is stable over time. She finds still
another sample of people and gives them the SAT twice, one month apart. The
correlation between time 1 and time 2, the test-retest reliability, turns out to be
r=.70. This is very good given the fact that people do change over time, and some
instability in the test is expected for this reason rather than due to the test’s quali-
ties.
Now, finally, she is ready to perform her research. She selects a sample of 50
males and 50 females in the range 22-25 years old, all unmarried, and administers
her two tests to them four times, once every six months. She gives Form A of
the SMS and Form A of the SAT the first time, Form B the second time, Form A
the third time, etc. Unfortunately, she gets very “noisy” results: the correlations
between SAT and SMS are in the right direction, but low.
She concludes that something else is affecting sexual activity and, based on other
social psychology research, theorizes that it is the physical attractiveness of the
subjects. She must now evaluate each subject’s physical attractiveness. She brings
all 100 into her lab and takes professional quality photos of them from “a variety
of angles.” Then she assembles a panel of two men and two women in the 22-25
Page 5
age range and has them rate each photo on attractiveness. To make sure they are
producing a reliable measure, she looks at the agreement rates among the four
raters. Coefficient Kappa comes out to be K=.65, which is good enough. Now
she can use the attractiveness ratings to lower the noise (error variance) in her
data.
How would such a study actually come out? This particular study has not been
performed, but components of it have. Self Monitoring does predict higher sexual
activity, and attractive people do get more dates. Based on other studies, we
would predict that SMS would cause sexual activity, not the opposite. Adding at-
tractiveness would undoubtedly strengthen the results of the study.
Reference
Gabrenya, W. K., Jr., & Arkin, R. M. (1980). Self-Monitoring Scale: Factor structure
and correlates. Personality and Social Psychology Bulletin, 6, 13-22.