Module 5 - Reliability and Validity in Measurement
Module 5 - Reliability and Validity in Measurement
Learning Outcomes
- Recognize and explain how the reliability and validity of psychological measures
determine the utility of these measures for testing hypotheses and for practical
applications
- Understand the concept of measurement error and distinguish between random and
systematic measurement error
- Differentiate between reliability and validity as indices of the quality of psychological
measures
- Identify key methods for assessing the reliability of psychological measures such as
testing parallel forms, temporal stability, internal consistency, and inter-rater reliability
- Recognize what face validity is and explain why it is an inadequate method for assessing
the validity of psychological measures
- Identify key methods for assessing the validity of psychological measures, such as
criterion validity testing and construct validity testing
- Recognize the need to establish convergent and discriminant validity of psychological
measures
Overview
- There are 2 major types of measurement error that causes scores on a measure to differ
from the true value that the measure is designed to estimate: random measurement
error and systematic measurement error
Random Measurement Error
- The first major type of measurement error is random measurement error (aka noise)
- It causes scores on the measure to deviate from the true value in random, unpredictable
ways during the process of measurement
- The reliability of a measure is inversely related to the level of random measurement error
in that measure
- Increasing the number of distinct observations that get averages together into an
estimate will reduce random measurement error because random errors tend to cancel
each other out when they are averaged
- The amount of random measurement error in a measure is assessed through
tests of that measure’s reliability
- In psychological measurement there are a variety of possible sources of random
measurement error that may cause the measured value to deviate randomly from the
true value
Common Sources of Random Measurement Error in Psychological Measurement
1. Random variability in the psychological states of participants or respondents
- These instabilities in the tested individual include such factors as fluctuation in their
mood, states of fatigue, and attentiveness during the testing process
- Random variance: variance in a measure that is unsystematic and does not exhibit
consistent patterns across repeated measures or modes of assessment
- Systematic variance: variance in a measure that exhibits discernable patterns including
properties such as stability of results across repeated measures and consistency of
results across modes of assessments
Useful Measures Must be Reliable Measures
- For a psychological measure to be useful it needs to be reliable, meaning that the
measure would likely yield similar results if it was used repeatedly to quantify a set of
behaviours or psychological experiences
- Reliable measures are needed in order to produce dependable results when researchers
test their hypothesis - for this reason psychologists have developed a variety of methods
to evaluate the reliability of their measures
- The reliability of a measure is inversely related to the level of measurement error
associated with that measure
- The more error-prone a measure is the less reliable the results of that measure
will tend to be
- You can get a sense of what reliable measure is like by considering common synonyms
for reliability:
- Consistency: a reliable measure is a measure that produces a consistent pattern
of results whenever it is applied to record a given set of behaviours or
psychological experience
- Stability: the scores that a reliable measure yields tend to be more stable across
time and across settings
- Dependability: a reliable measure is dependable in the sense that researchers
can depend on their results replicating if these results are based on reliable
measures
- Random error undermines the consistency, stability and dependability of measures
because when random error is high the results produced by a measure are likely to vary
in unpredictable ways
Systematic Variance: The Signal Researchers Aim to Detect
- Reliability assesses the proportion of that total measured variance in a sample that is
systematic variance, meaning that this variance shows discernable and stable patterns,
versus the variance that is due to random measurement error
- Signal-to-noise ratio is high -> we may be able to accurately detect the signal
- Signal-to-noise ratio is low -> it can be very hard to detect any signal amidst the high
background noise
- Low signal-to-noise ratio: trying to have a conversation with a friend in a noisy
environment, such as a busy restaurant
- High signal-to-noise ratio: conversing as you stroll down a quiet street
- *signal you are trying to detect is what your friend is saying
- *noise is the background sounds that are not part of your conversation
- Analogously in psychological measurement it is harder to estimate the value of some
latent variable when the measure that is used to assess that value contains a high level
of random measurement error or noise
- There are a variety of strategies that psychologists use to assess how well a
psychological measure captures systematic variance as opposed to measuring random
variability
Procedures for Assessing Reliability: Parallel Forms
- Parallel forms: 2 distinct forms of the same psychological measure that have the same
overall structure and format but differ in the specific items that they contain
- Parallel forms reliability: technique that estimated the reliability of a measure by
assessing the magnitude of the correlation between score on parallel forms of that
measure
- Test bank: a corpus of test items that are designed to assess some psychological
construct(s) or knowledge of some topic(s)
Creating Two Forms of a Measure to Assess Reliability
- One method for assessing the reliability of a measure involves constructing 2 parallel
forms of the measure that each have distinct content, with no identical items across the 2
forms of the measure
- Parallel forms reliability can be assessed if both of these parallel forms of the measure
are administered to the same sample of participants and the degree to which these 2
measures correlate is observed
- If there is a strong positive correlation in participants’ scores on the 2 parallel forms of
the measure then this indicates that the measures are assessing the same latent
variable
- A strong positive correlation would mean that participants who scored higher than the
sample average on one of the measures also scored higher than the sample average on
the parallel measure and participants who scored close to the sample average on one of
the measures also tended to score close to average on the parallel measure
- Participants who scored lower than the sample average on one of the measures also
tended to score lower than the sample average on the parallel measure
- If there is a strong positive correlation between the scores on the parallel measures it
doesn’t necessarily mean that individual participants get the exact same score on both
measures
- However, it meant that the ranks of an individual’s score relative to other
participants in the same is nearly the same for each measure of the construct
- Ex. if there is a high correlation between scores on the parallel measures
then a participant whose score was in the 95th percentile on one measure
will tend to be in the 95th percentile on the parallel measure
- Temporal stability (ex. Test-retest reliability): technique that estimates the reliability of a
measure by administering that same measure to a same sample of participants across
two or more sessions separated by some meaningful interval of time - the magnitude of
the correlation in participants’ scores across these sessions indicates the reliability of the
measure
Stability of a Measure Over Time
- Stability is also referred to as test-retest reliability
- To assess the stability of a measure researchers administer the same measure to a
sample of participants during an initial session then wait some interval of time before
administering the measure to the same sample again in a subsequent session
- The researchers then compute the correlation between participants’ scores
across these separate testing sessions
- If there is a strong positive correlation between the first set of scores and the
second set of score then this indicated that there is high stability of participants’
scores on the measure
- Strong positive correlation between scores at the different time points would mean
that participants who scored higher than the sample average in the first session also
scored higher than the sample average in the second session and participants who
scored close to the sample average in the first session also tended to score close to
average in the second session, and participants who scored lower than the sample
average in the first session also tended to score lower than the sample average in the
second session
- If a measure is highly stable it doesn’t necessarily mean that individual participants get
the exact same score each time that the measure is administered to them. However, it
means that the rank of an individual's score relative to other participants in the sample is
nearly the same each time that the measure is administered to that sample.
- For example, if a measure is highly stable then a participant whose score was in
the 95th percentile during the first session will tend to be in the 95th percentile
again in the second session. Returning to the Mind-Eyes example, if this
measure is highly stable we would expect to see participants' rankings, based on
their score, to be consistent across different time points. In other words,
participants that ranked high at time 1, relative to other participants, should rank
high at time 2, though their specific score may vary slightly. Conversely, if this
measure is not stable or inconsistent over time, individual participants' rankings
should vary across time points
- To the extent that participants’ scores on a given measure are highly stable, in terms of
how they rank relative to other scores in the sample, across testing sessions then this
indicates that the measure is reliably discriminating individual differences in something
Some Authentic Change May be Expected Over Time
- Using the temporal stability of a measure to assess reliability assumes that any change
in the relative ranking of participants’ scores within the sample is due to random
measurement error (ex. Low reliability) rather than authentic change in whatever
psychological variable the measure is assessing
- This assumption is more plausible if there is not an extended interval between
the administrations of the measure
- The length of the interval during which authentic change might plausibly occur
depends on the nature of the variable that is being assessed
- Usually personality traits, fundamental values, and core abilities would not be expected
to change across long stretches of time - months or even years
- For assessing the temporal stability of measure of these types of variables
researchers should schedule the testing sessions to be separated by at least
several weeks or even months
- Even shorter intervals between testing sessions should be scheduled for
assessing temporal stability of measures if there are variables of more
rapid change (childhood, early adolescence)
- Other variables (person’s attitudes, preferences, and feelings) might be expected
to undergo authentic change in relatively shorter intervals of just a few weeks
- If longer intervals are used to test the stability of measure of these more rapidly changing
variables then it would be ambiguous whether low correlations between the testing
sessions indicate random measurement error or authentic change
Determining an Interval that is ‘Just Right’
- Researchers who are testing the temporal stability of a measure need to be careful not
to schedule too long an interval between testing sessions because this makes it
ambiguous whether low correlations in scores are due to measurement error or authentic
change
- However, researchers also need to be careful not to schedule too short an
interval between testing sessions because in this case it would make it
ambiguous whether high correlations in scores are due to reliability of the
measure or due to repeated testing effects
- If there is a short interval between testing sessions then participants may
be able to recall their responses from the first session and they may try to
give a similar response in the second session just to appear consistent
- This inflates the estimates of the stability of the measure
- Goldilocks problem in determining the optimal interval between sessions to test the
temporal stability of a measure
- If the interval between testing sessions is too long the reliability of the measure
may be underestimated
- If interval between the testing sessions is too short then the reliability of the
measure may be overestimated
- There is no one-size-fits-all rule for scheduling the interval between testing sessions
- For most variables it’s customary to separate the testing sessions by at least 2
weeks to balance the cross-pressures of problems due to repeated testing effects
on the one hand versus authentic change in the variable on the other hand
Think and Respond
https://skepticalinquirer.org/2002/01/snaring_the_fowler_mark_twain_debunks_phrenology/
- When multiple distinct observations are aggregated through some process of summing
or averaging, the resulting aggregated measure tends to be more reliable than any of the
individual component observations are
- This is because any given observation contains some amount of measurement
error
- As more distinct observations are aggregated together their individual errors
should tend to cancel each other out and the signal for whatever content these
observations share in common should by consequence become clearer following
aggregation
- So, as a general rule increasing the number of measurements taken will tend to reduce
random measurement error and improve the overall reliability of an aggregate measure
- However, there are some important considerations to bear in mind when deciding
whether to include additional items to a scale or test or additional observers in an
observational study.
- To Add Items or Not to Add Measurement Items? Further Considerations
1. When considering whether to add measurements to improve the reliability of a
measurement system it is important to take care to add measurements that will share
some content in common with the existing measurements
- Adding more measurements to an aggregate measure will only improve the reliability of
that measure if the added measurements overlap somewhat in their content with the
original measurements
- If the added measurements involve unique content that diverges strongly from the
content of the existing measurements then adding these additional measurements will
not improve the reliability of that measurement system
- If added items are poorly correlated it may not improve reliability of the questionnaire
and also have some risk that the added items will lower the reliability
- If poorly correlated, it will not improve the inter-rater reliability of a coding system
2. Even if the added measurements share something in common with the existing
measurements there are diminishing returns to adding more and more measurements
to enhance an aggregate measure’s reliability
- Ex. you will get a larger boost in reliability by adding 5 items to a 10-item scale (50%
increase in items) than you will get from adding 5 items to a 50-item scale (just a 10%
increase in items)
3. Additional measurements to not reduce the impact of systematic measurement error, of
the level of bias in measurement
- A measure contains systematic biases in estimating some target value (ex. A systematic
bias to overestimate the value of the variable), adding additional measurements will not
counteract such systematic biases
- Ex. if people are biased to report higher levels of satisfaction with their romantic
relationship then they actually feel, perhaps because they want to convey a favourable
impression to the researcher, then adding more items to a relationship satisfaction scale
will not correct this bias in self-reporting
2. Increasing Reliability by Removing Weakly Associated Measurements
- Another approach for enhancing the reliability of a measure involves identifying any
items that show very low correlations with the other items\
- Items that are not correlated with the other items will be contributing to the noise in the
measure and this noise will make it harder to detect the signal from the other items
- Removing these poorly performing items from the measure could help to improve that
measure’s reliability metrics
- With this approach, researchers could use the patterns of intercorrelation among the
items within a measure to select the most consistently performing items and refine the
measure
- Dr. Johnson makes the insightful observation that when experimenters examine the
consistency of people’s behaviour across just two experimental situations this is
essentially like looking at the test-retest reliability of a single-item measure. As we
reviewed in the module, measures that are consistent with a low number of items tend
not to be very reliable. So, when researchers assess a single behaviour across two
situations, we should not be surprised that there tends to be relatively low stability in
those behaviours. This low stability likely represents the unreliability of a single-item
measure rather than a lack of consistency in personality expression. To provide a more
reliable assessment of the consistency of personality what researchers should do is
observe the behaviour of a sample of individuals in multiple situations, and compute the
average of each individual’s behaviour in a randomly selected half of those situations
and the average of their behaviour in the other half of the situations and then compute
the correlation between those two averages. If multiple observations go into each of
these averages then each average should be a relatively reliable measure of the
individuals’ behavioural tendencies and thus we would expect the correlation between
these two behavioural averages to be relatively large. For example, suppose that you
observed a sample of individuals across 10 situations. You could then randomly divide
these 10 situations into two sets of 5 situations. You would then calculate each
individual’s average behavior score in each of these sets of 5 situations and compute the
correlation between those two averages. This procedure should result in a much higher
consistency in behaviour then you would get if you computed the correlations in
individuals’ behaviours across only 2 situations, as was typically done in the past
experimental research that questioned the consistency of personality.
Validation of Measures
- Criterion validity: technique for assessing the validity of a measure by examining how
well it predicts some key outcome that it was specifically designed to predict
- Face validity: the extent to which the content of the measure appears to resemble
whatever latent construct it was designed to measure
- Measurement confound: variability in the measured value that can be attributed to a
source that is not the latent construct that the measure was designed to assess
Does the Measure Actually Get at the Variable of Interest?
- Establishing the reliability of a psychological measure is the critical first step in
determining whether it will be a useful instrument for testing hypotheses about the
variable of interest
- When a measure is reliable this tells us that it is measuring something that is
consistent and stable enough to yield replicable results
- For a measure to be useful it is not sufficient to demonstrate that it is measuring
something
- We must establish that the psychological measure is measuring the specific
variable that it was designed to measure, as opposed to something else
- This critical next step entails testing the validity of the measure
- There is always some risk that a measure that was designed to measure a particular
variable might measure variance in other unintended variables
- If some variability in a measure can be attributed to sources other than the variable that
it was designed to measure then this is referred to as a measurement confound
- To establish that a measure will be useful for research purposes a researcher
needs to demonstrate that most of the variability in that measure is related to the
variable that it was designed to measure and relatively little of its variability is due
to some measurement confound
- Much of the work in assessing the validity of a measure focuses on testing
whether potential measurement confounds can be ruled out
Face Validity
- Face validity is the most simplistic and least scientifically compelling evidence for the
validity of a measure
- A measure is said to be face valid if most people who examine just the content of the
measure would agree that it appears to be measuring what it is designed to measure
- Face validity is thus equivalent to the proverbial “duck test”
- “If it looks like a duck and it quacks like a duck, then it must be a duck”
- Similar to the duck test, a researcher might claim that a measure is a face valid
measure of relationship satisfaction because the items consist of statements that
sound like the kinds of things a person would endorse a statement
- Although face validity is intuitively compelling it is not scientifically compelling evidence
of a measure’s validity because experienced researchers know to expect things often
are not as simple and straightforward as they may appear
- Some people might endorse a statement such as “I enjoy spending time with my
partner” not because they are actually satisfied with their relationship but
because they fear that they will look bad if they don’t say positive things like this
about their partner
- So on the surface appears to be a face valid measure of relationship
satisfaction may actually be tapping into a person’s motivation to convey
a positive impression to the experimenter
- The principle that we should be suspicious of seemingly face valid evidence
extends beyond research to many practical domains of life
- “You can’t judge a book by its cover”
- Some of the most compelling examples of fallibility of face validity come from the
criminal justice system
- Ex. people naively assume that a suspect’s confession to a crime is face
valid evidence of that person’s guilt - there have been such a large
number of confessions that were proven to be false and were overturned
based on definitive forensic evidence that we should be extremely
skeptical of leaping to the conclusion that a confession proves a person’s
guilt
Criterion Validity
- Some measures are designed to assess variability in a psychological construct in order
to predict specific criterion outcomes of interest
- A criterion outcome is a specific, definitive outcome that usually has some real
world significance or practical value
- In these cases the criterion outcome serves as a “gold standard” for
indexing the validity of the measure in question
- This example illustrates that criterion validity is restricted to measures that are
designed for a quite narrow purpose and is not relevant to most psychological
measures that are designed to assess broader constructs. To validate such broader
measures psychologists need to engage in a more extensive form of validity testing,
called construct validation
Construct Validity: Convergent Validity
- Construct validity: approach for estimating the validity of a measure by testing a network
of predictions about what patterns the measure should exhibit in relation to other
measures and outcomes, and in relation to meaningful groups
- Convergent validity: technique for estimating the validity of a measure by assessing
correlations between that target measure and other existing measures that were
designed to assess the same construct or same closely related construct
- Halo effect: source of systematic measurement error in observational measures that
occurs when observers are biased to attribute positive qualities to targets who are
physically attractive or who make a favourable initial impression compared to targets
who are less attractive or make a less favourable initial impression
- Known groups validation: technique for estimating the validity of a measure by
administering that measure to two or more groups of participants that, according to the
researchers’ theory of the construct, are predicted to differ in their levels of the
psychological variable of interest
- Shared method variance: correspondence between measures that is due to
methodological elements that they share in common rather than due to their convergent
measurement of the same psychological construct
- Social desirability response bias: source of systematic measurement error in self-report
measures that occurs when participants are motivated to respond to self-report items in
a way that will promote a favourable impression
- Theory of the construct: the researcher’s theoretical assumptions about the nature,
scope, and properties of the latent construct that they are studying
Validity Derived From Accumulation of Many Successful Predictions
- Construct validity assesses the validity of a measure by testing a network of
predictions about what patterns the measure should exhibit in relation to other measures
and outcomes and in relation to meaningful groups
- These predictions are derived from the researcher’s theoretical understanding of
whatever latent variable the measure was designed to assess
- This theory of the latent variable is referred to as the researcher’s theory of the
construct, which is why this approach is known as construct validity
- The accumulation of many successful predictions involving the measure and diverse
observations relevant to the theory of the construct enhances the research community’s
confidence in the construct validity of that measure
- Establishing a measure’s construct validity is thus an ongoing process that is never
definitively resolved and that may need to be revisited as the field’s theory of the
construct evolves
Testing Construct Validity: Known Groups Validation
- One useful approach for testing the construct validity of a measure involves
administering the measure to two or more groups of participants that, according to the
researchers’ theory of the construct, are predicted to differ in their levels of the
psychological variable of interest
- This is referred to an known groups validation
- These groups could be a group that has a clinical condition that entails atypically high or
atypically low levels of the variable of interest and a comparison group of participants
who should have typical levels of the variable of interest
- Known groups testing of a measure might also involve administering the measure to a
group of participants who work in a field that require above-average levels of the variable
of interest and a comparison group of participants who work in a field that does not
require particularly high levels of that variable
- Known groups validation has been used to test the construct validity of the Mind-Eyes
test
- As you will recall this measure was designed to assess the latent variable,
cognitive empathy
- The researcher’s theory of cognitive empathy led them to hypothesize that people who
have an autism spectrum diagnosis (ASD) should be lower in cognitive empathy than
people who were not on the autism spectrum
- This was predicted because ASD is theorized to involve a core deficit in the
individuals’ theory of mind, which is the ability to make accurate inferences about
other people’s cognitive stress
- Thus, if the Mind-Eyes test is a valid measure of cognitive empathy then the
researchers predicted that a sample of participants who had ASD should get
significantly lower scores on the Mind-Eyes test than a comparison sample of
participants who did not have ASD
- Dr. Johnson notes that items such as "I feel a kinship with other people” assess social
adjustment. Social adjustment is one of the outcomes that spirituality should be
correlated with according to theories of spirituality, but social adjustment is not one of the
defining characteristics of spirituality. If items that assess social adjustment are included
in a measure of spirituality then this would result in spirituality being confounded with
social adjustment in that measure. We know from other research that social adjustment
tends to predict positive physical and psychological outcomes. So, the confounding of
spirituality and social adjustment in a spirituality measure makes it difficult to determine
whether the positive correlation between that spirituality measure and
physical/psychological outcomes is due to the spirituality-relevant items in the spirituality
measure or due to the social adjustment-related items within that measure. Thus, when
researchers are designing a new measure they need to be careful to ensure that the
contents of the measure only include items that are relevant to the core defining
characteristics of whatever construct the measure is being designed to assess and
exclude any items that are specific to other variables that the researchers may want to
use their measure to predict.
Summary
- Testing psychological hypotheses about the relations between variables depends upon
using measures that accurately represent those variables. Thus, researchers need to
demonstrate that the measures they use to operationalize their hypothesized variables
accurately represent those variables. For a measure to be considered an accurate
representation of a variable that measure must be shown to have adequate reliability
and validity
- A measure is reliable to the extent that it yields consistent, dependable results when it is
used repeatedly to measure variability within a sample. The reliability of a measure is
inversely related to the amount of random measurement error in that measure. A
variety of methods are used to assess the reliability of psychological measures including
the temporal stability of scores on that measure, the consistency between scores on
parallel forms of the same measure, and the internal consistency of the scores on
individual items within that measure. For observational measures, reliability is assessed
by estimating inter-rater agreement in the ratings that independent observers give to
the same sample of observations. When a measure is found to have low reliability
researchers may seek to improve its reliability by incorporating more items into the
aggregated measure, by removing items that are weakly or inconsistently associated
with other items in the measure, and by clarifying the items in the measure and
standardizing the conditions of administering or scoring the measure
- A measure is valid to the extent that it assesses the specific variable that it was designed
to measure as opposed to measuring some other confounded variable. To establish
that a measure will be useful for research purposes a researcher needs to demonstrate
that most of the variability in that measure is related to the variable that it was designed
to measure and relatively little of its variability is due to some measurement confound.
Researchers may be tempted to assume that a measure is valid simply because the
content of the measure appears to resemble the variable that it was designed to
measure. This is known as face validity and it is not considered to be an adequate
demonstration of a measure's validity. When a measure is designed to predict some
specific criterion outcome then the validity of a measure can be assessed by testing how
well the measure predicts that outcome, which is known as criterion validity. For most
psychological measures, which are designed to predict a broad range of psychological
states and behaviours, validity testing usually involves extensive research that tests the
relations between the measure of interest and a variety of other measures and
outcomes, which is known as construct validity. One method of assessing a measure's
construct validity involves administering that measure to two or more groups of
participants that, according to the researchers' theory of the construct, are predicted to
differ in their levels of the psychological variable of interest, which is referred to as
testing known groups validity. Another technique for testing the construct validity of a
measure involves assessing correlations between that target measure and other existing
measures that were designed to assess the same construct or some closely related
construct, which is known as testing the measure's convergent validity. Another
important technique for estimating the construct validity of a measure involves testing
whether that target measure is unrelated to measures that it should be distinct from
according to the researcher's theory of the construct, which is known as testing the
measure's discriminant validity.