Reliability and Validity

Measurement Error
Whatever measurement we might make with regard to
some psychological construct, we do so with some amount

of error
Any observed score for an individual is their true score with error
added in
There are different types of error, but here we are
concerned with a measures inability to capture the true

response for an individual
Observed Score = True score + Error of measurement
Reliability
Reliability refers to a measures ability to capture an individuals true
score, i.e. to distinguish accurately one person from another

While a reliable measure will be consistent, consistency can actually be
seen as a by-product of reliability, and in a case where we had perfect
consistency (everyone scores the same and gets the same score
repeatedly), reliability coefficients could not be calculated
No variance/covariance to give a correlation
The error in our analyses is due to individual differences but also the
lack of the measure being perfectly reliable
Reliability
Criteria of reliability
Test-retest
Test components (internal consistency)
Test-retest reliability
Consistency of measurement for individuals over time
The score similarly e.g. today and 6 months from now
Issues
Memory
If too close in time the correlation between scores is due to memory of item responses rather
than true score captured
Chance covariation
Any two variables will always have a non-zero correlation
Reliability is not constant across subsets of a population
General IQ scores good reliability
IQ scores for college students, less reliable
Restriction of range, fewer individual differences
Internal Consistency
We can get a sort of average correlation among items
to assess the reliability of some measure1

As one would most likely intuitively assume, having
more measures of something is better than few
It is the case that having more items which correlate
with one another will increase the tests reliability
Whats good reliability?

While we have conventions, it really kind of depends
As mentioned reliability of a measure may be different for
different groups of people

What we may need to do is compare reliability to those
measures which are in place and deemed good as well as
get interval estimates to provide an assessment of the
uncertainty in our reliability estimate
Note also that reliability estimates are biased upwardly and
so are a bit optimistic
Also, many of our techniques do not take into account the
reliability of our measures, and poor reliability can result in
lower statistical power i.e. an increase in type II error
Though technically increasing reliability can potentially also lower
power1
Replication and Reliability

While reliability implies replicability, assessing reliability does not provide a
probability of replication
Note also that statistical significance is not a measure of reliability or replicability1
Replication is not perhaps conducted as much as should be in psychology for a

number of reasons
Practical concerns, lack of publishing outlets etc.
Furthermore, knowing our estimates are biased and variable themselves, we
might even think that in many cases we would not expect consistent research
findings
In psychology, many people spend a lot of time debating back and forth about
the merits of some theory, citing cases where it did or did not replicate
However the lack of replication could be due to low power, low reliability,
problem data, incorrectly carrying out the experiment etc.
In other words, we didnt repeat because of methodology, not because the theory was
wrong
Factors affecting the utility of replications

You cant step in the same river twice!
Heraclitus1
When
Later replications are not providing as much information, however
they can contribute greatly to the overall assessment of an effect
Meta-analysis
How
There is no perfect replication (different people involved, time it
takes to conduct etc.)

Doing exact replication gives us more confidence in the original
finding (should it hold), but may not offer much in the way of
generalization
Example: doing a gender difference study at UNT over and over. Does it
work for non-college folk? People outside of Texas?
Factors affecting the utility of replications

By whom
It is well known that those with a vested interest in some idea tend
to find confirming evidence more than those that dont

Replications by others are still being done by those with an interest
in that research topic and so may have a precorrelation inherent in
their attempt
Direct: correlation of attributes of persons involved

Indirect: correlation of data to be obtained
Gist, we cant have truly independent replication attempts,
but must strive to minimize bias

The more independent replication attempts are, the more
informative they will be
Validity
Validity refers to the question of whether our
measurements are actually hitting on the construct we

think they are
While we can obtain specific statistics for reliability (even
different types), validity is more of a global assessment
based on the evidence available
We can have reliable measurements that are invalid
Classic example: The scale which is consistent and able to
distinguish from one person to the next but actually off by 5 pounds
Validity Criteria in Psychological Testing

Content validity
Criterion validity
Concurrent
Predictive
Construct-related validity
Convergent
Discriminant
Content validity
Items represent the kinds of material (or content areas) they are supposed to
represent
Are the questions worth a flip in the sense they cover all domains of a given
construct?
E.g. job satisfaction = salary, relationship w/ boss, relationship w/ coworkers etc.

Criterion validity
the degree to which the measure correlates with various outcomes
Does some new personality measure correlate with the Big 5
Concurrent
Criterion is in the present
Measure of ADHD and current scholastic behavioral problems
Predictive
Criterion in the future
SAT and college gpa

Construct-related validity
How much is it an actual measure of the construct of interest
Convergent
Correlates well with other measures of the construct
Depression scale correlates well with other dep scales
Discriminant
Is distinguished from related but distinct constructs
Dep scale != Stress scale
Validity Criteria in Experimentation

Statistical conclusion validity
Is there a causal relationship between X and Y?
Correlation is our starting point (i.e. correlation isnt causation, but does lead to it)
Related to this is the question of whether the study was sufficiently sensitive to pick
up on the correlation
Internal validity
Has the study been conducted so as to rule out other effects which were controllable?
Poor instruments, experimenter bias
External validity
Will the relationship be seen in other settings?
Construct validity
Same concerns as before
Ex. Is reaction time an appropriate measure of learning?
Summary
Reliability and Validity are key concerns in psychological
research
Part of the problem in psychology is the lack of reliable
measures of the things we are interested in1
Assuming that they are valid to begin with, we must always
press for more reliable measures if we are to progress
scientifically
This means letting go of supposed standards when they are no
longer as useful and look for ways to improve current ones

Reliability and Validity

Uploaded by

Copyright:

Available Formats

Reliability and Validity

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability and Validity

Uploaded by

Copyright:

Available Formats

Measurement Error

Whatever measurement we might make with regard to

some psychological construct, we do so with some amount

There are different types of error, but here we are

concerned with a measures inability to capture the true

score, i.e. to distinguish accurately one person from another

lack of the measure being perfectly reliable

Restriction of range, fewer individual differences

to assess the reliability of some measure1

Whats good reliability?

different groups of people

Replication and Reliability

Note also that statistical significance is not a measure of reliability or replicability1

Replication is not perhaps conducted as much as should be in psychology for a

Furthermore, knowing our estimates are biased and variable themselves, we

Factors affecting the utility of replications

they can contribute greatly to the overall assessment of an effect

takes to conduct etc.)

Factors affecting the utility of replications

to find confirming evidence more than those that dont

Direct: correlation of attributes of persons involved

Gist, we cant have truly independent replication attempts,

but must strive to minimize bias

measurements are actually hitting on the construct we

Validity Criteria in Psychological Testing

E.g. job satisfaction = salary, relationship w/ boss, relationship w/ coworkers etc.

Validity Criteria in Psychological Testing

Does some new personality measure correlate with the Big 5

Measure of ADHD and current scholastic behavioral problems

SAT and college gpa

Validity Criteria in Psychological Testing

Depression scale correlates well with other dep scales

Dep scale != Stress scale

Validity Criteria in Experimentation

longer as useful and look for ways to improve current ones

You might also like