CHET7080@6

Assignment 6
Attempt both the Questions in the same document

a. What are four levels of measurement?
Levels of measurement, also called scales of measurement, tell you how precisely variables
are recorded. In scientific research, a variable is anything that can take on different values
across your data set (e.g., height or test scores).
There are 4 levels of measurement:
 Nominal: the data can only be categorized
 Ordinal: the data can be categorized and ranked
 Interval: the data can be categorized, ranked, and evenly spaced
 Ratio: the data can be categorized, ranked, evenly spaced, and has a natural zero.
Depending on the level of measurement of the variable, what you can do to analyze your
data may be limited. There is a hierarchy in the complexity and precision of the level of
measurement, from low (nominal) to high (ratio).
Nominal, ordinal, interval, and ratio data
Going from lowest to highest, the 4 levels of measurement are cumulative. This means that
they each take on the properties of lower levels and add new properties.
Nominal level Examples of nominal scales
You can categorize your data by labelling them in City of birth
mutually exclusive groups, but there is no order between Gender
the categories. Ethnicity
Ordinal level Examples of ordinal scales
You can categorize and rank your data in an order, but Top 5 Olympic medallists
you cannot say anything about the intervals between the Language ability (e.g.,
rankings. beginner, intermediate,
Although you can rank the top 5 Olympic medallists, this fluent)
scale does not tell you how close or far apart they are in Likert-type questions (e.g.,
number of wins. very dissatisfied to very
satisfied)
Interval level Examples of interval scales
You can categorize, rank, and infer equal intervals Test scores (e.g., IQ or exams)
between neighboring data points, but there is no true Personality inventories
zero point. The difference between any two adjacent Temperature in Fahrenheit or
temperatures is the same: one degree. But zero degrees Celsius
is defined differently depending on the scale – it doesn’t
mean an absolute absence of temperature.
The same is true for test scores and personality
inventories. A zero on a test is arbitrary; it does not mean
that the test-taker has an absolute lack of the trait being
measured.
Ratio level Examples of ratio scales
You can categorize, rank, and infer equal intervals Height
between neighboring data points, and there is a true zero Age
point. Weight
A true zero means there is an absence of the variable of Temperature in Kelvin
interest. In ratio scales, zero does mean an absolute lack
of the variable.
For example, in the Kelvin temperature scale, there are
no negative degrees of temperature – zero means an
absolute lack of thermal energy.
b. What are reliability and validity of a measure?

Reliability
Reliability refers to the consistency of a measure. Psychologists consider three types of
consistency: over time (test-retest reliability), across items (internal consistency), and
across different researchers (inter-rater reliability).
 Test-Retest Reliability
When researchers measure a construct that they assume to be consistent across
time, then the scores they obtain should also be consistent across time. Test-retest
reliability is the extent to which this is actually the case. For example, intelligence is
generally thought to be consistent across time. A person who is highly intelligent today will
be highly intelligent next week. This means that any good measure of intelligence should
produce roughly the same scores for this individual next week as it does today. Clearly, a
measure that produces highly inconsistent scores over time cannot be a very good measure
of a construct that is supposed to be consistent.
Assessing test-retest reliability requires using the measure on a group of people at
one time, using it again on the same group of people at a later time, and then looking at test-
retest correlation between the two sets of scores. This is typically done by graphing the
data in a scatterplot and computing Pearson’s r. Figure 5.2 shows the correlation between
two sets of scores of several university students on the Rosenberg Self-Esteem Scale,
administered two times, a week apart. Pearson’s r for these data is +.95. In general, a test-
retest correlation of +.80 or greater is considered to indicate good reliability.
Score at time 1 is on the x-axis and score at time 2 is on the y-axis, showing fairly
consistent scores
Figure 5.2 Test-Retest Correlation Between Two Sets of Scores of Several College
Students on the Rosenberg Self-Esteem Scale, Given Two Times a Week Apart
Again, high test-retest correlations make sense when the construct being measured
is assumed to be consistent over time, which is the case for intelligence, self-esteem, and
the Big Five personality dimensions. But other constructs are not assumed to be stable over
time. The very nature of mood, for example, is that it changes. So a measure of mood that
produced a low test-retest correlation over a period of a month would not be a cause for
concern.
 Internal Consistency
A second kind of reliability is internal consistency, which is the consistency of
people’s responses across the items on a multiple-item measure. In general, all the items on
such measures are supposed to reflect the same underlying construct, so people’s scores on
those items should be correlated with each other. On the Rosenberg Self-Esteem Scale,
people who agree that they are a person of worth should tend to agree that that they have a
number of good qualities. If people’s responses to the different items are not correlated
with each other, then it would no longer make sense to claim that they are all measuring the
same underlying construct. This is as true for behavioural and physiological measures as
for self-report measures. For example, people might make a series of bets in a simulated
game of roulette as a measure of their level of risk seeking. This measure would be
internally consistent to the extent that individual participants’ bets were consistently high
or low across trials.
Like test-retest reliability, internal consistency can only be assessed by collecting
and analyzing data. One approach is to look at a split-half correlation. This involves splitting
the items into two sets, such as the first and second halves of the items or the even- and
odd-numbered items. Then a score is computed for each set of items, and the relationship
between the two sets of scores is examined. For example, Figure 5.3 shows the split-half
correlation between several university students’ scores on the even-numbered items and
their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s r for
these data is +.88. A split-half correlation of +.80 or greater is generally considered good
internal consistency.
Score on even-numbered items is on the x-axis and score on odd-numbered items
is on the y-axis, showing fairly consistent scores
Figure 5.3 Split-Half Correlation Between Several College Students’ Scores on the
Even-Numbered Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-
Esteem Scale
Perhaps the most common measure of internal consistency used by researchers in
psychology is a statistic called Cronbach’s α (the Greek letter alpha). Conceptually, α is the
mean of all possible split-half correlations for a set of items. For example, there are 252
ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the
252 split-half correlations. Note that this is not how α is actually computed, but it is a
correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is
generally taken to indicate good internal consistency.
 Interrater Reliability
Many behavioural measures involve significant judgment on the part of an
observer or a rater. Inter-rater reliability is the extent to which different observers are
consistent in their judgments. For example, if you were interested in measuring university
students’ social skills, you could make video recordings of them as they interacted with
another student whom they are meeting for the first time. Then you could have two or more
observers watch the videos and rate each student’s level of social skills. To the extent that
each participant does in fact have some level of social skills that can be detected by an
attentive observer, different observers’ ratings should be highly correlated with each other.
Inter-rater reliability would also have been measured in Bandura’s Bobo doll study. In this
case, the observers’ ratings of how many acts of aggression a particular child committed
while playing with the Bobo doll should have been highly positively correlated. Interrater
reliability is often assessed using Cronbach’s α when the judgments are quantitative or an
analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.
Validity
Validity is the extent to which the scores from a measure represent the variable they are
intended to. But how do researchers make this judgment? We have already considered one
factor that they take into account—reliability. When a measure has good test-retest
reliability and internal consistency, researchers should be more confident that the scores
represent what they are supposed to. There has to be more to it, however, because a
measure can be extremely reliable but have no validity whatsoever. As an absurd example,
imagine someone who believes that people’s index finger length reflects their self-esteem
and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers.
Although this measure would have extremely good test-retest reliability, it would have
absolutely no validity. The fact that one person’s index finger is a centimetre longer than
another’s would indicate nothing about which one had higher self-esteem.
Discussions of validity usually divide it into several distinct “types.” But a good way to
interpret these types is that they are other kinds of evidence—in addition to reliability—
that should be taken into account when judging the validity of a measure. Here we consider
three basic kinds: face validity, content validity, and criterion validity.
 Face Validity
Face validity is the extent to which a measurement method appears “on its face” to
measure the construct of interest. Most people would expect a self-esteem questionnaire to
include items about whether they see themselves as a person of worth and whether they
think they have good qualities. So a questionnaire that included these kinds of items would
have good face validity. The finger-length method of measuring self-esteem, on the other
hand, seems to have nothing to do with self-esteem and therefore has poor face validity.
Although face validity can be assessed quantitatively—for example, by having a large
sample of people rate a measure in terms of whether it appears to measure what it is
intended to—it is usually assessed informally.
Face validity is at best a very weak kind of evidence that a measurement method is
measuring what it is supposed to. One reason is that it is based on people’s intuitions about
human behaviour, which are frequently wrong. It is also the case that many established
measures in psychology work quite well despite lacking face validity. The Minnesota
Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics
and disorders by having people decide whether each of over 567 different statements
applies to them—where many of the statements do not have any obvious relationship to
the construct that they measure. For example, the items “I enjoy detective or mystery
stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the
suppression of aggression. In this case, it is not the participants’ literal answers to these
questions that are of interest, but rather whether the pattern of the participants’ responses
to a series of questions matches those of individuals who tend to suppress their aggression.
 Content Validity
Content validity is the extent to which a measure “covers” the construct of interest.
For example, if a researcher conceptually defines test anxiety as involving both sympathetic
nervous system activation (leading to nervous feelings) and negative thoughts, then his
measure of test anxiety should include items about both nervous feelings and negative
thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and
actions toward something. By this conceptual definition, a person has a positive attitude
toward exercise to the extent that he or she thinks positive thoughts about exercising, feels
good about exercising, and actually exercises. So to have good content validity, a measure of
people’s attitudes toward exercise would have to reflect all three of these aspects. Like face
validity, content validity is not usually assessed quantitatively. Instead, it is assessed by
carefully checking the measurement method against the conceptual definition of the
construct.
 Criterion Validity
Criterion validity is the extent to which people’s scores on a measure are correlated
with other variables (known as criteria) that one would expect them to be correlated with.
For example, people’s scores on a new measure of test anxiety should be negatively
correlated with their performance on an important school exam. If it were found that
people’s scores were in fact negatively correlated with their exam performance, then this
would be a piece of evidence that these scores really represent people’s test anxiety. But if it
were found that people scored equally well on the exam regardless of their test anxiety
scores, then this would cast doubt on the validity of the measure.
A criterion can be any variable that one has reason to think should be correlated
with the construct being measured, and there will usually be many of them. For example,
one would expect test anxiety scores to be negatively correlated with exam performance
and course grades and positively correlated with general anxiety and with blood pressure
during an exam. Or imagine that a researcher develops a new measure of physical risk
taking. People’s scores on this measure should be correlated with their participation in
“extreme” activities such as snowboarding and rock climbing, the number of speeding
tickets they have received, and even the number of broken bones they have had over the
years. When the criterion is measured at the same time as the construct, criterion validity is
referred to as concurrent validity; however, when the criterion is measured at some point in
the future (after the construct has been measured), it is referred to as predictive validity
(because scores on the measure have “predicted” a future outcome).
Criteria can also include other measures of the same construct. For example, one
would expect new measures of test anxiety or physical risk taking to be positively
correlated with existing measures of the same constructs. This is known as convergent
validity.
Assessing convergent validity requires collecting data using the measure.
Researchers John Cacioppo and Richard Petty did this when they created their self-report
Need for Cognition Scale to measure how much people value and engage in thinking
(Cacioppo & Petty, 1982)[1]. In a series of studies, they showed that people’s scores were
positively correlated with their scores on a standardized academic achievement test, and
that their scores were negatively correlated with their scores on a measure of dogmatism
(which represents a tendency toward obedience). In the years since it was created, the
Need for Cognition Scale has been used in literally hundreds of studies and has been shown
to be correlated with a wide variety of other variables, including the effectiveness of an
advertisement, interest in politics, and juror decisions (Petty, Briñ ol, Loersch, & McCaslin,
2009)[2].
 Discriminant Validity
Discriminant validity, on the other hand, is the extent to which scores on a measure
are not correlated with measures of variables that are conceptually distinct. For example,
self-esteem is a general attitude toward the self that is fairly stable over time. It is not the
same as mood, which is how good or bad one happens to be feeling right now. So people’s
scores on a new measure of self-esteem should not be very highly correlated with their
moods. If the new measure of self-esteem were highly correlated with a measure of mood, it
could be argued that the new measure is not really measuring self-esteem; it is measuring
mood instead.
When they created the Need for Cognition Scale, Cacioppo and Petty also provided
evidence of discriminant validity by showing that people’s scores were not correlated with
certain other variables. For example, they found only a weak correlation between people’s
need for cognition and a measure of their cognitive style—the extent to which they tend to
think analytically by breaking ideas into smaller parts or holistically in terms of “the big
picture.” They also found no correlation between people’s need for cognition and measures
of their test anxiety and their tendency to respond in socially desirable ways. All these low
correlations provide evidence that the measure is reflecting a conceptually distinct
construct.

CHET7080@6

Uploaded by

Copyright:

Available Formats

CHET7080@6

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CHET7080@6

Uploaded by

Copyright:

Available Formats

Assignment 6

Attempt both the Questions in the same document

b. What are reliability and validity of a measure?

You might also like