Reliability

UNIT-4
RELIABILITY
By: Dr. Khusboo
Assistant Professor
RELIABILITY
 Reliability of a test is a criterion of test quality relating to the accuracy of psychological
measurements. The higher the reliability of a test, relatively the freer it would be of
measurement errors. Some regard it as the stability of results in repeated testing, that is, the
same individual or object is tested in the same way, so that it yields the same value from
moment to moment, provided that the thing measured has itself not changed in the meantime.
 In other words, Reliability refers to the consistency of a measure. 1 A test is considered
reliable if we get the same result repeatedly. For example, if a test is designed to measure a
trait (such as introversion), then each time the test is administered to a subject, the results
should be approximately the same.
 There are two types of reliability – internal and external reliability.
• Internal reliability assesses the consistency of results across items within a test.
• External reliability refers to the extent to which a measure varies from one use to
another.
TEST-RETEST RELIABILITY
 The most frequently used method to find the reliability of a test is by repeating the same test
on a second occasion. The reliability coefficient (r) in this case would be the correlation
between the score obtained by the same person on two administrations of the test. The
problem related to this test is the controversy about the interval between two administrations.
If the interval between the tests is long (say, six months) and the subjects are young children,
growth changes will affect the test scores.
 For example, intelligence is generally thought to be consistent across time. A person who is
highly intelligent today will be highly intelligent next week. This means that any good
measure of intelligence should produce roughly the same scores for this individual next week
as it does today.
 Assessing test-retest reliability requires using the measure on a group of people at one time,
using it again on the same group of people at a later time, and then looking at test-retest
correlation between the two sets of scores. This is typically done by graphing the data in a
scatterplot and computing Pearson’s r.
PARALLEL FORM OR
ALTERNATE FORMS
 To overcome the difficulty of practice and time interval in case of test–retest method, the
method of parallel or alternate form is used. Using the equivalent or parallel forms has some
advantages like lessening the possible effect of practice and recall. But this method presents an
additional problem of construction and standardization of the second form.
 In other words, Parallel forms reliability measures the correlation between two equivalent
versions of a test. The most common way to measure parallel forms reliability is to produce a
large set of questions to evaluate the same thing, then divide these randomly into two question
sets. The same group of respondents answers both sets, and you calculate the correlation
between the results. High correlation between the two indicates high parallel forms reliability.
 For example, A set of questions is formulated to measure financial risk aversion in a group of
respondents. The questions are randomly divided into two sets, and the respondents are
randomly divided into two groups. Both groups take both tests: group A takes test A first, and
group B takes test B first. The results of the two tests are compared, and the results are almost
identical, indicating high parallel forms reliability.
SPLIT-HALF METHOD
 The advantage that this method has over the test–retest method is that only testing is needed.
This technique is also better than the parallel form method to fi nd reliability because only one
test is required. In this method, the test is scored for the single testing to get two halves, so that
variation brought about by difference between the two testing situations is eliminated.
 In other words, split-half reliability is determined by dividing the total set of items (e.g.,
questions) relating to a construct of interest into halves (e.g., odd-numbered and even-
numbered questions) and comparing the results obtained from the two subsets of items thus
created. The closer the correlation between results from the two versions, the greater
the reliability of the survey or instrument.
 A reliability coefficient of this type is called a coefficient of internal consistency. Internal
consistency reflects the extent to which items within an instrument measure various aspects of
the same characteristic or construct.
OTHER TYPES OF INTERNAL
CONSISTENCY
 The Kuder-Richardson Formula 20, often abbreviated KR-20, is used to measure
the internal consistency reliability of a test with dichotomous choices, i.e., each question only
has two answers: right or wrong.
 The Kuder–Richardson formula is applicable to find the internal consistency of tests whose
items are scored as right or wrong, or according to some other all or none system. Some tests,
however, may have multiple choice items. On a personality inventory, however, there are more
than two response categories. For such tests, a generalized formula has been derived known as
coefficient alpha (Cronbach 1951). Cronbach’s alpha quantifies the level of agreement on
a standardized 0 to 1 scale. Higher values indicate higher agreement between items.
 High Cronbach’s alpha values indicate that response values for each participant across a set of
questions are consistent. For example, when participants give a high response for one of the
items, they are also likely to provide high responses for the other items. This consistency
indicates the measurements are reliable and the items might measure the same characteristic.
 Conversely, low values indicate the set of items do not reliably measure the same
construct. High responses for one question do not suggest that participants rated the other
items highly. Consequently, the questions are unlikely to measure the same property
because the measurements are unreliable.
 Inter-rater reliability measures the agreement between subjective ratings by multiple
raters, inspectors, judges, or appraisers. It answers the question, is the rating system
consistent? High inter-rater reliability indicates that multiple raters’ ratings for the same
item are consistent. Conversely, low reliability means they are inconsistent.
 For example, judges evaluate the quality of academic writing samples using ratings of 1 –
5. When multiples raters assess the same writing, how similar are their ratings?
 Evaluating inter-rater reliability is vital for understanding how likely a measurement
system will misclassify an item. A measurement system is invalid when ratings do not
have high inter-rater reliability because the judges frequently disagree.
 For the writing example, if the judges give vastly different ratings to the same writing,
you cannot trust the results because the ratings are inconsistent. However, if the ratings
are very similar, the rating system is consistent
GENERALIZABILITY THEORY
 A highly useful theory that informs reliability, validity, elements of study design, and
data analysis is Generalizability theory (G-theory). G-theory is a statistical framework for
examining, determining, and designing the reliability of various observations or ratings.
 In performance-based assessments we need to consider potential influences on
assessment scores, such as rater bias, relative difficulty of items or stations, the rater's or
examinee's attention or mood, the abilities of standardized patients, and the overall
environment. G-theory offers a way to quantify the variance contributed by these
factors, which G-theory refers to as facets. Each form of a given facet is called
a condition.
 In G theory, sources of variation are referred to as facets. Facets are similar to the
"factors" used in analysis of variance, and may include persons, raters, items/forms,
time, and settings among other possibilities. These facets are potential sources of error
and the purpose of generalizability theory is to quantify the amount of error caused by
each facet and interaction of facets. The usefulness of data gained from a G study is
crucially dependent on the design of the study. Therefore, the researcher must carefully
consider the ways in which he/she hopes to generalize any specific results.

Reliability

Uploaded by

Copyright:

Available Formats

Reliability

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability

Uploaded by

Copyright:

Available Formats

UNIT-4

You might also like