Reliability (Statistics)

Reliability (statistics)
In statistics and psychometrics, reliability is the overall consistency of a measure.[1] A measure is said to
have a high reliability if it produces similar results under consistent conditions:
"It is the characteristic of a set of test scores that relates to the amount of random error from the
measurement process that might be embedded in the scores. Scores that are highly reliable are
precise, reproducible, and consistent from one testing occasion to another. That is, if the testing
process were repeated with a group of test takers, essentially the same results would be
obtained. Various kinds of reliability coefficients, with values ranging between 0.00 (much
error) and 1.00 (no error), are usually used to indicate the amount of error in the scores."[2]
For example, measurements of people's height and weight are often extremely reliable.[3][4]
Contents
Types
Difference from validity
General model
Classical test theory
Item response theory
Estimation
See also
References
External links
Types
There are several general classes of reliability estimates:
Inter-rater reliability assesses the degree of agreement between two or more raters in their
appraisals. For example, a person gets a stomach ache and different doctors all give the
same diagnosis.[5]: 71
Test-retest reliability assesses the degree to which test scores are consistent from one test
administration to the next. Measurements are gathered from a single rater who uses the
same methods or instruments and the same testing conditions.[4] This includes intra-rater
reliability.
Inter-method reliability assesses the degree to which test scores are consistent when there
is a variation in the methods or instruments used. This allows inter-rater reliability to be ruled
out. When dealing with forms, it may be termed parallel-forms reliability.[6]
Internal consistency reliability, assesses the consistency of results across items within a
test.[6]
Difference from validity

Reliability does not imply validity. That is, a reliable measure that is measuring something consistently is
not necessarily measuring what you want to be measured. For example, while there are many reliable tests
of specific abilities, not all of them would be valid for predicting, say, job performance.
While reliability does not imply validity, reliability does place a limit on the overall validity of a test. A test
that is not perfectly reliable cannot be perfectly valid, either as a means of measuring attributes of a person
or as a means of predicting scores on a criterion. While a reliable test may provide useful valid information,
a test that is not reliable cannot possibly be valid.[7]
For example, if a set of weighing scales consistently measured the weight of an object as 500 grams over
the true weight, then the scale would be very reliable, but it would not be valid (as the returned weight is
not the true weight). For the scale to be valid, it should return the true weight of an object. This example
demonstrates that a perfectly reliable measure is not necessarily valid, but that a valid measure necessarily
must be reliable.
General model
In practice, testing measures are never perfectly consistent. Theories of test reliability have been developed
to estimate the effects of inconsistency on the accuracy of measurement. The basic starting point for almost
all theories of test reliability is the idea that test scores reflect the influence of two sorts of factors:[7]
1. Factors that contribute to consistency: stable characteristics of the individual or the attribute that one is
trying to measure.
2. Factors that contribute to inconsistency: features of the individual or the situation that can affect test
scores but have nothing to do with the attribute being measured.
These factors include:[7]
Temporary but general characteristics of the individual: health, fatigue, motivation, emotional
strain
Temporary and specific characteristics of individual: comprehension of the specific test task,
specific tricks or techniques of dealing with the particular test materials, fluctuations of
memory, attention or accuracy
Aspects of the testing situation: freedom from distractions, clarity of instructions, interaction
of personality, etc.
Chance factors: luck in selection of answers by sheer guessing, momentary distractions
The goal of estimating reliability is to determine how much of the variability in test scores is due to errors
in measurement and how much is due to variability in true scores.[7]
A true score is the replicable feature of the concept being measured. It is the part of the observed score that
would recur across different measurement occasions in the absence of error.
Errors of measurement are composed of both random error and systematic error. It represents the
discrepancies between scores obtained on tests and the corresponding true scores.
This conceptual breakdown is typically represented by the simple equation:
Observed test score = true score + errors of measurement
Classical test theory

The goal of reliability theory is to estimate errors in measurement and to suggest ways of improving tests so
that errors are minimized.
The central assumption of reliability theory is that measurement errors are essentially random. This does not
mean that errors arise from random processes. For any individual, an error in measurement is not a
completely random event. However, across a large number of individuals, the causes of measurement error
are assumed to be so varied that measure errors act as random variables.[7]
If errors have the essential characteristics of random variables, then it is reasonable to assume that errors are
equally likely to be positive or negative, and that they are not correlated with true scores or with errors on
other tests.
It is assumed that:[8]
1. Mean error of measurement = 0
2. True scores and errors are uncorrelated
3. Errors on different measures are uncorrelated
Reliability theory shows that the variance of obtained scores is simply the sum of the variance of true
scores plus the variance of errors of measurement.[7]
This equation suggests that test scores vary as the result of two factors:
1. Variability in true scores
2. Variability due to errors of measurement.
The reliability coefficient provides an index of the relative influence of true and error scores on
attained test scores. In its general form, the reliability coefficient is defined as the ratio of true score variance
to the total variance of test scores. Or, equivalently, one minus the ratio of the variation of the error score
and the variation of the observed score:
Unfortunately, there is no way to directly observe or calculate the true score, so a variety of methods are
used to estimate the reliability of a test.
Some examples of the methods to estimate reliability include test-retest reliability, internal consistency
reliability, and parallel-test reliability. Each method comes at the problem of figuring out the source of error
in the test somewhat differently.
Item response theory
It was well known to classical test theorists that measurement precision is not uniform across the scale of
measurement. Tests tend to distinguish better for test-takers with moderate trait levels and worse among
high- and low-scoring test-takers. Item response theory extends the concept of reliability from a single
index to a function called the information function. The IRT information function is the inverse of the
conditional observed score standard error at any given test score.
Estimation
The goal of estimating reliability is to determine how much of the variability in test scores is due to errors in
measurement and how much is due to variability in true scores.
Four practical strategies have been developed that provide workable methods of estimating test reliability.[7]
1. Test-retest reliability method: directly assesses the degree to which test scores are consistent from one
test administration to the next.
It involves:
Administering a test to a group of individuals

Re-administering the same test to the same group at some later time
Correlating the first set of scores with the second
The correlation between scores on the first test and the scores on the retest is used to estimate the reliability
of the test using the Pearson product-moment correlation coefficient: see also item-total correlation.
2. Parallel-forms method:
The key to this method is the development of alternate test forms that are equivalent in terms of content,
response processes and statistical characteristics. For example, alternate forms exist for several tests of
general intelligence, and these tests are generally seen equivalent.[7]
With the parallel test model it is possible to develop two forms of a test that are equivalent in the sense that a
person's true score on form A would be identical to their true score on form B. If both forms of the test
were administered to a number of people, differences between scores on form A and form B may be due to
errors in measurement only.[7]
It involves:
Administering one form of the test to a group of individuals

At some later time, administering an alternate form of the same test to the same group of
people
Correlating scores on form A with scores on form B
The correlation between scores on the two alternate forms is used to estimate the reliability of the test.
This method provides a partial solution to many of the problems inherent in the test-retest reliability
method. For example, since the two forms of the test are different, carryover effect is less of a problem.
Reactivity effects are also partially controlled; although taking the first test may change responses to the
second test. However, it is reasonable to assume that the effect will not be as strong with alternate forms of
the test as with two administrations of the same test.[7]
However, this technique has its disadvantages:
It may be very difficult to create several alternate forms of a test

It may also be difficult if not impossible to guarantee that two alternate forms of a test are
parallel measures
3. Split-half method:
This method treats the two halves of a measure as alternate forms. It provides a simple solution to the
problem that the parallel-forms method faces: the difficulty in developing alternate forms.[7]
It involves:
Administering a test to a group of individuals

Splitting the test in half
Correlating scores on one half of the test with scores on the other half of the test
The correlation between these two split halves is used in estimating the reliability of the test. This halves
reliability estimate is then stepped up to the full test length using the Spearman–Brown prediction formula.
There are several ways of splitting a test to estimate reliability. For example, a 40-item vocabulary test could
be split into two subtests, the first one made up of items 1 through 20 and the second made up of items 21
through 40. However, the responses from the first half may be systematically different from responses in the
second half due to an increase in item difficulty and fatigue.[7]
In splitting a test, the two halves would need to be as similar as possible, both in terms of their content and
in terms of the probable state of the respondent. The simplest method is to adopt an odd-even split, in which
the odd-numbered items form one half of the test and the even-numbered items form the other. This
arrangement guarantees that each half will contain an equal number of items from the beginning, middle,
and end of the original test.[7]
4. Internal consistency: assesses the consistency of results across items within a test. The most common
internal consistency measure is Cronbach's alpha, which is usually interpreted as the mean of all possible
split-half coefficients.[9] Cronbach's alpha is a generalization of an earlier form of estimating internal
consistency, Kuder–Richardson Formula 20.[9] Although the most commonly used, there are some
misconceptions regarding Cronbach's alpha.[10][11]
These measures of reliability differ in their sensitivity to different sources of error and so need not be equal.
Also, reliability is a property of the scores of a measure rather than the measure itself and are thus said to be
sample dependent. Reliability estimates from one sample might differ from those of a second sample
(beyond what might be expected due to sampling variations) if the second sample is drawn from a different
population because the true variability is different in this second population. (This is true of measures of all
types—yardsticks might measure houses well yet have poor reliability when used to measure the lengths of
insects.)
Reliability may be improved by clarity of expression (for written assessments), lengthening the measure,[9]
and other informal means. However, formal psychometric analysis, called item analysis, is considered the
most effective way to increase reliability. This analysis consists of computation of item difficulties and item
discrimination indices, the latter index involving computation of correlations between the items and sum of
the item scores of the entire test. If items that are too difficult, too easy, and/or have near-zero or negative
discrimination are replaced with better items, the reliability of the measure will increase.
(where is the failure rate)
See also
Coefficient of variation
Congeneric reliability
Consistency (statistics)
Homogeneity (statistics)
Test-retest reliability
Internal consistency
Levels of measurement
Accuracy and precision
Reliability theory
Reliability engineering
Reproducibility
Validity (statistics)
References
1. William M.K. Trochim, Reliability (http://www.socialresearchmethods.net/kb/reliable.php)
2. National Council on Measurement in Education
http://www.ncme.org/ncme/NCME/Resource_Center/Glossary/NCME/Resource_Center/Gloss
hkey=4bb87415-44dc-4088-9ed9-e8515326a061#anchorR
3. al.], Neil R. Carlson ... [et (2009). Psychology : the science of behaviour (https://archive.org/d
etails/psychologyscienc0004unse) (4th Canadian ed.). Toronto: Pearson. ISBN 978-0-205-
64524-4.
4. The Marketing Accountability Standards Board (MASB) endorses this definition as part of its
ongoing Common Language: Marketing Activities and Metrics Project (http://www.themasb.o
rg/common-language-project/) Archived (https://web.archive.org/web/20130212100753/htt
p://www.themasb.org/common-language-project/) 12 February 2013 at the Wayback
Machine.
5. Durand, V. Mark. (2015). Essentials of abnormal psychology. [Place of publication not
identified]: Cengage Learning. ISBN 978-1305633681. OCLC 884617637 (https://www.worl
dcat.org/oclc/884617637).
6. Types of Reliability (http://www.socialresearchmethods.net/kb/reltypes.php) The Research
Methods Knowledge Base. Last Revised: 20 October 2006
7. Davidshofer, Kevin R. Murphy, Charles O. (2005). Psychological testing : principles and
applications (6th ed.). Upper Saddle River, N.J.: Pearson/Prentice Hall. ISBN 0-13-189172-
3.
8. Gulliksen, Harold (1987). Theory of mental tests. Hillsdale, N.J.: L. Erlbaum Associates.
ISBN 978-0-8058-0024-1.
9. Cortina, J.M., (1993). What Is Coefficient Alpha? An Examination of Theory and
Applications. Journal of Applied Psychology, 78(1), 98–104.
10. Ritter, N. (2010). Understanding a widely misunderstood statistic: Cronbach's alpha. Paper
presented at Southwestern Educational Research Association (SERA) Conference 2010,
New Orleans, LA (ED526237).
11. Eisinga, R.; Te Grotenhuis, M.; Pelzer, B. (2012). "The reliability of a two-item scale:
Pearson, Cronbach or Spearman-Brown?" (https://repository.ubn.ru.nl/bitstream/2066/11673
5/1/116735pre.pdf) (PDF). International Journal of Public Health. 58 (4): 637–642.
doi:10.1007/s00038-012-0416-3 (https://doi.org/10.1007%2Fs00038-012-0416-3).
hdl:2066/116735 (https://hdl.handle.net/2066%2F116735). PMID 23089674 (https://pubmed.
ncbi.nlm.nih.gov/23089674).
External links
Internal and external reliability and validity explained. (http://www.loopa.co.uk/internal-extern
al-reliability-and-validity-in-psychology-aqa-a-explained-easily/)
Uncertainty models, uncertainty quantification, and uncertainty processing in engineering (ht
tp://www.uncertainty-in-engineering.net)
The relationships between correlational and internal consistency concepts of test reliability
(http://www.visualstatistics.net/Statistics/Principal%20Components%20of%20Reliability/PC
ofReliability.asp)
The problem of negative reliabilities (http://www.visualstatistics.net/Statistics/Reliability%20
Negative/Negative%20Reliability.asp)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Reliability_(statistics)&oldid=1074421426"
This page was last edited on 28 February 2022, at 05:05 (UTC).
Text is available under the Creative Commons Attribution-ShareAlike License 3.0;

additional terms may apply. By
using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the
Wikimedia Foundation, Inc., a non-profit organization.

Reliability (Statistics)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reliability (Statistics)

Uploaded by

Copyright:

Available Formats

Reliability (statistics)

Difference from validity

These factors include:[7]

Observed test score = true score + errors of measurement

Classical test theory

1. Mean error of measurement = 0

2. True scores and errors are uncorrelated

3. Errors on different measures are uncorrelated

1. Variability in true scores

2. Variability due to errors of measurement.

Administering a test to a group of individuals

Administering one form of the test to a group of individuals

However, this technique has its disadvantages:

It may be very difficult to create several alternate forms of a test

Administering a test to a group of individuals

(where is the failure rate)

Retrieved from "https://en.wikipedia.org/w/index.php?title=Reliability_(statistics)&oldid=1074421426"

This page was last edited on 28 February 2022, at 05:05 (UTC).

Text is available under the Creative Commons Attribution-ShareAlike License 3.0;

You might also like