Clsi GP10 A
Clsi GP10 A
Clsi GP10 A
Vol. 15 No. 19
Replaces GP10-T
December 1995 Vol. 13 No. 28
This document provides a protocol for evaluating the accuracy of a test to discriminate between two
subclasses of subjects where there is some clinically relevant reason to separate them. In addition to
the use of ROC plots, the importance of defining the question, selecting the sample group, and
determining the "true" clinical state are emphasized.
ABC
GP10-A
THIS NCCLS DOCUMENT HAS BEEN
REAFFIRMED
WITHOUT CHANGE
AS AN APPROVED CONSENSUS DOCUMENT
EFFECTIVE MAY 2001
NCCLS...
Serving the World's Medical Science Community Through Voluntary Consensus
NCCLS is an international, interdisciplinary, nonprofit, scope, approach, and utility, and a line-by-line review of its
standards-developing and educational organization that technical and editorial content.
promotes the development and use of voluntary consensus
standards and guidelines within the healthcare community. It Tentative A tentative standard or guideline is made available
is recognized worldwide for the application of its unique for review and comment only when a recommended method
consensus process in the development of standards and has a well-defined need for a field evaluation or when a
guidelines for patient testing and related healthcare issues. recommended protocol requires that specific data be collected.
NCCLS is based on the principle that consensus is an effective It should be reviewed to ensure its utility.
and cost-effective way to improve patient testing and
healthcare services. Approved An approved standard or guideline has achieved
consensus within the healthcare community. It should be
In addition to developing and promoting the use of voluntary reviewed to assess the utility of the final document, to ensure
consensus standards and guidelines, NCCLS provides an open attainment of consensus (i.e., that comments on earlier
and unbiased forum to address critical issues affecting the versions have been satisfactorily addressed), and to identify
quality of patient testing and health care. the need for additional consensus documents.
Abstract
Assessment of the Clinical Accuracy of Laboratory Tests Using Receiver Operating Characteristic (ROC)
Plots; Approved Guideline (NCCLS document GP10-A) provides guidance for laboratorians who assess
clinical test accuracy. It is not a recipe; rather it is a set of concepts to be used to design an assessment
of test performance or to interpret data generated by others. In addition to the use of ROC plots, the
importance of defining the question, selecting a sample group, and determining the “true” clinical state
are emphasized. The statistical data generated can be useful whether one is considering replacing an
existing test, adding a new test, or eliminating a current test.
[NCCLS. Assessment of the Clinical Accuracy of Laboratory Tests Using Receiver Operating
Characteristic (ROC) Plots; Approved Guideline. NCCLS Document GP10-A (ISBN 1-56238-285-3).
NCCLS, 940 West Valley Road, Suite 1400, Wayne, Pennsylvania 19087, 1995.]
THE NCCLS consensus process, which is the mechanism for moving a document through two
or more levels of review by the clinical laboratory testing community, is an ongoing process.
(See the inside front cover of this document for more information on the consensus process.)
Users should expect revised editions of any given document. Because rapid changes in
technology may affect the procedures, bench and reference methods, and evaluation protocols
used in clinical laboratory testing, users should replace outdated editions with the current
editions of NCCLS documents. Current editions are listed in the NCCLS Catalog, which is
distributed to member organizations, or to nonmembers on request. If your organization is not
a member and would like to become one, or to request a copy of the NCCLS Catalog, contact
the NCCLS Executive Offices. Telephone: 610.688.1100; Fax: 610.688.6400.
Volume 15 Number 19
ABC
December 1995 GP10-A
This publication is protected by copyright. No part of it may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise) without written permission from NCCLS, except as stated below.
NCCLS hereby grants permission to reproduce limited portions of this publication for use in laboratory
procedure manuals at a single site, for interlibrary loan, or for use in educational programs provided that
multiple copies of such reproduction shall include the following notice, be distributed without charge,
and, in no event, contain more than 20% of the document's text.
Reproduced with permission, from NCCLS publication GP10-A, Assessment of the Clinical
Accuracy of Laboratory Tests Using Receiver Operating Characteristic (ROC) Plots; Approved
Guideline. Copies of the current edition may be obtained from NCCLS, 940 West Valley Road,
Suite 1400, Wayne, Pennsylvania 19087, USA.
Permission to reproduce or otherwise use the text of this document to an extent that exceeds the
exemptions granted here or under the Copyright Law must be obtained from NCCLS by written request.
To request such permission, address inquiries to the Executive Director, NCCLS, 940 West Valley Road,
Suite 1400, Wayne, Pennsylvania 19087, USA.
Suggested Citation
NCCLS. Assessment of the Clinical Accuracy of Laboratory Tests Using Receiver Operating Characteristic
(ROC) Plots; Approved Guideline. NCCLS Document GP10-A (ISBN 1-56238-285-3). NCCLS, 940 West
Valley Road, Suite 1400, Wayne, Pennsylvania 19087, USA.
Proposed Guideline
March 1987
Tentative Guideline
December 1993
Approved Guideline
Approved by Membership
November 1995
Published
December 1995
ISBN 1-56238-285-3
ISSN 0273-3099
Contents
Page
Abstract ........................................................... i
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
5 The Use of ROC Plots: Examples from the Clinical Laboratory Literature . . . . . . . . 11
6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Committee Membership
Max Robinowitz, M.D. FDA Center for Devices and Radiological Health
Rockville, Maryland
Advisors
William C. Dierksheide, Ph.D. FDA Center for Devices and Radiological Health
Rockville, Maryland
Jerome A. Donlon, M.D., Ph.D. FDA Center for Biologics Evaluation and Research
Rockville, Maryland
Foreword
As laboratorians, we are often interested in how well a test performs clinically. This is true whether we
are considering replacing an existing test with a newer one, adding a new test to our laboratory's menu,
eliminating tests where possible, or just because we want to know something about the value of what
we are doing. This project was originally intended to make recommendations about assessing the clinical
performance of diagnostic tests. We elected to adopt the concepts of Swets and Pickett,1 whereby
clinical performance is divided into (1) a discrimination or diagnostic accuracy element and (2) a decision
or efficacy element. Laboratory tests are ordered to help answer questions about patient management.
How much help an individual test result provides is variable and, in any case, a highly complicated issue.
Management decisions and strategies are complex activities that require the physician to consider
probabilities of disease, quality of the data available, effectiveness of various treatment/management
alternatives, probability of outcomes, and value (and cost) of outcomes to the patient. Many types of
clinical data (including laboratory results) are usually integrated into a complex decision-making process.
Most often, a single laboratory test result is not the sole basis for a diagnosis or a patient-management
decision. Therefore, some have criticized the practice of evaluating the diagnostic performance of a test
as if it were used alone. However, each clinical tool, whether it is a clinical chemistry test, an
electroencephalogram, an electrocardiogram, a nuclide scan, an x-ray, a biopsy, a view through an orifice,
a pulmonary function test, or a sonogram, is meant to make some definable discrimination. It is important
to know just how inherently accurate each tool (test) is as a diagnostic discriminator. Note that assessing
clinical accuracy, without engaging in comprehensive clinical decision analysis, is a valid and useful
activity for the clinical laboratory. Clinical accuracy is the most fundamental characteristic of the test itself
as a classification device; it measures the ability of the test to discriminate among alternative states of
health. In the simplest form, this property is the ability to distinguish between just two states of health
or circumstances. Sometimes this involves distinguishing health from disease; other times it might involve
distinguishing between benign and malignant disease, between patients responding to therapy and those
not responding, or predicting who will get sick versus who will not. This ability to distinguish or
discriminate between two states among patients who could be in either of the two states is a property of
the test itself.
Indeed, the ability of the test to distinguish between the relevant alternative states or conditions of the
subject (i.e., clinical accuracy) is the most basic property of a laboratory test as a device to help in
decision making. This property is the place to start when assessing what value a test has in contributing
to the patient-management process. If the test cannot provide the relevant distinction, it will not be
valuable for patient care. On the other hand, once we establish that a test does discriminate well, then
we can explore its role in the process of patient management to determine the practical usefulness of the
information in a management strategy. This exploration is clinical decision analysis, and measures of test
accuracy provide part of the data used to carry out that analysis.
Usefulness or efficacy refers to the practical value of the information in managing patients. A test can
have considerable ability to discriminate, yet not be of practical value for patient care. This could happen
for several reasons. For instance, the cost or undesirability of false results can be so high that there is no
decision threshold for the test where the trade-off between sensitivity and specificity is acceptable.
Perhaps there are less invasive or less expensive means to obtain comparable information. The test may
be so expensive or technically demanding that its availability is limited. It could be so uncomfortable or
invasive that the subjects do not want to submit to it.
Exploration of the usefulness of medical information, such as test data, involves a number of factors or
parameters that are not properties of the test system or device; rather they are properties of the
circumstances of the clinical application. These include the probability of disease (prevalence), the
possible outcomes and the relative values of those outcomes, the costs to the patient (and others) of
incorrect information (false-positive and false-negative classifications), and the costs and benefits of
various treatment options. These are characteristics or properties of the context in which test information
is used, but they are not properties of the tests themselves. These factors interact with test
Foreword (Continued)
results to affect the usefulness of the test. Thus, it is helpful to conceptually separate the characteristic
that is fundamental and inherent to the tests themselves, discrimination ability, from the interaction that
results when this discrimination ability is mixed with external factors in the course of patient
management.
In summary, we define clinical accuracy as the basic ability to discriminate between two subclasses of
subjects where there is some clinically relevant reason to separate them. This concept of clinical accuracy
refers to the quality of the information (classification) provided by the test and it should be distinguished
from the practical usefulness of the information.1 Both are aspects of test performance. Second, we
suggest that the assessment of clinical accuracy is the place to start in evaluating test performance. If a
test cannot discriminate between clinically relevant subclasses of subjects, then there is little incentive to
go any further in exploring a possible clinical role. If, on the other hand, a test does exhibit substantial
ability to discriminate, then by examining the degree of accuracy of the test and/or by comparing its
accuracy to that of other tests, we can decide whether to delve into a more complex assessment of its
role in patient-care management (decision analysis). This document addresses the assessment of
diagnostic accuracy but not the analysis of usefulness, or the role of the test in patient-care strategy.
The subcommittee believes that this guideline will be of value to a wide variety of possible users
including:
! Manufacturers of reagents and other devices for performing tests who are interested in assessing
or validating test performance in terms of clinical accuracy
! Clinical laboratories that are reviewing data, literature, and/or generating their own data to make
decisions about which tests to employ in their laboratory
! Health care/scientific workers interested in critical evaluation of data being presented on clinical
test performance.
Key Words
Acknowledgment
The subcommittee thanks Dr. Gregory Campbell (Director, Division of Biostatistics, Office of Surveillance
and Biometrics, Center for Devices/Radiological Health, Food and Drug Administration, Rockville, MD) for
his invaluable expert statistical consultation on this document.
1 Scope
Efficacy: Actual practical value of the data, i.e.,
usefulness for clinical purposes.
This guideline outlines the steps and principles
for designing a prospective study to evaluate the
False-negative result (FN): Negative test result
intrinsic diagnostic accuracy of a clinical
in a subject in whom the disease or condition is
laboratory test, i.e., its fundamental ability to
present.
discriminate correctly among alternative states of
health expressed in terms of sensitivity and
False-positive result (FP): Positive test result in a
specificity. Each of the steps is discussed in
subject in whom the disease or condition is
detail, along with its rationale and suggestions
absent.
for its execution. These same concepts can be
used in critical evaluations of data already
False-negative fraction (FNF): Ratio of subjects
generated.
who have the disease but who have a negative
test result to all subjects who have the disease;
2 Glossary FN/ (FN + TP); same as (1-sensitivity).
Clinical accuracy (diagnostic accuracy): The False-positive fraction (FPF): Ratio of subjects
ability of a diagnostic test to discriminate who do not have the disease but who have a
between two or more clinical states, for positive test result to all subjects who do not
example, discrimination between rheumatoid have the disease; FP/ (FP + TN); same as
arthritis and systemic lupus erythematosus, (1-specificity).
between rheumatoid arthritis and "no joint
disease," between chronic hepatitis and "no liver Prevalence: The pretest probability of a particular
disease," and between rheumatoid arthritis and a clinical state in a specified population; the
"mixture" of other joint diseases. frequency of a disease in the population of
interest at a given point in time.
Clinical state: A state of health or disease that
has been defined either by a clinical definition or Receiver operating characteristic (ROC) plot: A
some other independent reference standard. graphical description of test performance
Examples of clinical states include "no disease representing the relationship between the true-
found," "disease 1" (where 1 represents the first positive fraction (sensitivity) and the false-
clinical state under consideration), "disease 2" positive fraction (1-specificity). Customarily, the
(where 2 represents the second clinical state true-positive fraction is plotted on the vertical
under investigation), and so on. axis and the false-positive rate (or, alternatively,
the true-negative fraction) is plotted on the
Decision threshold (also decision level, cutoff): horizontal axis. Clinical accuracy, in terms of
A test score used as the criterion for a "positive sensitivity and specificity, is displayed for the
test." All test scores at or beyond this test entire spectrum of decision levels.
score are considered to be "positive"; those not
at or beyond the score are considered to be Sensitivity (clinical sensitivity): Test positivity in
"negative." In some cases, a low test score is disease; true positive fraction; ability of a test to
considered to be "abnormal," e.g., L/S ratio or correctly identify disease at a particular decision
hemoglobin. In other cases, a high test score is threshold.
considered to be "abnormal," e.g., cardiac
enzyme or uric acid concentration. Specificity (clinical specificity): Test negativity
in health; true-negative fraction; ability of a test
Diagnostic test: A measurement or examination to correctly identify the absence of disease at a
used to classify patients into a particular class or particular decision threshold.
clinical state.
Study group: A group of persons representing a (3) Account for patients for whom data are
sample of a clinically defined population of incomplete.
interest. The population of interest is the target
group to which the test being evaluated will be 3.3 Establish the "True" Clinical State of
applied in practice. Subgroups of the study Each Subject (See Section 4.3)
group will be designated as belonging to
particular clinical states by applying the standard Use the following procedure to establish the true
criteria (see text). clinical state of each subject:
True-negative result (TN): Negative test result in (1) Adopt independent external standards or
a subject in whom the disease is absent. criteria of diagnostic truth for each
relevant clinical state so as to classify
True-positive result (TP): Positive test result in a each subject as accurately as possible.
subject in whom the disease is present. This may be based on a rigorous
diagnostic workup or, alternatively, an
True-negative fraction (TNF): Ratio of subjects assessment of clinical course or
who do not have the disease and have a outcome.
negative test to all subjects who do not have the
disease; TN/(TN + FP); specificity. (2) Classify subjects independent of the test
being evaluated, i.e., without knowing
True-positive fraction (TPF): Ratio of subjects the test results and without including the
who have the disease and a positive test to all test results in the criteria.
subjects who have the disease; TP/(TP + FN);
sensitivity. 3.4 Test the Study Subjects (See Section
4.4)
3 Outline of the Evaluation Proce-
dure Use the following procedure to test the study
subjects:
3.1 Define the Clinical Question (See
Section 4.1) (1) Perform the test without knowing the
clinical classification of the subjects.
Use the following procedure to define the clinical
question: (2) When comparing multiple tests, perform
all tests on all subjects, preferably in a
(1) Characterize the subject population. batch mode, and at the same point in
their clinical course.
(2) State the management decision to be
made. 3.5 Assess the Clinical Accuracy of the
Test (See Section 4.5)
(3) Identify the role of the test in making the
decision. Use the following procedure to assess the
clinical accuracy of the test:
3.2 Select a Representative Study Sample
(See Section 4.2) (1) Construct and analyze receiver operating
characteristic (ROC) plots to evaluate
Use the following procedure to select a test accuracy.
representative study sample:
(2) Compare alternative tests on the basis of
(1) Select, prospectively, a statistically valid their ROC plots and analysis.
sample that consists of subjects who are
representative of the population
identified in Section 3.1 above.
4 Designing the Basic Evaluation sarcoidosis and those with some other cause of
Study hypercalcemia (such as malignancy or
hyperparathyroidism), each of which would
receive different management.
4.1 Define the Clinical Question
For the previously mentioned cases, the target
Laboratory tests are requested to provide population must be defined carefully, including
information that can be helpful in managing the nature, duration, and magnitude of the
patients. There is always a relevant clinical qualifying conditions. For example, this might
question. Defining the clinical question is include a serum calcium concentration greater
fundamental, then, because it establishes the than "X" on two occasions at least one week
particular patient-care issue being addressed by apart, as well as age range, sex, and other
the evaluation. Can CK-2 concentrations be findings (for example, chest x-ray) that are
used to discriminate between acute myocardial required for including and excluding subjects
infarction (AMI) and other causes of chest pain from the population.
in subjects who present to an emergency
department with a history suggestive of AMI?
4.2 Select a Representative Study Sample
Which, among several tests, is the best to use in
discriminating between those subjects with
The process of clearly defining the clinical
breast cancer who will respond to a particular
question actually serves to identify the
chemotherapy and those who will not? Which,
population relevant to the test evaluation. From
among several tests, is most accurate in
this clinical population, choose a sample of
distinguishing between iron deficiency and other
subjects for the study. These subjects should
causes of anemia in elderly patients who present
be selected to represent the larger population of
with previously undiscovered anemia?
clinical interest about which conclusions are to
be drawn.
A given test can perform differently in different
clinical settings. A test can perform well in
The meaningfulness of the results depends on
helping to discriminate between young,
the care with which the relevant population is
apparently healthy men with no prostatic disease
identified and sampled. The conclusions that
and middle-aged men with prostatic cancer, but
can be drawn follow from the definition of the
it might not do so well in helping to discriminate
question and the nature of the subjects selected
between middle-aged men with benign prostatic
for study.
disease and middle-aged men with malignant
prostatic disease. The latter distinction
It is commonplace in routine laboratory practice
addresses a relevant clinical question applied to
to adopt or establish reference intervals, which
symptomatic middle-aged men, whereas the
are usually available with patient results to aid in
former distinction addresses a different issue
their interpretation. These intervals are
that might not be clinically relevant at all.
frequently derived from test-result data gathered
from blood donors, laboratory workers, students,
Usually, the clinical question or goal involves a
or other ambulatory, "healthy" volunteers. Note
population of apparently similar subjects
that such groups might not be relevant for the
(grouped together on the basis of information
evaluations of diagnostic accuracy described in
available before the test under evaluation is
this guideline. When the accuracy of a test as a
done) that should be subdivided into relevant
screening tool is being assessed, then a sample
management subgroups. The results of the test
representative of the population to be screened
should indicate to which management subgroup
should be used. Consider, for example, fecal
individual subjects belong. For example, a
occult blood testing for colon cancer. If the goal
radioimmunoassay (RIA) for serum
is to evaluate the accuracy of the test in
angiotensin-converting enzyme activity might be
discovering occult cancer in middle-aged
expected to answer the following question:
subjects with no specific signs or symptoms
"Among patients with hypercalcemia, which
suggestive of the disease, then the sample
ones have sarcoidosis?" The apparently similar
studied should be taken entirely from such a
patients share the common characteristic of
population. Studying a group of cancer-free,
hypercalcemia. The test helps in the attempt to
divide them into subgroups: those with
healthy volunteers and a group already known to predetermined number of subjects is obtained.
have carcinoma of the colon is not appropriate. Once chosen, subjects should not be dropped
from the study. If some patients do not
The same principles apply when a test is being complete the study (because of technical errors,
used, not for screening, but for differentiating analytical interferences, death, or loss to
between disease states in symptomatic patients. follow-up), they should be accounted for in the
If a test is to be used to identify acute final analysis of the data. The uncertainty and
pancreatitis in patients with a history and possible biases that the lost subjects cause in
presentation indicating the possibility of the study's conclusions must be considered and
pancreatitis, the sample should comprise such reported.
persons. Because the test is not intended to
distinguish between healthy volunteers and 4.2.4 Prevalence of Disease
patients with well-defined pancreatitis, a study
sample composed of such subjects is not The approach described here is independent of
appropriate. Conclusions based on such a prevalence of disease, so it is not necessary to
sample would not serve the purpose of the have a sample that reflects actual prevalence. It
study. is desirable to have approximately equal numbers
of subjects who are truly affected and truly
4.2.1 Selection Bias unaffected by the disease.
substantially more reliable than the diagnostic parameter, as well as the other parameters.9 A
system [test] undergoing evaluation."6(p. 723) fourth approach is to, rather than definitively
assign each such patient to one of the groups,
4.3.1 Validity of Evaluation say, "diseased" or "nondiseased," assign to
each a value between 0 and 1 that corresponds
When evaluating the clinical accuracy of a test, to the (subjective) assessment of how likely it is
the validity of the evaluation is limited by the that this patient belongs to the diseased group
accuracy with which the subjects are classified. (this could be accomplished by logistic
A perfect test can appear to perform poorly regression). Then there is no need to discard the
simply because the "truth" was not established data from these gray, fuzzy cases where group
accurately for each patient and, therefore, the assignment is equivocal.10-12, 13
test results disagree with the apparent "true"
diagnosis. On the other hand, when test results Although diagnostic categories often do predict
do agree with an inaccurate classification, the complications and therapeutic responses, the
test will appear to perform better than it actually best evaluation of a test can be in terms of its
does. It is important, then, to attempt to ability to indicate clinical course or outcome,
classify individual persons as correctly as rather than its ability to assign a diagnosis. For
possible, as well as to consider the possible example, it might be possible to classify patients
biases in the results caused by the classification with suspected prostatic disease into those who
scheme. The closer the classifications are to the have cancer and those who do not have cancer
truth, the less distortion there will be in the based on biopsy results; however, it might be
apparent performance of any test being more useful to classify them in terms of which
evaluated. patients progress to overt disease. If the goal of
the evaluation is to assess the accuracy of a
4.3.2 True Clinical Subgroup serum marker in discriminating between those
patients who need intervention and those who
Routine clinical diagnoses are likely to be do not, then it is more relevant to know which
inadequate for evaluation studies. Determining a patients will progress than to know which have
patient's true clinical subgroup can require such histologic evidence of disease at that moment.
procedures as biopsy, surgical exploration, This issue is actually one that is properly
autopsy examination, angiography, or long-term confronted earlier in formulating the original
follow-up of response to therapy and clinical clinical management task to be addressed by the
outcome. Although such procedures can add to test under evaluation. Thus, lack of an
the financial cost of the evaluation, a less immediate definitive diagnostic category does
expensive, routine clinical evaluation can prove not necessarily prevent a valid assessment of the
quite costly in the long term if its erroneous clinical accuracy of a test. In fact, even when
conclusions lead to improper test use or the correct diagnosis can be easily established, a
improper patient management. study correlating test results with the clinical
course can provide a more usefulclinical
4.3.3 Approaches to Classification evaluation than a study that merely correlates
test results with patient diagnoses.
In many clinical situations, obtaining an
independent, accurate classification of the 4.3.4 Independent Classification
patient's true clinical condition is difficult.
Several strategies have been developed to deal To avoid bias in evaluating the clinical accuracy
with the difficulties in identifying true states of of a test, the true clinical state should also be
health. One strategy is to define the diagnostic determined independent of the test(s) under
problem in terms of measurable clinical investigation or used for comparison. Obviously,
outcomes.7 A second approach is to employ the new test should not be included in the
some sort of consensus, majority rule, or expert criteria used to classify the subjects. Neither
review to arrive at a less error-prone should a closely related test be included in the
identification process.8 A third solution is to criteria for classifying subjects. For example, if
assume for the comparison of several accurate an RIA for CK-MB is being evaluated for the
tests that there is some unknown mixture of diagnosis of AMI, neither CK-MB by
diseased and nondiseased persons in the subject electrophoresis or by immuno-inhibition should
population and then to estimate this mixture be included in the "gold standard" workup for
classifying the study subjects. Furthermore, if to have better sensitivity than the others.
the performance of the CK-MB assay is to be Conversely, inclusion of subjects with minimal
compared directly to the performance of the disease, which might be harder to detect, would
LD-1/LD-2 isoenzyme ratio, then LD isoenzyme tend to diminish the apparent sensitivity of tests
results should also not be included in the performed on these subjects, as compared with
diagnostic criteria because the apparent tests not done on these subjects. Performing all
performance will be biased in favor of any test tests on all subjects ensures that differences in
that is part of the "truth standard." sensitivity and specificity are not simply due to
inconsistent application of the diagnostic criteria.
4.3.5 Masked Evaluation
Similarly, if two or more tests are applied to the
To ensure that the classification is not influenced same subject at different times during the course
by the result of the test under evaluation, it of his illness, an apparent superiority of one of
should be done masked, that is, without the tests might simply reflect that it was done
knowing the results of the test. Furthermore, when the disease was more easily detected.
the criteria for classifying each patient into a Therefore, all tests should be performed at the
management subgroup should be as objective as same point in the course of each subject's
possible. When the classification rests on illness. Using identical specimens for all tests
subjective evaluation of clinical or morphological obviates all of the above pitfalls.
patterns, such as radionuclide scans or bone
marrow smears, the decision for each patient 4.4.3 Testing Mode
should reflect the consensus of experts who
each interpret the material masked and Assaying all samples in one batch, when
independent of each other. possible, to minimize the influence of between-
run analytical variance, is suggested. However,
4.4 Test the Study Subjects attention should be given to maintaining analyte
stability through proper storage conditions.
4.4.1 Conduct a Masked Study
4.5 Assess the Clinical Accuracy of the
The person performing the test under evaluation Test
should do so masked, that is, without knowing
the clinical status of the subject. Ideally, the Assessing the performance of a test by
testing should be done before the clinical examining its clinical accuracy, that is, its ability
question is answered. Knowing the answer to to correctly classify individual persons into two
the clinical question can introduce subtle biases. subgroups, for example, a subgroup of persons
Results that do not fit the clinical status might affected by some disease (and therefore needing
be selectively repeated or rejected on the basis treatment) and a second subgroup of unaffected
of supposed technical difficulties or interfering persons, is suggested. If there is no overlap in
factors. test results from these two subgroups, then the
test can identify all persons correctly and
4.4.2 Identical Specimens discriminate between the two subgroups
perfectly. However, if there is some overlap in
When comparing two or more tests, it is the test results for the two subgroups, the ability
important that the subjects and specimens be of the test to discriminate is not perfect. In
identical for all tests. Failure to use the identical either case, it is desirable to have a way to
subjects for evaluating each test can result in represent and measure this power to discriminate
misleading conclusions based on sampling (accuracy).
errors. Furthermore, subtle biases can affect the
selection of subjects for the different groups. 4.5.1 Diagnostic or Clinical Sensitivity and
Thus, apparent differences in test performance Specificity
can simply be reflections of differences in the
composition of the groups tested. If some The ability of a test to identify or recognize the
subjects have more advanced and, presumably, presence of disease is its diagnostic sensitivity;
more easily detectable disease and are tested by its ability to recognize the absence of disease is
only some of the tests, those tests could appear its diagnostic specificity. Both are measures of
accuracy and can be expressed as percentages, threshold is varied over the range of observed
rates, or decimal fractions. A perfect test results, the sensitivity and specificity will move
achieves a sensitivity and specificity of 100% or in opposite directions. As one increases, the
1.0. However, tests are rarely perfect, and, other decreases. For each decision threshold,
usually, they usually do not achieve a sensitivity then, there is a corresponding sensitivity and
and a specificity of 100% at the same time. specificity pair. Which one(s) describe(s) the
accuracy of the test? All of them do. Only the
Diagnostic sensitivity (true-positive rate or entire spectrum of sensitivity/specificity pairs
fraction) is defined as follows: provides a complete picture of test accuracy.
This is the fraction of persons who are truly Furthermore, a test can have one set of
unaffected by a disease who have negative test sensitivity–specificity pairs in one clinical
results. situation but a different set in another clinical
situation with a different group of subjects. If
Often, a test is said to have a particular CK-BB had been measured in postoperative
sensitivity and specificity. However, there is not patients suspected of having an AMI, instead of
a single sensitivity or specificity for a test; rather in emergency department patients (as in Figure 1
there is a continuum of sensitivities and p. 13), the sensitivity–specificity pairs could
specificities. By varying the decision threshold be quite different. The spectrum of pairs
(or decision level, upper-limit-of-normal, cut-off contained in the test characterizes its basic
value, or reference value), any sensitivity from 0 accuracy for a particular clinical setting.
to 100% can be obtained, and each one will
have a corresponding specificity. For each 4.5.2 Receiver Operating Characteristic Plots
decision threshold used to classify the subjects
as "positive" or "negative" based on test results, 4.5.2.1 General
there is a single combination of sensitivity and
specificity. These parameters occur, then, in The spectrum of trade-offs between sensitivity
pairs, and the accuracy of a test is reflected in and specificity is conveniently represented by
the spectrum of pairs that can occur (not all the ROC plot.14 ROC methodology is based on
pairs being possible for a particular test). For statistical decision theory and was developed in
any test in which the distributions of results the context of electronic signal detection and
from the two categories of subjects overlap, issues surrounding the behavior and use of radar
there are inevitable "trade-offs" between receivers in the middle of the twentieth century.6
sensitivity and specificity. As the decision An ROC-type plot was used in the 1950s to
characterize the ability of an automated Pap patients with stable or limited disease. This
smear analyzer to discriminate between smears dependence of TP and FP fractions on the study
with and without malignant cells.15 population is the reason that an ROC plot must
be generated for each clinical situation.
The ROC plot graphically displays this entire
spectrum of a test's performance for a particular In the ROC plot, the various combinations of
sample group of affected and unaffected sensitivity and specificity possible for the test in
subjects. It is, then, a "test performance curve," a given setting are readily apparent. Also
representing the fundamental clinical accuracy of apparent, then, are the "trade-offs" inherent in
the test by plotting all the sensitivity–specificity varying the decision threshold for that test. As
pairs resulting from continuously varying the the decision level changes, sensitivity improves
decision threshold over the entire range of at the expense of specificity, or vice versa. This
results observed. The important part of the plot can be appreciated directly from the plot. Note
is generated when the decision threshold is that the decision thresholds, though known, are
varying within the region where results from the not part of the plot. However, selected decision
affected and unaffected subjects overlap. thresholds can be displayed at the point on the
Outside of the overlap region, either sensitivity plot where the corresponding sensitivity and
or specificity is 1.0 and not varying; within the specificity appears.
overlap region, neither is 1.0 and both are
varying as the decision threshold varies. On the Because true- and false-positive fractions are
Y axis, sensitivity, or the true-positive fraction calculated entirely separately, using the test
(TPF), is plotted. On the X axis, false-positive results from two different subgroups of persons
fraction (FPF) (or 1-specificity) is plotted. This is (affected, unaffected), the ROC plot is
the fraction of truly unaffected subjects who independent of the prevalence in the sample of
nevertheless have positive test results; therefore, the disease or condition of interest. However,
it is a measure of specificity. as mentioned above, the TPFs and FPFs, and
thus the ROC plot, are still influenced by the
Another option is to plot specificity directly type (spectrum) of subjects included in the
(false-negative fraction) on the X axis. This sample.
results in a left-to-right "flip," giving a
mirror-image of the plot described above. The ROC plot provides a general, global
However, if the X axis is labeled from 0 to 1.0 assessment of performance that is not provided
from right to left (instead of left to right), then when only one or a few sensitivity–specificity
the plot is not flipped over. pairs are known. The test performance data
obtained to derive ROC plots may also be used
As mentioned above for sensitivity and to select decision thresholds for particular clinical
specificity, TP and FP fractions vary applications of the test. Several elements
continuously with the decision threshold within besides test performance determine which of the
the region of overlapping results. Each decision possible sensitivity–specificity pairs (and thus
threshold has a corresponding pair of TP the corresponding decision threshold) is most
(sensitivity) and FP (1-specificity) fractions. The appropriate for a given patient-care application:
rates observed also depend on the clinical (a) the relative cost or undesirability of errors,
setting, as reflected by the study group chosen. i.e., false-positive and false-negative
The FP fraction is influenced by the type of classifications (the benefits of correct
unaffected subjects included in the study group. classifications may also be considered); (b) the
If, for example, the unaffected subjects are all value (or "utility") of various outcomes (death,
healthy blood donors who are free of any signs cure, prolongation of life, or change in the
or symptoms, a test can appear to have much quality of life); and (c) the relative proportions of
lower FP fractions than if the unaffected the two states of health that the test is intended
subjects are persons who clinically resemble to discriminate between (prevalence of the
those who actually have the disease. conditions or diseases). While the selection of a
decision threshold is usually required for using a
Likewise, the TP fraction also depends on the test for patient management, this important step
study group. A test used to detect cancer can is beyond the scope of this guideline.
have higher TP fractions when applied to Discussion of this complex issue can be found
patients with active or advanced disease than to elsewhere.3,16-19
4.5.2.2 Generating the ROC Plot; Ties these points that is the ROC plot. For data with
no ties, adjacent points can be connected with
Usually, clinical data occur in one of two forms: horizontal and vertical lines in a unique manner
discrete or continuous. Most clinical laboratory to give a staircase figure (Figure 2, p. 14). As
data are continuous, being generated from a the threshold changes, inclusion of a
measuring device with sufficient resolution to true-positive result in the decision rule produces
provide observations on a continuum. a vertical line; inclusion of a false-positive result
Measurements of electrolyte, therapeutic drug, produces a horizontal line. As the numbers of
hormone, enzyme, and tumor-marker persons in the two groups increase, the steps in
concentrations are essentially continuous. the staircase become smaller and the plot usually
Urinalysis dipstick results, on the other hand, appears less jagged. Because this ROC plot uses
are discrete data, as are rapid pregnancy testing all the information in the data directly through
devices, which give positive/negative results. the ranks of the test results in the combined
Scales in diagnostic imaging also generally sample, it can also be called the nonparametric
provide discrete (ratings) data with rating ROC plot. The term "nonparametric" here refers
categories such as "definitely abnormal," to the lack of parameters needed to model the
"probably abnormal," "equivocal," "probably behavior of the plot, in contrast to parametric
normal," and "definitely normal." approaches that rely on models with parameters
to be estimated.
A tie in laboratory data is of interest when a
member of the diseased group has the same When there are ties in continuous data, both the
result as does a member of the nondiseased true-positive and false-positive fractions change
group. Such ties are more likely to occur when simultaneously, resulting in a point displaced
there are few data categories (i.e., few different both horizontally and vertically from the last
results), such as with coarse discrete data point. Connecting such adjacent points
(dipstick data, for example) rather than when the produces diagonal (nonhorizontal and
number of different results is large, as with nonvertical) lines on the plot. Diagonal
continuous data. This results from grouping or segments in the ROC plot, then, indicate ties.
"binning" the data into ordered categories. In
clinical laboratories, when observations are made As mentioned above, ties may be intentionally
on a continuous scale, ties are much less likely introduced in the display of the test results by
(unless intentional grouping into "bins" has grouping the results into intervals. A common
occurred). Theoretically, if measurements are approach often adopted in the literature is to plot
exact enough, no two persons would have the the ROC at only a few points by using only a
same result on a continuous scale. However, few decision thresholds and connecting adjacent
the resolution of results in the clinical laboratory points with straight line segments. All data
is often not so fine as to prevent this, and some falling in an interval between thresholds are
ties will occur even with continuous data. treated as tied. Although this bin approach has
Furthermore, intentional binning of continuous the advantage of plotting ease, it discards much
data also increases the chance for ties. This of the data and introduces many ties in the data.
occurs when, for example, gonadotrophin results If the points are few and far between, this
are expressed as whole numbers even though approximation can be poor and it can
the assay provides concentrations to 0.1 of a misrepresent the actual plot.
unit. It also occurs when all results within
intervals, such as 0–50, 51–100, etc., are 4.5.2.3 Qualitative Interpretation of the ROC
grouped together. Ties can be caused, then, Plot
either by the intentional binning of data or by the
degree of analytical resolution of the method of A test with good clinical performance achieves
observation. high TPFs (sensitivity), while having low FPFs
(corresponding to high specificity). Tests with
For both tied and untied data, one merely plots high diagnostic accuracy, then, have ROC plots
the calculated (1-specificity, sensitivity) points at with points close to the upper left corner where
all the possible decision thresholds (observed TPFs are high and FPFs are low. A test with
values) of the test. (This can be limited to the perfect accuracy, giving perfect discrimination
decision thresholds in the region of overlapping between affected and unaffected groups,
results; see Section 4.5.2.1.) It is the graph of achieves a TPF of 1.0 (100% sensitivity) and an
FPF of 0.0 (100% specificity) at one or more close the ROC plot is to the perfect one (area =
decision thresholds. This ROC plot, then, goes 1.0). The statistician readily recognizes the ROC
through the point (0, 1.0) in the upper left area as the Mann–Whitney version of the
corner. A simple rule of thumb is that the closer nonparametric two-sample statistic20,21
the plot is to this point, the more clinically introduced by the chemist Frank Wilcoxon. An
accurate the test usually is. A test that does not area of 0.8, for example, means that a randomly
discriminate between truly affected and truly selected person from the diseased group has a
unaffected subgroups has an ROC plot that runs laboratory test value larger than that for a
at a 45° angle from the point (0,0) to (1.0, 1.0). randomly chosen person from the nondiseased
Along this line, TPF equals FPF at all points, group 80% of the time. It does not mean that a
regardless of the decision threshold. (See "X" in positive result occurs with probability 0.80 or
Figure 2, p. 14.) All tests have plots between that a positive result is associated with disease
the 45° diagonal and the ideal upper left corner. 80% of the time.
The closer the plot is to the upper left corner,
the higher the discriminating ability of the test. When there are no ties between the diseased
Visual inspection of the plot, then, provides a and nondiseased groups, this area is easily com-
direct qualitative assessment of accuracy. puted from the plot as the sum of the rectangles
under this graph. Analytical formulas to calculate
Figure 2 (p. 14) has an ROC plot for a test with the area appear in reports by Bamber20 and
modest accuracy. Here the plot is in an Hanley and McNeil.21 Alternatively, the area can
intermediate position between the 45° diagonal be obtained indirectly from the Wilcoxon
and the ideal upper left corner. Figure 3 (p. 15) rank-sum statistic.22
has an ROC plot for a test with high accuracy.
Note how closely the plot passes to the upper Parametric approaches to calculating area,
left corner where sensitivity is highest and the employing some model for fitting a curve, have
FPF (1-specificity) is lowest. Figure 4 (p. 16) also been described. Both parametric and
shows ROC plots of results of three tests, all nonparametric methods are discussed and
derived from the same sample of persons. This compared in published reviews.13,23
provides a convenient comparison of accuracies.
The plot for amylase lies above and to the left of In using global indices such as area under the
the plot for phospholipase A (PLA). Thus, at ROC plot, there is a loss of information.
most sensitivities (TPF), amylase has a lower Therefore, it is undesirable to consider area
FPF (higher specificity) than PLA. Conversely, at without visual examination of the ROC plot itself
most FPFs, amylase has a higher TPF (better as well.
sensitivity) than does PLA. Amylase and lipase
have nearly identical ROC plots, indicating 4.5.2.5 Statistical Comparison of Multiple
virtually the same ability to discriminate. Both Tests
appear to be more accurate than PLA.
Direct statistical comparison of multiple
4.5.2.4 Area Under a Single ROC Plot diagnostic tests is frequent in clinical
laboratories. Usually, two (or more) tests are
One convenient way to quantify the diagnostic performed on the same subjects, as in a
accuracy of a laboratory test is to express its split-sample comparison.
performance by a single number. The most
common measure is the area under the ROC Tests can be compared to one another at a
plot. By convention, this area is always $ 0.5 (if single observed or theoretical sensitivity or
it is not, one can reverse the decision rule to specificity.24–26 Alternatively, a portion of the
make it so). Values range between 1.0 (perfect ROC plot can be used to compare tests.27
separation of the test values of the two groups)
and 0.5 (no apparent distributional difference A global approach is to compare entire ROC
between the two groups of test values). The plots by using an overall measure, such as area
area does not depend only on a particular portion under the plot; this can be performed either
of the plot, such as the point closest to the nonparametrically or parametrically.13 This is
upper left corner or the sensitivity at some especially attractive to laboratories because the
chosen specificity, but on the entire plot. This is comparison does not rely on the selection of a
a quantitative, descriptive expression of how particular decision threshold (which should
The ROC plot has the following advantages: It Several other authors also used ROC plots in
is simple, graphical, and easily appreciated various ways. Carson et al30 investigated the
visually. It is a comprehensive representation of abilities of four different assays of prostatic acid
pure clinical accuracy, i.e., discriminating ability, phosphatase to discriminate between those
over the entire range of the test. It does not subjects with prostatic cancer and those
require selection of a particular decision subjects with either some other urologic
threshold because the whole spectrum of abnormality or no known urologic abnormality.
possible decision thresholds is included. It is Hermann et al31 compared the clinical accuracies
independent of prevalence: No care need be of two versions of a commercial assay for
taken to obtain samples with representative thyrotropin to test a claim that the newer one
prevalence; in fact, it is usually preferable to was superior for discriminating between
have equal numbers of subjects with both euthyroidism and hyperthyroidism. Kazmierczak
conditions. It provides a direct visual et al32 used ROC plots in a study of the clinical
comparison between tests on a common scale. accuracies of lipase, amylase, and phospholipase
It requires no grouping or binning of data, and A in discriminating acute pancreatitis from other
both specificity and sensitivity are readily diseases in 151 consecutive patients seen with
accessible. abdominal pain. Flack et al33 used ROC plots
and areas to compare the abilities of urinary free
4.5.2.8 Disadvantages of ROC Plots cortisol and 17-hydroxysteroid suppression tests
to discriminate between Cushing disease and
The ROC plot has several drawbacks: Actual other causes of Cushing syndrome. Guyatt et
decision thresholds usually do not appear on the al34 studied the ability of seven tests, including
plot (though they are known and are used to ferritin, transferrin, saturation, mean cell volume,
generate the graph). The number of subjects is and erythrocyte protoporphyrin, to discriminate
also not part of the plot. Without computer between iron-deficiency anemia and other causes
assistance, the generation of plots and analysis of anemia in subjects older than 65 years who
is cumbersome. (See the Appendix for available were admitted to the hospital with anemia.
software packages). Beck,35 while studying iron-deficiency anemia,
also used ROC plots to compare four tests.
6 Summary
Figure 1. Dot diagram of serum CK-BB concentrations 16 hours after the onset of symptoms in 70
subjects presenting to an emergency department with typical chest pain. Fifty were eventually consid-
ered to have had acute mycocardial infarction (AMI); 20 were not. (Data from Van Steirteghem AC, Zweig
MH, Robertson EA, et al. Comparison of the effectiveness of four clinical chemical assays in classifying
patients with chest pain. Clin Chem 1982;28:1319–1324.)
Figure 2. Nonparametric ROC plot of serum apolipoprotein A-I/B ratios used in identifying clinically
significant coronary artery disease (CAD) in 304 men suspected of having CAD. Presence or absence of
CAD was established by coronary angiography. Area under the ROC plot is 0.75. The line labeled "X"
represents the theoretical plot of a test with no ability to discriminate (area = 0.5). (From Zweig MH.
Apolipoproteins and lipids in coronary artery disease: Analysis of diagnostic accuracy using receiver
operating characteristic plots and areas. Arch Pathol Lab Med 1994; 118:141–144.)
Figure 3. Nonparametric ROC plot of serum myoglobin concentrations, 5 hours after the onset of
symptoms in 55 emergency department patients suspected of having acute myocardial infarction. The
area under the plot is 0.953. Thirty-seven subjects had an AMI; 18 did not. (Data from Van Steirteghem
AC, Zweig MH, Robertson EA, et al. Comparison of the effectiveness of four clinical chemical assays in
classifying patients with chest pain. Clin Chem 1982;28:1319–1324.)
Figure 4. ROC plots for peak serum amylase, lipase, and phospholipase A (PLA) concentrations in
identifying acute pancreatitis in 151 consecutive patients with abdominal pain. (From Kazmierczak SC,
Van Lente F, Hodges ED. Diagnostic and prognostic utility of phospholipase A activity in patients with
acute pancreatitis: comparison with amylase and lipase. Clin Chem 1991;37:356–360.)
Appendix (Continued)
TESTIMATE, idv-Data Analysis and Study SmarTest, idv-Data Analysis and Study Planning,
Planning, Wessobrunner Str. 6, 82131 Gauting/ Wessobrunner Str. 6, 82131 Gauting/ Munich,
Munich, Germany. Fax: +49.89.8503666. Germany. Fax: +49.89.8503666.
Product/Vendor List
in NCCLS Standards
References
4. Zweig MH, Robertson EA. Why we 12. Campbell G, Levy D, Bailey JJ.
need better test evaluations. Clin Bootstrap comparison of fuzzy R.O.C.
Chem 1982;28:1272-1276. curves for ECG-LVH algorithms using
data from the Framingham heart study.
5. Lachs MS, Nachamkin I, Edelstein PH, J Electrocardiol 1990;23
et al. Spectrum bias in the evaluation (suppl):132-137.
of diagnostic tests: lessons from the
rapid dipstick test for urinary tract *13. Zweig MH, Campbell G.
infection. Ann Intern Med Receiver-operating characteristic (ROC)
1992;117:135-140. plots: A fundamental evaluation tool in
clinical medicine. Clin Chem
6. Metz CE. ROC methodology in 1993;39:561-577.
radiologic imagining. Invest Radiol
1986;21:720-733. *14. Metz CE. Basic principles of ROC
analysis. Semin Nucl Med
7. Valenstein PN. Evaluating diagnostic 1978;8:283-298.
tests with imperfect standards. Am J
Clin Pathol 1990;93:252-258. 15. Lusted LB. ROC recollected [editorial].
Med Decis Making 1984;4:131-135.
8. Revesz G, Kundel HL, Bonitatibus M.
The effect of verification on the 16. Lusted LB. Decision making studies in
assessment of imaging techniques. patient management. N Engl J Med
Invest Radiol 1983;18:194-198. 1971; 284:416-424.
References (Continued)
17. Lusted LB. Signal detectability and statistics for comparing diagnostic
medical decision-making. Science markers with paired or unpaired data.
1971;171:1217-1219. Biometrika 1989; 76:585–592.
18. McNeil BJ, Keeler E, Adelstein SJ. 28. Van Steirteghem AC, Zweig MH,
Primer on certain elements of medical Robertson EA, et al. Comparison of
decision making. N Engl J Med 1975: the effectiveness of four clinical
293:211-215. chemical assays in classifying patients
with chest pain. Clin Chem
19. Weinstein MC, Fineberg HV. Clinical 1982;28:1319-1324.
Decision Analysis. Philadelphia: WB
Saunders, 1980. 29. Leung FY, Galbraith LV, Jablonsky G,
et al. Re-evaluation of the diagnostic
20. Bamber D. The area above the ordinal utility of serum total creatine kinase
dominance graph and the area below and creatine kinase-2 in myocardial
the receiver operating characteristic infarction. Clin Chem 1989;35:1435-
curve. J Math Psychol 1440.
1975;12:387-415.
30. Carson JL, Eisenberg JM, Shaw LM, et
21. Hanley JA, McNeil BJ. The meaning al. Diagnostic accuracy of four assays
and use of the area under a receiver of prostatic acid phosphatase.
operating characteristic (ROC) curve. Comparison using receiver operating
Radiology 1982;143:29-36. characteristic curve analysis. J Am
Med Assoc 1985;253:665-669.
22. Hollander M, Wolfe DA.
Nonparametric statistical methods. 31. Hermann GA, Sugiura HT, Krumm RP.
New York: John Wiley, 1973:67-78. Comparison of thyrotropin assays by
relative operating characteristics
*23. Hanley JA. Receiver operating analysis. Arch Pathol Lab Med
characteristic (ROC) methodology: the 1986;110:21-25.
state of the art. Crit Rev Diagn
Imaging 1989;29:307-335. 32. Kazmierczak SC, Van Lente F, Hodges
ED. Diagnostic and prognostic utility
24. Beck JR, Shultz EK. The use of of phospholipase A activity in patients
relative operating characteristic (ROC) with acute pancreatitis: comparison
curves in test performance evaluation. with amylase and lipase. Clin Chem
Arch Pathol Lab Med 1986;110:13-20. 1991;37:356-360.
*25. McNeil BJ, Hanley JA. Statistical 33. Flack MR, Oldfield EH, Cutler GB, et al.
approaches to the analysis of receiver Urine free cortisol in the high-dose
operating characteristic (ROC) curves. dexamethasone suppression test for
Med Decis Making 1984;2:137-150. the differential diagnosis of the
Cushing syndrome. Ann Intern Med
26. Greenhouse SW, Mantel N. The 1992;116:211-217.
evaluation of diagnostic tests.
Biometrics 1950; 6:399–412.
*Note that these articles give detailed reviews of
27. Wieand S, Gail MH, James BR, James procedures. Review of these articles is
KL. A family of nonparametric especially recommended.
References (Continued)
34. Guyatt GH, Patterson C, Ali M, et al. 37. Centor RM, Keightley GE. Receiver
Diagnosis of iron-deficiency anemia in operating characteristic (ROC) curve
the elderly. Am J Med 1990;88:205- area analysis using The ROC
209. ANALYZER. Proceedings of the
Symposium for Computer Applications
35. Beck JR. The role of new laboratory to Medical Care, 1989: 222-226.
tests in clinical-decision making. Clin
Lab Med 1982;2:51-77. 38. Pellar TG, Leung FY, Henderson AR. A
computer program for rapid generation
36. Zweig MH. Apolipoproteins and lipids of receiver operating characteristic
in coronary artery disease: Analysis of curves and likelihood ratios in the
diagnostic accuracy using receiver evaluation of diagnostic tests. Ann
operating characteristic plots and Clin Biochem 1988;25: 411-416.
areas. Arch Pathol Lab Med 1994;
118:141–144.
GP10-T: Assessment of Clinical Sensitivity and Specificity of Laboratory Tests; Tentative Guideline
General
1. We were very impressed with the document and believe it will be of value to the clinical
laboratory. Although most laboratories may not do studies that lead to ROC plots, they
certainly need to understand how they are developed and what they mean. This document will
be a good start.
2. The document is a summary of the relevant issues written at an introductory primer level. It will
therefore be of use to clinical "laboratorians" who will (one hopes) be guided by senior
investigators responsible for experimental design and analysis. In fact, perhaps the most telling
line of the document is this (page 5): "Consultation with a professional statistician is
recommended..."
In particular, none of the subtle issues involved in the data analysis are mentioned in the
document; there is no display (or explanation) of the results on the double-probability scale that
is most frequently used to fit the results with a straight line. Finally, there is a good list of
available software and one can find technical guidance by working through the references at the
back.
In a few words, this is an OK introductory primer on the subject. Nevertheless, it is historic and
important.
3. Our group, which routinely determines diagnostic efficiency, prefers cumulative distribution
analysis graphs (see for example, BI Bluestein et. al. Cancer Research 1984;44:4131–4136)
rather than ROC curves.
Cumulative distribution analysis graphs are more readily understood. Sensitivity and specificity
are immediately known for any concentration cutoff. ROC curves do not show concentration at
all and specificity only indirectly.
! Regarding cumulative distribution graphs, the subcommittee recognizes that these have
desirable features including the display of decision thresholds. An important limitation is that
multiple tests cannot be plotted together and compared directly to one another because the
abscissa depends on the concentration scale peculiar to each test. This is the feature that
allows for the display of decision thresholds but interferes with comparison of tests. ROC plots,
because the axes are normalized, permit all tests to be evaluated, either singly or in multiples,
on the very same scale, regardless of the original scale. The subcommittee did not intend to
review all graphical or statistical approaches to evaluating test performance, nor did it intend to
select one as the best or only approach. As ROC plots have finally received fairly widespread
recognition, we feel it is appropriate to recommend them without contending that they are
necessarily the only useful approach.
Specificity can be directly shown on an ROC plot simply by employing the variation using an
abscissa on which the scale runs right to left instead of left to right. This is already mentioned
in the document in the third paragraph of Section 4.5.2.1.
Foreword
4. In the sentence before "Note that assessing..." the “a” should be removed from the sentence to
read: "It is important to know just how inherently accurate each tool (test) is as a diagnostic
discriminator."
Section 4.2.5
5. In this section, the authors recommend consulting a statistician. In our view, this should be
emphasized very strongly because, as the authors point out in the response to Comment #46,
the statistical techniques and issues are not simple. This is evident even in the subcommittee's
own recommendation of McNemar's test or Fisher's exact test to compare ROC plots, which
really are not appropriate. Greenhouse and Mantel (Biometrics 1950;12:399) derived
appropriate non-parametric test statistics to use in this context. This class of statistics was
generalized by Wieand, Gail, James, and James (Biometrika 1989;76:3:585–592), who
provided a useful general nonparametric approach. In addition, parametric binormal models
which are discussed by Metz, Hanley, and others, are computationally more manageable than
nonparametric approaches, but they require careful assessment of the appropriateness of the
statistical assumptions on which the tests and estimators are based.
As a corollary to the above comment, more emphasis should be given to the importance of
adequate sample size to provide a sufficiently precise estimate of the ROC curve and use of
confidence intervals to assess the precision of the estimates. Because of the special nature of
the test statistics, power/sample size computations are not possible with any currently available
packages of which we are aware.
The standard method of comparing ROC curves using area under the curve, although well
accepted, is a blunt instrument, which receives much more emphasis than it deserves. This
measure averages in ranges of sensitivity/specificity, which would be of little clinical usefulness
and therefore are irrelevant to deciding between two competing technologies. Comparisons of
ROC curves at a definite specificity, or over a limited range of relevant specifications, as
proposed by Wieand et al above, is better.
! The subcommittee recognizes the points made here and acknowledges the statistical
complexities involved. Because we do not feel it is appropriate to deal with these extensive
statistical issues in the document, we have revised Section 4.5.2.5, second paragraph, to be
more general and refer the reader to more primary sources, including Greenhouse and Mantel,
1950, and Wieand et al, 1989. Also, we revised Section 4.5.2.4 by adding the caveat that
global quantitative indices, such as area under the curve, can mask important information and
that visual inspection of the plot is necessary to fully appreciate test accuracy. Likewise,
Section 4.5.2.5 is revised to recommend visual inspection when comparing multiple tests. A
sentence was added to Section 4.2.5 that emphasizes the need for appropriate sample size.
Section 4.3
6. In the special case when comparison is being done between different implementations of the
same test (for instance, comparison of CKMB on different analyzers), it may not be necessary to
go through the rigor of establishing the "true" clinical state of the patients. While I believe this
is essential for a new test, when the clinical laboratory is assessing equivalence, it may be
sufficient to simply compare the current implementation of the test with the new one using the
final diagnosis on the chart. Although this diagnosis is biased, because it was determined using
the laboratory's current test, the study should be valid because the question being asked is,
"Are the two implementations of the test equivalent?" If the ROC plots show that the tests
being compared are equivalent, then no additional studies would need to be done. However, if
the ROC plots were substantially different, then additional work would need to be done to
understand the difference. If the committee agrees with this and could include this type of
information in the current guideline without much delay, I believe it would be of value.
! While we recognize the logic in the approach used for the particular circumstances described in
the comment, we prefer not to encourage users of the document to compromise on the rigor of
their classification (diagnosis). Those users who are well acquainted with the principles will
know when it may not be necessary to seek definitive classifications. Even in the situation
described in the comment it is still advisable to establish the "true" clinical state if this had not
been done originally when the "current" (old) test was studied.
7. It is our understanding that the terms "blind/blinded/blinding" are now politically incorrect.
Contemporary terms are "masked/masking."
! The subcommittee changed "blind" and "blindly" to "masked" in Sections 4.3.5 and 4.4.1.
Section 4.5.2.1
8. In the second paragraph you write about plotting sensitivity/specificity pairs "over the entire
range of results observed." That is unnecessary. Results only have to be plotted for the
overlap region (the range in which sensitivity and specificity are both less than 1.0). This same
error occurs in the fourth paragraph with the statement that "TP and FP fractions vary
continuously with the decision threshold." No, they only vary when the cut-off point yields true
positive fractions >0 and <1.
An easy way to decide what range to plot is to look at the extremes for each group (disease vs.
non-disease). For a test that increases with disease, the range of values to plot on the ROC
curve is between the lowest value for the disease group and the highest value for the non-
disease group.
In the last paragraph of this section, the next to last sentence should read: "While the selection
of a decision...."
! The subcommittee agrees that results only have to be plotted for the overlap region. Sections
4.5.2.1 and 4.5.2.2 have been revised to add a statement to that effect.
9. On page 11, line 5, the word "section" should be selection—"While the selection of a decision
threshold..."
Section 4.5.2.3
10. The word "accuracy" is used where I believe the word "sensitivity" should be used.
! The term "accuracy," not "sensitivity," was indeed intended. Accuracy is used here to refer to
the overall ability of the diagnostic device to discriminate between alternative states of health
(see the Foreword). Sensitivity and specificity are components of accuracy. No change is
indicated.
11. The statistical discussion is a little bit difficult to understand. However, the author's
recommendation to use commercially available programs is a good one.
Section 4.5.2.5
12. I believe this section should be expanded. McNemar's statistic for paired data and Fisher's
exact test for unpaired data should be thoroughly described. Sample calculations would also be
useful. Comparing two tests using their areas under the plot has significant weaknesses. For
example, for tests where the ROC plots cross at some point, one test may be significantly
better than the other at a certain decision point. This may not be reflected by comparing areas
under each plot.
There is no discussion in this section on test efficiency (TP + TN)/(Total subjects). Efficiency
should be defined in the glossary and explained in this section. It is a commonly used method
to describe a test's usefulness at a particular decision point. Tests can also be compared by
their maximum efficiencies.
! See Comment 5. The subcommittee recognizes the statistical complexity and notes that
Comment 13 also addresses Fisher's exact test. As mentioned in Comment 5, we have added
some primary references and simplified the discussion in the document in the belief that a
thorough description of all of these approaches is beyond the scope of the document.
The term "efficiency" was removed previously in response to an earlier comment (#58).
Because efficiency is very dependent on prevalence, it is not actually a characteristic of the test
itself but of the interaction of the test with the setting.
13. On page 14, 2nd paragraph, the Fisher's exact test seems vague to us. An idea of the intended
audience can be obtained from the "Summary of Comments."
Appendix
14. Rulemaker is not available. It never completed beta testing, and Digital Medicine, Inc. has not
made it available. I am not sure why, since it progressed far enough to be used in studies and
mentioned in publications.
! Rulemaker is still under development and the date of availability is projected to be 1996. GP10
has been revised accordingly.
15. I agree with item 2 in Comment 50 (page 38). An expansion of this document to discuss
selection of decision limits and predictive values would be a significant value. Possibly
discussions of "gray zones" could also be included. My experience is that there is a significant
lack of understanding of the concepts, how they are determined, and how they should be used.
While it may be beyond the scope of NCCLS to address what is essentially an educational issue,
I believe guidelines similar in scope to those in the ROC document would help the educational
process. However, I would not want to see the ROC document delayed to incorporate this
information. I believe it has value and is and should be approved.
EP6-P Evaluation of the Linearity of Quantitative Analytical Methods; Proposed Guideline (1986).
EP6-P discusses the verification of the analytical range (or linearity) of a clinical chemistry
device.
EP7-P Interference Testing in Clinical Chemistry; Proposed Guideline (1986). EP7-P discusses
interference testing during characterization and evaluation of a clinical laboratory method
or device.
EP9-T Method Comparison and Bias Estimation Using Patient Samples; Tentative Guideline
(1993). EP9-T discusses procedures for determining the relative bias between two clinical
chemistry methods or devices. It also discusses the design of a method comparison
experiment using split patient samples and analysis of the data.