Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Clsi GP10 A

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36
At a glance
Powered by AI
The document discusses evaluating the accuracy of laboratory tests using receiver operating characteristic (ROC) plots and emphasizes defining the clinical question, sample selection, and determining true clinical states.

The document provides a protocol for evaluating the accuracy of a test to discriminate between two subclasses of subjects where there is some clinically relevant reason to separate them.

The consensus process is a protocol for authorizing and developing documents through open review and revision in response to comments from users to reach consensus as a standard or guideline.

GP10-A

Vol. 15 No. 19
Replaces GP10-T
December 1995 Vol. 13 No. 28

Assessment of the Clinical Accuracy of Laboratory Tests Using


Receiver Operating Characteristic (ROC) Plots; Approved
Guideline

This document provides a protocol for evaluating the accuracy of a test to discriminate between two
subclasses of subjects where there is some clinically relevant reason to separate them. In addition to
the use of ROC plots, the importance of defining the question, selecting the sample group, and
determining the "true" clinical state are emphasized.

ABC
GP10-A
THIS NCCLS DOCUMENT HAS BEEN
REAFFIRMED
WITHOUT CHANGE
AS AN APPROVED CONSENSUS DOCUMENT
EFFECTIVE MAY 2001
NCCLS...
Serving the World's Medical Science Community Through Voluntary Consensus
NCCLS is an international, interdisciplinary, nonprofit, scope, approach, and utility, and a line-by-line review of its
standards-developing and educational organization that technical and editorial content.
promotes the development and use of voluntary consensus
standards and guidelines within the healthcare community. It Tentative A tentative standard or guideline is made available
is recognized worldwide for the application of its unique for review and comment only when a recommended method
consensus process in the development of standards and has a well-defined need for a field evaluation or when a
guidelines for patient testing and related healthcare issues. recommended protocol requires that specific data be collected.
NCCLS is based on the principle that consensus is an effective It should be reviewed to ensure its utility.
and cost-effective way to improve patient testing and
healthcare services. Approved An approved standard or guideline has achieved
consensus within the healthcare community. It should be
In addition to developing and promoting the use of voluntary reviewed to assess the utility of the final document, to ensure
consensus standards and guidelines, NCCLS provides an open attainment of consensus (i.e., that comments on earlier
and unbiased forum to address critical issues affecting the versions have been satisfactorily addressed), and to identify
quality of patient testing and health care. the need for additional consensus documents.

PUBLICATIONS NCCLS standards and guidelines represent a consensus opinion


on good practices and reflect the substantial agreement by
An NCCLS document is published as a standard, guideline, or materially affected, competent, and interested parties obtained
committee report. by following NCCLS’s established consensus procedures.
Provisions in NCCLS standards and guidelines may be more or
Standard A document developed through the consensus less stringent than applicable regulations. Consequently,
process that clearly identifies specific, essential requirements conformance to this voluntary consensus document does not
for materials, methods, or practices for use in an unmodified relieve the user of responsibility for compliance with applicable
form. A standard may, in addition, contain discretionary regulations.
elements, which are clearly identified.
COMMENTS
Guideline A document developed through the consensus
process describing criteria for a general operating practice, The comments of users are essential to the consensus
procedure, or material for voluntary use. A guideline may be process. Anyone may submit a comment, and all comments
used as written or modified by the user to fit specific needs. are addressed, according to the consensus process, by the
NCCLS committee that wrote the document. All comments,
Report A document that has not been subjected to consensus including those that result in a change to the document when
review and is released by the Board of Directors. published at the next consensus level and those that do not
result in a change, are responded to by the committee in an
CONSENSUS PROCESS appendix to the document. Readers are strongly encouraged
to comment in any form and at any time on any NCCLS
The NCCLS voluntary consensus process is a protocol document. Address comments to the NCCLS Executive
establishing formal criteria for: Offices, 940 West Valley Road, Suite 1400, Wayne, PA
19087, USA.
! The authorization of a project
VOLUNTEER PARTICIPATION
! The development and open review of documents
Healthcare professionals in all specialities are urged to
! The revision of documents in response to comments by volunteer for participation in NCCLS projects. Please contact
users the NCCLS Executive Offices for additional information on
committee participation.
! The acceptance of a document as a consensus standard or
guideline.

Most NCCLS documents are subject to two levels of


consensus–"proposed" and "approved." Depending on the
need for field evaluation or data collection, documents may
also be made available for review at an intermediate (i.e.,
"tentative") consensus level.

Proposed An NCCLS consensus document undergoes the first


stage of review by the healthcare community as a proposed
standard or guideline. The document should receive a wide and
thorough technical review, including an overall review of its
December 1995 GP10-A

Assessment of the Clinical Accuracy of Laboratory Tests Using


Receiver Operating Characteristic (ROC) Plots; Approved
Guideline

Abstract
Assessment of the Clinical Accuracy of Laboratory Tests Using Receiver Operating Characteristic (ROC)
Plots; Approved Guideline (NCCLS document GP10-A) provides guidance for laboratorians who assess
clinical test accuracy. It is not a recipe; rather it is a set of concepts to be used to design an assessment
of test performance or to interpret data generated by others. In addition to the use of ROC plots, the
importance of defining the question, selecting a sample group, and determining the “true” clinical state
are emphasized. The statistical data generated can be useful whether one is considering replacing an
existing test, adding a new test, or eliminating a current test.

[NCCLS. Assessment of the Clinical Accuracy of Laboratory Tests Using Receiver Operating
Characteristic (ROC) Plots; Approved Guideline. NCCLS Document GP10-A (ISBN 1-56238-285-3).
NCCLS, 940 West Valley Road, Suite 1400, Wayne, Pennsylvania 19087, 1995.]

THE NCCLS consensus process, which is the mechanism for moving a document through two
or more levels of review by the clinical laboratory testing community, is an ongoing process.
(See the inside front cover of this document for more information on the consensus process.)
Users should expect revised editions of any given document. Because rapid changes in
technology may affect the procedures, bench and reference methods, and evaluation protocols
used in clinical laboratory testing, users should replace outdated editions with the current
editions of NCCLS documents. Current editions are listed in the NCCLS Catalog, which is
distributed to member organizations, or to nonmembers on request. If your organization is not
a member and would like to become one, or to request a copy of the NCCLS Catalog, contact
the NCCLS Executive Offices. Telephone: 610.688.1100; Fax: 610.688.6400.

NCCLS VOL. 15 NO. 19 i


GP10-A
ISBN 1-56238-285-3
December 1995 ISSN 0273-3099

Assessment of the Clinical Accuracy of Laboratory Tests Using


Receiver Operating Characteristics (ROC) Plots; Approved
Guideline

Volume 15 Number 19

Mark H. Zweig, M.D.


Edward R. Ashwood, M.D.
Robert S. Galen, M.D., M.P.H.
Ronley H. Plous, M.D., FCAP
Max Robinowitz, M.D.

ABC
December 1995 GP10-A

This publication is protected by copyright. No part of it may be reproduced, stored in a retrieval system,
or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise) without written permission from NCCLS, except as stated below.

NCCLS hereby grants permission to reproduce limited portions of this publication for use in laboratory
procedure manuals at a single site, for interlibrary loan, or for use in educational programs provided that
multiple copies of such reproduction shall include the following notice, be distributed without charge,
and, in no event, contain more than 20% of the document's text.

Reproduced with permission, from NCCLS publication GP10-A, Assessment of the Clinical
Accuracy of Laboratory Tests Using Receiver Operating Characteristic (ROC) Plots; Approved
Guideline. Copies of the current edition may be obtained from NCCLS, 940 West Valley Road,
Suite 1400, Wayne, Pennsylvania 19087, USA.

Permission to reproduce or otherwise use the text of this document to an extent that exceeds the
exemptions granted here or under the Copyright Law must be obtained from NCCLS by written request.
To request such permission, address inquiries to the Executive Director, NCCLS, 940 West Valley Road,
Suite 1400, Wayne, Pennsylvania 19087, USA.

Copyright ©1995. The National Committee for Clinical Laboratory Standards.

Suggested Citation

NCCLS. Assessment of the Clinical Accuracy of Laboratory Tests Using Receiver Operating Characteristic
(ROC) Plots; Approved Guideline. NCCLS Document GP10-A (ISBN 1-56238-285-3). NCCLS, 940 West
Valley Road, Suite 1400, Wayne, Pennsylvania 19087, USA.

Proposed Guideline
March 1987

Tentative Guideline
December 1993

Approved Guideline

Approved by Board of Directors


August 1995

Approved by Membership
November 1995

Published
December 1995

ISBN 1-56238-285-3
ISSN 0273-3099

NCCLS VOL. 15 NO. 19 iii


December 1995 GP10-A

Contents
Page

Abstract ........................................................... i

Committee Membership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

3 Outline of the Evaluation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.1 Define the Clinical Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2


3.2 Select a Representative Study Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.3 Establish the "True" Clinical State of Each Subject . . . . . . . . . . . . . . . . . . . . 2
3.4 Test the Study Subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.5 Assess the Clinical Accuracy of the Test . . . . . . . . . . . . . . . . . . . . . . . . . . 2

4 Designing the Basic Evaluation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4.1 Define the Clinical Question . . . . . . . . . .......................... 3


4.2 Select a Representative Study Sample . . .......................... 3
4.3 Establish the "True" Clinical State of
Each Subject . . . . . . . . . . . . . . . . . . . .......................... 4
4.4 Test the Study Subjects . . . . . . . . . . . . .......................... 6
4.5 Assess the Clinical Accuracy of the Test .......................... 6

5 The Use of ROC Plots: Examples from the Clinical Laboratory Literature . . . . . . . . 11

6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Appendix: Computer Software for ROC Plotting and Analysis . . . . . . . . . . . . . . . . . . . . . . 17

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Summary of Comments and Subcommittee Responses . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Related NCCLS Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

NCCLS VOL. 15 NO. 19 iv


December 1995 GP10-A

Committee Membership

Area Committee on General Laboratory Practices

Gerald A. Hoeltge, M.D. The Cleveland Clinic Foundation


Chairholder Cleveland, Ohio

Donald A. Dynek, M.D. Pathology Medical Services, P.C.


Vice Chairholder Lincoln, Nebraska

Subcommittee on Clinical Evaluation of Tests

Mark H. Zweig, M.D. National Institutes of Health


Chairholder Bethesda, Maryland

Edward R. Ashwood, M.D. University of Utah School of Medicine


Salt Lake City, Utah

Robert S. Galen, M.D., M.P.H. Case Western Reserve University


Cleveland, Ohio

Ronley H. Plous, M.D., FCAP LabOne, Inc.


Shawnee Mission, Kansas

Max Robinowitz, M.D. FDA Center for Devices and Radiological Health
Rockville, Maryland

Advisors

George S. Cembrowski, M.D., Ph.D. Park Nicollet Medical Center


St. Louis Park, Minnesota

William Lee Collinsworth, Ph.D. Boehringer Mannheim Diagnostics, Inc.


Indianapolis, Indiana

William C. Dierksheide, Ph.D. FDA Center for Devices and Radiological Health
Rockville, Maryland

Jerome A. Donlon, M.D., Ph.D. FDA Center for Biologics Evaluation and Research
Rockville, Maryland

Marlene E. Haffner, M.D. Food and Drug Administration


Rockville, Maryland

Marianne C. Watters, M.T.(ASCP) Parkland Memorial Hospital


Board Liaison Dallas, Texas

Denise M. Lynch, M.T.(ASCP), M.S. NCCLS


Staff Liaison Wayne, Pennsylvania

NCCLS VOL. 15 NO. 19 vii


December 1995 GP10-A

Foreword

As laboratorians, we are often interested in how well a test performs clinically. This is true whether we
are considering replacing an existing test with a newer one, adding a new test to our laboratory's menu,
eliminating tests where possible, or just because we want to know something about the value of what
we are doing. This project was originally intended to make recommendations about assessing the clinical
performance of diagnostic tests. We elected to adopt the concepts of Swets and Pickett,1 whereby
clinical performance is divided into (1) a discrimination or diagnostic accuracy element and (2) a decision
or efficacy element. Laboratory tests are ordered to help answer questions about patient management.
How much help an individual test result provides is variable and, in any case, a highly complicated issue.
Management decisions and strategies are complex activities that require the physician to consider
probabilities of disease, quality of the data available, effectiveness of various treatment/management
alternatives, probability of outcomes, and value (and cost) of outcomes to the patient. Many types of
clinical data (including laboratory results) are usually integrated into a complex decision-making process.
Most often, a single laboratory test result is not the sole basis for a diagnosis or a patient-management
decision. Therefore, some have criticized the practice of evaluating the diagnostic performance of a test
as if it were used alone. However, each clinical tool, whether it is a clinical chemistry test, an
electroencephalogram, an electrocardiogram, a nuclide scan, an x-ray, a biopsy, a view through an orifice,
a pulmonary function test, or a sonogram, is meant to make some definable discrimination. It is important
to know just how inherently accurate each tool (test) is as a diagnostic discriminator. Note that assessing
clinical accuracy, without engaging in comprehensive clinical decision analysis, is a valid and useful
activity for the clinical laboratory. Clinical accuracy is the most fundamental characteristic of the test itself
as a classification device; it measures the ability of the test to discriminate among alternative states of
health. In the simplest form, this property is the ability to distinguish between just two states of health
or circumstances. Sometimes this involves distinguishing health from disease; other times it might involve
distinguishing between benign and malignant disease, between patients responding to therapy and those
not responding, or predicting who will get sick versus who will not. This ability to distinguish or
discriminate between two states among patients who could be in either of the two states is a property of
the test itself.

Indeed, the ability of the test to distinguish between the relevant alternative states or conditions of the
subject (i.e., clinical accuracy) is the most basic property of a laboratory test as a device to help in
decision making. This property is the place to start when assessing what value a test has in contributing
to the patient-management process. If the test cannot provide the relevant distinction, it will not be
valuable for patient care. On the other hand, once we establish that a test does discriminate well, then
we can explore its role in the process of patient management to determine the practical usefulness of the
information in a management strategy. This exploration is clinical decision analysis, and measures of test
accuracy provide part of the data used to carry out that analysis.

Usefulness or efficacy refers to the practical value of the information in managing patients. A test can
have considerable ability to discriminate, yet not be of practical value for patient care. This could happen
for several reasons. For instance, the cost or undesirability of false results can be so high that there is no
decision threshold for the test where the trade-off between sensitivity and specificity is acceptable.
Perhaps there are less invasive or less expensive means to obtain comparable information. The test may
be so expensive or technically demanding that its availability is limited. It could be so uncomfortable or
invasive that the subjects do not want to submit to it.

Exploration of the usefulness of medical information, such as test data, involves a number of factors or
parameters that are not properties of the test system or device; rather they are properties of the
circumstances of the clinical application. These include the probability of disease (prevalence), the
possible outcomes and the relative values of those outcomes, the costs to the patient (and others) of
incorrect information (false-positive and false-negative classifications), and the costs and benefits of
various treatment options. These are characteristics or properties of the context in which test information
is used, but they are not properties of the tests themselves. These factors interact with test

NCCLS VOL. 15 NO. 19 ix


December 1995 GP10-A

Foreword (Continued)

results to affect the usefulness of the test. Thus, it is helpful to conceptually separate the characteristic
that is fundamental and inherent to the tests themselves, discrimination ability, from the interaction that
results when this discrimination ability is mixed with external factors in the course of patient
management.

In summary, we define clinical accuracy as the basic ability to discriminate between two subclasses of
subjects where there is some clinically relevant reason to separate them. This concept of clinical accuracy
refers to the quality of the information (classification) provided by the test and it should be distinguished
from the practical usefulness of the information.1 Both are aspects of test performance. Second, we
suggest that the assessment of clinical accuracy is the place to start in evaluating test performance. If a
test cannot discriminate between clinically relevant subclasses of subjects, then there is little incentive to
go any further in exploring a possible clinical role. If, on the other hand, a test does exhibit substantial
ability to discriminate, then by examining the degree of accuracy of the test and/or by comparing its
accuracy to that of other tests, we can decide whether to delve into a more complex assessment of its
role in patient-care management (decision analysis). This document addresses the assessment of
diagnostic accuracy but not the analysis of usefulness, or the role of the test in patient-care strategy.

The subcommittee believes that this guideline will be of value to a wide variety of possible users
including:

! Investigators who are developing new tests for specific applications

! Manufacturers of reagents and other devices for performing tests who are interested in assessing
or validating test performance in terms of clinical accuracy

! Regulatory agencies interested in establishing requirements for claims related to diagnostic


accuracy

! Clinical laboratories that are reviewing data, literature, and/or generating their own data to make
decisions about which tests to employ in their laboratory

! Health care/scientific workers interested in critical evaluation of data being presented on clinical
test performance.

Key Words

Clinical accuracy, sensitivity, specificity, true-positive fraction, false-positive fraction, false-negative


fraction, receiver operating characteristic (ROC) plot, performance evaluation, medical decision analysis,
true-negative fraction.

Acknowledgment

The subcommittee thanks Dr. Gregory Campbell (Director, Division of Biostatistics, Office of Surveillance
and Biometrics, Center for Devices/Radiological Health, Food and Drug Administration, Rockville, MD) for
his invaluable expert statistical consultation on this document.

NCCLS VOL. 15 NO. 19 x


December 1995 GP10-A

Assessment of the Clinical Accuracy of Laboratory


Tests Using Receiver Operating Characteristic (ROC) Plots;
Approved Guideline

1 Scope
Efficacy: Actual practical value of the data, i.e.,
usefulness for clinical purposes.
This guideline outlines the steps and principles
for designing a prospective study to evaluate the
False-negative result (FN): Negative test result
intrinsic diagnostic accuracy of a clinical
in a subject in whom the disease or condition is
laboratory test, i.e., its fundamental ability to
present.
discriminate correctly among alternative states of
health expressed in terms of sensitivity and
False-positive result (FP): Positive test result in a
specificity. Each of the steps is discussed in
subject in whom the disease or condition is
detail, along with its rationale and suggestions
absent.
for its execution. These same concepts can be
used in critical evaluations of data already
False-negative fraction (FNF): Ratio of subjects
generated.
who have the disease but who have a negative
test result to all subjects who have the disease;
2 Glossary FN/ (FN + TP); same as (1-sensitivity).

Clinical accuracy (diagnostic accuracy): The False-positive fraction (FPF): Ratio of subjects
ability of a diagnostic test to discriminate who do not have the disease but who have a
between two or more clinical states, for positive test result to all subjects who do not
example, discrimination between rheumatoid have the disease; FP/ (FP + TN); same as
arthritis and systemic lupus erythematosus, (1-specificity).
between rheumatoid arthritis and "no joint
disease," between chronic hepatitis and "no liver Prevalence: The pretest probability of a particular
disease," and between rheumatoid arthritis and a clinical state in a specified population; the
"mixture" of other joint diseases. frequency of a disease in the population of
interest at a given point in time.
Clinical state: A state of health or disease that
has been defined either by a clinical definition or Receiver operating characteristic (ROC) plot: A
some other independent reference standard. graphical description of test performance
Examples of clinical states include "no disease representing the relationship between the true-
found," "disease 1" (where 1 represents the first positive fraction (sensitivity) and the false-
clinical state under consideration), "disease 2" positive fraction (1-specificity). Customarily, the
(where 2 represents the second clinical state true-positive fraction is plotted on the vertical
under investigation), and so on. axis and the false-positive rate (or, alternatively,
the true-negative fraction) is plotted on the
Decision threshold (also decision level, cutoff): horizontal axis. Clinical accuracy, in terms of
A test score used as the criterion for a "positive sensitivity and specificity, is displayed for the
test." All test scores at or beyond this test entire spectrum of decision levels.
score are considered to be "positive"; those not
at or beyond the score are considered to be Sensitivity (clinical sensitivity): Test positivity in
"negative." In some cases, a low test score is disease; true positive fraction; ability of a test to
considered to be "abnormal," e.g., L/S ratio or correctly identify disease at a particular decision
hemoglobin. In other cases, a high test score is threshold.
considered to be "abnormal," e.g., cardiac
enzyme or uric acid concentration. Specificity (clinical specificity): Test negativity
in health; true-negative fraction; ability of a test
Diagnostic test: A measurement or examination to correctly identify the absence of disease at a
used to classify patients into a particular class or particular decision threshold.
clinical state.

NCCLS VOL. 15 NO. 19 1


December 1995 GP10-A

Study group: A group of persons representing a (3) Account for patients for whom data are
sample of a clinically defined population of incomplete.
interest. The population of interest is the target
group to which the test being evaluated will be 3.3 Establish the "True" Clinical State of
applied in practice. Subgroups of the study Each Subject (See Section 4.3)
group will be designated as belonging to
particular clinical states by applying the standard Use the following procedure to establish the true
criteria (see text). clinical state of each subject:

True-negative result (TN): Negative test result in (1) Adopt independent external standards or
a subject in whom the disease is absent. criteria of diagnostic truth for each
relevant clinical state so as to classify
True-positive result (TP): Positive test result in a each subject as accurately as possible.
subject in whom the disease is present. This may be based on a rigorous
diagnostic workup or, alternatively, an
True-negative fraction (TNF): Ratio of subjects assessment of clinical course or
who do not have the disease and have a outcome.
negative test to all subjects who do not have the
disease; TN/(TN + FP); specificity. (2) Classify subjects independent of the test
being evaluated, i.e., without knowing
True-positive fraction (TPF): Ratio of subjects the test results and without including the
who have the disease and a positive test to all test results in the criteria.
subjects who have the disease; TP/(TP + FN);
sensitivity. 3.4 Test the Study Subjects (See Section
4.4)
3 Outline of the Evaluation Proce-
dure Use the following procedure to test the study
subjects:
3.1 Define the Clinical Question (See
Section 4.1) (1) Perform the test without knowing the
clinical classification of the subjects.
Use the following procedure to define the clinical
question: (2) When comparing multiple tests, perform
all tests on all subjects, preferably in a
(1) Characterize the subject population. batch mode, and at the same point in
their clinical course.
(2) State the management decision to be
made. 3.5 Assess the Clinical Accuracy of the
Test (See Section 4.5)
(3) Identify the role of the test in making the
decision. Use the following procedure to assess the
clinical accuracy of the test:
3.2 Select a Representative Study Sample
(See Section 4.2) (1) Construct and analyze receiver operating
characteristic (ROC) plots to evaluate
Use the following procedure to select a test accuracy.
representative study sample:
(2) Compare alternative tests on the basis of
(1) Select, prospectively, a statistically valid their ROC plots and analysis.
sample that consists of subjects who are
representative of the population
identified in Section 3.1 above.

(2) Select the sample independent of test


results.

NCCLS VOL. 15 NO. 19 2


December 1995 GP10-A

4 Designing the Basic Evaluation sarcoidosis and those with some other cause of
Study hypercalcemia (such as malignancy or
hyperparathyroidism), each of which would
receive different management.
4.1 Define the Clinical Question
For the previously mentioned cases, the target
Laboratory tests are requested to provide population must be defined carefully, including
information that can be helpful in managing the nature, duration, and magnitude of the
patients. There is always a relevant clinical qualifying conditions. For example, this might
question. Defining the clinical question is include a serum calcium concentration greater
fundamental, then, because it establishes the than "X" on two occasions at least one week
particular patient-care issue being addressed by apart, as well as age range, sex, and other
the evaluation. Can CK-2 concentrations be findings (for example, chest x-ray) that are
used to discriminate between acute myocardial required for including and excluding subjects
infarction (AMI) and other causes of chest pain from the population.
in subjects who present to an emergency
department with a history suggestive of AMI?
4.2 Select a Representative Study Sample
Which, among several tests, is the best to use in
discriminating between those subjects with
The process of clearly defining the clinical
breast cancer who will respond to a particular
question actually serves to identify the
chemotherapy and those who will not? Which,
population relevant to the test evaluation. From
among several tests, is most accurate in
this clinical population, choose a sample of
distinguishing between iron deficiency and other
subjects for the study. These subjects should
causes of anemia in elderly patients who present
be selected to represent the larger population of
with previously undiscovered anemia?
clinical interest about which conclusions are to
be drawn.
A given test can perform differently in different
clinical settings. A test can perform well in
The meaningfulness of the results depends on
helping to discriminate between young,
the care with which the relevant population is
apparently healthy men with no prostatic disease
identified and sampled. The conclusions that
and middle-aged men with prostatic cancer, but
can be drawn follow from the definition of the
it might not do so well in helping to discriminate
question and the nature of the subjects selected
between middle-aged men with benign prostatic
for study.
disease and middle-aged men with malignant
prostatic disease. The latter distinction
It is commonplace in routine laboratory practice
addresses a relevant clinical question applied to
to adopt or establish reference intervals, which
symptomatic middle-aged men, whereas the
are usually available with patient results to aid in
former distinction addresses a different issue
their interpretation. These intervals are
that might not be clinically relevant at all.
frequently derived from test-result data gathered
from blood donors, laboratory workers, students,
Usually, the clinical question or goal involves a
or other ambulatory, "healthy" volunteers. Note
population of apparently similar subjects
that such groups might not be relevant for the
(grouped together on the basis of information
evaluations of diagnostic accuracy described in
available before the test under evaluation is
this guideline. When the accuracy of a test as a
done) that should be subdivided into relevant
screening tool is being assessed, then a sample
management subgroups. The results of the test
representative of the population to be screened
should indicate to which management subgroup
should be used. Consider, for example, fecal
individual subjects belong. For example, a
occult blood testing for colon cancer. If the goal
radioimmunoassay (RIA) for serum
is to evaluate the accuracy of the test in
angiotensin-converting enzyme activity might be
discovering occult cancer in middle-aged
expected to answer the following question:
subjects with no specific signs or symptoms
"Among patients with hypercalcemia, which
suggestive of the disease, then the sample
ones have sarcoidosis?" The apparently similar
studied should be taken entirely from such a
patients share the common characteristic of
population. Studying a group of cancer-free,
hypercalcemia. The test helps in the attempt to
divide them into subgroups: those with

NCCLS VOL. 15 NO. 19 3


December 1995 GP10-A

healthy volunteers and a group already known to predetermined number of subjects is obtained.
have carcinoma of the colon is not appropriate. Once chosen, subjects should not be dropped
from the study. If some patients do not
The same principles apply when a test is being complete the study (because of technical errors,
used, not for screening, but for differentiating analytical interferences, death, or loss to
between disease states in symptomatic patients. follow-up), they should be accounted for in the
If a test is to be used to identify acute final analysis of the data. The uncertainty and
pancreatitis in patients with a history and possible biases that the lost subjects cause in
presentation indicating the possibility of the study's conclusions must be considered and
pancreatitis, the sample should comprise such reported.
persons. Because the test is not intended to
distinguish between healthy volunteers and 4.2.4 Prevalence of Disease
patients with well-defined pancreatitis, a study
sample composed of such subjects is not The approach described here is independent of
appropriate. Conclusions based on such a prevalence of disease, so it is not necessary to
sample would not serve the purpose of the have a sample that reflects actual prevalence. It
study. is desirable to have approximately equal numbers
of subjects who are truly affected and truly
4.2.1 Selection Bias unaffected by the disease.

To avoid selection biases that could compromise 4.2.5 Consult a Statistician


the study's validity or relevance to the question
being posed, choose subjects carefully. Using Consultation with a professional statistician is
only patients with well-established or clinically recommended when planning the definition, size,
apparent disease, for example, can exclude the and selection of study populations that will be
more typical patients, especially those with used for critical evaluation of test performance.
occult or early disease. Likewise, using young The sample size should be appropriate to the
healthy volunteers can be inappropriate to the goals of the evaluation and provide valid
presumptive application of the test. The estimates of ROC plots and comparisons among
measures of accuracy used here are influenced tests. When this is not possible, the criteria for
by the spectrum of disease in the target selection should be clearly described.
population and, therefore, in the sample. The
importance of the proper spectrum of subjects is 4.3 Establish the "True" Clinical State of
discussed in detail in the literature.2-6 Each Subject

4.2.2 Retrospective Study An objective assessment of clinical accuracy


requires comparing the results provided by the
Do not allow the test result or the testing test with some independent, external definition
procedures to affect the selection of subjects. of truth. The clinical question, defined above,
Excluding patients with unexpected, equivocal, establishes what the categories of "truth"
or discordant results is likely to make the test (states of health) are, relevant to the evaluation.
appear more useful than it is. A retrospective Criteria or standards are applied to place
study with only patients who actually had their individual persons in their respective categories
test results reported excludes patients who could of truth. The standards may include biopsy
not be successfully tested for various reasons, data, surgical or autopsy findings, imaging data,
again possibly distorting the performance of the and long-term follow-up. Unfortunately,
test. classifying individual persons into distinct
categories can be an imperfect operation. The
4.2.3 Selection Before Testing standards can be unreliable and/or can produce
bias.6 Some of them might not fit clearly into
Choosing subjects before testing begins acts as one of the defined states of health. Metz
a precaution against the biases introduced when suggests that "truth is ultimately a philosophical
the test result directly or indirectly influences the concept, of course, and standards of truth are
selection of subjects. To avoid any biases, adequate for practical purposes if they are
include in the test all patients who meet the
definition of the clinical group of interest until a

NCCLS VOL. 15 NO. 19 4


December 1995 GP10-A

substantially more reliable than the diagnostic parameter, as well as the other parameters.9 A
system [test] undergoing evaluation."6(p. 723) fourth approach is to, rather than definitively
assign each such patient to one of the groups,
4.3.1 Validity of Evaluation say, "diseased" or "nondiseased," assign to
each a value between 0 and 1 that corresponds
When evaluating the clinical accuracy of a test, to the (subjective) assessment of how likely it is
the validity of the evaluation is limited by the that this patient belongs to the diseased group
accuracy with which the subjects are classified. (this could be accomplished by logistic
A perfect test can appear to perform poorly regression). Then there is no need to discard the
simply because the "truth" was not established data from these gray, fuzzy cases where group
accurately for each patient and, therefore, the assignment is equivocal.10-12, 13
test results disagree with the apparent "true"
diagnosis. On the other hand, when test results Although diagnostic categories often do predict
do agree with an inaccurate classification, the complications and therapeutic responses, the
test will appear to perform better than it actually best evaluation of a test can be in terms of its
does. It is important, then, to attempt to ability to indicate clinical course or outcome,
classify individual persons as correctly as rather than its ability to assign a diagnosis. For
possible, as well as to consider the possible example, it might be possible to classify patients
biases in the results caused by the classification with suspected prostatic disease into those who
scheme. The closer the classifications are to the have cancer and those who do not have cancer
truth, the less distortion there will be in the based on biopsy results; however, it might be
apparent performance of any test being more useful to classify them in terms of which
evaluated. patients progress to overt disease. If the goal of
the evaluation is to assess the accuracy of a
4.3.2 True Clinical Subgroup serum marker in discriminating between those
patients who need intervention and those who
Routine clinical diagnoses are likely to be do not, then it is more relevant to know which
inadequate for evaluation studies. Determining a patients will progress than to know which have
patient's true clinical subgroup can require such histologic evidence of disease at that moment.
procedures as biopsy, surgical exploration, This issue is actually one that is properly
autopsy examination, angiography, or long-term confronted earlier in formulating the original
follow-up of response to therapy and clinical clinical management task to be addressed by the
outcome. Although such procedures can add to test under evaluation. Thus, lack of an
the financial cost of the evaluation, a less immediate definitive diagnostic category does
expensive, routine clinical evaluation can prove not necessarily prevent a valid assessment of the
quite costly in the long term if its erroneous clinical accuracy of a test. In fact, even when
conclusions lead to improper test use or the correct diagnosis can be easily established, a
improper patient management. study correlating test results with the clinical
course can provide a more usefulclinical
4.3.3 Approaches to Classification evaluation than a study that merely correlates
test results with patient diagnoses.
In many clinical situations, obtaining an
independent, accurate classification of the 4.3.4 Independent Classification
patient's true clinical condition is difficult.
Several strategies have been developed to deal To avoid bias in evaluating the clinical accuracy
with the difficulties in identifying true states of of a test, the true clinical state should also be
health. One strategy is to define the diagnostic determined independent of the test(s) under
problem in terms of measurable clinical investigation or used for comparison. Obviously,
outcomes.7 A second approach is to employ the new test should not be included in the
some sort of consensus, majority rule, or expert criteria used to classify the subjects. Neither
review to arrive at a less error-prone should a closely related test be included in the
identification process.8 A third solution is to criteria for classifying subjects. For example, if
assume for the comparison of several accurate an RIA for CK-MB is being evaluated for the
tests that there is some unknown mixture of diagnosis of AMI, neither CK-MB by
diseased and nondiseased persons in the subject electrophoresis or by immuno-inhibition should
population and then to estimate this mixture be included in the "gold standard" workup for

NCCLS VOL. 15 NO. 19 5


December 1995 GP10-A

classifying the study subjects. Furthermore, if to have better sensitivity than the others.
the performance of the CK-MB assay is to be Conversely, inclusion of subjects with minimal
compared directly to the performance of the disease, which might be harder to detect, would
LD-1/LD-2 isoenzyme ratio, then LD isoenzyme tend to diminish the apparent sensitivity of tests
results should also not be included in the performed on these subjects, as compared with
diagnostic criteria because the apparent tests not done on these subjects. Performing all
performance will be biased in favor of any test tests on all subjects ensures that differences in
that is part of the "truth standard." sensitivity and specificity are not simply due to
inconsistent application of the diagnostic criteria.
4.3.5 Masked Evaluation
Similarly, if two or more tests are applied to the
To ensure that the classification is not influenced same subject at different times during the course
by the result of the test under evaluation, it of his illness, an apparent superiority of one of
should be done masked, that is, without the tests might simply reflect that it was done
knowing the results of the test. Furthermore, when the disease was more easily detected.
the criteria for classifying each patient into a Therefore, all tests should be performed at the
management subgroup should be as objective as same point in the course of each subject's
possible. When the classification rests on illness. Using identical specimens for all tests
subjective evaluation of clinical or morphological obviates all of the above pitfalls.
patterns, such as radionuclide scans or bone
marrow smears, the decision for each patient 4.4.3 Testing Mode
should reflect the consensus of experts who
each interpret the material masked and Assaying all samples in one batch, when
independent of each other. possible, to minimize the influence of between-
run analytical variance, is suggested. However,
4.4 Test the Study Subjects attention should be given to maintaining analyte
stability through proper storage conditions.
4.4.1 Conduct a Masked Study
4.5 Assess the Clinical Accuracy of the
The person performing the test under evaluation Test
should do so masked, that is, without knowing
the clinical status of the subject. Ideally, the Assessing the performance of a test by
testing should be done before the clinical examining its clinical accuracy, that is, its ability
question is answered. Knowing the answer to to correctly classify individual persons into two
the clinical question can introduce subtle biases. subgroups, for example, a subgroup of persons
Results that do not fit the clinical status might affected by some disease (and therefore needing
be selectively repeated or rejected on the basis treatment) and a second subgroup of unaffected
of supposed technical difficulties or interfering persons, is suggested. If there is no overlap in
factors. test results from these two subgroups, then the
test can identify all persons correctly and
4.4.2 Identical Specimens discriminate between the two subgroups
perfectly. However, if there is some overlap in
When comparing two or more tests, it is the test results for the two subgroups, the ability
important that the subjects and specimens be of the test to discriminate is not perfect. In
identical for all tests. Failure to use the identical either case, it is desirable to have a way to
subjects for evaluating each test can result in represent and measure this power to discriminate
misleading conclusions based on sampling (accuracy).
errors. Furthermore, subtle biases can affect the
selection of subjects for the different groups. 4.5.1 Diagnostic or Clinical Sensitivity and
Thus, apparent differences in test performance Specificity
can simply be reflections of differences in the
composition of the groups tested. If some The ability of a test to identify or recognize the
subjects have more advanced and, presumably, presence of disease is its diagnostic sensitivity;
more easily detectable disease and are tested by its ability to recognize the absence of disease is
only some of the tests, those tests could appear its diagnostic specificity. Both are measures of

NCCLS VOL. 15 NO. 19 6


December 1995 GP10-A

accuracy and can be expressed as percentages, threshold is varied over the range of observed
rates, or decimal fractions. A perfect test results, the sensitivity and specificity will move
achieves a sensitivity and specificity of 100% or in opposite directions. As one increases, the
1.0. However, tests are rarely perfect, and, other decreases. For each decision threshold,
usually, they usually do not achieve a sensitivity then, there is a corresponding sensitivity and
and a specificity of 100% at the same time. specificity pair. Which one(s) describe(s) the
accuracy of the test? All of them do. Only the
Diagnostic sensitivity (true-positive rate or entire spectrum of sensitivity/specificity pairs
fraction) is defined as follows: provides a complete picture of test accuracy.

In Figure 1 (p. 13), at a threshold of 6 Fg/L,


CK-BB exhibits a sensitivity of 100% or 1.0.
All 50 subjects with acute mycocardial infarction
(AMI) are correctly classified as "positive" or
TP "affected." Likewise, at this same threshold, 9
or (1) of the 20 subjects without AMI are incorrectly
TP + FN classified as positive, so the specificity is only
55% (55% true negatives, 45% false positives).
This is the fraction of persons who are truly However, when the decision threshold is 12 µg/L
affected by a disease who have positive test instead of 6, the sensitivity decreases to 96%
results. (0.96) because only 48 of the 50 subjects with
AMI are correctly classified as "positive."
Diagnostic specificity (true-negative fraction) is Furthermore, because all non-AMI subjects are
defined as follows: now correctly classified as unaffected, specifity
has increased to 100% (100% true negatives,
0% false positives). Thus the shift in the
threshold from 6 to 12µg/L results in a decrease
in sensitivity and an increase in specificity. Note
that sensitivity is calculated entirely from the
affected (AMI) subjects, while specificity is
calculated from the unaffected subgroup.

This is the fraction of persons who are truly Furthermore, a test can have one set of
unaffected by a disease who have negative test sensitivity–specificity pairs in one clinical
results. situation but a different set in another clinical
situation with a different group of subjects. If
Often, a test is said to have a particular CK-BB had been measured in postoperative
sensitivity and specificity. However, there is not patients suspected of having an AMI, instead of
a single sensitivity or specificity for a test; rather in emergency department patients (as in Figure 1
there is a continuum of sensitivities and p. 13), the sensitivity–specificity pairs could
specificities. By varying the decision threshold be quite different. The spectrum of pairs
(or decision level, upper-limit-of-normal, cut-off contained in the test characterizes its basic
value, or reference value), any sensitivity from 0 accuracy for a particular clinical setting.
to 100% can be obtained, and each one will
have a corresponding specificity. For each 4.5.2 Receiver Operating Characteristic Plots
decision threshold used to classify the subjects
as "positive" or "negative" based on test results, 4.5.2.1 General
there is a single combination of sensitivity and
specificity. These parameters occur, then, in The spectrum of trade-offs between sensitivity
pairs, and the accuracy of a test is reflected in and specificity is conveniently represented by
the spectrum of pairs that can occur (not all the ROC plot.14 ROC methodology is based on
pairs being possible for a particular test). For statistical decision theory and was developed in
any test in which the distributions of results the context of electronic signal detection and
from the two categories of subjects overlap, issues surrounding the behavior and use of radar
there are inevitable "trade-offs" between receivers in the middle of the twentieth century.6
sensitivity and specificity. As the decision An ROC-type plot was used in the 1950s to

NCCLS VOL. 15 NO. 19 7


December 1995 GP10-A

characterize the ability of an automated Pap patients with stable or limited disease. This
smear analyzer to discriminate between smears dependence of TP and FP fractions on the study
with and without malignant cells.15 population is the reason that an ROC plot must
be generated for each clinical situation.
The ROC plot graphically displays this entire
spectrum of a test's performance for a particular In the ROC plot, the various combinations of
sample group of affected and unaffected sensitivity and specificity possible for the test in
subjects. It is, then, a "test performance curve," a given setting are readily apparent. Also
representing the fundamental clinical accuracy of apparent, then, are the "trade-offs" inherent in
the test by plotting all the sensitivity–specificity varying the decision threshold for that test. As
pairs resulting from continuously varying the the decision level changes, sensitivity improves
decision threshold over the entire range of at the expense of specificity, or vice versa. This
results observed. The important part of the plot can be appreciated directly from the plot. Note
is generated when the decision threshold is that the decision thresholds, though known, are
varying within the region where results from the not part of the plot. However, selected decision
affected and unaffected subjects overlap. thresholds can be displayed at the point on the
Outside of the overlap region, either sensitivity plot where the corresponding sensitivity and
or specificity is 1.0 and not varying; within the specificity appears.
overlap region, neither is 1.0 and both are
varying as the decision threshold varies. On the Because true- and false-positive fractions are
Y axis, sensitivity, or the true-positive fraction calculated entirely separately, using the test
(TPF), is plotted. On the X axis, false-positive results from two different subgroups of persons
fraction (FPF) (or 1-specificity) is plotted. This is (affected, unaffected), the ROC plot is
the fraction of truly unaffected subjects who independent of the prevalence in the sample of
nevertheless have positive test results; therefore, the disease or condition of interest. However,
it is a measure of specificity. as mentioned above, the TPFs and FPFs, and
thus the ROC plot, are still influenced by the
Another option is to plot specificity directly type (spectrum) of subjects included in the
(false-negative fraction) on the X axis. This sample.
results in a left-to-right "flip," giving a
mirror-image of the plot described above. The ROC plot provides a general, global
However, if the X axis is labeled from 0 to 1.0 assessment of performance that is not provided
from right to left (instead of left to right), then when only one or a few sensitivity–specificity
the plot is not flipped over. pairs are known. The test performance data
obtained to derive ROC plots may also be used
As mentioned above for sensitivity and to select decision thresholds for particular clinical
specificity, TP and FP fractions vary applications of the test. Several elements
continuously with the decision threshold within besides test performance determine which of the
the region of overlapping results. Each decision possible sensitivity–specificity pairs (and thus
threshold has a corresponding pair of TP the corresponding decision threshold) is most
(sensitivity) and FP (1-specificity) fractions. The appropriate for a given patient-care application:
rates observed also depend on the clinical (a) the relative cost or undesirability of errors,
setting, as reflected by the study group chosen. i.e., false-positive and false-negative
The FP fraction is influenced by the type of classifications (the benefits of correct
unaffected subjects included in the study group. classifications may also be considered); (b) the
If, for example, the unaffected subjects are all value (or "utility") of various outcomes (death,
healthy blood donors who are free of any signs cure, prolongation of life, or change in the
or symptoms, a test can appear to have much quality of life); and (c) the relative proportions of
lower FP fractions than if the unaffected the two states of health that the test is intended
subjects are persons who clinically resemble to discriminate between (prevalence of the
those who actually have the disease. conditions or diseases). While the selection of a
decision threshold is usually required for using a
Likewise, the TP fraction also depends on the test for patient management, this important step
study group. A test used to detect cancer can is beyond the scope of this guideline.
have higher TP fractions when applied to Discussion of this complex issue can be found
patients with active or advanced disease than to elsewhere.3,16-19

NCCLS VOL. 15 NO. 19 8


December 1995 GP10-A

4.5.2.2 Generating the ROC Plot; Ties these points that is the ROC plot. For data with
no ties, adjacent points can be connected with
Usually, clinical data occur in one of two forms: horizontal and vertical lines in a unique manner
discrete or continuous. Most clinical laboratory to give a staircase figure (Figure 2, p. 14). As
data are continuous, being generated from a the threshold changes, inclusion of a
measuring device with sufficient resolution to true-positive result in the decision rule produces
provide observations on a continuum. a vertical line; inclusion of a false-positive result
Measurements of electrolyte, therapeutic drug, produces a horizontal line. As the numbers of
hormone, enzyme, and tumor-marker persons in the two groups increase, the steps in
concentrations are essentially continuous. the staircase become smaller and the plot usually
Urinalysis dipstick results, on the other hand, appears less jagged. Because this ROC plot uses
are discrete data, as are rapid pregnancy testing all the information in the data directly through
devices, which give positive/negative results. the ranks of the test results in the combined
Scales in diagnostic imaging also generally sample, it can also be called the nonparametric
provide discrete (ratings) data with rating ROC plot. The term "nonparametric" here refers
categories such as "definitely abnormal," to the lack of parameters needed to model the
"probably abnormal," "equivocal," "probably behavior of the plot, in contrast to parametric
normal," and "definitely normal." approaches that rely on models with parameters
to be estimated.
A tie in laboratory data is of interest when a
member of the diseased group has the same When there are ties in continuous data, both the
result as does a member of the nondiseased true-positive and false-positive fractions change
group. Such ties are more likely to occur when simultaneously, resulting in a point displaced
there are few data categories (i.e., few different both horizontally and vertically from the last
results), such as with coarse discrete data point. Connecting such adjacent points
(dipstick data, for example) rather than when the produces diagonal (nonhorizontal and
number of different results is large, as with nonvertical) lines on the plot. Diagonal
continuous data. This results from grouping or segments in the ROC plot, then, indicate ties.
"binning" the data into ordered categories. In
clinical laboratories, when observations are made As mentioned above, ties may be intentionally
on a continuous scale, ties are much less likely introduced in the display of the test results by
(unless intentional grouping into "bins" has grouping the results into intervals. A common
occurred). Theoretically, if measurements are approach often adopted in the literature is to plot
exact enough, no two persons would have the the ROC at only a few points by using only a
same result on a continuous scale. However, few decision thresholds and connecting adjacent
the resolution of results in the clinical laboratory points with straight line segments. All data
is often not so fine as to prevent this, and some falling in an interval between thresholds are
ties will occur even with continuous data. treated as tied. Although this bin approach has
Furthermore, intentional binning of continuous the advantage of plotting ease, it discards much
data also increases the chance for ties. This of the data and introduces many ties in the data.
occurs when, for example, gonadotrophin results If the points are few and far between, this
are expressed as whole numbers even though approximation can be poor and it can
the assay provides concentrations to 0.1 of a misrepresent the actual plot.
unit. It also occurs when all results within
intervals, such as 0–50, 51–100, etc., are 4.5.2.3 Qualitative Interpretation of the ROC
grouped together. Ties can be caused, then, Plot
either by the intentional binning of data or by the
degree of analytical resolution of the method of A test with good clinical performance achieves
observation. high TPFs (sensitivity), while having low FPFs
(corresponding to high specificity). Tests with
For both tied and untied data, one merely plots high diagnostic accuracy, then, have ROC plots
the calculated (1-specificity, sensitivity) points at with points close to the upper left corner where
all the possible decision thresholds (observed TPFs are high and FPFs are low. A test with
values) of the test. (This can be limited to the perfect accuracy, giving perfect discrimination
decision thresholds in the region of overlapping between affected and unaffected groups,
results; see Section 4.5.2.1.) It is the graph of achieves a TPF of 1.0 (100% sensitivity) and an

NCCLS VOL. 15 NO. 19 9


December 1995 GP10-A

FPF of 0.0 (100% specificity) at one or more close the ROC plot is to the perfect one (area =
decision thresholds. This ROC plot, then, goes 1.0). The statistician readily recognizes the ROC
through the point (0, 1.0) in the upper left area as the Mann–Whitney version of the
corner. A simple rule of thumb is that the closer nonparametric two-sample statistic20,21
the plot is to this point, the more clinically introduced by the chemist Frank Wilcoxon. An
accurate the test usually is. A test that does not area of 0.8, for example, means that a randomly
discriminate between truly affected and truly selected person from the diseased group has a
unaffected subgroups has an ROC plot that runs laboratory test value larger than that for a
at a 45° angle from the point (0,0) to (1.0, 1.0). randomly chosen person from the nondiseased
Along this line, TPF equals FPF at all points, group 80% of the time. It does not mean that a
regardless of the decision threshold. (See "X" in positive result occurs with probability 0.80 or
Figure 2, p. 14.) All tests have plots between that a positive result is associated with disease
the 45° diagonal and the ideal upper left corner. 80% of the time.
The closer the plot is to the upper left corner,
the higher the discriminating ability of the test. When there are no ties between the diseased
Visual inspection of the plot, then, provides a and nondiseased groups, this area is easily com-
direct qualitative assessment of accuracy. puted from the plot as the sum of the rectangles
under this graph. Analytical formulas to calculate
Figure 2 (p. 14) has an ROC plot for a test with the area appear in reports by Bamber20 and
modest accuracy. Here the plot is in an Hanley and McNeil.21 Alternatively, the area can
intermediate position between the 45° diagonal be obtained indirectly from the Wilcoxon
and the ideal upper left corner. Figure 3 (p. 15) rank-sum statistic.22
has an ROC plot for a test with high accuracy.
Note how closely the plot passes to the upper Parametric approaches to calculating area,
left corner where sensitivity is highest and the employing some model for fitting a curve, have
FPF (1-specificity) is lowest. Figure 4 (p. 16) also been described. Both parametric and
shows ROC plots of results of three tests, all nonparametric methods are discussed and
derived from the same sample of persons. This compared in published reviews.13,23
provides a convenient comparison of accuracies.
The plot for amylase lies above and to the left of In using global indices such as area under the
the plot for phospholipase A (PLA). Thus, at ROC plot, there is a loss of information.
most sensitivities (TPF), amylase has a lower Therefore, it is undesirable to consider area
FPF (higher specificity) than PLA. Conversely, at without visual examination of the ROC plot itself
most FPFs, amylase has a higher TPF (better as well.
sensitivity) than does PLA. Amylase and lipase
have nearly identical ROC plots, indicating 4.5.2.5 Statistical Comparison of Multiple
virtually the same ability to discriminate. Both Tests
appear to be more accurate than PLA.
Direct statistical comparison of multiple
4.5.2.4 Area Under a Single ROC Plot diagnostic tests is frequent in clinical
laboratories. Usually, two (or more) tests are
One convenient way to quantify the diagnostic performed on the same subjects, as in a
accuracy of a laboratory test is to express its split-sample comparison.
performance by a single number. The most
common measure is the area under the ROC Tests can be compared to one another at a
plot. By convention, this area is always $ 0.5 (if single observed or theoretical sensitivity or
it is not, one can reverse the decision rule to specificity.24–26 Alternatively, a portion of the
make it so). Values range between 1.0 (perfect ROC plot can be used to compare tests.27
separation of the test values of the two groups)
and 0.5 (no apparent distributional difference A global approach is to compare entire ROC
between the two groups of test values). The plots by using an overall measure, such as area
area does not depend only on a particular portion under the plot; this can be performed either
of the plot, such as the point closest to the nonparametrically or parametrically.13 This is
upper left corner or the sensitivity at some especially attractive to laboratories because the
chosen specificity, but on the entire plot. This is comparison does not rely on the selection of a
a quantitative, descriptive expression of how particular decision threshold (which should

NCCLS VOL. 15 NO. 19 10


December 1995 GP10-A

consider prevalence and cost trade-off 5 The Use of ROC Plots:


information). However, the user should always Examples From the Clinical Laboratory
inspect the ROC plot visually when comparing
tests, rather than relying on the area that Literature
condenses all the information into a single
number. Van Steirteghem et al28 compared the accuracies
of myoglobin, CK-BB, CK-MB, and total CK in
4.5.2.6 Other ROC Statistics discriminating among persons with and without
acute mycocardial infarction, who presented to
Confidence intervals around a point or points on an emergency department with typical chest
the ROC plot can be estimated both pain. ROC plots could be constructed for any
parametrically and nonparametrically for those sampling time by using measurements on
who so desire such estimates.13 When "true" multiple, closely sequential serum samples timed
diagnoses are not well known for the subjects from the onset of pain. Leung et al29 performed
being studied ("fuzzy" cases), the probability a similarly detailed evaluation of total CK and
that a given patient belongs to a particular CK-2 in 310 patients admitted to a cardiac care
diagnostic category can be assigned, and a unit with chest pain. These authors also used
"fuzzy" ROC plot can be constructed.13 ROC plots to describe the changing clinical
accuracy at various time intervals after the onset
4.5.2.7 Advantages of ROC Plots13 of pain.

The ROC plot has the following advantages: It Several other authors also used ROC plots in
is simple, graphical, and easily appreciated various ways. Carson et al30 investigated the
visually. It is a comprehensive representation of abilities of four different assays of prostatic acid
pure clinical accuracy, i.e., discriminating ability, phosphatase to discriminate between those
over the entire range of the test. It does not subjects with prostatic cancer and those
require selection of a particular decision subjects with either some other urologic
threshold because the whole spectrum of abnormality or no known urologic abnormality.
possible decision thresholds is included. It is Hermann et al31 compared the clinical accuracies
independent of prevalence: No care need be of two versions of a commercial assay for
taken to obtain samples with representative thyrotropin to test a claim that the newer one
prevalence; in fact, it is usually preferable to was superior for discriminating between
have equal numbers of subjects with both euthyroidism and hyperthyroidism. Kazmierczak
conditions. It provides a direct visual et al32 used ROC plots in a study of the clinical
comparison between tests on a common scale. accuracies of lipase, amylase, and phospholipase
It requires no grouping or binning of data, and A in discriminating acute pancreatitis from other
both specificity and sensitivity are readily diseases in 151 consecutive patients seen with
accessible. abdominal pain. Flack et al33 used ROC plots
and areas to compare the abilities of urinary free
4.5.2.8 Disadvantages of ROC Plots cortisol and 17-hydroxysteroid suppression tests
to discriminate between Cushing disease and
The ROC plot has several drawbacks: Actual other causes of Cushing syndrome. Guyatt et
decision thresholds usually do not appear on the al34 studied the ability of seven tests, including
plot (though they are known and are used to ferritin, transferrin, saturation, mean cell volume,
generate the graph). The number of subjects is and erythrocyte protoporphyrin, to discriminate
also not part of the plot. Without computer between iron-deficiency anemia and other causes
assistance, the generation of plots and analysis of anemia in subjects older than 65 years who
is cumbersome. (See the Appendix for available were admitted to the hospital with anemia.
software packages). Beck,35 while studying iron-deficiency anemia,
also used ROC plots to compare four tests.

6 Summary

The first step in designing a study to evaluate


the clinical accuracy of a test is to establish the
clinical goal clearly and explicitly. It is essential

NCCLS VOL. 15 NO. 19 11


December 1995 GP10-A

to identify what issue of consequence to patient


management is to be addressed by the test. The
following guidelines are suggested for a clinical
test evaluation or diagnostic trial.

! Carefully define the clinical question or


goal.

! Choose study subjects who are


representative of the clinical population
to which the test is ultimately to be
applied. Advance consultation with a
statistician is recommended.

! Perform all tests being evaluated on


the same specimens from the same
subjects; perform all tests on individual
subjects at the same point in their
clinical course.

! Classify the subjects as either "af-


fected" or "unaffected," or into other
relevant management subgroups, by
rigorous and complete means so that
the true diagnoses or outcomes are
approached closely. Diagnostic proce-
dures that go beyond routine clinical
practice or the use of extensive follow-
up, may be required for the purpose of
the evaluation. All classification crite-
ria should be independent of the test or
tests being studied.

! Evaluate and compare clinical accuracy


in terms of sensitivity and specificity at
all decision thresholds using ROC
plots.

NCCLS VOL. 15 NO. 19 12


December 1995 GP10-A

Figure 1. Dot diagram of serum CK-BB concentrations 16 hours after the onset of symptoms in 70
subjects presenting to an emergency department with typical chest pain. Fifty were eventually consid-
ered to have had acute mycocardial infarction (AMI); 20 were not. (Data from Van Steirteghem AC, Zweig
MH, Robertson EA, et al. Comparison of the effectiveness of four clinical chemical assays in classifying
patients with chest pain. Clin Chem 1982;28:1319–1324.)

NCCLS VOL. 15 NO. 19 13


December 1995 GP10-A

Figure 2. Nonparametric ROC plot of serum apolipoprotein A-I/B ratios used in identifying clinically
significant coronary artery disease (CAD) in 304 men suspected of having CAD. Presence or absence of
CAD was established by coronary angiography. Area under the ROC plot is 0.75. The line labeled "X"
represents the theoretical plot of a test with no ability to discriminate (area = 0.5). (From Zweig MH.
Apolipoproteins and lipids in coronary artery disease: Analysis of diagnostic accuracy using receiver
operating characteristic plots and areas. Arch Pathol Lab Med 1994; 118:141–144.)

NCCLS VOL. 15 NO. 19 14


December 1995 GP10-A

Figure 3. Nonparametric ROC plot of serum myoglobin concentrations, 5 hours after the onset of
symptoms in 55 emergency department patients suspected of having acute myocardial infarction. The
area under the plot is 0.953. Thirty-seven subjects had an AMI; 18 did not. (Data from Van Steirteghem
AC, Zweig MH, Robertson EA, et al. Comparison of the effectiveness of four clinical chemical assays in
classifying patients with chest pain. Clin Chem 1982;28:1319–1324.)

NCCLS VOL. 15 NO. 19 15


December 1995 GP10-A

Figure 4. ROC plots for peak serum amylase, lipase, and phospholipase A (PLA) concentrations in
identifying acute pancreatitis in 151 consecutive patients with abdominal pain. (From Kazmierczak SC,
Van Lente F, Hodges ED. Diagnostic and prognostic utility of phospholipase A activity in patients with
acute pancreatitis: comparison with amylase and lipase. Clin Chem 1991;37:356–360.)

NCCLS VOL. 15 NO. 19 16


December 1995 GP10-A

Appendix: Computer Software for ROC Plotting and Analysis

Commercial, public domain, and shareware is tion of normality in the original or


available to calculate sensitivities and specifici- log-transformed scale.
ties; generate ROC plots; calculate areas under
the plots; to generate other statistics, such as Metz programs: LABROC1, CLINROC, ROCFIT,
standard deviation and confidence intervals, and CORROC. Charles E. Metz, Department of Radi-
analyses such as comparing the areas for multi- ology, MC2026, The University of Chicago Medi-
ple tests. This software has not been evaluated cal Center, 5841 South Maryland Avenue, Chi-
by NCCLS. The list was generated as a starting cago, IL 60637-1470; [FAX (312) 702-6779;
point for users who might wish to evaluate and Internet address: c-metz[@] uchicago.edu]. The
purchase a package. Metz programs are, for a single diagnostic test,
LABROC1 for continuous data and ROCFIT for
Some programs were developed primarily to deal discrete data, and, for two correlated tests,
with discrete or ratings-type data typically used CLABROC and CORROC, respectively. Program
by radiologists, for example, where the number requesters are asked to specify the platform and
of different results is small. Laboratories almost to include, for microcomputer requests, two
always generate continuous data with a virtually appropriate floppy disks. A version for the
infinite number of possible results. Some pro- Macintosh is available.
grams are designed to use all the raw continuous
data without grouping or compressing it into ROC ANALYZER. Robert M. Centor, 10806
fewer categories or bins. Whether the data is Stoneycreek Drive, Richmond, VA 23233. [E
continuous (not grouped or "binned") or origi- mail address: rcentor@gim.meb.vab.edu]. This
nally discrete (or made discrete by binning), the program is described by Centor and Keightley.37
ROC plot can be generated parametrically or
nonparametrically. The nonparametric approach ROCLAB. James M. DeLeo, Bldg. 12A, Room
does not use models to fit the curve but rather 2013, Division of Computer Research and
simply plots the calculated (l-specificity, sensitiv- Technology, National Institutes of Health,
ity) points at all the possible observed values of Bethesda, MD 20892. [E mail address:
the tests. The points are connected to produce deleoj@6100.dcrt.nih.gov]. ROCLAB provides
the plot. Parametric approaches rely on models maximal, as well as trapezoidal, areas for ties. It
with parameters to be estimated in order to fit a also has the ability to do ROC plots for fuzzy
curve to the data. These approaches, the under- data.
lying assumptions, and a discussion of the char-
acteristics, advantages, and disadvantages for RULEMAKER. Digital Medicine, Inc., Hanover,
laboratory data are published.13 NH 03755; [(603) 643-3686]. Rulemaker,
which will run on a Macintosh, is still being
Several commercial and public domain software developed; a release version is anticipated in
products for ROC plotting and analysis are listed 1996.
below. Note that only three of the programs are
designed to treat the continuous data directly, SIGNAL. SYSTAT, Inc., 1800 Sherman Avenue,
without binning (forcing into discrete intervals) Evanston, IL 60201. SIGNAL is a module of a
the data (numbers 4, 5, 7). A comparison of much larger commercial package SYSTAT.
features is published.13
TEP-UH (Test Evaluation Program -University
CLINROC. Henry T. Sugiura and George A. Hospital). Thomas G. Pellar, Department of
Hermann, R. Phillip Custer Laboratories, Presby- Clinical Biochemistry, University Hospital, P.O.
terian University of Pennsylvania Medical Center, Box 5339, 339 Windemere Road, London,
39th & Market Streets, Philadelphia, PA 19104. Ontario, Canada N6A 5A5. Running TEP-UH
CLINROC does not produce its parametric analy- requires the parent program MUMPS (Micronetics
sis of likelihood ratios through maximum likeli- Design Corp., Rockville, MD).38
hood methods but rather based on the assump-

NCCLS VOL. 15 NO. 19 17


December 1995 GP10-A

Appendix (Continued)

Two other packages are available:

TESTIMATE, idv-Data Analysis and Study SmarTest, idv-Data Analysis and Study Planning,
Planning, Wessobrunner Str. 6, 82131 Gauting/ Wessobrunner Str. 6, 82131 Gauting/ Munich,
Munich, Germany. Fax: +49.89.8503666. Germany. Fax: +49.89.8503666.

Product/Vendor List
in NCCLS Standards

This list includes products known to NCCLS at


the time this guideline was published, but it is
not all inclusive. NCCLS has not evaluated the
listed products. Inclusion of products and/or
vendors on the list does not constitute
endorsement by NCCLS.

NCCLS VOL. 15 NO. 19 18


December 1995 GP10-A

References

1. Swets JA, Pickett RM. Evaluation of Science and Statistics: Proceedings of


Diagnostic Systems. New York: the Twenty-First Symposium on the
Academic Press Inc., 1982:1-6. Interface. Alexandria, VA: American
Statistical Association, 1989:543-548.
2. Ransohoff DF, Feinstein AR. Problems
of spectrum and bias in evaluating the 11. DeLeo JM, Campbell G. The fuzzy
efficacy of diagnostic tests. N Eng J receiver operating characteristic
Med 1978;299:926-930. function and medical decisions with
uncertainty. Proceedings of the First
3. Robertson EA, Zweig MH, Van International Symposium on
Steirteghem AC. Evaluating the clinical Uncertainty Modeling and Analysis.
efficiency of laboratory tests. Am J College Park, NJ: IEEE Computer
Clin Pathol 1983;79:78-86. Society Press, 1990:694-699.

4. Zweig MH, Robertson EA. Why we 12. Campbell G, Levy D, Bailey JJ.
need better test evaluations. Clin Bootstrap comparison of fuzzy R.O.C.
Chem 1982;28:1272-1276. curves for ECG-LVH algorithms using
data from the Framingham heart study.
5. Lachs MS, Nachamkin I, Edelstein PH, J Electrocardiol 1990;23
et al. Spectrum bias in the evaluation (suppl):132-137.
of diagnostic tests: lessons from the
rapid dipstick test for urinary tract *13. Zweig MH, Campbell G.
infection. Ann Intern Med Receiver-operating characteristic (ROC)
1992;117:135-140. plots: A fundamental evaluation tool in
clinical medicine. Clin Chem
6. Metz CE. ROC methodology in 1993;39:561-577.
radiologic imagining. Invest Radiol
1986;21:720-733. *14. Metz CE. Basic principles of ROC
analysis. Semin Nucl Med
7. Valenstein PN. Evaluating diagnostic 1978;8:283-298.
tests with imperfect standards. Am J
Clin Pathol 1990;93:252-258. 15. Lusted LB. ROC recollected [editorial].
Med Decis Making 1984;4:131-135.
8. Revesz G, Kundel HL, Bonitatibus M.
The effect of verification on the 16. Lusted LB. Decision making studies in
assessment of imaging techniques. patient management. N Engl J Med
Invest Radiol 1983;18:194-198. 1971; 284:416-424.

9. Henkelman RM, Kay I, Bronskill M .


Receiver operator characteristic (ROC)
analysis without truth. Med Decis
Making 1990;10:24-29.

10. Campbell G, DeLeo JM. Fundamentals


of fuzzy receiver operating *Note that these articles give detailed reviews of
characteristic (ROC) functions. In: procedures. Review of these articles is
Malone L, Beck K, eds. Computing especially recommended.

NCCLS VOL. 15 NO. 19 19


December 1995 GP10-A

References (Continued)

17. Lusted LB. Signal detectability and statistics for comparing diagnostic
medical decision-making. Science markers with paired or unpaired data.
1971;171:1217-1219. Biometrika 1989; 76:585–592.

18. McNeil BJ, Keeler E, Adelstein SJ. 28. Van Steirteghem AC, Zweig MH,
Primer on certain elements of medical Robertson EA, et al. Comparison of
decision making. N Engl J Med 1975: the effectiveness of four clinical
293:211-215. chemical assays in classifying patients
with chest pain. Clin Chem
19. Weinstein MC, Fineberg HV. Clinical 1982;28:1319-1324.
Decision Analysis. Philadelphia: WB
Saunders, 1980. 29. Leung FY, Galbraith LV, Jablonsky G,
et al. Re-evaluation of the diagnostic
20. Bamber D. The area above the ordinal utility of serum total creatine kinase
dominance graph and the area below and creatine kinase-2 in myocardial
the receiver operating characteristic infarction. Clin Chem 1989;35:1435-
curve. J Math Psychol 1440.
1975;12:387-415.
30. Carson JL, Eisenberg JM, Shaw LM, et
21. Hanley JA, McNeil BJ. The meaning al. Diagnostic accuracy of four assays
and use of the area under a receiver of prostatic acid phosphatase.
operating characteristic (ROC) curve. Comparison using receiver operating
Radiology 1982;143:29-36. characteristic curve analysis. J Am
Med Assoc 1985;253:665-669.
22. Hollander M, Wolfe DA.
Nonparametric statistical methods. 31. Hermann GA, Sugiura HT, Krumm RP.
New York: John Wiley, 1973:67-78. Comparison of thyrotropin assays by
relative operating characteristics
*23. Hanley JA. Receiver operating analysis. Arch Pathol Lab Med
characteristic (ROC) methodology: the 1986;110:21-25.
state of the art. Crit Rev Diagn
Imaging 1989;29:307-335. 32. Kazmierczak SC, Van Lente F, Hodges
ED. Diagnostic and prognostic utility
24. Beck JR, Shultz EK. The use of of phospholipase A activity in patients
relative operating characteristic (ROC) with acute pancreatitis: comparison
curves in test performance evaluation. with amylase and lipase. Clin Chem
Arch Pathol Lab Med 1986;110:13-20. 1991;37:356-360.

*25. McNeil BJ, Hanley JA. Statistical 33. Flack MR, Oldfield EH, Cutler GB, et al.
approaches to the analysis of receiver Urine free cortisol in the high-dose
operating characteristic (ROC) curves. dexamethasone suppression test for
Med Decis Making 1984;2:137-150. the differential diagnosis of the
Cushing syndrome. Ann Intern Med
26. Greenhouse SW, Mantel N. The 1992;116:211-217.
evaluation of diagnostic tests.
Biometrics 1950; 6:399–412.
*Note that these articles give detailed reviews of
27. Wieand S, Gail MH, James BR, James procedures. Review of these articles is
KL. A family of nonparametric especially recommended.

NCCLS VOL. 15 NO. 19 20


December 1995 GP10-A

References (Continued)

34. Guyatt GH, Patterson C, Ali M, et al. 37. Centor RM, Keightley GE. Receiver
Diagnosis of iron-deficiency anemia in operating characteristic (ROC) curve
the elderly. Am J Med 1990;88:205- area analysis using The ROC
209. ANALYZER. Proceedings of the
Symposium for Computer Applications
35. Beck JR. The role of new laboratory to Medical Care, 1989: 222-226.
tests in clinical-decision making. Clin
Lab Med 1982;2:51-77. 38. Pellar TG, Leung FY, Henderson AR. A
computer program for rapid generation
36. Zweig MH. Apolipoproteins and lipids of receiver operating characteristic
in coronary artery disease: Analysis of curves and likelihood ratios in the
diagnostic accuracy using receiver evaluation of diagnostic tests. Ann
operating characteristic plots and Clin Biochem 1988;25: 411-416.
areas. Arch Pathol Lab Med 1994;
118:141–144.

NCCLS VOL. 15 NO. 19 21


December 1995 GP10-A

Summary of Comments and Subcommittee Responses

GP10-T: Assessment of Clinical Sensitivity and Specificity of Laboratory Tests; Tentative Guideline

General

1. We were very impressed with the document and believe it will be of value to the clinical
laboratory. Although most laboratories may not do studies that lead to ROC plots, they
certainly need to understand how they are developed and what they mean. This document will
be a good start.

! The subcommittee is pleased to receive this praise. No changes were requested.

2. The document is a summary of the relevant issues written at an introductory primer level. It will
therefore be of use to clinical "laboratorians" who will (one hopes) be guided by senior
investigators responsible for experimental design and analysis. In fact, perhaps the most telling
line of the document is this (page 5): "Consultation with a professional statistician is
recommended..."

In particular, none of the subtle issues involved in the data analysis are mentioned in the
document; there is no display (or explanation) of the results on the double-probability scale that
is most frequently used to fit the results with a straight line. Finally, there is a good list of
available software and one can find technical guidance by working through the references at the
back.

In a few words, this is an OK introductory primer on the subject. Nevertheless, it is historic and
important.

! The subcommittee is pleased to receive these comments. No changes were requested.

3. Our group, which routinely determines diagnostic efficiency, prefers cumulative distribution
analysis graphs (see for example, BI Bluestein et. al. Cancer Research 1984;44:4131–4136)
rather than ROC curves.

Cumulative distribution analysis graphs are more readily understood. Sensitivity and specificity
are immediately known for any concentration cutoff. ROC curves do not show concentration at
all and specificity only indirectly.

! Regarding cumulative distribution graphs, the subcommittee recognizes that these have
desirable features including the display of decision thresholds. An important limitation is that
multiple tests cannot be plotted together and compared directly to one another because the
abscissa depends on the concentration scale peculiar to each test. This is the feature that
allows for the display of decision thresholds but interferes with comparison of tests. ROC plots,
because the axes are normalized, permit all tests to be evaluated, either singly or in multiples,
on the very same scale, regardless of the original scale. The subcommittee did not intend to
review all graphical or statistical approaches to evaluating test performance, nor did it intend to
select one as the best or only approach. As ROC plots have finally received fairly widespread
recognition, we feel it is appropriate to recommend them without contending that they are
necessarily the only useful approach.

NCCLS VOL. 15 NO. 19 22


December 1995 GP10-A

Summary of Comments and Subcommittee Responses (Continued)

Specificity can be directly shown on an ROC plot simply by employing the variation using an
abscissa on which the scale runs right to left instead of left to right. This is already mentioned
in the document in the third paragraph of Section 4.5.2.1.

Foreword

4. In the sentence before "Note that assessing..." the “a” should be removed from the sentence to
read: "It is important to know just how inherently accurate each tool (test) is as a diagnostic
discriminator."

! The subcommittee removed "a" from this sentence.

Section 4.2.5

5. In this section, the authors recommend consulting a statistician. In our view, this should be
emphasized very strongly because, as the authors point out in the response to Comment #46,
the statistical techniques and issues are not simple. This is evident even in the subcommittee's
own recommendation of McNemar's test or Fisher's exact test to compare ROC plots, which
really are not appropriate. Greenhouse and Mantel (Biometrics 1950;12:399) derived
appropriate non-parametric test statistics to use in this context. This class of statistics was
generalized by Wieand, Gail, James, and James (Biometrika 1989;76:3:585–592), who
provided a useful general nonparametric approach. In addition, parametric binormal models
which are discussed by Metz, Hanley, and others, are computationally more manageable than
nonparametric approaches, but they require careful assessment of the appropriateness of the
statistical assumptions on which the tests and estimators are based.

As a corollary to the above comment, more emphasis should be given to the importance of
adequate sample size to provide a sufficiently precise estimate of the ROC curve and use of
confidence intervals to assess the precision of the estimates. Because of the special nature of
the test statistics, power/sample size computations are not possible with any currently available
packages of which we are aware.

The standard method of comparing ROC curves using area under the curve, although well
accepted, is a blunt instrument, which receives much more emphasis than it deserves. This
measure averages in ranges of sensitivity/specificity, which would be of little clinical usefulness
and therefore are irrelevant to deciding between two competing technologies. Comparisons of
ROC curves at a definite specificity, or over a limited range of relevant specifications, as
proposed by Wieand et al above, is better.

! The subcommittee recognizes the points made here and acknowledges the statistical
complexities involved. Because we do not feel it is appropriate to deal with these extensive
statistical issues in the document, we have revised Section 4.5.2.5, second paragraph, to be
more general and refer the reader to more primary sources, including Greenhouse and Mantel,
1950, and Wieand et al, 1989. Also, we revised Section 4.5.2.4 by adding the caveat that
global quantitative indices, such as area under the curve, can mask important information and
that visual inspection of the plot is necessary to fully appreciate test accuracy. Likewise,
Section 4.5.2.5 is revised to recommend visual inspection when comparing multiple tests. A
sentence was added to Section 4.2.5 that emphasizes the need for appropriate sample size.

NCCLS VOL. 15 NO. 19 23


December 1995 GP10-A

Summary of Comments and Subcommittee Responses (Continued)

Section 4.3

6. In the special case when comparison is being done between different implementations of the
same test (for instance, comparison of CKMB on different analyzers), it may not be necessary to
go through the rigor of establishing the "true" clinical state of the patients. While I believe this
is essential for a new test, when the clinical laboratory is assessing equivalence, it may be
sufficient to simply compare the current implementation of the test with the new one using the
final diagnosis on the chart. Although this diagnosis is biased, because it was determined using
the laboratory's current test, the study should be valid because the question being asked is,
"Are the two implementations of the test equivalent?" If the ROC plots show that the tests
being compared are equivalent, then no additional studies would need to be done. However, if
the ROC plots were substantially different, then additional work would need to be done to
understand the difference. If the committee agrees with this and could include this type of
information in the current guideline without much delay, I believe it would be of value.

! While we recognize the logic in the approach used for the particular circumstances described in
the comment, we prefer not to encourage users of the document to compromise on the rigor of
their classification (diagnosis). Those users who are well acquainted with the principles will
know when it may not be necessary to seek definitive classifications. Even in the situation
described in the comment it is still advisable to establish the "true" clinical state if this had not
been done originally when the "current" (old) test was studied.

Section 4.3.5 & 4.4.1

7. It is our understanding that the terms "blind/blinded/blinding" are now politically incorrect.
Contemporary terms are "masked/masking."

! The subcommittee changed "blind" and "blindly" to "masked" in Sections 4.3.5 and 4.4.1.

Section 4.5.2.1

8. In the second paragraph you write about plotting sensitivity/specificity pairs "over the entire
range of results observed." That is unnecessary. Results only have to be plotted for the
overlap region (the range in which sensitivity and specificity are both less than 1.0). This same
error occurs in the fourth paragraph with the statement that "TP and FP fractions vary
continuously with the decision threshold." No, they only vary when the cut-off point yields true
positive fractions >0 and <1.

An easy way to decide what range to plot is to look at the extremes for each group (disease vs.
non-disease). For a test that increases with disease, the range of values to plot on the ROC
curve is between the lowest value for the disease group and the highest value for the non-
disease group.

In the last paragraph of this section, the next to last sentence should read: "While the selection
of a decision...."

! The subcommittee agrees that results only have to be plotted for the overlap region. Sections
4.5.2.1 and 4.5.2.2 have been revised to add a statement to that effect.

NCCLS VOL. 15 NO. 19 24


December 1995 GP10-A

Summary of Comments and Subcommittee Responses (Continued)

9. On page 11, line 5, the word "section" should be selection—"While the selection of a decision
threshold..."

! "Section" has been changed to "selection" as suggested.

Section 4.5.2.3

10. The word "accuracy" is used where I believe the word "sensitivity" should be used.

! The term "accuracy," not "sensitivity," was indeed intended. Accuracy is used here to refer to
the overall ability of the diagnostic device to discriminate between alternative states of health
(see the Foreword). Sensitivity and specificity are components of accuracy. No change is
indicated.

11. The statistical discussion is a little bit difficult to understand. However, the author's
recommendation to use commercially available programs is a good one.

! No change was requested.

Section 4.5.2.5

12. I believe this section should be expanded. McNemar's statistic for paired data and Fisher's
exact test for unpaired data should be thoroughly described. Sample calculations would also be
useful. Comparing two tests using their areas under the plot has significant weaknesses. For
example, for tests where the ROC plots cross at some point, one test may be significantly
better than the other at a certain decision point. This may not be reflected by comparing areas
under each plot.

There is no discussion in this section on test efficiency (TP + TN)/(Total subjects). Efficiency
should be defined in the glossary and explained in this section. It is a commonly used method
to describe a test's usefulness at a particular decision point. Tests can also be compared by
their maximum efficiencies.

! See Comment 5. The subcommittee recognizes the statistical complexity and notes that
Comment 13 also addresses Fisher's exact test. As mentioned in Comment 5, we have added
some primary references and simplified the discussion in the document in the belief that a
thorough description of all of these approaches is beyond the scope of the document.

The term "efficiency" was removed previously in response to an earlier comment (#58).
Because efficiency is very dependent on prevalence, it is not actually a characteristic of the test
itself but of the interaction of the test with the setting.

13. On page 14, 2nd paragraph, the Fisher's exact test seems vague to us. An idea of the intended
audience can be obtained from the "Summary of Comments."

! See Comment 12.

NCCLS VOL. 15 NO. 19 25


December 1995 GP10-A

Summary of Comments and Subcommittee Responses (Continued)

Appendix

14. Rulemaker is not available. It never completed beta testing, and Digital Medicine, Inc. has not
made it available. I am not sure why, since it progressed far enough to be used in studies and
mentioned in publications.

! Rulemaker is still under development and the date of availability is projected to be 1996. GP10
has been revised accordingly.

Summary of Comments and Subcommittee Responses; Comment 50

15. I agree with item 2 in Comment 50 (page 38). An expansion of this document to discuss
selection of decision limits and predictive values would be a significant value. Possibly
discussions of "gray zones" could also be included. My experience is that there is a significant
lack of understanding of the concepts, how they are determined, and how they should be used.
While it may be beyond the scope of NCCLS to address what is essentially an educational issue,
I believe guidelines similar in scope to those in the ROC document would help the educational
process. However, I would not want to see the ROC document delayed to incorporate this
information. I believe it has value and is and should be approved.

! The subcommittee is pleased to receive the recommendation for approval.

NCCLS VOL. 15 NO. 19 26


December 1995 GP10-A

Related NCCLS Publications

EP5-T2 Precision Performance of Clinical Chemistry Devices—Second Edition; Tentative Guideline


(1992). EP5-T2 contains guidelines for designing an experiment to evaluate the precision
performance of clinical chemistry devices; recommendations on comparing the resulting
precision estimates with manufacturer's precision performance claims and determining
when such comparisons are valid; and manufacturer's guidelines for establishing claims.

EP6-P Evaluation of the Linearity of Quantitative Analytical Methods; Proposed Guideline (1986).
EP6-P discusses the verification of the analytical range (or linearity) of a clinical chemistry
device.

EP7-P Interference Testing in Clinical Chemistry; Proposed Guideline (1986). EP7-P discusses
interference testing during characterization and evaluation of a clinical laboratory method
or device.

EP9-T Method Comparison and Bias Estimation Using Patient Samples; Tentative Guideline
(1993). EP9-T discusses procedures for determining the relative bias between two clinical
chemistry methods or devices. It also discusses the design of a method comparison
experiment using split patient samples and analysis of the data.

EP10-T2 Preliminary Evaluation of Quantitative Clinical Laboratory Methods—Second Edition;


Tentative Guideline (1993). EP10-T2 addresses experimental design and data analysis for
preliminary evaluation of the performance of an analytical method or device.

NCCLS VOL. 15 NO. 19 27

You might also like