Reference Manual on Scientific Evidence: Third Edition
Reference Manual on
Scientific Evidence
Third Edition
Committee on the Development of the Third Edition of the
Reference Manual on Scientific Evidence
Committee on Science, Technology, and Law
Policy and Global Affairs
FEDERAL JUDICIAL CENTER
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001
The Federal Judicial Center contributed to this publication in furtherance of the Center’s
statutory mission to develop and conduct educational programs for judicial branch employees. The views expressed are those of the authors and not necessarily those of the Federal
Judicial Center.
NOTICE: The project that is the subject of this report was approved by the Governing
Board of the National Research Council, whose members are drawn from the councils of
the National Academy of Sciences, the National Academy of Engineering, and the Institute
of Medicine. The members of the committee responsible for the report were chosen for
their special competences and with regard for appropriate balance.
The development of the third edition of the Reference Manual on Scientific Evidence was supported by Contract No. B5727.R02 between the National Academy of Sciences and the
Carnegie Corporation of New York and a grant from the Starr Foundation. The views
expressed in this publication are those of the authors and do not necessarily reflect those of
the National Academies or the organizations that provided support for the project.
International Standard Book Number-13: 978-0-309-21421-6
International Standard Book Number-10: 0-309-21421-1
Library of Congress Cataloging-in-Publication Data
Reference manual on scientific evidence. — 3rd ed.
p. cm.
Includes bibliographical references and index.
ISBN-13: 978-0-309-21421-6 (pbk.)
ISBN-10: 0-309-21421-1 (pbk.)
1. Evidence, Expert—United States. I. Federal Judicial Center.
KF8961.R44 2011
347.73´67—dc23
2011031458
Additional copies of this report are available from the National Academies Press, 500 Fifth
Street, N.W., Lockbox 285, Washington, DC 20055; (800) 624-6242 or (202) 334-3313
(in the Washington metropolitan area); Internet, http://www.nap.edu.
Copyright 2011 by the National Academy of Sciences. All rights reserved.
Printed in the United States of America
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
THE FEDERAL JUDICIAL CENTER
The Federal Judicial Center is the research and education agency of the federal judicial
system. It was established by Congress in 1967 (28 U.S.C. §§ 620–629), on the recommendation of the Judicial Conference of the United States, with the mission to “further
the development and adoption of improved judicial administration in the courts of the
United States.” By statute, the Chief Justice of the United States chairs the Federal Judicial
Center’s Board, which also includes the director of the Administrative Office of the U.S.
Courts and seven judges elected by the Judicial Conference.
The Center undertakes empirical and exploratory research on federal judicial processes,
court management, and sentencing and its consequences, often at the request of the Judicial
Conference and its committees, the courts themselves, or other groups in the federal system.
In addition to orientation and continuing education programs for judges and court staff on
law and case management, the Center produces publications, videos, and online resources.
The Center provides leadership and management education for judges and court employees,
and other training as needed. Center research informs many of its educational efforts. The
Center also produces resources and materials on the history of the federal courts, and it
develops resources to assist in fostering effective judicial administration in other countries.
Since its founding, the Center has had nine directors. Judge Barbara J. Rothstein became
director of the Federal Judicial Center in 2003
www.fjc.gov
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The National Academy of Sciences is a private, nonprofit, self-perpetuating society
of distinguished scholars engaged in scientific and engineering research, dedicated to the
furtherance of science and technology and to their use for the general welfare. Upon the
authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters.
Dr. Ralph J. Cicerone is president of the National Academy of Sciences.
The National Academy of Engineering was established in 1964, under the charter
of the National Academy of Sciences, as a parallel organization of outstanding engineers.
It is autonomous in its administration and in the selection of its members, sharing with
the National Academy of Sciences the responsibility for advising the federal government.
The National Academy of Engineering also sponsors engineering programs aimed at
meeting national needs, encourages education and research, and recognizes the superior
achievements of engineers. Dr. Charles M. Vest is president of the National Academy of
Engineering.
The Institute of Medicine was established in 1970 by the National Academy of Sciences
to secure the services of eminent members of appropriate professions in the examination
of policy matters pertaining to the health of the public. The Institute acts under the
responsibility given to the National Academy of Sciences by its congressional charter to
be an adviser to the federal government and, upon its own initiative, to identify issues of
medical care, research, and education. Dr. Harvey V. Fineberg is president of the Institute
of Medicine.
The National Research Council was organized by the National Academy of Sciences
in 1916 to associate the broad community of science and technology with the Academy’s
purposes of furthering knowledge and advising the federal government. Functioning in
accordance with general policies determined by the Academy, the Council has become
the principal operating agency of both the National Academy of Sciences and the National
Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies
and the Institute of Medicine. Dr. Ralph J. Cicerone and Dr. Charles M. Vest are chair and
vice chair, respectively, of the National Research Council.
www.national-academies.org
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Committee on the Development of the Third Edition of the
Reference Manual on Scientific Evidence
Co-Chairs:
JEROME P. KASSIRER (IOM), Distinguished Professor, Tufts University
School of Medicine
GLADYS KESSLER, Judge, U.S. District Court for the District of Columbia
Members:
MING W. CHIN, Associate Justice, The Supreme Court of California
PAULINE NEWMAN, Judge, U.S. Court of Appeals for the Federal Circuit
KATHLEEN MCDONALD O’MALLEY, Judge, U.S. Court of Appeals for
the Federal Circuit
JED S. RAKOFF, Judge, U.S. District Court, Southern District of New York
CHANNING R. ROBERTSON, Ruth G. and William K. Bowes Professor,
School of Engineering, and Professor, Department of Chemical Engineering,
Stanford University
JOSEPH V. RODRICKS, Principal, Environ
ALLEN WILCOX, Senior Investigator, Institute of Environmental Health
Sciences
SANDY L. ZABELL, Professor of Statistics and Mathematics, Weinberg
College of Arts and Sciences, Northwestern University
Consultant to the Committee:
JOE S. CECIL, Project Director, Program on Scientific and Technical Evidence,
Division of Research, Federal Judicial Center
Staff:
ANNE-MARIE MAZZA, Director
STEVEN KENDALL, Associate Program Officer
GURUPRASAD MADHAVAN, Program Officer (until November 2010)
v
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Board of the Federal Judicial Center
The Chief Justice of the United States, Chair
Judge Susan H. Black, U.S. Court of Appeals for the Eleventh Circuit
Magistrate Judge John Michael Facciola, U.S. District Court for the District of
Columbia
Judge James B. Haines, U.S. Bankruptcy Court for the District of Maine
Chief Judge James F. Holderman, U.S. District Court for the Northern District
of Illinois
Judge Edward C. Prado, U.S. Court of Appeals for the Fifth Circuit
Chief Judge Loretta A. Preska, U.S. District Court for the Southern District of
New York
Chief Judge Kathryn H. Vratil, U.S. District Court for the District of Kansas
James C. Duff, Director of the Administrative Office of the U.S. Courts
vi
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Committee on Science, Technology, and Law
National Research Council
DAVID KORN (Co-Chair), Professor of Pathology, Harvard Medical School,
and formerly, Inaugural Vice Provost for Research, Harvard University
RICHARD A. MESERVE (Co-Chair), President, Carnegie Institution for
Science, and Senior of Counsel, Covington & Burling LLP
FREDERICK R. ANDERSON, JR., Partner, McKenna, Long & Aldridge
LLP
ARTHUR I. BIENENSTOCK, Special Assistant to the President for Federal
Research Policy, and Director, Wallenberg Research Link, Stanford
University
BARBARA E. BIERER, Professor of Medicine, Harvard Medical School,
and Senior Vice President, Research, Brigham and Women’s Hospital
ELIZABETH H. BLACKBURN, Morris Herzstein Professor of Biology and
Physiology, University of California, San Francisco
JOHN BURRIS, President, Burroughs Wellcome Fund
ARTURO CASADEVALL, Leo and Julia Forchheimer Professor of
Microbiology and Immunology; Chair, Department of Biology and
Immunology; and Professor of Medicine, Albert Einstein College of
Medicine
JOE S. CECIL, Project Director, Program on Scientific and Technical
Evidence, Division of Research, Federal Judicial Center
ROCHELLE COOPER DREYFUSS, Pauline Newman Professor of Law
and Director, Engelberg Center on Innovation Law and Policy, New York
University School of Law
DREW ENDY, Assistant Professor, Bioengineering, Stanford University, and
President, The BioBricks Foundation
PAUL G. FALKOWSKI, Board of Governors Professor in Geological and
Marine Science, Department of Earth and Planetary Science, Rutgers, The
State University of New Jersey
MARCUS FELDMAN, Burnet C. and Mildred Wohlford Professor of
Biological Sciences, Stanford University
ALICE P. GAST, President, Lehigh University
JASON GRUMET, President, Bipartisan Policy Center
BENJAMIN W. HEINEMAN, JR., Senior Fellow, Harvard Law School and
Harvard Kennedy School of Government
D. BROCK HORNBY, U.S. District Judge for the District of Maine
ALAN B. MORRISON, Lerner Family Associate Dean for Public Interest and
Public Service, George Washington University Law School
PRABHU PINGALI, Deputy Director of Agricultural Development, Global
Development Program, Bill and Melinda Gates Foundation
vii
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
HARRIET RABB, Vice President and General Counsel, Rockefeller
University
BARBARA JACOBS ROTHSTEIN, Director, The Federal Judicial Center
DAVID S.TATEL, Judge, U.S. Court of Appeals for the District of Columbia
Circuit
SOPHIE VANDEBROEK, Chief Technology Officer and President, Xerox
Innovation Group, Xerox Corporation
Staff
ANNE-MARIE MAZZA, Director
STEVEN KENDALL, Associate Program Officer
viii
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Foreword
In 1993, in the case Daubert v. Merrell Dow Pharmaceuticals, Inc., the Supreme
Court instructed trial judges to serve as “gatekeepers” in determining whether the
opinion of a proffered expert is based on scientific reasoning and methodology.
Since Daubert, scientific and technical information has become increasingly important in all types of decisionmaking, including litigation. As a result, the science and
legal communities have searched for expanding opportunities for collaboration.
Our two institutions have been at the forefront of trying to improve the use
of science by judges and attorneys. In Daubert, the Supreme Court cited an amicus
curiae brief submitted by the National Academy of Sciences and the American
Association for the Advancement of Science to support the view of science as “a
process for proposing and refining theoretical explanations about the world that
are subject to further testing and refinement.” Similarly, in Kumho Tire Co. v.
Carmichael (1999) the Court cited an amicus brief filed by the National Academy
of Engineering for its assistance in explaining the process of engineering.
Soon after the Daubert decision the Federal Judicial Center published the first
edition of the Reference Manual on Scientific Evidence, which has become the leading
reference source for federal judges for difficult issues involving scientific testimony.
The Center also undertook a series of research studies and judicial education programs intended to strengthen the use of science in courts.
More recently the National Research Council through its Committee on Science, Technology, and Law has worked closely with the Federal Judicial Center to
organize discussions, workshops, and studies that would bring the two communities together to explore the nature of science and engineering, and the processes
by which science and technical information informs legal issues. It is in that spirit
that our organizations joined together to develop the third edition of the Reference
Manual on Scientific Evidence. This third edition, which was supported by grants from
the Carnegie Foundation and the Starr Foundation, builds on the foundation of the
first two editions, published by the Center. This edition was overseen by a National
Research Council committee composed of judges and scientists and engineers who
share a common vision that together scientists and engineers and members of the
judiciary can play an important role in informing judges about the nature and work
of the scientific enterprise.
Our organizations benefit from the contributions of volunteers who give
their time and energy to our efforts. During the course of this project, two of
the chapter authors passed away: Margaret Berger and David Friedman. Both
Margaret and David served on NRC committees and were frequent contributors
to Center judicial education seminars. Both were involved in the development of
the Reference Manual from the beginning, both have aided each of our institutions
through their services on committees, and both have made substantial contributions to our understanding of law and science through their individual scholarship.
ix
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
They will be missed but their work will live on in the thoughtful scholarship they
have left behind.
We extend our sincere appreciation to Dr. Jerome Kassirer and Judge Gladys
Kessler and all the members of the committee who gave so generously to make
this edition possible.
T HE H ONORABLE B ARBARA J. R OTHSTEIN
R ALPH J. C ICERONE
Director
Federal Judicial Center
President
National Academy of Sciences
x
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Acknowledgments
This report has been reviewed in draft form by individuals chosen for their diverse
perspectives and technical expertise, in accordance with procedures approved
by the National Academies’ Report Review Committee. The purpose of this
independent review is to provide candid and critical comments that will assist the
institution in making its published report as sound as possible and to ensure that
the report meets institutional standards for objectivity, accuracy, and responsiveness to the study charge. The review comments and draft manuscript remain
confidential to protect the integrity of the process.
We wish to thank the following individuals for their review of selected chapters of this report: Bert Black, Mansfield, Tanick & Cohen; Richard Bjur, University of Nevada; Michael Brick, Westat; Edward Cheng, Vanderbilt University; Joel
Cohen, Rockefeller University; Morton Corn, Morton Corn and Associates; Carl
Cranor, University of California, Riverside; Randall Davis, Massachusetts Institute of Technology; John Doull, University of Kansas; Barry Fisher, Los Angeles
County Sheriff’s Department; Edward Foster, University of Minnesota; David
Goldston, Natural Resources Defense Council; James Greiner, Harvard University; Susan Haack, University of Miami; David Hillis, University of Texas; Karen
Kafadar, Indiana University; Graham Kalton, Westat; Randy Katz, University of
California, Berkeley; Alan Leshner, American Association for the Advancement
of Science; Laura Liptai, Biomedical Forensics; Patrick Malone, Patrick Malone
& Associates; Geoffrey Mearns, Cleveland State University; John Monahan, The
University of Virginia; William Nordhaus, Yale University; Fernando Olguin,
U.S. District Court for the Central District of California; Jonathan Samet, University of Southern California; Nora Cate Schaeffer, University of Wisconsin;
Shira Scheindlin, U.S. District Court for the Southern District of New York; and
Reggie Walton, U.S. District Court for the District of Columbia.
Although the reviewers listed above have provided many constructive comments and suggestions, they were not asked to endorse the report, nor did they
see the final draft of the report before its release. The review of this report was
overseen by D. Brock Hornby, U.S. District Judge for the District of Maine.
Appointed by the National Academies, he was responsible for making certain that
an independent examination of this report was carried out in accordance with
institutional procedures and that all review comments were carefully considered.
Responsibility for the final content of this report rests entirely with the authoring
committee and the institution.
xi
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Preface
Supreme Court decisions during the last decade of the twentieth century mandated that federal courts examine the scientific basis of expert testimony to ensure
that it meets the same rigorous standard employed by scientific researchers and
practitioners outside the courtroom. Needless to say, this requirement places a
demand on judges not only to comprehend the complexities of modern science
but to adjudicate between parties’ differing interpretations of scientific evidence.
Science, meanwhile, advances. Methods change, new fields are born, new tests
are introduced, the lexicon expands, and fresh approaches to the interpretation of
causal relations evolve. Familiar terms such as enzymes and molecules are replaced
by microarray expression and nanotubes; single-author research studies have now
become multi-institutional, multi-author, international collaborative efforts.
No field illustrates the evolution of science better than forensics. The evidence provided by DNA technology was so far superior to other widely accepted
methods and called into question so many earlier convictions that the scientific
community had to reexamine many of its time-worn forensic science practices.
Although flaws of some types of forensic science evidence, such as bite and footprint analysis, lineup identification, and bullet matching were recognized, even
the most revered form of forensic science—fingerprint identification—was found
to be fallible. Notably, even the “gold standard” of forensic evidence, namely
DNA analysis, can lead to an erroneous conviction if the sample is contaminated,
if specimens are improperly identified, or if appropriate laboratory protocols and
practices are not followed.
Yet despite its advances, science has remained fundamentally the same. In its
ideal expression, it examines the nature of nature in a rigorous, disciplined manner
in, whenever possible, controlled environments. It still is based on principles of
hypothesis generation, scrupulous study design, meticulous data collection, and
objective interpretation of experimental results. As in other human endeavors,
however, this ideal is not always met. Feverish competition between researchers
and their parent institutions, fervent publicity seeking, and the potential for dazzling financial rewards can impair scientific objectivity. In recent years we have
experienced serious problems that range from the introduction of subtle bias in
the design and interpretation of experiments to overt fraudulent studies. In this
welter of modern science, ambitious scientists, self-designated experts, billiondollar corporate entities, and aggressive claimants, judges must weigh evidence,
judge, and decide.
As with previous editions of the Reference Manual, this edition is organized
according to many of the important scientific and technological disciplines likely
to be encountered by federal (or state) judges. We wish to highlight here two
critical issues germane to the interpretation of all scientific evidence, namely issues
of causation and conflict of interest. Causation is the task of attributing cause
and effect, a normal everyday cognitive function that ordinarily takes little or
xiii
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
no effort. Fundamentally, the task is an inferential process of weighing evidence
and using judgment to conclude whether or not an effect is the result of some
stimulus. Judgment is required even when using sophisticated statistical methods.
Such methods can provide powerful evidence of associations between variables,
but they cannot prove that a causal relationship exists. Theories of causation
(evolution, for example) lose their designation as theories only if the scientific
community has rejected alternative theories and accepted the causal relationship as fact. Elements that are often considered in helping to establish a causal
relationship include predisposing factors, proximity of a stimulus to its putative
outcome, the strength of the stimulus, and the strength of the events in a causal
chain. Unfortunately, judges may be in a less favorable position than scientists to
make causal assessments. Scientists may delay their decision while they or others
gather more data. Judges, on the other hand, must rule on causation based on
existing information. Concepts of causation familiar to scientists (no matter what
stripe) may not resonate with judges who are asked to rule on general causation
(i.e., is a particular stimulus known to produce a particular reaction) or specific
causation (i.e., did a particular stimulus cause a particular consequence in a specific instance). In the final analysis, a judge does not have the option of suspending
judgment until more information is available, but must decide after considering
the best available science. Finally, given the enormous amount of evidence to be
interpreted, expert scientists from different (or even the same) disciplines may not
agree on which data are the most relevant, which are the most reliable, and what
conclusions about causation are appropriate to be derived.
Like causation, conflict of interest is an issue that cuts across most, if not all,
scientific disciplines and could have been included in each chapter of the Reference
Manual. Conflict of interest manifests as bias, and given the high stakes and adversarial nature of many courtroom proceedings, bias can have a major influence on
evidence, testimony, and decisionmaking. Conflicts of interest take many forms
and can be based on religious, social, political, or other personal convictions. The
biases that these convictions can induce may range from serious to extreme, but
these intrinsic influences and the biases they can induce are difficult to identify.
Even individuals with such prejudices may not appreciate that they have them, nor
may they realize that their interpretations of scientific issues may be biased by them.
Because of these limitations, we consider here only financial conflicts of interest;
such conflicts are discoverable. Nonetheless, even though financial conflicts can
be identified, having such a conflict, even one involving huge sums of money,
does not necessarily mean that a given individual will be biased. Having a financial
relationship with a commercial entity produces a conflict of interest, but it does
not inevitably evoke bias. In science, financial conflict of interest is often accompanied by disclosure of the relationship, leaving to the public the decision whether
the interpretation might be tainted. Needless to say, such an assessment may be
difficult. The problem is compounded in scientific publications by obscure ways
in which the conflicts are reported and by a lack of disclosure of dollar amounts.
xiv
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Preface
Judges and juries, however, must consider financial conflicts of interest when
assessing scientific testimony. The threshold for pursuing the possibility of bias
must be low. In some instances, judges have been frustrated in identifying expert
witnesses who are free of conflict of interest because entire fields of science seem
to be co-opted by payments from industry. Judges must also be aware that the
research methods of studies funded specifically for purposes of litigation could
favor one of the parties. Though awareness of such financial conflicts in itself is
not necessarily predictive of bias, such information should be sought and evaluated
as part of the deliberations.
The Reference Manual on Scientific Evidence, here in its third edition, is formulated to provide the tools for judges to manage cases involving complex scientific
and technical evidence. It describes basic principles of major scientific fields
from which legal evidence is typically derived and provides examples of cases in
which such evidence was used. Authors of the chapters were asked to provide an
overview of principles and methods of the science and provide relevant citations.
We expect that few judges will read the entire manual; most will use the volume
in response to a need when a particular case arises involving a technical or scientific issue. To help in this endeavor, the Reference Manual contains completely
updated chapters as well as new ones on neuroscience, exposure science, mental
health, and forensic science. This edition of the manual has also gone through the
thorough review process of the National Academy of Sciences.
As in previous editions, we continue to caution judges regarding the proper
use of the reference guides. They are not intended to instruct judges concerning what evidence should be admissible or to establish minimum standards for
acceptable scientific testimony. Rather, the guides can assist judges in identifying
the issues most commonly in dispute in these selected areas and in reaching an
informed and reasoned assessment concerning the basis of expert evidence. They
are designed to facilitate the process of identifying and narrowing issues concerning scientific evidence by outlining for judges the pivotal issues in the areas of
science that are often subject to dispute. Citations in the reference guides identify
cases in which specific issues were raised; they are examples of other instances
in which judges were faced with similar problems. By identifying scientific areas
commonly in dispute, the guides should improve the quality of the dialogue
between the judges and the parties concerning the basis of expert evidence.
In our committee discussions, we benefited from the judgment and wisdom
of the many distinguished members of our committee, who gave time without compensation. They included Justice Ming Chin of the Supreme Court
of California; Judge Pauline Newman of the U.S. Court of Appeals for the
Federal Circuit in Washington, D.C.; Judge Kathleen MacDonald O’Malley of
the U.S. Court of Appeals for the Federal Circuit; Judge Jed Rakoff of the U.S.
District Court for the Southern District of New York; Channing Robertson,
Ruth G. and William K. Bowes Professor, School of Enginering, and Professor,
Department of Chemical Engineering, Stanford University; Joseph Rodricks,
xv
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Principal, Environ, Arlington, Virginia; Allen Wilcox, Senior Investigator, Institute of Environmental Health Sciences, Research Triangle Park, North Carolina;
and Sandy Zabell, Professor of Statistics and Mathematics, Weinberg College of
Arts and Sciences, Northwestern University.
Special commendation, however, goes to Anne-Marie Mazza, Director of
the Committee on Science, Technology, and Law, and Joe Cecil of the Federal
Judicial Center. These individuals not only shepherded each chapter and its
revisions through the process, but provided critical advice on content and editing.
They, not we, are the real editors.
Finally, we would like to express our gratitude for the superb assistance of
Steven Kendall and for the diligent work of Guru Madhavan, Sara Maddox, Lillian
Maloy, and Julie Phillips.
J EROME P. K ASSIRER
AND
G LADYS K ESSLER
Committee Co-Chairs
xvi
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Summary Table of Contents
A detailed Table of Contents appears at the front of each chapter.
Introduction, 1
Stephen Breyer
The Admissibility of Expert Testimony, 11
Margaret A. Berger
How Science Works, 37
David Goodstein
Reference Guide on Forensic Identification Expertise, 55
Paul C. Giannelli, Edward J. Imwinkelried, & Joseph L. Peterson
Reference Guide on DNA Identification Evidence, 129
David H. Kaye & George Sensabaugh
Reference Guide on Statistics, 211
David H. Kaye & David A. Freedman
Reference Guide on Multiple Regression, 303
Daniel L. Rubinfeld
Reference Guide on Survey Research, 359
Shari Seidman Diamond
Reference Guide on Estimation of Economic Damages, 425
Mark A. Allen, Robert E. Hall, & Victoria A. Lazear
Reference Guide on Exposure Science, 503
Joseph V. Rodricks
Reference Guide on Epidemiology, 549
Michael D. Green, D. Michal Freedman, & Leon Gordis
Reference Guide on Toxicology, 633
Bernard D. Goldstein & Mary Sue Henifin
Reference Guide on Medical Testimony, 687
John B. Wong, Lawrence O. Gostin, & Oscar A. Cabrera
Reference Guide on Neuroscience, 747
Henry T. Greely & Anthony D. Wagner
Reference Guide on Mental Health Evidence, 813
Paul S. Appelbaum
Reference Guide on Engineering, 897
Channing R. Robertson, John E. Moalli, & David L. Black
Appendix A. Biographical Information of Committee and Staff, 961
xvii
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Introduction
STEPHEN BREYER
Stephen Breyer, L.L.B., is Associate Justice of the Supreme Court of the United States.
Portions of this Introduction appear in Stephen Breyer, The Interdependence of Science and Law, 280
Science 537 (1998).
1
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
I N THIS AGE OF SCIENCE , SCIENCE SHOULD EXPECT TO find a warm welcome, perhaps a permanent home, in our courtrooms. The reason is a simple
one. The legal disputes before us increasingly involve the principles and tools of
science. Proper resolution of those disputes matters not just to the litigants, but
also to the general public—those who live in our technologically complex society
and whom the law must serve. Our decisions should reflect a proper scientific and
technical understanding so that the law can respond to the needs of the public.
Consider, for example, how often our cases today involve statistics—a tool
familiar to social scientists and economists but, until our own generation, not to
many judges. In 2007, the U.S. Supreme Court heard Zuni Public Schools District
No. 89 v. Department of Education,1 in which we were asked to interpret a statistical formula to be used by the U.S. Secretary of Education when determining
whether a state’s public school funding program “equalizes expenditures” among
local school districts. The formula directed the Secretary to “disregard” school
districts with “per-pupil expenditures . . . above the 95th percentile or below the
5th percentile of such expenditures . . . in the State.” The question was whether
the Secretary, in identifying the school districts to be disregarded, could look to
the number of pupils in a district as well as the district’s expenditures per pupil.
Answering that question in the affirmative required us to draw upon technical
definitions of the term “percentile” and to consider five different methods by
which one might calculate the percentile cutoffs.
In another recent Term, the Supreme Court heard two cases involving consideration of statistical evidence. In Hunt v. Cromartie,2 we ruled that summary
judgment was not appropriate in an action brought against various state officials,
challenging a congressional redistricting plan as racially motivated in violation of
the Equal Protection Clause. In determining that disputed material facts existed
regarding the motive of the state legislature in redrawing the redistricting plan, we
placed great weight on a statistical analysis that offered a plausible alternative interpretation that did not involve an improper racial motive. Assessing the plausibility
of this alternative explanation required knowledge of the strength of the statistical
correlation between race and partisanship, understanding of the consequences of
restricting the analysis to a subset of precincts, and understanding of the relationships among alternative measures of partisan support.
In Department of Commerce v. United States House of Representatives,3 residents
of a number of states challenged the constitutionality of a plan to use two forms
of statistical sampling in the upcoming decennial census to adjust for expected
“undercounting” of certain identifiable groups. Before examining the constitutional issue, we had to determine if the residents challenging the plan had standing
to sue because of injuries they would be likely to suffer as a result of the sampling
1. 127 S. Ct. 1534 (2007).
2. 119 S. Ct. 1545 (1999).
3. 119 S. Ct. 765 (1999).
2
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Introduction
plan. In making this assessment, it was necessary to apply the two sampling strategies to population data in order to predict the changes in congressional apportionment that would most likely occur under each proposed strategy. After resolving
the standing issue, we had to determine if the statistical estimation techniques were
consistent with a federal statute.
In each of these cases, we judges were not asked to become expert statisticians, but we were expected to understand how the statistical analyses worked.
Trial judges today are asked routinely to understand statistics at least as well, and
probably better.
But science is far more than tools, such as statistics. And that “more” increasingly enters directly into the courtroom. The Supreme Court, for example, has
recently decided cases involving basic questions of human liberty, the resolution
of which demanded an understanding of scientific matters. Recently we were
asked to decide whether a state’s method of administering a lethal injection to
condemned inmates constituted cruel and unusual punishment in violation of the
Eighth Amendment.4 And in 1997, we were asked to decide whether the Constitution protects a right to physician-assisted suicide.5 Underlying the legal questions
in these cases were medical questions: What effect does a certain combination of
drugs, administered in certain doses, have on the human body, and to what extent
can medical technology reduce or eliminate the risk of dying in severe pain? The
medical questions did not determine the answer to the legal questions, but to do
our legal job properly, we needed to develop an informed—although necessarily
approximate—understanding of the science.
Nor were the lethal-injection and “right-to-die” cases unique in this respect.
A different case concerned a criminal defendant who was found to be mentally
competent to stand trial but not mentally competent to represent himself. We
held that a state may insist that such a defendant proceed to trial with counsel.6
Our opinion was grounded in scientific literature suggesting that mental illness
can impair functioning in different ways, and consequently that a defendant may
be competent to stand trial yet unable to carry out the tasks needed to present
his own defense.
The Supreme Court’s docket is only illustrative. Scientific issues permeate
the law. Criminal courts consider the scientific validity of, say, DNA sampling or
voiceprints, or expert predictions of defendants’ “future dangerousness,” which
can lead courts or juries to authorize or withhold the punishment of death. Courts
review the reasonableness of administrative agency conclusions about the safety of
a drug, the risks attending nuclear waste disposal, the leakage potential of a toxic
waste dump, or the risks to wildlife associated with the building of a dam. Patent
law cases can turn almost entirely on an understanding of the underlying technical
4. Baze v. Rees, 128 S. Ct. 1520 (2008).
5. Washington v. Glucksberg, 521 U.S. 702 (1997); Vacco v. Quill, 521 U.S. 793 (1997).
6. Indiana v. Edwards, 128 S. Ct. 2379 (2008).
3
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
or scientific subject matter. And, of course, tort law often requires difficult determinations about the risk of death or injury associated with exposure to a chemical
ingredient of a pesticide or other product.
The importance of scientific accuracy in the decision of such cases reaches
well beyond the case itself. A decision wrongly denying compensation in a toxic
substance case, for example, can not only deprive the plaintiff of warranted compensation but also discourage other similarly situated individuals from even trying
to obtain compensation and encourage the continued use of a dangerous substance.
On the other hand, a decision wrongly granting compensation, although of immediate benefit to the plaintiff, can improperly force abandonment of the substance.
Thus, if the decision is wrong, it will improperly deprive the public of what can
be far more important benefits—those surrounding a drug that cures many while
subjecting a few to less serious risk, for example. The upshot is that we must search
for law that reflects an understanding of the relevant underlying science, not for law
that frees companies to cause serious harm or forces them unnecessarily to abandon
the thousands of artificial substances on which modern life depends.
The search is not a search for scientific precision. We cannot hope to investigate all the subtleties that characterize good scientific work. A judge is not a
scientist, and a courtroom is not a scientific laboratory. But consider the remark
made by the physicist Wolfgang Pauli. After a colleague asked whether a certain
scientific paper was wrong, Pauli replied, “That paper isn’t even good enough
to be wrong!”7 Our objective is to avoid legal decisions that reflect that paper’s
so-called science. The law must seek decisions that fall within the boundaries of
scientifically sound knowledge.
Even this more modest objective is sometimes difficult to achieve in practice.
The most obvious reason is that most judges lack the scientific training that might
facilitate the evaluation of scientific claims or the evaluation of expert witnesses
who make such claims. Judges typically are generalists, dealing with cases that can
vary widely in subject matter. Our primary objective is usually process-related:
seeing that a decision is reached fairly and in a timely way. And the decision in a
court of law typically (though not always) focuses on a particular event and specific
individualized evidence.
Furthermore, science itself may be highly uncertain and controversial with
respect to many of the matters that come before the courts. Scientists often express
considerable uncertainty about the dangers of a particular substance. And their
views may differ about many related questions that courts may have to answer.
What, for example, is the relevance to human cancer of studies showing that a
substance causes some cancers, perhaps only a few, in test groups of mice or rats?
What is the significance of extrapolations from toxicity studies involving high
doses to situations where the doses are much smaller? Can lawyers or judges or
anyone else expect scientists always to be certain or always to have uniform views
7. Peter W. Huber, Galileo’s Revenge: Junk Science in the Courtroom 54 (1991).
4
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Introduction
with respect to an extrapolation from a large dose to a small one, when the causes
of and mechanisms related to cancer are generally not well known? Many difficult
legal cases fall within this area of scientific uncertainty.
Finally, a court proceeding, such as a trial, is not simply a search for dispassionate truth. The law must be fair. In our country, it must always seek to protect
basic human liberties. One important procedural safeguard, guaranteed by our
Constitution’s Seventh Amendment, is the right to a trial by jury. A number of
innovative techniques have been developed to strengthen the ability of juries to
consider difficult evidence.8 Any effort to bring better science into the courtroom
must respect the jury’s constitutionally specified role—even if doing so means that,
from a scientific perspective, an incorrect result is sometimes produced.
Despite the difficulties, I believe there is an increasingly important need for
law to reflect sound science. I remain optimistic about the likelihood that it will
do so. It is common to find cooperation between governmental institutions and
the scientific community where the need for that cooperation is apparent. Today,
as a matter of course, the President works with a science adviser, Congress solicits
advice on the potential dangers of food additives from the National Academy of
Sciences, and scientific regulatory agencies often work with outside scientists, as
well as their own, to develop a product that reflects good science.
The judiciary, too, has begun to look for ways to improve the quality of
the science on which scientifically related judicial determinations will rest. The
Federal Judicial Center is collaborating with the National Academy of Sciences
through the Academy’s Committee on Science, Technology, and Law.9 The
Committee brings together on a regular basis knowledgeable scientists, engineers,
judges, attorneys, and corporate and government officials to explore areas of interaction and improve communication among the science, engineering, and legal
communities. The Committee is intended to provide a neutral, nonadversarial
forum for promoting understanding, encouraging imaginative approaches to problem solving, and discussing issues at the intersection of science and law.
In the Supreme Court, as a matter of course, we hear not only from the parties to a case but also from outside groups, which file amicus curiae briefs that help
us to become more informed about the relevant science. In the “right-to-die”
case, for example, we received about 60 such documents from organizations of
doctors, psychologists, nurses, hospice workers, and handicapped persons, among
others. Many discussed pain-control technology, thereby helping us to identify
areas of technical consensus and disagreement. Such briefs help to educate the
justices on potentially relevant technical matters, making us not experts, but
moderately educated laypersons, and that education improves the quality of our
decisions.
8. See generally Jury Trial Innovations (G. Thomas Munsterman et al. eds., 1997).
9. A description of the program can be found at Committee on Science, Technology, and Law,
http://www.nationalacademies.org/stl (last visited Aug. 10, 2011).
5
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Moreover, our Court has made clear that the law imposes on trial judges the
duty, with respect to scientific evidence, to become evidentiary gatekeepers.10
The judge, without interfering with the jury’s role as trier of fact, must determine
whether purported scientific evidence is “reliable” and will “assist the trier of
fact,” thereby keeping from juries testimony that, in Pauli’s sense, isn’t even good
enough to be wrong. This requirement extends beyond scientific testimony to all
forms of expert testimony.11 The purpose of Daubert’s gatekeeping requirement
“is to make certain that an expert, whether basing testimony upon professional
studies or personal experience, employs in the courtroom the same level of intellectual rigor that characterizes the practice of an expert in the relevant field.”12
Federal trial judges, looking for ways to perform the gatekeeping function better, increasingly have used case-management techniques such as pretrial
conferences to narrow the scientific issues in dispute, pretrial hearings where
potential experts are subject to examination by the court, and the appointment
of specially trained law clerks or scientific special masters. For example, Judge
Richard Stearns of Massachusetts, acting with the consent of the parties in a
highly technical genetic engineering patent case,13 appointed a Harvard Medical
School professor to serve “as a sounding board for the court to think through
the scientific significance of the evidence” and to “assist the court in determining
the validity of any scientific evidence, hypothesis or theory on which the experts
base their testimony.”14 Judge Robert E. Jones of Oregon appointed experts from
four different fields to help him assess the scientific reliability of expert testimony
in silicone gel breast implant litigation.15 Judge Gladys Kessler of the District of
Columbia hired a professor of environmental science at the University of California at Berkeley “to answer the Court’s technical questions regarding the meaning
of terms, phrases, theories and rationales included in or referred to in the briefs
and exhibits” of the parties.16 Judge A. Wallace Tashima of the Ninth Circuit has
described the role of technical advisor as “that of a … tutor who aids the court
in understanding the ‘jargon and theory’ relevant to the technical aspects of the
evidence.”17
Judge Jack B. Weinstein of New York suggests that courts should sometimes “go beyond the experts proffered by the parties” and “appoint indepen10. Gen. Elec. Co. v. Joiner, 522 U.S. 136 (1997); Daubert v. Merrell Dow Pharms., Inc., 509
U.S. 579 (1993).
11. Kumho Tire Co. v. Carmichael, 119 S. Ct. 1167 (1999).
12. Id. at 1176.
13. Biogen, Inc. v. Amgen, Inc., 973 F. Supp. 39 (D. Mass. 1997).
14. MediaCom Corp. v. Rates Tech., Inc., 4 F. Supp. 2d 17 app. B at 37 (D. Mass. 1998)
(quoting the Affidavit of Engagement filed in Biogen, Inc. v. Amgen, Inc., 973 F. Supp. 39 (D. Mass.
1997) (No. 95-10496)).
15. Hall v. Baxter Healthcare Corp., 947 F. Supp. 1387 (D. Or. 1996).
16. Conservation Law Found. v. Evans, 203 F. Supp. 2d 27, 32 (D.D.C. 2002).
17. Ass’n of Mexican-American Educators v. State of California, 231 F.3d 572, 612 (9th Cir.
2000) (en banc) (Tashima, J., dissenting).
6
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Introduction
dent experts” as the Federal Rules of Evidence allow.18 Judge Gerald Rosen of
Michigan appointed a University of Michigan Medical School professor to testify
as an expert witness for the court, helping to determine the relevant facts in a
case that challenged a Michigan law prohibiting partial-birth abortions.19 Chief
Judge Robert Pratt of Iowa hired two experts—a professor of insurance and an
actuary—to help him review the fairness of a settlement agreement in a complex
class-action insurance-fraud case.20 And Judge Nancy Gertner of Massachusetts
appointed a professor from Brandeis University to assist the court in assessing a
criminal defendant’s challenge to the racial composition of the jury venire in the
Eastern Division of the District of Massachusetts.21
In what one observer has described as “the most comprehensive attempt to
incorporate science, as scientists practice it, into law,”22 Judge Sam Pointer, Jr.,
of Alabama appointed a “neutral science panel” of four scientists from different
disciplines to prepare a report and testimony on the scientific basis of claims in silicone gel breast implant product liability cases consolidated as part of a multidistrict
litigation process.23 The panel’s report was cited in numerous decisions excluding expert testimony that connected silicone gel breast implants with systemic
injury.24 The scientists’ testimony was videotaped and made part of the record
so that judges and jurors could consider it in cases returned to the district courts
from the multidistrict litigation process. The use of such videotape testimony can
result in more consistent decisions across courts, as well as great savings of time
and expense for individual litigants and courts.
These case-management techniques are neutral, in principle favoring neither
plaintiffs nor defendants. When used, they have typically proved successful. Nonetheless, judges have not often invoked their rules-provided authority to appoint
their own experts.25 They may hesitate simply because the process is unfamiliar
or because the use of this kind of technique inevitably raises questions. Will use
of an independent expert, in effect, substitute that expert’s judgment for that of
the court? Will it inappropriately deprive the parties of control over the presentation of the case? Will it improperly intrude on the proper function of the jury?
Where is one to find a truly neutral expert? After all, different experts, in total
honesty, often interpret the same data differently. Will the search for the expert
18. Jack B. Weinstein, Individual Justice in Mass Tort Litigation: The Effect of Class Actions,
Consolidations, and Other Multiparty Devices 116 (1995).
19. Evans v. Kelley, 977 F. Supp. 1283 (E.D. Mich. 1997).
20. Grove v. Principal Mutual Life Ins. Co., 200 F.R.D. 434, 443 (S.D. Iowa 2001).
21. United States v. Green, 389 F. Supp. 2d 29, 48 (D. Mass. 2005).
22. Olivia Judson, Slide-Rule Justice, Nat’l J., Oct. 9, 1999, at 2882, 2885.
23. In re Silicone Gel Breast Implant Prod. Liab. Litig., Order 31 (N.D. Ala. filed May 30,
1996) (MDL No. 926).
24. See Laura L. Hooper et al., Assessing Causation in Breast Implant Litigation: The Role of Science
Panels, 64 Law & Contemp. Probs. 139, 181 n.217 (collecting cases).
25. Joe S. Cecil & Thomas E. Willging, Accepting Daubert’s Invitation: Defining a Role for CourtAppointed Experts in Assessing Scientific Validity, 43 Emory L.J. 995, 1004 (1994).
7
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
create inordinate delay or significantly increase costs? Who will pay the expert?
Judge William Acker, Jr., of Alabama writes:
Unless and until there is a national register of experts on various subjects and a
method by which they can be fairly compensated, the federal amateurs wearing black robes will have to overlook their new gatekeeping function lest they
assume the intolerable burden of becoming experts themselves in every discipline
known to the physical and social sciences, and some as yet unknown but sure
to blossom.26
A number of scientific and professional organizations have come forward
with proposals to aid the courts in finding skilled experts. The National Conference of Lawyers and Scientists, a joint committee of the American Association for
the Advancement of Science (AAAS) and the Science and Technology Section
of the American Bar Association, has developed a program to assist federal and
state judges, administrative law judges, and arbitrators in identifying independent experts in cases that present technical issues, when the adversarial system is
unlikely to yield the information necessary for a reasoned and principled resolution of the disputed issues. The program locates experts through professional and
scientific organizations and with the help of a Recruitment and Screening Panel
of scientists, engineers, and health care professionals.27
The Private Adjudication Center at Duke University—which unfortunately
no longer exists—established a registry of independent scientific and technical
experts who were willing to provide advice to courts or serve as court-appointed
experts.28 Registry services also were available to arbitrators and mediators and
to parties and lawyers who together agreed to engage an independent expert at
the early stages of a dispute. The registry recruited experts primarily from major
academic institutions and conducted targeted searches to find experts with the
qualifications required for particular cases. Registrants were required to adhere to
a code of conduct designed to ensure confidence in their impartiality and integrity.
Among those judges who have thus far experimented with court-appointed
scientific experts, the reaction has been mixed, ranging from enthusiastic to disappointed. The Federal Judicial Center has examined a number of questions
arising from the use of court-appointed experts and, based on interviews with
participants in Judge Pointer’s neutral science panel, has offered lessons to guide
courts in future cases. We need to learn how better to identify impartial experts,
to screen for possible conflicts of interest, and to instruct experts on the scope of
26. Letter from Judge William Acker, Jr., to the Judicial Conference of the United States et al.
(Jan. 2, 1998).
27. Information on the AAAS program can be found at Court Appointed Scientific Experts,
http://www.aaas.org/spp/case/case.htm (last visited Aug. 10, 2011).
28. Letter from Corinne A. Houpt, Registry Project Director, Private Adjudication Center, to
Judge Rya W. Zobel, Director, Federal Judicial Center (Dec. 29, 1998) (on file with the Research
Division of the Federal Judicial Center).
8
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Introduction
their duties. Also, we need to know how better to protect the interests of the parties and the experts when such extraordinary procedures are used. We also need to
know how best to prepare a scientist for the sometimes hostile legal environment
that arises during depositions and cross-examination.29
It would also undoubtedly be helpful to recommend methods for efficiently
educating (i.e., in a few hours) willing scientists in the ways of the courts, just as
it would be helpful to develop training that might better equip judges to understand the ways of science and the ethical, as well as practical and legal, aspects of
scientific testimony.30
In this age of science we must build legal foundations that are sound in science as well as in law. Scientists have offered their help. We in the legal community should accept that offer. We are in the process of doing so. This manual
seeks to open legal institutional channels through which science—its learning,
tools, and principles—may flow more easily and thereby better inform the law.
The manual represents one part of a joint scientific–legal effort that will further
the interests of truth and justice alike.
29. Laura L. Hooper et al., Neutral Science Panels: Two Examples of Panels of Court-Appointed
Experts in the Breast Implants Product Liability Litigation 93–98 (Federal Judicial Center 2001);
Barbara S. Hulka et al., Experience of a Scientific Panel Formed to Advise the Federal Judiciary on Silicone
Breast Implants, 342 New Eng. J. Med. 812 (2000).
30. Gilbert S. Omenn, Enhancing the Role of the Scientific Expert Witness, 102 Envtl. Health Persp.
674 (1994).
9
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert
Testimony
MARGARET A. BERGER
Margaret A. Berger, J.D., was the Trustee Professor of Law, Brooklyn Law School, Brooklyn,
New York.
[Editor’s Note: While revising this chapter Professor Berger became ill and, tragically,
passed away. We have published her last revision, with a few edits to respond to
suggestions by reviewers.]
CONTENTS
I. Supreme Court Cases, 12
A. Daubert v. Merrell Dow Pharmaceuticals, Inc., 12
B. General Electric v. Joiner, 14
C. Kumho Tire Co. v. Carmichael, 16
D. Weisgram v. Marley, 18
II. Interpreting Daubert, 19
A. Atomization, 19
B. Conflating Admissibility with Sufficiency, 20
C. Credibility, 21
III. Applying Daubert, 22
A. Is the Expert Qualified? 22
B. Assessing the Scientific Foundation of Studies from Different
Disciplines, 23
C. How Should the Courts Assess Exposure? 25
IV. Forensic Science, 26
A. Validity, 27
B. Proficiency, 28
C. Malfunctioning Laboratories, 28
D. Interpretation, 29
E. Testimony, 29
F. Assistance for the Defense and Judges, 29
G. Confrontation Clause, 30
V. Procedural Context, 30
A. Class Certification Proceedings, 30
B. Discovery, 32
1. Amended discovery rules, 32
2. E-discovery, 34
C. Daubert Hearings, 35
VI. Conclusion, 36
11
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
I. Supreme Court Cases
In 1993, the Supreme Court’s opinion in Daubert v. Merrell Dow Pharmaceuticals1
ushered in a new era with regard to the admissibility of expert testimony. As
expert testimony has become increasingly essential in a wide variety of litigated
cases, the Daubert opinion has had an enormous impact. If plaintiffs’ expert proof is
excluded on a crucial issue, plaintiffs cannot win and usually cannot even get their
case to a jury. This discussion begins with a brief overview of the Supreme Court’s
three opinions on expert testimony—often called the Daubert trilogy2—and their
impact. It then examines a fourth Supreme Court case that relates to expert testimony, before turning to a variety of issues that judges are called upon to resolve,
particularly when the proffered expert testimony hinges on scientific knowledge.
A. Daubert v. Merrell Dow Pharmaceuticals, Inc.
In the seminal Daubert case, the Court granted certiorari to decide whether the
so-called Frye (or “general acceptance”) test,3 which some federal circuits (and
virtually all state courts) used in determining the admissibility of scientific evidence, had been superseded by the enactment of the Federal Rules of Evidence
in 1973. The Court held unanimously that the Frye test had not survived. Six
justices joined Justice Blackmun in setting forth a new test for admissibility after
concluding that “Rule 702 . . . clearly contemplates some degree of regulation of
the subjects and theories about which an expert may testify.”4 While the two other
members of the Court agreed with this conclusion about the role of Rule 702,
they thought that the task of enunciating a new rule for the admissibility of expert
proof should be left to another day.5
The majority opinion in Daubert sets forth a number of major themes that run
throughout the trilogy. First, it recognized the trial judge as the “gatekeeper” who
must screen proffered expert testimony.6 Second, the objective of the screening
is to ensure that expert testimony, in order to be admissible, must be “not only
relevant, but reliable.”7 Although there was nothing particularly novel about the
Supreme Court finding that a trial judge has the power to make an admissibility
determination—Federal Rules of Evidence 104(a) and 702 pointed to such a
conclusion—and federal trial judges had excluded expert testimony long before
1. 509 U.S. 579 (1993).
2. The other two cases are Gen. Elec. Co. v. Joiner, 522 U.S. 136 (1997) and Kumho Tire Co. v.
Carmichael, 526 U.S. 137 (1999). The disputed issue in all three cases was causation.
3. Frye v. United States, 293 F. 1013 (D.C. Cir. 1923).
4. Daubert, 509 U.S. at 589.
5. Id. at 601.
6. Id. at 589.
7. Id.
12
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
Daubert, the majority opinion in Daubert stated that the trial court has not only
the power but the obligation to act as gatekeeper.8
The Court then considered the meaning of its two-pronged test of relevancy
and reliability in the context of scientific evidence. With regard to relevancy, the
Court explained that expert testimony cannot assist the trier in resolving a factual
dispute, as required by Rule 702, unless the expert’s theory is tied sufficiently to
the facts of the case. “Rule 702’s ‘helpfulness’ standard requires a valid scientific
connection to the pertinent inquiry as a precondition to admissibility.”9 This
consideration, the Court remarked, “has been aptly described by Judge Becker
as one of ‘fit.’”10
To determine whether proffered scientific testimony or evidence satisfies
the standard of evidentiary reliability,11 a judge must ascertain whether it is
“ground[ed] in the methods and procedures of science.”12 The Court, emphasizing that “[t]he inquiry envisioned by Rule 702 is . . . a flexible one,”13 then
examined the characteristics of scientific methodology and set out a nonexclusive
list of four factors that bear on whether a theory or technique has been derived
by the scientific method.14 First and foremost, the Court viewed science as an
empirical endeavor: “[W]hether [a theory or technique] can be (and has been)
tested” is the “methodology [that] distinguishes science from other fields of human
inquiry.”15 The Court also mentioned as indicators of good science whether the
technique or theory has been subjected to peer review or publication, whether
the existence of known or potential error rates has been determined, and whether
standards exist for controlling the technique’s operation.16 In addition, although
general acceptance of the methodology within the scientific community is no
longer dispositive, it remains a factor to be considered.17
The Court did not apply its new test to the eight experts for the plaintiffs
who sought to testify on the basis of in vitro, animal, and epidemiological studies
8. Id.
9. Id. at 591–92.
10. Id. at 591. Judge Becker used this term in discussing the admissibility of expert testimony
about factors that make eyewitness testimony unreliable. See United States v. Downing, 753 F.2d
1224, 1242 (3d Cir. 1985) (on remand court rejected the expert testimony on ground of “fit” because
expert discussed factors such as the high likelihood of inaccurate cross-racial identifications that were
not present in the case) and United States v. Downing, 609 F. Supp. 784, 791–92 (E.D. Pa. 1985),
aff’d, 780 F.2d 1017 (3d Cir. 1985).
11. Commentators have faulted the Court for using the label “reliability” to refer to the concept
that scientists term “validity.” The Court’s choice of language was deliberate. It acknowledged that
scientists typically distinguish between validity and reliability and that “[i]n a case involving scientific
evidence, evidentiary reliability will be based upon scientific validity.” Daubert, 509 U.S. at 590 n.9.
12. Id. at 590.
13. Id. at 594.
14. Id. at 593–94. “[W]e do not presume to set out a definitive checklist or test.” Id. at 593.
15. Id.
16. Id. at 593–94.
17. Id. at 594.
13
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
that the drug Bendectin taken by the plaintiffs’ mothers during pregnancy could
cause or had caused the plaintiffs’ birth defects. Instead, it reversed and remanded
the case. Nor did the Court deal with any of the procedural issues raised by the
Daubert opinion, such as the burden, if any, on the party seeking a ruling excluding expert testimony, or the standard of review on appeal.
The Daubert opinion soon led to Daubert motions followed by Daubert hearings as parties moved in limine to have their opponents’ experts precluded from
testifying at trial for failure to satisfy the new requirements for expert testimony.
The motions raised numerous questions that the Court had not dealt with, some
of which were dealt with in the next two opinions by the Supreme Court.
B. General Electric v. Joiner
In General Electric Co. v. Joiner,18 the second case in the trilogy, certiorari was
granted in order to determine the appropriate standard an appellate court should
apply in reviewing a trial court’s Daubert decision to admit or exclude scientific
expert testimony. In Joiner, the 37-year-old plaintiff, a longtime smoker with a
family history of lung cancer, claimed that exposure to polychlorinated biphenyls
(PCBs) and their derivatives had promoted the development of his small-cell lung
cancer. The trial court applied the Daubert criteria, excluded the opinions of the
plaintiff’s experts, and granted the defendants’ motion for summary judgment.19
The court of appeals reversed the decision, stating that “[b]ecause the Federal
Rules of Evidence governing expert testimony display a preference for admissibility, we apply a particularly stringent standard of review to the trial judge’s
exclusion of expert testimony.”20
All the justices joined Chief Justice Rehnquist in holding that abuse of discretion is the correct standard for an appellate court to apply in reviewing a district
court’s evidentiary ruling, regardless of whether the ruling allowed or excluded
expert testimony.21 The Court unequivocally rejected the suggestion that a more
stringent standard is permissible when the ruling, as in Joiner, is “outcome determinative” because it resulted in a grant of summary judgment for the defendant
because the plaintiff failed to produce evidence of causation.22 In a concurring
opinion, Justice Breyer urged judges to avail themselves of techniques, such as the
use of court-appointed experts, that would assist them in making determinations
about the admissibility of complex scientific or technical evidence.23
18. 522 U.S. 136 (1997).
19. Joiner v. Gen. Elec. Co., 864 F. Supp. 1310 (N.D. Ga. 1994).
20. Joiner v. Gen. Elec. Co., 78 F.3d 524, 529 (11th Cir. 1996).
21. Gen. Elec. Co. v. Joiner, 522 U.S. at 141–43.
22. Id. at 142–43.
23. Id. at 147–50. This issue is discussed in further detail in Justice Breyer’s introduction to
this manual.
14
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
With the exception of Justice Stevens, who dissented from this part of the
opinion, the justices then did what they had not done in Daubert—they examined
the record, found that the plaintiff’s experts had been properly excluded, and
reversed the court of appeals decision without a remand to the lower court. The
Court concluded that it was within the district court’s discretion to find that the
statements of the plaintiff’s experts with regard to causation were nothing more
than speculation. The Court noted that the plaintiff never explained “how and
why the experts could have extrapolated their opinions”24 from animal studies
far removed from the circumstances of the plaintiff’s exposure.25 It also observed
that the district court could find that the four epidemiological studies the plaintiff
relied on were insufficient as a basis for his experts’ opinions.26 Consequently, the
court of appeals had erred in reversing the district court’s determination that the
studies relied on by the plaintiff’s experts “were not sufficient, whether individually or in combination, to support their conclusions that Joiner’s exposure to PCBs
contributed to his cancer.”27
The plaintiff in Joiner had argued that the epidemiological studies showed a
link between PCBs and cancer if the results of all the studies were pooled, and
that this weight-of-the-evidence methodology was reliable. Therefore, according
to the plaintiff, the district court erred when it excluded a conclusion based on a
scientifically reliable methodology because it thereby violated the Court’s precept
in Daubert that the “focus, of course, must be solely on principles and methodology, not on the conclusions that they generate.”28 The Supreme Court responded
to this argument by stating that
conclusions and methodology are not entirely distinct from one another. Trained
experts commonly extrapolate from existing data. But nothing in either Daubert
or the Federal Rules of Evidence requires a district court to admit opinion evidence which is connected to existing data only by the ipse dixit of the expert. A
court may conclude that there is simply too great an analytical gap between the
data and the opinion proffered.29
24. Id. at 144.
25. The studies involved infant mice that had massive doses of PCBs injected directly into their
bodies; Joiner was an adult who was exposed to fluids containing far lower concentrations of PCBs.
The infant mice developed a different type of cancer than Joiner did, and no animal studies showed that
adult mice exposed to PCBs developed cancer or that PCBs lead to cancer in other animal species. Id.
26. The authors of the first study of workers at an Italian plant found lung cancer rates among
ex-employees somewhat higher than might have been expected but refused to conclude that PCBs
had caused the excess rate. A second study of workers at a PCB production plant did not find the
somewhat higher incidence of lung cancer deaths to be statistically significant. The third study made
no mention of exposure to PCBs, and the workers in the fourth study who had a significant increase
in lung cancer rates also had been exposed to numerous other potential carcinogens. Id. at 145–46.
27. Id. at 146–47.
28. Id. at 146 (quoting Daubert, 509 U.S. at 595).
29. Id. at 146.
15
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Justice Stevens, in his partial dissent, assumed that the plaintiff’s expert was
entitled to rely on such a methodology, which he noted is often used in risk assessment, and that a district court that admits expert testimony based on a weight-ofthe-evidence methodology does not abuse its discretion.30 Justice Stevens would
have remanded the case for the court below to determine if the trial court had
abused its discretion when it excluded the plaintiff’s experts.31
C. Kumho Tire Co. v. Carmichael
Less than one year after deciding Joiner, the Supreme Court granted certiorari in
Kumho to decide if the trial judge’s gatekeeping obligation under Daubert applies
only to scientific evidence or if it extends to proffers of “technical, or other specialized knowledge,” the other categories of expertise recognized in Federal Rule of
Evidence 702. In addition, there was uncertainty about whether disciplines such as
economics, psychology, and other “soft” sciences were governed by this standard;
about when the four factors endorsed in Daubert as indicators of reliability had to
be applied; and how experience factors into the gatekeeping process. Although
Rule 702 specifies that an expert may be qualified through experience, the Court’s
emphasis in Daubert on “testability” suggested that an expert should not be allowed
to base a conclusion solely on experience if the conclusion can easily be tested.
In Kumho, the plaintiffs brought suit after a tire blew out on a minivan, causing an accident in which one passenger died and others were seriously injured.
The tire, which was manufactured in 1988, had been installed on the minivan
sometime before it was purchased as a used car by the plaintiffs in 1993. In their
diversity action against the tire’s maker and its distributor, the plaintiffs claimed
that the tire was defective. To support this allegation, the plaintiffs relied primarily
on deposition testimony by an expert in tire-failure analysis, who concluded on
the basis of a visual inspection of the tire that the blowout was caused by a defect
in the tire’s manufacture or design.
When the defendants moved to exclude the plaintiffs’ expert, the district
court agreed with the defendants that the Daubert gatekeeping obligation applied
not only to scientific knowledge but also to “technical analyses.”32 The district
court excluded the plaintiffs’ expert and granted summary judgment. Although
the court conceded on a rehearing that it had erred in treating the four factors discussed in Daubert as mandatory, it adhered to its original determination because the
court simply found the Daubert factors appropriate, analyzed them, and discerned
no competing criteria sufficiently strong to outweigh them.33
30. Id. at 153–54.
31. Id. at 150–51.
32. Carmichael v. Samyang Tire, Inc., 923 F. Supp. 1514, 1522 (S.D. Ala. 1996), rev’d, 131
F.3d 1433 (11th Cir. 1997), rev’d sub nom. Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).
33. Id. at 1522, 1524.
16
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
The Eleventh Circuit reversed the district court’s decision in Kumho, holding,
as a matter of law under a de novo standard of review, that Daubert applies only
to scientific opinions.34 The court of appeals drew a distinction between expert
testimony that relies on the application of scientific theories or principles—which
would be subject to a Daubert analysis—and testimony that is based on the expert’s
“skill- or experience-based observation.”35 The court then found that the testimony proffered by plaintiff was “non-scientific” and that “the district court erred
as a matter of law by applying Daubert in this case.”36 The circuit court agreed that
the trial court has a gatekeeping obligation; its quarrel with the district court was
with that court’s assumption that Daubert’s four factors had to be applied.
All of the justices of the Supreme Court, in an opinion by Justice Breyer, held
that the trial court’s gatekeeping obligation extends to all expert testimony,37 and
unanimously rejected the Eleventh Circuit’s dichotomy between the expert who
“relies on the application of scientific principles” and the expert who relies on
“skill- or experience-based observation.”38 The Court noted that Federal Rule of
Evidence 702 “makes no relevant distinction between ‘scientific’ knowledge and
‘technical’ or ‘other specialized’ knowledge,” and “applies its reliability standard
to all . . . matters within its scope.”39 Furthermore, said the Court, “no clear line”
can be drawn between the different kinds of knowledge, and “no one denies that
an expert might draw a conclusion from a set of observations based on extensive
and specialized experience.”40
The Court also unanimously found that the court of appeals had erred when
it used a de novo standard, instead of the Joiner abuse-of-discretion standard, to
determine that Daubert’s criteria were not reasonable measures of the reliability
of the expert’s testimony.41 As in Joiner, and again over the dissent of Justice
Stevens,42 the Court then examined the record and concluded that the trial court
had not abused its discretion when it excluded the testimony of the witness.
Accordingly, it reversed the opinion of the Eleventh Circuit.
The opinion adopts a flexible approach that stresses the importance of identifying “the particular circumstances of the particular case at issue.”43 The court
must then make sure that the proffered expert will observe the same standard of
“intellectual rigor” in testifying as he or she would employ when dealing with
similar matters outside the courtroom.44
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
Carmichael v. Samyang Tire, Inc., 131 F.3d 1433, 1435 (11th Cir. 1997).
Id.
Id. at 1436 (footnotes omitted).
Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).
Id. at 151.
Id. at 148.
Id. at 156.
Id. at 152.
Id. at 158.
Id. at 150.
Id. at 152.
17
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
How this extremely flexible approach of the Court is to be applied emerges in
Part III of the opinion when the Court engages in a remarkably detailed analysis
of the record that illustrates its comment in Joiner that an expert must account for
“how and why” he or she reached the challenged opinion.45
The Court illustrated the application of this standard to the facts of the case
and its deference to the district court findings as follows:
After examining the transcript in some detail, and after considering respondents’
defense of Carlson’s methodology, the District Court determined that Carlson’s
testimony was not reliable. It fell outside the range where experts might reasonably differ, and where the jury must decide among the conflicting views of
different experts, even though the evidence is shaky. In our view, the doubts
that triggered the District Court’s initial inquiry here were reasonable, as was the
court’s ultimate conclusion.46
Although Kumho is the most recent pronouncement by the Supreme Court
on how to determine whether proffered testimony by an expert is admissible,
and Rule 702 of the Federal Rules of Evidence was amended in 2000 to provide
“some general standards that the trial court must use to assess the reliability and
helpfulness of proffered expert testimony,” it is still Daubert that trial courts cite
and rely on most frequently when ruling on a motion to preclude expert testimony.47 Even though Daubert interprets a federal rule of evidence, and rules of
evidence are designed to operate at trial, Daubert’s greatest impact has been pretrial: If plaintiff’s experts can be excluded from testifying about an issue crucial to
plaintiff’s case, the litigation may end with summary judgment for the defendant.
Furthermore, although summary judgment grants are reviewed de novo by an
appellate court, there is nothing to review if plaintiff failed to submit admissible
evidence on a material issue. Consequently, only the less stringent abuse-ofdiscretion standard will apply, and there will be less chance for a reversal on appeal.
D. Weisgram v. Marley
Plaintiff is entitled to only one chance to select an expert who can withstand a
Daubert motion. In a fourth Supreme Court case, Weisgram v. Marley,48 the district
court ruled for plaintiffs on a Daubert motion and the plaintiffs won a jury verdict.
On appeal, the circuit court found that, despite the abuse-of-discretion standard,
plaintiff’s experts should have been excluded and granted judgment as a matter
of law for the defendants. Plaintiffs argued that they now had the right to a new
trial at which they could introduce more expert testimony. The Supreme Court
45. Gen. Elec. Co v. Joiner, 522 U.S. 136, 144 (1997).
46. Kumho Tire Co. v. Carmichael, 526 U.S. at 153.
47. A search of federal cases on Westlaw after Kumho was decided indicates that the Daubert
decision has been cited more than twice as often as the Kumho decision.
48. 528 U.S. 440 (2000).
18
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
granted certiorari limited to the new trial issue (it did not review the Daubert
determination) but refused to grant a new trial. Justice Ginsberg explained:
Since Daubert, moreover, parties relying on expert testimony have had notice of
the exacting standards of reliability such evidence must meet. . . . It is implausible to suggest, post-Daubert, that parties will initially present less than their best
expert evidence in the expectation of a second chance should their first trial fail.49
Weisgram causes tactical problems for plaintiffs about how much to spend
for expert testimony. Should they pay for additional expensive expert testimony
even though they think the district court would rule in their favor on a Daubert
motion, or is the risk of a reversal on Daubert grounds and a consequent judgment
for the defendant too great despite the abuse-of-discretion standard? Weisgram
may indeed push plaintiffs to bring the very best expertise into litigation—a
stated goal of the trilogy, but it may also make it difficult to litigate legitimate
claims because of the cost of expert testimony. Is access to the federal courts less
important than regulating the admissibility of expert testimony? Even if plaintiffs
successfully withstand a Daubert motion, that does not guarantee they will win
were the case to be tried. But very few cases now go to trial, and an inability by
the defendant to exclude plaintiffs’ experts undoubtedly affects the willingness
of the defendant to negotiate a settlement.
II. Interpreting Daubert
Although almost 20 years have passed since Daubert was decided, a number of
basic interpretive issues remain.
A. Atomization
When there is a Daubert challenge to an expert, should the court look at all the
studies on which the expert relies for their collective effect or should the court
examine the reliability of each study independently? The issue arises with proof of
causation in toxic tort cases when plaintiff’s expert relies on studies from different
scientific disciplines, or studies within a discipline that present different strengths
and weaknesses, in concluding that defendant’s product caused plaintiff’s adverse
health effects. Courts rarely discuss this issue explicitly, but some appear to look
at each study separately and give no consideration to those studies that cannot
alone prove causation.
Although some use the language in Joiner as the basis for this slicing-and-dicing approach,50 scientific inference typically requires consideration of numerous
49. 528 U.S. at 445 (internal citations omitted).
50. See discussion, supra notes 28–31 and related text.
19
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
findings, which, when considered alone, may not individually prove the contention.51 It appears that many of the most well-respected and prestigious scientific
bodies (such as the International Agency for Research on Cancer (IARC), the
Institute of Medicine, the National Research Council, and the National Institute
for Environmental Health Sciences) consider all the relevant available scientific
evidence, taken as a whole, to determine which conclusion or hypothesis regarding a causal claim is best supported by the body of evidence. In applying the scientific method, scientists do not review each scientific study individually for whether
by itself it reliably supports the causal claim being advocated or opposed. Rather,
as the Institute of Medicine and National Research Council noted, “summing,
or synthesizing, data addressing different linkages [between kinds of data] forms a
more complete causal evidence model and can provide the biological plausibility
needed to establish the association” being advocated or opposed.52 The IARC has
concluded that “[t]he final overall evaluation is a matter of scientific judgment
reflecting the weight of the evidence derived from studies in humans, studies in
experimental animals, and mechanistic and other relevant data.”53
B. Conflating Admissibility with Sufficiency
In Daubert, Justice Blackmun’s opinion explicitly acknowledges that in some cases
admissible evidence may not suffice to support a verdict in favor of plaintiffs. In
other words, it seems to recognize that the admissibility determination comes first
and is separate from the sufficiency determination. But in Joiner the Court pays
little attention to this distinction and suggests that plaintiff’s expert testimony may
be excluded if the evidence on which he seeks to rely is itself deemed insufficient.
But what difference does it make if sufficiency is conflated with admissibility?54
After all, the case’s final outcome will be the same. As Daubert recognizes, the trial
judge’s authority to decide whether the plaintiff has produced sufficient evidence
to withstand a dispositive motion under Rule 56 or 50 is indisputable; a one-step
process that considers sufficiency when adjudicating a Daubert motion is arguably
51. See e.g., Susan Haack, An Epistemologist in the Bramble-Bush: At the Supreme Court with
Mr. Joiner, 26 J. Health Pol. Pol’y & L. 217–37 (1999) (discussing the individual studies that lead to
the compelling inference of a double-helical structure of a DNA molecule, which, when considered
separately, fail to compel that inference). See also Milward v. Acuity Specialty Products Group, Inc., __
F.3d __, 2011 WL 982385, *10 639 F.3d 11, 26 (1st Cir. 2011) (reversing the district court’s exclusion
of expert testimony based on an assessment of the direct causal effect of the individual studies, finding
that the “weight of the evidence” properly supported the expert’s opinion that exposure to benzene
can cause acute promyelocytic leukemia).
52. Institute of Medicine and National Research Council, Dietary Supplements: A Framework
for Evaluating Safety 262 (2005).
53. Vincent J. Cogliano et al., The Science and Practice of Carcinogen Identification and Evaluation,
112 Envtl. Health Persp. 1272 (2004).
54. The distinction between admissibility and sufficiency is also discussed in Michael D. Green
et al., Reference Guide on Epidemiology, Section VII, in this manual.
20
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
more efficient than a two-step process that requires the district judge to analyze
admissibility before it can turn to sufficiency.
There are, however, consequences to conflating admissibility and sufficiency.
The de novo standard of review that ordinarily applies to judgments as a matter of
law following a determination of insufficient evidence is converted into the lower
abuse-of-discretion standard that governs evidentiary rulings on admissibility, and
thereby undermines the jury trial mandate of the Seventh Amendment. Science
proceeds by cumulating and synthesizing evidence until there is enough for a new
paradigm. That does not mean that every study meets the most rigorous scientific
standards. Judgment is required in determining which inferences are appropriate,
but an approach that encourages looking at studies sequentially rather than holistically has costs that must be considered.
C. Credibility
Daubert and the expense of litigation make it difficult for courts to hew to the line
that assigns credibility issues to the jury rather than the court. One troublesome
area is conflicts of interest. To what extent should a court permit the plaintiff to
inquire into the defense expert’s relationship with the defendant? If the expert
testified at trial, information that could have skewed the expert’s testimony could
be brought to the attention of the jury through cross-examination or extrinsic
evidence. Impeachment by bias suffers from fewer constraints than other forms
of impeachment.55 But suppose the defendant seeks through a Daubert challenge
to exclude the plaintiff’s expert witness as relying on unreliable evidence to show
causation in a toxic tort action. The defendant supports its argument with testimony by an academic from a highly respected institution whose research shows
that the defendant’s product is safe. Should the court permit the plaintiff to inquire
whether the expert was on the payroll of the defendant corporation, or attended
conferences paid for by the defendant, or received gifts from the defendant? What
about corporate employees ghostwriting reports about their products that are then
submitted in someone else’s name? Other ties that an expert may have to industry
have also been reported: royalties, stock ownership, working in an institution that
receives considerable funding from the defendant. These are all practices that have
been reported in the media and are practices that the plaintiff would like to question the expert about under oath.56 A court is unlikely to allow a wide-ranging
55. See United States v. Abel, 469 U.S. 45, 50 (1984) (explaining that “proof of bias is almost
always relevant because the jury, as finder of fact and weigher of credibility, has historically been
entitled to assess all evidence which might bear on the accuracy and truth of a witness’ testimony”).
56. See, e.g., In re Welding Fume Products, 534 F. Supp. 2d 761, 764 (N.D. Ohio 2008)
(requiring all parties to the litigation to “disclose the fact of, and the amounts of, payments they made,
either directly or indirectly, to any entity (whether an individual or organization) that has authored
or published any study, article, treatise, or other text upon which any expert in this MDL litigation
relies, or has relied”).
21
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
fishing expedition if the plaintiff has no proof that the defense expert engaged in
such behavior. But even if the plaintiff has extrinsic evidence available that points
to conflicts of interest on the part of the expert, how should a court assess this
information in ruling on the admissibility of plaintiff’s experts? Is this a credibility
determination? Should allegations about conflicts be resolved by the judge at an
in limine hearing, or should the plaintiff’s expert be permitted to testify so that
this issue can be explored at trial?
Another troublesome issue about credibility arises when an expert seeks to
base an opinion on controverted evidence in the case. May the court exclude the
expert’s opinion on a Daubert motion if it finds that the expert’s model did not
incorporate the appropriate data that fit the facts of the case, or is this an issue
for the jury?57
Does the court avoid a credibility determination if it finds that the expert is
qualified but the court disagrees with the theory on which the expert is relying?
In Kochert v. Greater Lafayette Health Serv. Inc.,58 a complex antitrust case, the court
held that the trial court properly excluded the plaintiff’s economic experts on the
ground that the plaintiff’s antitrust theory was based on the wrong legal standard
after ruling for the plaintiff on Daubert challenges.
III. Applying Daubert
Application of Daubert raises a number of persistent issues, many of which relate
to proof of causation. The three cases in the trilogy and Weisgram all turned on
questions of causation, and the plaintiffs in each of the cases ultimately lost because
they failed to introduce admissible expert testimony on this issue.
Causation questions have been particularly troubling in cases in which plaintiffs allege that the adverse health effects for which they seek damages are a result
of exposure to the defendant’s product.
A. Is the Expert Qualified?
As a threshold matter, the witness must be qualified as an expert to present
expert opinion testimony. An expert needs more than proper credentials, whether
grounded in “skill, experience, training or education” as set forth in Rule 702 of
the Federal Rules of Evidence. A proposed expert must also have “knowledge.”
57. Compare Consol. Insured Benefits, Inc. v. Conseco Med. Ins. Co., No. 03-cv-3211, 2006
WL 3423891 (D.S.C. 2006) (fraud case; Daubert motion to exclude plaintiff’s expert economist’s testimony on damages; court finds that testimony question of weight, not admissibility) with Concord Boat
Corp. v. Brunswick Corp., 207 F.3d 1039, 1055-56 (8th Cir. 2000) (excluding expert’s testimony as
“mere speculation” that ignored inconvenient evidence).
58. 463 F.3d 710 (7th Cir. 2006).
22
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
For example, an expert who seeks to testify about the findings of epidemiological
studies must be knowledgeable about the results of the studies and must take into
account those studies that reach conclusions contrary to the position the expert
seeks to advocate.
B. Assessing the Scientific Foundation of Studies from Different
Disciplines
Expert opinion is typically based on multiple studies, and those studies may come
from different scientific disciplines. Some courts have explicitly stated that certain
types of evidence proffered to prove causation have no probative value and therefore cannot be reliable.59 Opinions based on animal studies have been rejected
because of reservations about extrapolating from animals to humans or because the
plaintiff’s extrapolated dose was lower than the animals’—which is invariably the
case because one would have to study unmanageable, gigantic numbers of animals
to see results if animals were not given high doses. The field of toxicology, which,
unlike epidemiology, is an experimental science, is rapidly evolving, and prior case
law regarding such studies may not take into account important new developments.
But even when there are epidemiological studies, a court may conclude that
they cannot prove causation because they are not conclusive and therefore unreliable. And if they are unreliable, they cannot be combined with other evidence.60
Experts will often rely on multiple studies, each of which has some probative
value but, when considered separately, cannot prove general causation.
As noted above, trial judges have great discretion under Daubert and a court
is free to choose an atomistic approach that evaluates the available studies one by
one. Some judges have found this practice contrary to that of scientists who look
at knowledge incrementally.61 But there are no hard-and-fast scientific rules for
synthesizing evidence, and most research can be critiqued on a variety of grounds.
59. See, e.g., In re Rezulin, 2004 WL 2884327, at *3 (S.D.N.Y. 2004); Cloud v. Pfizer Inc.,
198 F. Supp. 2d 1118, 1133 (D. Ariz. 2001) (stating that case reports were merely compilations of
occurrences and have been rejected as reliable scientific evidence supporting an expert opinion that
Daubert requires); Haggerty v. Upjohn Co., 950 F. Supp. 1160, 1164 (S.D. Fla. 1996), aff’d, 158 F.3d
588 (11th Cir. 1998) (“scientifically valid cause and effect determinations depend on controlled clinical
trials and epidemiological studies”); Wade-Greaux v. Whitehall Labs., Inc., 874 F. Supp. 1441, 1454
(D.V.I. 1994), aff’d, 46 F.3d 1120 (3d Cir. 1994) (stating there is a need for consistent epidemiological
studies showing statistically significant increased risks).
60. See Hollander v. Sandoz Pharm. Corp., 289 F.3d 1193, 1216 n.21 (10th Cir. 2002) (“To suggest that those individual categories of evidence deemed unreliable by the district court may be added to
form a reliable theory would be to abandon ‘the level of intellectual rigor of the expert in the field.’”).
61. See, e.g., In re Ephedra, 393 F. Supp. 2d 181, 190 (S.D.N.Y. 2005) (allowing scientific expert
testimony regarding “a confluence of suggestive, though non-definitive, scientific studies [that] make[s]
it more-probable-than-not that a particular substance . . . contributed to a particular result. . . .”; after
a two-week Daubert hearing in a case in which there would never be epidemiological evidence, the
court concluded that some of plaintiffs’ experts could testify on the basis of animal studies, analogous
23
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Few studies are flawless. Epidemiology is vulnerable to attack because of problems
with confounders and bias. Furthermore, epidemiological studies are grounded
in statistical models. What role should statistical significance play in assessing the
value of a study? Epidemiological studies that are not conclusive but show some
increased risk do not prove a lack of causation. Some courts find that they therefore have some probative value,62 at least in proving general causation.63
Even, however, if plaintiffs convince the trial judge that their experts relied
on reliable and relevant evidence in establishing general causation, that is, in opining that the defendant’s product can cause the adverse effects for which plaintiffs
seek compensation, plaintiffs must also present admissible expert testimony that
the defendant’s product caused their specific injuries. For example, in the Zyprexa
litigation,64 the court found that plaintiffs’ expert’s conclusion that Zyprexa may
cause excessive weight gain leading to diabetes was well supported, but the expert’s
assertion that Zyprexa had a direct adverse effect on cells essential to the production of insulin by the body in cases in which there was no documented weight
gain lacked scientific support. The record demonstrates that the expert’s opinions
relied on a subjective methodology, a fast-and-loose application of his scientific
theories to the facts, and conclusion-driven assessments on the issues of causation
in the cases on which he proposed to testify. He was not allowed to testify because
his opinions were neither “based upon sufficient facts or data,” nor were they “the
product of reliable principles and methods,” and he had not “applied the principles
and methods reliably to the facts of the case.”65
Courts handling Daubert motions sometimes sound as though only one possible answer is legitimate. If scientists seeking to testify for opposing sides disagree,
some courts conclude that one side must be wrong.66 The possibility that both
sides are offering valid scientific inferences is rarely recognized, even though this
happens often in the world of science.
As noted above, district courts have great discretion in deciding how to proceed when faced with evidence from different scientific disciplines and presenting
different degrees of scientific rigor. In assessing the proffered testimony of the
human studies, plausible theories of the mechanisms involved, etc.); Milward v. Acuity Specialty Prods.
Group, Inc., 639 F.3d 11 (1st Cir. 2011).
62. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071 (D. Colo. 2006) (discussing why
the court excluded expert’s testimony, even though his epidemiological study did not produce statistically significant results).
63. In re Viagra Prods., 572 F. Supp. 2d 1071 (D. Minn. 2008) (extensive review of all expert
evidence proffered in multidistricted product liability case).
64. See In re Zyprexa Prods., 2009 WL 1357236 (E.D.N.Y. May 12, 2009) (providing citations
to opinions dealing with Daubert rulings and summary judgment motions in the Zyprexa litigation).
65. See Fed. R. Evid. 702; cf. Gen. Elec. Co. v. Joiner, 522 U.S. 136, 146 (opinion that “is connected to existing data only by the ipse dixit of the expert” need not be admitted).
66. See Soldo v. Sandoz Pharm. Corp., 2003 WL 22005007 (W.D. Pa. 2003) (stating that court
appointed three experts to assist it pursuant to Fed. R. Evid. 706 and then rejected opinion expressing minority view).
24
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
expert in light of the studies on which the testimony is based, courts may choose
to limit the opinion that the expert would be allowed to express if the case went
to trial.67 Given the expense of trials, the paucity of trials, and the uncertainty
about how jurors would evaluate such testimony, limiting an expert’s opinion
may lead to settlements.68
The abuse-of-discretion standard may lead to inconsistent results in how
courts handle proof of causation. There can be inconsistencies even within circuits when district judges disagree on whether plaintiffs’ experts have met their
burden of proof.69
C. How Should the Courts Assess Exposure?
Another difficulty in proving causation in toxic tort cases is that plaintiff must
establish that he or she was exposed to defendant’s product. Obviously this is not a
problem with prescription drugs, but in other types of cases, such as environmental torts, establishing exposure and the extent of the exposure can be difficult.70
Although exact data on exposure need not be required, an expert should, however, be able to provide reasonable explanations for his or her conclusions about
the amount of exposure and that it sufficed to cause plaintiffs’ injuries.71
67 See, e.g., In re Ephedra, 393 F. Supp. 2d 181 (S.D.N.Y. 2005) (stating that qualified experts
may testify to a reliable basis for believing that ephedra may contribute to cardiac injury and strokes in
persons with high blood pressure, certain serious heart conditions, or a genetic sensitivity to ephedra;
experts would have to acknowledge that none of this has been the subject of definitive studies and
may yet be disproved).
68. But cf. Giles v. Wyeth, 556 F.3d 596 (7th Cir. 2009) (plaintiff won Daubert challenge but
lost at trial).
69. Compare Bonner v. ISP Techs., Inc., 259 F.3d 924 (8th Cir. 2001) (affirming jury verdict
that exposure to solvent caused plaintiff’s psychological and cognitive impairment and Parkinsonian
symptoms; defendant argued that expert’s opinion based on case reports, animal studies, structural
analysis studies should have been excluded on Daubert grounds; the court stated: “The first several
victims of a new toxic tort should not be barred from having their day in court simply because the
medical literature, which will eventually show the connection between the victims’ condition and the
toxic substance, has not yet been completed.”) with Glastetter v. Novartis Pharm. Corp., 107 F. Supp.
2d 1015 (E.D. Mo. 2000), aff’d per curiam, 252 F.3d 986 (8th Cir. 2001) (plaintiff claimed that drug she
had taken for lactation suppression had caused her stroke; trial court held that Daubert precluded experts
from finding causation on the basis of case reports, animal studies, human dechallenge/rechallenge
data, internal documents from defendant, and Food and Drug Administration’s revocation of drug for
lactation suppression; appellate court stated: “We do not discount the possibility that stronger evidence
of causation exists, or that, in the future, physicians will demonstrate to a degree of medical certainty
that Parlodel can cause ICHs. Such evidence has not been presented in this case, however, and we
have no basis for concluding that the district court abused its discretion in excluding Glastetter’s expert
evidence.” Id. at 992.
70. Issues involving assessment of exposure are discussed in Joseph V. Rodricks, Reference
Guide on Exposure Science, in this manual.
71. Anderson v. Hess Corp., 592 F. Supp. 2d 1174, 1178 (D.N.D. 2009) (“[A] plaintiff [in a
toxic tort case] is not required to produce a mathematically precise table equating levels of exposure
25
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Suppose, for example, that plaintiff alleges that her unborn child suffered
injuries when her room was sprayed with an insecticide. Plaintiff’s expert is prepared to testify that she relied on another expert’s opinion that the insecticide can
cause harm of the sort suffered by the child and that academic studies have found
injuries when less than the amount sprayed in this case was used. But the expert
who offered this opinion reached this conclusion without considering the size of
the house, or the area treated, or how it was applied, or the amount applied to
the outside of the house. And no one had measured this substance in the mother.
Consequently, the court found that plaintiff had not provided adequate proof of
exposure.72
A recent case that illustrates the complex problems that arise with exposure
issues is Henricksen v. ConocoPhilips Co.73 In Henricksen, the plaintiff who drove a
gasoline tanker truck for 30 years alleged that his acute myelogenous leukemia
(AML) was caused by his occupational exposure to benzene, a component of gasoline. Although some studies show that AML, or at least some forms of AML, may
be caused by exposure to benzene, the same is not true with regard to gasoline.
The court rejected testimony by plaintiff’s experts that sought to link the exposure
to the benzene in the gasoline to plaintiff’s claim. There were numerous problems:
Did plaintiff manifest symptoms typical of AML that was chemically induced and
not idiopathic? How could one calculate how much benzene plaintiff would have
been exposed to considering how many hours he worked and how the gasoline
was delivered? How much benzene exposure is required to support the conclusion that general causation has been established? Each of these issues is discussed
in considerable detail, suggesting that the studies that would logically be needed
to conclude that the alleged exposure can be linked to causation may simply not
have been done. Because the plaintiff bears the burden of proof, this means that
plaintiff’s experts often will be excluded.
IV. Forensic Science
To date, Daubert has rarely been raised in the forensic context, but this may be
about to change.74 We do not know as yet what shifts may occur in response to
the National Academies’ highly critical report on the forensic sciences.75 We do
know that the report played a role in the Supreme Court’s opinion in Melendezwith levels of harm—plaintiff must only produce evidence from which a reasonable person could
conclude that the defendant’s emissions probably caused the plaintiff’s harms.”).
72. Junk v. Terminix Int’l. Co., 594 F. Supp. 2d 1062 (S.D. Iowa 2008).
73. 605 F. Supp. 2d 1142 (E.D. Wash. 2009).
74. These issues are discussed at greater length in Paul C. Giannelli et al., Reference Guide on
Forensic Identification Expertise, in this manual.
75. National Research Council, Strengthening Forensic Science in the United States: A Path
Forward (2009).
26
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
Diaz v. Massachusetts76 concerning the application of the Confrontation Clause to
expert forensic testimony. But it will take some time to understand the repercussions this opinion will cause in the criminal justice system.
Even aside from this constitutional development and in the absence of congressional or other institutional action, the extensive coverage of the National
Academies’ report by the media and academia may bring about change. Furthermore, analysts of the more than 200 DNA exonerations to date claim that in more
than 50% of the cases, invalid, or improperly conducted, or misleadingly interpreted forensic science contributed to the wrongful convictions.77 The seriousness
of these mistakes is aggravated because some of the inmates were on death row.
These developments may affect judicial approaches to opinions offered by prosecution experts. Also, as judges write more sharply focused opinions in civil cases,
the very different approach they use in criminal cases stands out in vivid contrast.
Supposedly, the federal rules are trans-substantive, and it is certainly arguable that
errors that bear on life and liberty should weigh more heavily than errors in civil
cases concerned primarily with money.
To date, however, few prosecution experts have been excluded as witnesses
in criminal prosecutions.78 Usually judges have allowed them to testify or, at most,
have curtailed some of the conclusions that prosecution experts sought to offer.79
However, there are a number of issues in forensic sciences that may become the
object of Daubert challenges.
A. Validity
As the discussion in Chapter 5 of the National Academies’ report recounts, forensic fields vary considerably with regard to the quantity and quality of research done
to substantiate that a given technique is capable of making reliable individualized
76. 129 S. Ct. 2527, 2536 (2009).
77. The Innocence Project, available at www.innocenceproject.org.
78. See Maryland v. Rose, Case No. K06-0545 at 31 (Balt. County Cir. Ct. Oct. 19, 2007)
(excluding fingerprint evidence in a death penalty case as a “subjective, untested, unverifiable identification procedure that purports to be infallible”).
79. See, e.g., United States v. Green, 405 F. Supp. 2d 104 (D. Mass. 2005) (explaining that an
expert would be permitted to describe similarities between shell casings but prohibited from testifying
to match; Judge Gertner acknowledged that toolmark identification testimony should be excluded
under Daubert, but that every single court post-Daubert admitted the testimony); United States v.
Glynn, 578 F. Supp. 2d 567 (S.D.N.Y. 2008) (explaining that testimony linking bullet and casings
to the defendant was inadmissible under Daubert, but testimony that the evidence was “more likely
than not” from the firearm was admissible under Federal Rule of Evidence 401); United States v.
Rutherford, 104 F. Supp. 2d 1190, 1193 (D. Neb. 2000) (handwriting experts permitted to testify to
similarities between sample from defendant and document in question but not permitted to conclude
that defendant was the author). See United States v. Rutherford, 104 F. Supp. 2d 1190, 1193 (D. Neb.
2000); United States v. Hines, 55 F. Supp. 2d 530 (D. Md. 2002).
27
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
identifications. Non-DNA forensic techniques often turn on subjective analyses.80
But making Daubert objections in these fields requires defense counsel to understand in detail how the particular technique works, as well as to be knowledgeable
about the scientific method and statistical issues.81
B. Proficiency
Non-DNA forensic techniques often rely on subjective judgments, and the proficiency of the expert to make such judgments may become the focus of a Daubert
challenge. In theory, proficiency tests could determine whether well-trained
experts in those fields can reach results with low error rates. In practice, however,
there are numerous obstacles to such tests. Sophisticated proficiency tests are difficult and expensive to design. If the tests are too easy, the results will not assess the
ability of examiners to draw correct conclusions when forensic evidence presents
a difficult challenge in identifying a specific individual or source.82 Furthermore,
in many jurisdictions, forensic examiners are not independent of law enforcement
agencies and/or prosecutors’ offices and can often obtain information about a
proficiency testing program through those sources.
C. Malfunctioning Laboratories
Numerous problems have been identified in crime laboratories ranging from uncertified laboratory professionals and unaccredited laboratories performing incompetent work to acts of deliberate fraud, such as providing falsified results from
tests that were never done.83 Although outright fraud may be rare, unintended
inaccurate results that stem from inadequate supervision, training, and record
keeping, failure to prevent contamination, and failure to follow proper statistical
procedures can have devastating effects. Evidence that a laboratory has engaged in
such practices should certainly lead to Daubert challenges for lack of reliability, but
this requires that such investigations be undertaken and the defense have access to
the results. Whether courts can be persuaded to almost automatically reject laboratory results in the absence of proper accreditation of laboratories and certification
80. See National Research Council, supra note 75, at 133.
81. Specific forensic science techniques are discussed in Paul C. Giannelli et al., Reference
Guide on Forensic Identification Expertise, Sections V–X, in this manual.
82. United States v. Llera Plaza, 188 F. Supp. 2d 549 (E.D. Pa. 2002) (court acknowledged
that defense raised real questions about the adequacy of proficiency tests taken by FBI fingerprint
examiners but concluded that fingerprint testimony satisfied Daubert in part because no examples were
shown of erroneous identifications by FBI examiners). An erroneous FBI identification was made in
the Brandon Mayfield case discussed in the introduction to Strengthening Forensic Science in the United
States, supra note 75, at 45–46.
83. See National Research Council, supra note 75, at 183–215 and Paul C. Giannelli et al.,
Reference Guide on Forensic Identification Expertise, Section IV, in this manual.
28
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
of forensic practitioners remains to be seen. Laboratory techniques, such as drug
analyses, that do not suffer from the same uncertainties regarding validity as the
forensic identification techniques can, of course, also produce erroneous results if
the laboratory is failing to follow proper procedures.
D. Interpretation
Forensic techniques that rest on subjective judgments are susceptible to cognitive
biases.84 We have seen instances of contextual bias, but as yet there has been little
research on contextual or other types of cognitive bias. We do not yet know
whether courts will consider this type of evidence when expertise is challenged.
E. Testimony
Defense counsel may of course object to testimony that a prosecution expert seeks
to give. When the prosecution relies on a subjective identification technique,
lawyers for the defense should attempt to clarify what “match” means if the expert
uses this terminology and to explain to the jury that studies to date do not permit conclusions about individualization. To do this, the defense may have to call
its own experts and ask for jury instructions. Defense counsel must also remain
alert and object to prosecution testimony in which the witness claims to know
probabilities—that have not been established in a particular field—on the basis of
extensive personal experience. Objections also should be raised to testimony about
zero error rates. The defense must also remember that the Daubert opinion itself
recognized that testimony can be excluded under Federal Rule of Evidence 403
if its prejudicial effect substantially outweighs its probative value.
F. Assistance for the Defense and Judges
Perhaps the most troubling aspect of trying to apply Daubert to forensic evidence is
that very few defense counsel are equipped to take on this challenge. Such counsel
lack the training and resources to educate judges on these complex issues. Judges
in the state criminal justice system that handle the great majority of criminal cases
often have overloaded dockets and little or no assistance. Whether a defendant in a
particular case is constitutionally entitled to expert assistance is a complicated issue
that defense counsel needs to explore.85 Possibly the best chance for the defense to
get meaningful help that also would assist the court is to get pro bono assistance
84. National Research Council, supra note 75, at 184–185.
85. See Ake v. Oklahoma, 470 U.S. 68 (1985) (recognizing indigent’s right to psychiatric expert
assistance in a capital case in which defendant raised insanity defense). Jurisdictions differ widely in
how they interpret Ake.
29
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
from other counsel who are knowledgeable about Daubert and have a sophisticated
understanding of statistical reasoning. Lawyers who have handled complex issues
about causation may be able to transfer their expertise to other difficult issues
relating to expert testimony.86 Judges might also consider asking for amicus briefs
from appropriate organizations or governmental units.
G. Confrontation Clause
The majority in Melendez-Diaz v. Massachusetts, in an opinion by Justice Scalia over
a strong dissent by Justice Kennedy, held that the defendant has a constitutional
right to demand that a forensic analyst whose conclusions the prosecution wishes
to introduce into evidence must be produced in court for cross-examination. In
a drug case, for example, the prosecution may not simply introduce a report or
an affidavit from the analyst if the defendant demands production of the analyst
for cross-examination. When the analyst is produced, this will gave the defense
the opportunity through cross-examination to raise questions about fraud, incompetence, and carelessness and to ask questions about laboratory procedures and
other issues discussed in the National Research Council report. Effective crossexamination will demand of defense counsel the same type of expertise needed
to succeed on Daubert challenges. Numerous unanswered questions about the
operation of Melendez-Diaz will have to be litigated. It remains to be seen how
often, if at all, defense counsel will take advantage of the Confrontation Clause
or whether they will waive the defendant’s right to confront expert witnesses.87
V. Procedural Context
Apart from their effect on admissibility of expert testimony, Daubert and its subsequent interpretations have also affected the broader context in which such cases
are litigated and have altered the role of testifying experts in the pretrial stages of
litigation.
A. Class Certification Proceedings
One question that arises with increasing frequency is whether and how Daubert is
to be applied at class certification proceedings. The problem arises because of the
commonality and predominance requirements in Rule 23(a) of the Federal Rules
86. Cf. Kitzmiller v. Dover Area School Dist., 400 F. Supp. 2d 707 (M.D. Pa. 2005) (attorneys
who specialized in defense product liability litigation and had expertise about the nature of science
participated in case objecting to teaching intelligent design in public schools).
87. Both defendants and prosecutors face concerns about the resources required to fully implement such protections. See National Research Council, supra note 75, at 187.
30
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
of Civil Procedure and has emerged with regard to a wide variety of substantive
claims that plaintiffs seek to bring as a class action. For example, in Sanneman v.
Chrysler Corp.,88 plaintiff sought class certification of a common-law fraud action
and a breach-of-warranty action, the gist of which was “that Chrysler had fraudulently concealed a paint defect in many of the vehicles it manufactured beginning
on or about 1990.”89 Plaintiff’s expert testified at the class certification hearing
that the paint problem is always caused by ultraviolet rays, but acknowledged
“that other causes may contribute to or exacerbate the problem.”90 After oral
argument, the court concluded that plaintiff’s expert’s testimony satisfied Daubert,
but because ultraviolet rays are not always the only cause of problems with paint,
proof of damages would probably have to be made vehicle by vehicle. The motion
for class certification was therefore denied. Daubert challenges have been raised to
class certification in numerous other cases.91
As of this writing, there is a decided trend toward rejecting class certification on the ground that plaintiff’s proffered expert testimony does not satisfy the
Rule 23(a) requirements, although the circuits are not unanimous in how rigorous the examination of expert proof needs to be. Must the expert testimony be
subjected to the same rigorous scrutiny to determine whether it is relevant and
reliable as when the issue is admissibility at trial, or is a less searching analysis
appropriate at the certification stage? In other words, should the trial judge conduct a Daubert hearing and analysis identical to that undertaken when a defendant
seeks to preclude a plaintiff’s witness from testifying at trial? Not only “should”
the trial judge conduct a Daubert hearing, but, as the Seventh Circuit has ruled in
American Honda, the trial judge “must” do so. If a full Daubert hearing is required
in every class certification case, what has happened to the broad and case-familiar
discretion that a trial judge is supposed to exercise?”
The trial judge in Rhodes v. E.I. du Pont de Nemours & Co.92 concluded that
the expert opinions offered in support of class certification should be subjected
to a full-scale Daubert analysis, including a Daubert hearing. The judge explained
88. 191 F.R.D. 441 (E.D. Pa. 2000).
89. Id. at 443.
90. Id. at 451.
91. See, e.g., Blades v. Monsanto Co., 400 F.3d 562 (8th Cir. 2005) (antitrust price-fixing conspiracy); Rhodes v. E.I. du Pont de Nemours & Co., 2008 WL 2400944 (S.D. W. Va. June 11, 2008)
(medical monitoring claim in toxic tort action); Gutierrez v. Johnson & Johnson, 2006 WL 3246605
(D.N.J. Nov. 6, 2006) (employment discrimination); Nichols v. SmithKline Beecham Corp., 2003 WL
302352 (E.D. Pa. Jan. 29, 2003) (violation of Sherman Antitrust Act); In re St. Jude Med., Inc., 2003
WL 1589527 (D. Minn. Mar. 27, 2003) (product liability action); Bacon v. Honda of Am. Mfg Inc.,
205 F.R.D. 466 (S.D. Ohio 2001) (same); Midwestern Mach v. Northwest Airlines, Inc., 211 F.R.D.
562 (D. Minn. 2001) (violation of Clayton Act); In re Polypropylene Carpet, 996 F. Supp. 18 (N.D.
Ga. 1997) (same); In re Monosodium Glutamate, 205 F.R.D. 229 (D. Minn. 2001).
92. 2008 WL 2400944 (S.D. W. Va. June 11, 2008). See also American Honda Motor Co. v.
Allen, 600 F.3d 813, 816 (7th Cir. 2010) (district court must perform a full Daubert analysis before
certifying a class action where the expert’s report or testimony is critical to class certification).
31
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
that decisions that see a more limited role for Daubert in class certification hearings
stem in part from misinterpreting the Supreme Court’s opinion in Eisen v. Carlisle
& Jacquelin.93 In Eisen, which predated Daubert by 19 years, the Court instructed
district courts to refrain from conducting “a preliminary inquiry into the merits
of a proposed class action” when they consider certification.94 At this time, only
the Ninth Circuit forbids the lower courts from examining evidence that relates to
the merits and from requiring a rigorous examination of the expert testimony and
Rule 23(a) requirements.95 The Rhodes case deplored this approach because the
overwhelming majority of class actions settle and therefore allowing the action to
proceed as a class action “might invite plaintiffs to seek class status for settlement
purposes.” On the other hand, knocking out the possibility of class certification
early in the proceedings affects the possibility of settling cases in which liability is
debatable. A possible compromise is partial certification that would allow a common issue to be established at a class trial, leaving individual issues for separate
proceedings.
B. Discovery
1. Amended discovery rules
Rule 26 of the Federal Rules of Civil Procedure—the core rule on civil discovery—
was amended in 1993 more or less contemporaneously with Daubert to allow judges
to exert greater control of expert testimony. Those amendments required experts
retained or specially employed to provide expert testimony, or whose duties as the
party’s employee regularly involve giving expert testimony, to furnish an extensive
report prior to his or her deposition.96 These reports were required to indicate
93. 417 U.S. 156 (1974).
94. Id. at 177–78.
95. See Dukes v. Wal-Mart, Inc., 474 F.3d 1214 (9th Cir. 2007). The Supreme Court declined
an opportunity to address the role of Daubert in class certification when it granted certiorari in Dukes,
even though the issue was raised in some of the petitions. The Court subsequently granted a petition
for certiorari in Erica P. John Fund Inc. v. Halliburton Co. (U.S. Jan. 7, 2011) (No. 09-1403), which
raises related questions regarding the extent to which the district court may consider the merits of the
underlying litigation and require that loss causation be demonstrated by a preponderance of admissible
evidence at the class certification stage under Federal Rule of Civil Procedure 23. Other courts accord
Daubert a limited role, such as requiring the trial judge to determine only that the expert testimony is
“not fatally flawed.” See Fogarazzo v. Lehman Bros., Inc., 2005 WL 361205 (S.D.N.Y. Feb. 16, 2005).
96. Fed R. Civ. P. 26(a)(2)(B), as amended December 1, 2010, made substantial changes to the
1993 amendments. The 1993 amendments also recognized a second category of testifying experts who
were not retained or specially employed in anticipation of litigation, such as treating physicians, who
were not required to provide reports. But see 3M v. Signtech USA, 177 F.R.D. 459 (D. Minn. 1998)
(requiring report from employee experts who do not regularly provide expert testimony because it
eliminates surprise and is consistent with the spirit of Rule 26(a)(2)(B)). Under the 2010 amendments
the attorney must submit a report indicating the subject matter and the facts and opinions to which an
unretained testifying expert is expected to testify. Fed. R. Civ. P. 26(a)(2)(C) (amended Dec. 1, 2010).
32
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
“the data or other information considered by the expert witness in forming the opinions” (emphasis added). Many, although not all, courts construed this language as
opening the door to discovery of anything conveyed by counsel to the expert.97
Courts taking this approach found that all communications between counsel and
experts were discoverable even if the communication was opinion work product.
In other words, these courts found that the protection for opinion work product in
Rule 26(b)(3) was trumped by the disclosure provisions in Rule 26(a)(2)(B). These
courts also required disclosure of all the expert’s draft reports and notes.
Trigon Ins. Co. v. United States,98 went a step further. It held that drafts
prepared with the assistance of consultants who would not testify, as well as all
communications between the consultants and the experts, including e-mails,
were discoverable. In Trigon, many of these materials had been destroyed. The
court ordered the defendant to hire an outside technology consultant to retrieve
as much of these data as possible, allowed adverse inferences to be drawn against
the defendant, and awarded more than $179,000 in fees and costs to plaintiff.99
Those who favor the free discovery of communications between counsel and
experts and draft reports justified these results as shedding light on whether the
expert’s opinions are his or her own or those of counsel. Critics of this approach
found it costly and time-consuming and point out that lawyers have developed
strategies to overcome transparency, such as retaining two sets of experts—one to
consult and the other to testify—which makes discovery even more expensive.
After a series of public hearings the Advisory Committee on Civil Rules
determined that the disclosure rules increased the cost of litigation with no offsetting advantage to the conduct of litigation. The report of the Advisory Committee
noted that such an extensive inquiry into expert communications with attorneys
did not lead to better testing of expert opinions “because attorneys and expert
witnesses go to great lengths to forestall discovery.”100
Under amended rules that became effective in December 2010, disclosure is
limited to “the facts or data” considered by the expert, and does not extend to
“other information.” Draft reports are no longer discoverable, and communications between counsel and an expert are protected from discovery unless the
communications: (1) relate to compensation for the expert’s study or testimony;
97. See Karn v. Ingersoll Rand, 168 F.R.D. 633 (N.D. Ind. 1996) (requiring disclosure of all
documents reviewed by experts in forming their opinions); Reg’l Airport Auth. v. LFG, LLC, 460
F.3d 697, 716 (6th Cir. 2006) (“other information” interpreted to include all communications by
counsel to expert).
98. 204 F.R.D. 277 (E.D. Va. 2002).
99. Id. See also Semtech Corp. v. Royal Ins. Co., 2007 WL 5462339 (C.D. Cal. Oct. 24, 2007)
(explaining that preclusion of expert from testifying for failure to disclose drafts and failing to disclose
input of counsel at hearing made it impossible to discern the basis for his opinion).
100. Report of the Civil Rules Advisory Committee, from Honorable Mark R. Kravitz, Chair,
Advisory Committee on Federal Rules of Civil Procedure, to Honorable Lee H. Rosenthal, Chair,
Standing Committee on Rules of Practice and Procedure (May 8, 2008), available at http://www.
uscourts.gov/uscourts/RulesAndPolicies/rules/Reports/CV05-2009.pdf.
33
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
(2) identify facts or data provided by counsel and considered by the expert; or
(3) identify assumptions furnished by counsel that the expert relied upon in forming opinions. Testifying experts who were not required to provide a report under
the previous rules—such as treating physicians—are now required to provide a
summary of the facts or opinions to which the witness expects to testify. While this
requirement relating to experts not required to file a report would provide more
disclosure than under the 1993 amendments, the main thrust of the 2010 amendments is to narrow expert discovery with an eye toward minimizing expense and
focusing attention on the expert’s opinion.
Nothing in the amendments precludes asking an expert at a deposition to
explain the bases or foundations for his or her opinions or asking whether the
expert considered other possible approaches, but inquiries into counsel’s input
would be severely curtailed. Aside from communications with counsel relating
to compensation, or inquiring into “facts or data” provided by counsel that the
expert considered, the expert may also be asked if counsel furnished him or her
with assumptions on which he or she relied. Now that the amended rules have
become effective, it remains to be seen how broadly courts and magistrates will
interpret the “assumptions” provision. Are there instances in which it will be
inferred that counsel was seeking to have the expert make an assumption although
this was never explicitly stated? Those who think more transparency is desirable
in dealing with expert testimony will certainly push to expand this category.
Whether these amendments if adopted can constrain the gamesmanship that surrounds expert testimony remains to be seen.
2. E-discovery
Also uncertain is whether experts will be needed to determine the proper scope
of e-discovery. Rule 26(b)(2)(B) provides the following:
A party need not provide discovery of electronically stored information from
sources that the party identifies as not reasonably accessible because of undue
burden or cost.
The burden is on the party from whom discovery is sought to show this
undue burden or cost, but the court may nevertheless order discovery if the
requesting party can show good cause.
May the requesting party making a motion to compel proffer expert testimony to show that the requested information would have been readily accessible
if the party with the information had used a different search methodology? Recent
opinions by a magistrate judge so suggest.101 Magistrate Judge John Facciola notes
that “[w]hether search terms or ‘keywords’ will yield the information sought is a
complicated question involving the interplay, at least, of the sciences of computer
101. See e.g. United States v. O’Keefe, 537 F. Supp. 2d 14 (D.D.C. 2008); Equity Analytics,
LLC v. Lunden, 248 F.R.D. 331 (D.D.C. 2008).
34
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
The Admissibility of Expert Testimony
technology, statistics and linguistics. . . . This topic is clearly beyond the ken of a
layman and requires that any conclusion be based on evidence that, for example,
meets the criteria of Rule 702 of the Federal Rules of Evidence.”102
Superimposing Daubert hearings on top of e-discovery proceedings will make
an already costly procedure even more costly, one of the consequences that
Rule 26(b)(2)(B) seeks to avoid. On the other hand, a search that would not lead
to the information sought defeats the objectives of discovery. A helpful opinion
on how these factors should be balanced that examines the issues a court must
consider can be found in Victor Shirley, Inc. v. Creative Pipe, Inc.,103 which also
contains a very brief overview of the various techniques for conducting searches of
electronically stored information. A court may well require technical assistance in
dealing with these issues. In some instances, a court-appointed expert or a special
master appointed pursuant to Rule 53 of the Federal Rules of Civil Procedure
might be more desirable than a full-fledged Daubert battle among experts, particularly if one of the parties has far fewer resources than its opponent.
C. Daubert Hearings
When a Daubert issue arises, the trial court has discretion about how to proceed.104
It need not grant an evidentiary hearing and has leeway to decide when and how
issues about the admissibility of expert testimony should be determined. The burden is on the parties to persuade the court that a particular procedure is needed.105
The generally unfettered power of the trial judge to make choices emerges
clearly if we look at United States v. Nacchio,106 a criminal case. The defendant
claimed that the trial judge erred in granting the government’s Daubert motion to
exclude his expert in the middle of the trial without an evidentiary hearing, leading to his conviction. On appeal, a divided panel of the Tenth Circuit reversed
on the ground that the expert testimony had been improperly excluded and
remanded for a new trial. After a rehearing, the conviction was reinstated in a 5-4
opinion. The majority rejected the defense’s central argument that the court had
to take into account that this was a criminal case; the majority saw this purely as a
Daubert issue and found that the burden of satisfying Daubert and convincing the
trial judge to hold a hearing rested solely on the defendant. Although there may
be some cases in which a reviewing court would find that the trial court abused
its discretion in the procedures it used in handling a Daubert motion,107 this has
102. See Equity Analytics, 248 F.R.D. at 333.
103. 250 F.R.D. 251 (D. Md. 2008).
104. Kumho Tire Co. v. Carmichael, 526 U.S. at 137, 150 (1999).
105. For example, in the government’s RICO tobacco case, all Daubert issues were decided on
the papers without any testimony being presented. United States v. Phillip Morris Inc., 2002 WL
34233441, at *1 (D.D.C. Sept. 30, 2002).
106. 555 F.3d 1234 (10th Cir. 2009).
107. See Padillas v. Stork-Gamco, Inc., 186 F.3d 412 (3d Cir. 1999).
35
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
become more and more unlikely in civil cases as Daubert rulings have accumulated
and courts increasingly expect litigators to understand their obligations.
VI. Conclusion
The Daubert trilogy has dramatically changed the legal landscape with regard to
expert witness testimony. The Supreme Court attempted in Daubert to articulate
basic principles to guide trial judges in making decisions about the admissibility
of complex scientific and technological expert testimony. Unfortunately, the
Daubert trilogy has, in actuality, spawned a huge, and expensive, new subject of
litigation and have left many procedural and substantive questions unanswered.
Moreover, there are serious concerns about whether the guidelines enunciated by
the Court have been interpreted by lower courts to limit, rather than respect, the
discretion of trial judges to manage their complex cases, whether the guidelines
conflict with the preference for admissibility contained in both the Federal Rules
of Evidence and Daubert itself, and whether the guidelines have resulted in trial
judges encroaching on the province of the jury to decide highly contested factual
issues and to judge the overall credibility of expert witnesses and their scientific
theories. Perhaps most disturbingly, there are serious concerns on the part of
many scientists as to whether the courts are, as Daubert prescribed, making admissibility decisions—decisions that may well determine the ultimate outcome of a
case—which are in fact “ground[ed] in the methods and procedures of science.”108
108. Daubert v. Merrill Dow Pharms., 509 U.S. at 579, 590 (1993).
36
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
DAVID GOODSTEIN
David Goodstein, Ph.D., is Professor of Physics and Applied Physics, and the Frank J. Gilloon
Distinguished Teaching and Service Professor, Emeritus, California Institute of Technology,
Pasadena, California, where he also served for 20 years as Vice Provost.
CONTENTS
I. Introduction, 38
II. A Bit of History, 38
III. Theories of Science, 39
A. Francis Bacon’s Scientific Method, 39
B. Karl Popper’s Falsification Theory, 40
C. Thomas Kuhn’s Paradigm Shifts, 41
D. An Evolved Theory of Science, 43
IV. Becoming a Professional Scientist, 45
A. The Institutions, 45
B. The Reward System and Authority Structure, 46
V. Some Myths and Facts About Science, 47
VI. Comparing Science and the Law, 51
A. Language, 51
B. Objectives, 52
VII. A Scientist’s View of Daubert, 52
37
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
I. Introduction
Recent Supreme Court decisions have put judges in the position of having to
decide what is scientific and what is not.1 Some judges may not be entirely
comfortable making such decisions, despite the guidance supplied by the Court
and illuminated by learned commentators.2 The purpose of this chapter is not
to resolve the practical difficulties that judges will encounter in reaching those
decisions; it is to demystify somewhat the business of science and to help judges
understand the Daubert decision, at least as it appears to a scientist. In the hope
of accomplishing these tasks, I take a mildly irreverent look at some formidable
subjects. I hope the reader will accept this chapter in that spirit.
II. A Bit of History
Modern science can reasonably be said to have come into being during the time
of Queen Elizabeth I of England and William Shakespeare. Almost immediately,
it came into conflict with the law.
While Shakespeare was composing his sonnets and penning his plays in
England, Galileo Galilei in Italy was inventing the idea that careful experiments
in a laboratory could reveal universal truths about the way objects move through
space. A bit later, after hearing about the newly invented telescope, he made
one for himself, and with it he made discoveries in the heavens that astonished
and thrilled all of Europe. Nonetheless, in 1633, Galileo was put on trial for his
scientific teachings. The trial of Galileo is usually portrayed as a conflict between
science and the Roman Catholic Church, but it was, after all, a trial, with judges
and lawyers, and all the other trappings of a formal legal procedure.
Another great scientist of the day, William Harvey, who discovered the circulation of blood, worked not only at the same time as Galileo, but also at the same
place—the University of Padua, not far from Venice. If you visit the University of
Padua today and tour the old campus at the heart of the city, you will be shown
Galileo’s cattedra, the wooden pulpit from which he lectured (and curiously, one
of his vertebrae in a display case just outside the rector’s office—maybe the rector
needs to be reminded to have a little spine). You will also be shown the lecture
1. These Supreme Court decisions are discussed in Margaret A. Berger, The Admissibility of
Expert Testimony, Sections II–III, IV.A, in this manual. For a discussion of the difficulty in distinguishing between science and engineering, see Channing R. Robertson et al., Reference Guide on
Engineering, in this manual.
2. Since publication of the first edition of this manual, a number of works have been developed
to assist judges and attorneys in understanding a wide range of scientific evidence. See, e.g., 1 & 2
Modern Scientific Evidence: The Law and Science of Expert Testimony (David L. Faigman et al. eds.,
1997); Expert Evidence: A Practitioner’s Guide to Law, Science, and the FJC Manual (Bert Black &
Patrick W. Lee eds., 1997).
38
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
theater in which Harvey dissected cadavers while eager students peered downward from tiers of overhanging balconies. Because dissecting cadavers was illegal
in Harvey’s time, the floor of the theater was equipped with a mechanism that
whisked the body out of sight when a lookout gave the word that the authorities
were coming. Obviously, both science and the law have changed a great deal since
the seventeenth century.
Another important player who lived in the same era was not a scientist at all,
but a lawyer who rose to be Lord Chancellor of England in the reign of Elizabeth’s
successor, James I. His name was Sir Francis Bacon, and in his magnum opus,
which he called Novum Organum, he put forth the first theory of the scientific
method. In Bacon’s view, the scientist should be an impartial observer of nature,
collecting observations with a mind cleansed of harmful preconceptions that might
cause error to creep into the scientific record. Once enough such observations
were gathered, patterns would emerge, giving rise to truths about nature.
Bacon’s theory has been remarkably influential down through the centuries,
even though in his own time there were those who knew better. “That’s exactly
how a Lord Chancellor would do science,” William Harvey is said to have grumbled.
III. Theories of Science
Today, in contrast to the seventeenth century, few would deny the central importance of science to our lives, but not many would be able to give a good account
of what science is. To most, the word probably brings to mind not science itself,
but the fruits of science, the pervasive complex of technology and discoveries that
has transformed all of our lives. However, science also might equally be thought
to include the vast body of knowledge we have accumulated about the natural
world. There are still mysteries, and there always will be mysteries, but the fact is
that, by and large, we understand how nature works.
A. Francis Bacon’s Scientific Method
But science is even more than that. Ask a scientist what science is, and the answer
will almost surely be that it is a process—a way of examining the natural world
and discovering important truths about it. In short, the essence of science is the
scientific method.3
3. The Supreme Court, in Daubert v. Merrell Dow Pharmaceuticals, Inc., acknowledged the importance of defining science in terms of its methods as follows: “‘Science is not an encyclopedic body of
knowledge about the universe. Instead, it represents a process for proposing and refining theoretical
explanations about the world that are subject to further testing and refinement’” (emphasis in original).
509 U.S. 579, 590 (1993) (quoting Brief for the American Association for the Advancement of Science
and the National Academy of Sciences as Amici Curiae at 7–8).
39
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
This stirring description suffers from an important shortcoming. We do not
really know what the scientific method is.4 There have been many attempts at
formulating a general theory of how science works, or at least how it should work,
starting, as we have seen, with the theory of Sir Francis Bacon. But Bacon’s idea,
that science proceeds through the collection of observations without prejudice, has
been rejected by all serious thinkers. Everything about the way we do science—
the language we use, the instruments we use, the methods we use—depends on
clear presuppositions about how the world works. Modern science is full of things
that cannot be observed at all, such as force fields and complex molecules. At the
most fundamental level, it is impossible to observe nature without having some
reason to choose what is and is not worth observing. Once that elementary choice
is choice is made, Bacon has been left behind.
B. Karl Popper’s Falsification Theory
Over the past century, the ideas of the Vienna-born philosopher Sir Karl
Popper have had a profound effect on theories of the scientific method.5 In
contrast to Bacon, Popper believed that all science begins with a prejudice, or
perhaps more politely, a theory or hypothesis. Nobody can say where the theory
comes from. Formulating the theory is the creative part of science, and it cannot be analyzed within the realm of philosophy. However, once the theory is in
hand, Popper tells us, it is the duty of the scientist to extract from it logical but
unexpected predictions that, if they are shown by experiment not to be correct,
will serve to render the theory invalid.
Popper was deeply influenced by the fact that a theory can never be proved
right by agreement with observation, but it can be proved wrong by disagreement
with observation. Because of this asymmetry, science uniquely makes progress by
proving that good ideas are wrong so that they can be replaced by even better
ideas. Thus, Bacon’s impartial observer of nature is replaced by Popper’s skeptical
theorist. The good Popperian scientist somehow comes up with a hypothesis that
fits all or most of the known facts, then proceeds to attack that hypothesis at its
weakest point by extracting from it predictions that can be shown to be false. This
process is known as falsification.6
4. For a general discussion of theories of the scientific method, see Alan F. Chalmers, What Is
This Thing Called Science? (1982). For a discussion of the ethical implications of the various theories,
see James Woodward & David Goodstein, Conduct, Misconduct and the Structure of Science, 84 Am.
Scientist 479 (1996).
5. See, e.g., Karl R. Popper, The Logic of Scientific Discovery (Karl R. Popper trans., 1959).
6. The Supreme Court in Daubert recognized Popper’s conceptualization of scientific knowledge by noting that “[o]rdinarily, a key question to be answered in determining whether a theory or
technique is scientific knowledge that will assist the trier of fact will be whether it can be (and has
been) tested.” 509 U.S. at 593. In support of this point, the Court cited as parenthetical passages from
both Carl Gustav Hempel, Philosophy of Natural Science 49 (1966) (“‘[T]he statements constituting
40
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
Popper’s ideas have been fruitful in weaning the philosophy of science away
from the Baconian view and some other earlier theories, but they fall short in a
number of ways in describing correctly how science works. The first of these is
the observation that, although it may be impossible to prove a theory is true by
observation or experiment, it is as almost equally impossible to prove one is false
by these same methods. Almost without exception, in order to extract a falsifiable
prediction from a theory, it is necessary to make additional assumptions beyond
the theory itself. Then, when the prediction turns out to be false, it may well be
one of the other assumptions, rather than the theory itself, that is false. To take a
simple example, early in the twentieth century it was found that the orbits of the
outermost planets did not quite obey the predictions of Newton’s laws of gravity
and mechanics. Rather than take this to be a falsification of Newton’s laws, astronomers concluded that the orbits were being perturbed by an additional unseen
body out there. They were right. That is precisely how Pluto was discovered.
The apparent asymmetry between falsification and verification that lies at the
heart of Popper’s theory thus vanishes. But the difficulties with Popper’s view go
even beyond that problem. It takes a great deal of hard work to come up with a
new theory that is consistent with nearly everything that is known in any area of
science. Popper’s notion that the scientist’s duty is then to attack that theory at its
most vulnerable point is fundamentally inconsistent with human nature. It would
be impossible to invest the enormous amount of time and energy necessary to
develop a new theory in any part of modern science if the primary purpose of all
that work was to show that the theory was wrong.
This point is underlined by the fact that the behavior of the scientific community is not consistent with Popper’s notion of how it should be. Credit in
science is most often given for offering correct theories, not wrong ones, or for
demonstrating the correctness of unexpected predictions, not for falsifying them.
I know of no example of a Nobel Prize awarded to a scientist for falsifying his or
her own theory.
C. Thomas Kuhn’s Paradigm Shifts
Another towering figure in the twentieth century theory of science is Thomas
Kuhn.7 Kuhn was not a philosopher but a historian (more accurately, a physicist who retrained himself as a historian). It is Kuhn who popularized the word
paradigm, which has today come to seem so inescapable.
A paradigm, for Kuhn, is a kind of consensual worldview within which scientists work. It comprises an agreed-upon set of assumptions, methods, language, and
a scientific explanation must be capable of empirical test’”) and Karl R. Popper, Conjectures and
Refutations: The Growth of Scientific Knowledge 37 (5th ed. 1989) (“‘[T]he criterion of the scientific
status of a theory is its falsifiability, or refutability, or testability’”).
7. Thomas S. Kuhn, The Structure of Scientific Revolutions (1962).
41
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
everything else needed to do science. Within a given paradigm, scientists make
steady, incremental progress, doing what Kuhn calls “normal science.”
As time goes on, difficulties and contradictions arise that cannot be resolved,
but the tendency among scientists is to resist acknowledging them. One way or
another they are swept under the rug, rather than being allowed to threaten the
central paradigm. However, at a certain point, enough of these difficulties accumulate to make the situation intolerable. At that point, a scientific revolution
occurs, shattering the paradigm and replacing it with an entirely new one.
This new paradigm, says Kuhn, is so radically different from the old that
normal discourse between the practitioners of the two paradigms becomes impossible. They view the world in different ways and speak different languages. It is
not even possible to tell which of the two paradigms is superior, because they
address different sets of problems. They are incommensurate. Thus, science does
not progress incrementally, as the science textbooks would have it, except during
periods of normal science. Every once in a while, a scientific revolution brings
about a paradigm shift, and science heads off in an entirely new direction.
Kuhn’s view was formed largely on the basis of two important historical
revolutions. One was the original scientific revolution that started with Nicolaus
Copernicus and culminated with the new mechanics of Isaac Newton. The
very word revolution, whether it refers to the scientific kind, the political kind,
or any other kind, refers metaphorically to the revolutions in the heavens that
Copernicus described in a book, De Revolutionibus Orbium Caelestium, published
as he lay dying in 1543.8 Before Copernicus, the dominant paradigm was the
worldview of ancient Greek philosophy, frozen in the fourth century B.C.E. ideas
of Plato and Aristotle. After Newton, whose masterwork, Philosophiæ Naturalis
Principia Mathematica, was published in 1687, every scientist was a Newtonian, and
Aristotelianism was banished forever from the world stage. It is even possible that
Sir Francis Bacon’s disinterested observer was a reaction to Aristotelian authority.
Look to nature, not to the ancient texts, Bacon may have been saying.
The second revolution that served as an example for Kuhn occurred early in
the twentieth century. In a headlong series of events that lasted a mere 25 years,
the Newtonian paradigm was overturned and replaced with the new physics, in
the form of quantum mechanics and Einstein’s theories of special and general
relativity. This second revolution, although it happened much faster, was no less
profound than the first.
The idea that science proceeds by periods of normal activity punctuated by
shattering breakthroughs that make scientists rethink the whole problem is an
appealing one, especially to the scientists themselves, who believe from personal
experience that it really happens that way. Kuhn’s contribution is important. It
offers us a useful context (a paradigm, one might say) for organizing the entire
history of science.
8. I. Bernard Cohen, Revolution in Science (1985).
42
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
Nonetheless, Kuhn’s theory does suffer from a number of shortcomings. One
of them is that it contains no measure of how big the change must be in order
to qualify as a revolution or paradigm shift. Most scientists will say that there is
a paradigm shift in their laboratory every 6 months or so (or at least every time
it becomes necessary to write another proposal for research support). That is not
exactly what Kuhn had in mind.
Another difficulty is that even when a paradigm shift is truly profound, the
paradigms it separates are not necessarily incommensurate. The new sciences of
quantum mechanics and relativity, for example, did indeed show that Newton’s
laws of mechanics were not the most fundamental laws of nature. However,
they did not show that they were wrong. Quite the contrary, they showed why
Newton’s laws were right: Newton’s laws arose out of newly discovered laws that
were even deeper and that covered a wider range of circumstances unimagined
by Newton and his followers—that is, phenomena as small as atoms, or nearly as
fast as the speed of light, or as dense as black holes. In our more familiar realms of
experience, Newton’s laws go on working just as well as they always did. Thus,
there is no quarrel and no ambiguity at all about which paradigm is “better.” The
new laws of quantum mechanics and relativity subsume and enhance the older
Newtonian worldview.
D. An Evolved Theory of Science
If neither Bacon nor Popper nor Kuhn gives us a perfect description of what
science is or how it works, all three of them help us to gain a much deeper
understanding of it.
Scientists are not Baconian observers of nature, but all scientists become
Baconians when it comes to describing their observations. With very few exceptions, scientists are rigorously, even passionately, honest about reporting scientific
results and how they were obtained. Scientific data are the coin of the realm in
science, and they are always treated with reverence. Those rare instances in which
scientists are found to have fabricated or altered their data in some way are always
traumatic scandals of the first order.9
Scientists are also not Popperian falsifiers of their own theories, but they do
not have to be. They do not work in isolation. If a scientist has a rival with a
different theory of the same phenomenon, the rival will be more than happy to
perform the Popperian duty of attacking the scientist’s theory at its weakest point.
9. Such instances are discussed in David Goodstein, Scientific Fraud, 60 Am. Scholar 505
(1991). For a summary of recent investigations into scientific fraud and lesser instances of scientific
misconduct, see Office of Research Integrity, Department of Health and Human Services, Scientific
Misconduct Investigations: 1993–1997, http://ori.dhhs.gov/PDF/scientific.pdf (last visited Nov. 21,
1999) (summarizing 150 scientific misconduct investigations closed by the Office of Research
Integrity).
43
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Moreover, if falsification is no more definitive than verification, and scientists
prefer in any case to be right rather than wrong, they nonetheless know how to
hold verification to a very high standard. If a theory makes novel and unexpected
predictions, and those predictions are verified by experiments that reveal new and
useful or interesting phenomena, then the chances that the theory is correct are
greatly enhanced. And, even if it is not correct, it has been fruitful in the sense
that it has led to the discovery of previously unknown phenomena that might
prove useful in themselves and that will have to be explained by the next theory
that comes along.
Finally, science does not, as Kuhn seemed to think, periodically self-destruct
and need to start over again. It does, however, undergo startling changes of perspective that lead to new and, invariably, better ways of understanding the world.
Thus, although science does not proceed smoothly and incrementally, it is one
of the few areas of human endeavor that is genuinely progressive. There is no
doubt at all that the quality of twentieth century science is better than nineteenth
century science, and we can be absolutely confident that the quality of science in
the twenty-first century will be better still. One cannot say the same about, say,
art or literature.10
To all of this, a few things must be added. The first is that science is, above
all, an adversarial process. It is an arena in which ideas do battle, with observations and data the tools of combat. The scientific debate is very different from
what happens in a court of law, but just as in the law, it is crucial that every idea
receive the most vigorous possible advocacy, just in case it might be right. Thus,
the Popperian ideal of holding one’s hypothesis in a skeptical and tentative way
is not merely inconsistent with reality; it would be harmful to science if it were
pursued. As will be discussed shortly, not only ideas, but the scientists themselves,
engage in endless competition according to rules that, although they are not written down, are nevertheless complex and binding.
In the competition among ideas, the institution of peer review plays a central
role. Scientific articles submitted for publication and proposals for funding often
are sent to anonymous experts in the field, in other words, to peers of the author,
for review. Peer review works superbly to separate valid science from nonsense,
10. The law, too, can claim to be progressive. The development of legal constructs, such as due
process, equal protection, and individual privacy, reflects notable progress in the betterment of mankind. See Laura Kalman, The Strange Career of Legal Liberalism 2–4 (1996) (recognizing the “faith”
of legal liberalists in the use of law as an engine for progressive social change in favor of society’s
disadvantaged). Such progress is measured by a less precise form of social judgment than the consensus
that develops regarding scientific progress. See Steven Goldberg, The Reluctant Embrace: Law and Science
in America, 75 Geo. L.J. 1341, 1346 (1987) (“Social judgments, however imprecise, can sometimes be
reached on legal outcomes. If a court’s decision appears to lead to a sudden surge in the crime rate,
it may be judged wrong. If it appears to lead to new opportunities for millions of citizens, it may be
judged right. The law does gradually change to reflect this kind of social testing. But the process is
slow, uncertain, and controversial; there is nothing in the legal community like the consensus in the
scientific community on whether a particular result constitutes progress.”).
44
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
or, in Kuhnian terms, to ensure that the current paradigm has been respected.11
It works less well as a means of choosing between competing valid ideas, in part
because the peer doing the reviewing is often a competitor for the same resources
(space in prestigious journals, funds from government agencies or private foundations) being sought by the authors. It works very poorly in catching cheating
or fraud, because all scientists are socialized to believe that even their toughest
competitor is rigorously honest in the reporting of scientific results, which makes
it easy for a purposefully dishonest scientist to fool a referee. Despite all of this,
peer review is one of the venerated pillars of the scientific edifice.
IV. Becoming a Professional Scientist
Science as a profession or career has become highly organized and structured.12
It is not, relatively speaking, a very remunerative profession—that would be
inconsistent with the Baconian ideal—but it is intensely competitive, and material
well-being does tend to follow in the wake of success (successful scientists, one
might say, do get to bring home the Bacon).
A. The Institutions
These are the institutions of science: Research is done in the Ph.D.-granting
universities and, to a lesser extent, in colleges that do not grant Ph.D.s. It is also
done in national laboratories and in industrial laboratories. Before World War II,
basic science was financed mostly by private foundations (Rockefeller, Carnegie),
but since the war, the funding of science (except in industrial laboratories) has
largely been taken over by agencies of the federal government, notably the
National Science Foundation (an independent agency), the National Institutes of
11. The Supreme Court received differing views regarding the proper role of peer review.
Compare Brief for Amici Curiae Daryl E. Chubin et al. at 10, Daubert v. Merrell Dow Pharms., Inc.,
509 U.S. 579 (1993) (No. 92-102) (“peer review referees and editors limit their assessment of submitted articles to such matters as style, plausibility, and defensibility; they do not duplicate experiments
from scratch or plow through reams of computer-generated data in order to guarantee accuracy or
veracity or certainty”), with Brief for Amici Curiae New England Journal of Medicine, Journal of the
American Medical Association, and Annals of Internal Medicine in Support of Respondent, Daubert
v. Merrell Dow Pharm., Inc., 509 U.S. 579 (1993) (No. 92-102) (proposing that publication in a
peer-reviewed journal be the primary criterion for admitting scientific evidence in the courtroom). See
generally Daryl E. Chubin & Edward J. Hackett, Peerless Science: Peer Review and U.S. Science Policy
(1990); Arnold S. Relman & Marcia Angell, How Good Is Peer Review? 321 New Eng. J. Med. 827–29
(1989). As a practicing scientist and frequent peer reviewer, I can testify that Chubin’s view is correct.
12. The analysis that follows is based on David Goodstein & James Woodward, Inside Science,
68 Am. Scholar 83 (1999).
45
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Health (part of the Department of Health and Human Services), and parts of the
Department of Energy and the Department of Defense.
Scientists who work at all these organizations—universities, colleges, national
and industrial laboratories, and funding agencies—belong to scientific societies that
are organized mostly by discipline. There are large societies, such as the American
Physical Society and the American Chemical Society; societies for subdisciplines,
such as optics and spectroscopy; and even organizations of societies, such as
FASEB, the Federation of American Societies for Experimental Biology.
Scientific societies are private organizations that elect their own officers,
hold scientific meetings, publish journals, and finance their operations from the
collection of dues and from the proceeds of their publishing and educational
activities. The American Association for the Advancement of Science also holds
meetings and publishes Science, a famous journal, but it is not restricted to any one
discipline. The National Academy of Sciences holds meetings and publishes the
Proceedings of the National Academy of Sciences, and, along with the National Academy of Engineering, Institute of Medicine, and its operational arm, the National
Research Council, advises various government agencies on matters pertaining to
science, engineering, and health. In addition to the advisory activities, one of its
most important activities is to elect its own members.
These are the basic institutions of American science. It should not come as
news that the universities and colleges engage in a fierce but curious competition, in which no one knows who is keeping score, but everyone knows roughly
what the score is. (In recent years, some national and international media outlets
have found it worthwhile to appoint themselves scorekeepers in this competition. Academic officials dismiss these journalistic judgments, except when their
own institutions come out on top.) Departments in each discipline compete with
one another, as do national and industrial laboratories and even funding agencies.
Competition in science is at its most refined, however, at the level of individual
careers.
B. The Reward System and Authority Structure
To regulate competition among scientists, there is a reward system and an authority
structure. The fruits of the reward system are fame, glory, and immortality. The
purposes of the authority structure are power and influence. The reward system and
the authority structure are closely related to one another, but scientists distinguish
sharply between them. When they speak of a colleague who has become president
of a famous university, they will say sadly, “It’s a pity—he was still capable of good
work,” sounding like warriors lamenting the loss of a fallen comrade. The university president is a kingpin of the authority structure, but, with rare exceptions, he
is a dropout from the reward system. Similar kinds of behavior can be observed
in industrial and government laboratories, but a description of what goes on in
universities will be enough to illustrate how the system works.
46
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
A career in academic science begins at the first step on the reward system
ladder, earning a Ph.D., followed in many areas by one or two stints as a postdoctoral fellow. The Ph.D. and postdoctoral positions had best be at universities
(or at least departments) that are high up in that fierce but invisible competition, because all subsequent steps are more likely than not to take the individual
sideways or downward on the list. The next step is a crucial one: appointment
to a tenure-track junior faculty position. About two-thirds of all postdoctoral
fellows in biology in American universities believe that they are going to make
this step, but in fact, only about a quarter of them succeed. This step and all subsequent steps require growing renown as a scientist beyond the individual’s own
circle of acquaintances. Thus, it is essential by this time that the individual has
accomplished something. The remaining steps up the reward system ladder are
promotion to an academic tenured position and full professorship; various prizes,
medals, and awards given out by the scientific societies; an endowed chair (the
virtual equivalent of Galileo’s wooden cattedra); election to the National Academy
of Sciences; particularly prestigious awards up to and including the Nobel Prize;
and, finally, a reputation equivalent to immortality.
Positions in the authority structure are generally rewards for having achieved
a certain level in the reward system. For example, starting from the Ph.D. or
junior faculty level, it is possible to step sideways temporarily or even permanently
into a position as contract officer in a funding agency. Because contract officers
influence the distribution of research funds, they have a role in deciding who will
succeed in the climb up the reward system ladder. At successively higher levels one
can become a journal editor; department chair; dean, provost, director of national
research laboratory or president of a university; and even the head of a funding
agency, a key player in determining national policy as it relates to science and
technology. People in these positions have stepped out of the traditional reward
system, but they have something to say about who succeeds within it.
V. Some Myths and Facts About Science
“In matters of science,” Galileo wrote, “the authority of thousands is not worth
the humble reasoning of one single person.”13 Doing battle with the Aristotelian
professors of his day, Galileo believed that kowtowing to authority was the enemy
of reason. But, contrary to Galileo’s famous remark, the fact is that within the
scientific community itself, authority is of fundamental importance. If a paper’s
13. I found this statement framed on the office wall of a colleague in Italy in the form, “In
questioni di scienza L’autorità di mille non vale l’umile ragionare di un singolo.” However, I have not been
able to find the famous remark in this form in Galileo’s writings. An equivalent statement in different
words can be found in Galileo’s Il Saggiatore (1623). See Andrea Frova & Mariapiera Marenzona,
Parola di Galileo 473 (1998).
47
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
author is a famous scientist, the paper is probably worth reading. The triumph of
reason over authority is just one of the many myths about science. Following is
a brief list of some others:
Myth: Scientists must have open minds, being ready to discard old ideas in
favor of new ones.
Fact: Because science is an adversarial process through which each idea
deserves the most vigorous possible defense, it is useful for the successful progress of science that scientists tenaciously cling to their own
ideas, even in the face of contrary evidence.
Myth: The institution of peer review assures that all published papers are
sound and dependable.
Fact: Peer review generally will catch something that is completely out of
step with majority thinking at the time, but it is practically useless for
catching outright fraud, and it is not very good at dealing with truly
novel ideas. Peer review mostly assures that all papers follow the current paradigm (see comments on Kuhn, above). It certainly does not
ensure that the work has been fully vetted in terms of the data analysis
and the proper application of research methods.
Myth: Science must be an open book. For example, every new experiment
must be described so completely that any other scientist can reproduce it.
Fact: There is a very large component of skill in making cutting-edge
experiments work. Often, the only way to import a new technique
into a laboratory is to hire someone (usually a postdoctoral fellow)
who has already made it work elsewhere. Nonetheless, scientists have
a solemn responsibility to describe the methods they use as fully and
accurately as possible. And, eventually, the skill will be acquired by
enough people to make the new technique commonplace.
Myth: When a new theory comes along, the scientist’s duty is to falsify it.
Fact: When a new theory comes along, the scientist’s instinct is to verify it.
When a theory is new, the effect of a decisive experiment that shows
it to be wrong is that both the theory and the experiment are in most
cases quickly forgotten. This result leads to no progress for anybody in
the reward system. Only when a theory is well established and widely
accepted does it pay off to prove that it is wrong.
Myth: University-based research is pure and free of conflicts of interest.
Fact: The Bayh-Dole Act of the early 1980s permits universities to patent
the results of research supported by the federal government. Many universities have become adept at obtaining such patents. In many cases
48
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
this raises conflict-of-interest problems when the universities’ interest
in pursuing knowledge comes into conflict with its need for revenue.
This is an area that has generated considerable scrutiny. For instance,
the recent Institute of Medicine report Conflict of Interest in Medical
Research, Education, and Practice sheds light on the changing dimensions
of conflicts of interest associated with growing interdisciplinary collaborations between individuals, universities, and industry especially in
life sciences and biomedical research.14
Myth: Real science is easily distinguished from pseudoscience.
Fact: This is what philosophers call the problem of demarcation: One of
Popper’s principal motives in proposing his standard of falsifiability
was precisely to provide a means of demarcation between real science
and impostors. For example, Einstein’s general theory of relativity
(with which Popper was deeply impressed) made clear predictions that
could certainly be falsified if they were not correct. In contrast, Freud’s
theories of psychoanalysis (with which Popper was far less impressed)
could never be proven wrong. Thus, to Popper, relativity was science
but psychoanalysis was not.
Real scientists do not behave as Popper says they should, and
there is another problem with Popper’s criterion (or indeed any other
criterion) for demarcation: Would-be scientists read books too. If it
becomes widely accepted (and to some extent it has) that falsifiable
predictions are the signature of real science, then pretenders to the
throne of science will make falsifiable predictions too.15 There is no
simple, mechanical criterion for distinguishing real science from something that is not real science. That certainly does not mean, however,
that the job cannot be done. As I discuss below, the Supreme Court,
in the Daubert decision, has made a respectable stab at showing how
to do it.16
14. Institute of Medicine, Conflict of Interest in Medical Research, Education, and Practice
(Bernard Lo & Marilyn Field eds., 2009).
15. For a list of such pretenders, see Larry Laudan, Beyond Positivism and Relativism 219
(1996).
16. The Supreme Court in Daubert identified four nondefinitive factors that were thought to be
illustrative of characteristics of scientific knowledge: testability or falsifiability, peer review, a known or
potential error rate, and general acceptance within the scientific community. 509 U.S. at 590. Subsequent cases have expanded on these factors. See, e.g., In re TMI Litig. Cases Consol. II, 911 F. Supp.
775, 787 (M.D. Pa. 1995) (which considered the following additional factors: the relationship of the
technique to methods that have been established to be reliable, the qualifications of the expert witness
testifying based on the methodology, the nonjudicial uses of the method, logical or internal consistency
of the hypothesis, the consistency of the hypothesis with accepted theories, and the precision of the
hypothesis or theory). See generally Bert Black et al., Science and the Law in the Wake of Daubert: A New
Search for Scientific Knowledge, 72 Tex. L. Rev. 715, 783–84 (1994) (discussion of expanded list of factors).
49
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Myth: Scientific theories are just that: theories. All scientific theories are
eventually proved wrong and are replaced by other theories.
Fact: The things that science has taught us about how the world works are
the most secure elements in all of human knowledge. Here I must
distinguish between science at the frontiers of knowledge (where by
definition we do not yet understand everything and where theories
are indeed vulnerable) and textbook science that is known with great
confidence. Matter is made of atoms, DNA transmits the blueprints of
organisms from generation to generation, light is an electromagnetic
wave—these things are not likely to be proved wrong. The theory of
relativity and the theory of evolution are in the same class and are still
called “theories” for historic reasons only.17 The GPS device in my
car routinely uses the general theory of relativity to make calculations
accurate enough to tell me exactly where I am and to take me to my
destination with unerring precision. The phenomenon of natural selection has been observed under numerous field conditions as well as in
controlled laboratory experiments.
In recent times, the courts have had much to say about the teaching of the theory of evolution in public schools.18 In one instance
the school district decided that students should be taught the “gaps/
problems” in Darwin’s theory and given “Intelligent Design” as an
alternative explanation. The courts (Judge Jones of the United States
District Court for the Middle District of Pennsylvania) came down
hard on the side of Darwin, ruling that “Intelligent Design” was thinly
disguised religion that had no place in the science classroom.
It should be said here that the incorrect notion that all theories
must eventually be wrong is fundamental to the work of both Popper
and Kuhn, and these theorists have been crucial in helping us understand how science works. Thus, their theories, like good scientific
theories at the frontiers of knowledge, can be both useful and wrong.
Myth: Scientists are people of uncompromising honesty and integrity.
Fact: They would have to be if Bacon were right about how science works,
but he was not. Most scientists are rigorously honest where honesty
matters most to them: in the reporting of scientific procedures and data
in peer-reviewed publications. In all else, they are ordinary mortals.
17. According to the National Academy of Sciences and Institute of Medicine’s 2008 report
Science, Evolution, and Creationism, “the strength of a theory rests in part on providing scientists with
the basis to explain observed phenomena and to predict what they are likely to find when exploring
new phenomena and observations.” The report also helps differentiate a theory from a hypothesis, the
latter being testable natural explanations that may offer tentative scientific insights.
18. Kitzmiller v. Dover Area School District, 400 F. Supp. 2d 707 (M.D. Pa. 2005).
50
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
VI. Comparing Science and the Law
Science and the law differ both in the language they use and the objectives they
seek to accomplish.
A. Language
Oscar Wilde (and G.B. Shaw too) once remarked that the United States and
England are two nations divided by a common language. Something similar can
be said, with perhaps more truth (if less wit), of science and the law. There are
any number of words commonly used in both disciplines, but with different
meanings.
For example, the word force, as it is used by lawyers, has connotations of violence and the domination of one person’s will over another, when used in phrases
such as “excessive use of force” and “forced entry.” In science, force is something
that when applied to a body, causes its speed and direction of motion to change.
Also, all forces arise from a few fundamental forces, most notably gravity and the
electric force. The word carries no other baggage.
In contrast, the word evidence is used much more loosely in science than in
law. The law has precise rules of evidence that govern what is admissible and what
is not. In science, the word merely seems to mean something less than “proof.” A
certain number of the papers in any issue of a scientific journal will have titles that
begin with “Evidence for (or against) . . .” What that means is, the authors were
not able to prove their point, but are presenting their results anyway.
The word theory is a particularly interesting example of a word that has different meanings in each discipline. A legal theory is a proposal that fits the known
facts and legal precedents and that favors the attorney’s client. What’s required
of a theory in science is that it make new predictions that can be tested by new
experiments or observations and falsified or verified (as discussed above).
Even the word law has different meanings in the two disciplines. To a legal
practitioner, a law is something that has been promulgated by some human
authority, such as a legislature or parliament. In science, a law is a law of nature,
something that humans can hope to discover and describe accurately, but that can
never be changed by any human authority or intervention.
My final example is, to me, the most interesting of all. It is the word error.
In the law, and in common usage, error and mistake are more or less synonymous.
A legal decision can be overturned if it is found to be contaminated by judicial
error. In science, however, error and mistake have different meanings. Anyone can
make a mistake, and scientists have no obligation to report theirs in the scientific
literature. They just clean up the mess and go on to the next attempt. Error,
on the other hand, is intrinsic to any measurement, and far from ignoring it or
covering it up or even attempting to eliminate it, authors of every paper about a
scientific experiment will include a careful analysis of the errors to put limits on
51
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the uncertainty in the measured result. To make mistakes is human, one might
say, but error is intrinsic to our interaction with nature, and is therefore part of
science.
B. Objectives
Beyond the meanings of certain key words, science and the law differ fundamentally in their objectives. The objective of the law is justice; that of science
is truth.19 These are among the highest goals to which humans can aspire, but
they are not the same thing. Justice, of course, also seeks truth, but it requires
that clear decisions be made in a reasonable and limited period of time. In the
scientific search for truth there are no time limits and no point at which a final
decision must be made.
And yet, despite all these differences, science and the law share, at the deepest
possible level, the same aspirations and many of the same methods. Both disciplines seek, in structured debate and using empirical evidence, to arrive at rational
conclusions that transcend the prejudices and self-interest of individuals.
VII. A Scientist’s View of Daubert
In the 1993 Daubert decision, the U.S. Supreme Court took it upon itself to
resolve, once and for all, the knotty problem of the demarcation between science
and pseudoscience. Better yet, it undertook to enable every federal judge to solve
that problem in deciding the admissibility of each scientific expert witness in every
case that arises. In light of all the uncertainties discussed in this chapter, it must be
considered an ambitious thing to do.20
The presentation of scientific evidence in a court of law is a kind of shotgun
marriage between the two disciplines. Both are obliged to some extent to yield
19. This point was made eloquently by D. Allen Bromley in Science and the Law, Address at
the 1998 Annual Meeting of the American Bar Association (Aug. 2, 1998).
20. Chief Justice Rehnquist, responding to the majority opinion in Daubert, was the first to
express his uneasiness with the task assigned to federal judges, as follows: “I defer to no one in my
confidence in federal judges; but I am at a loss to know what is meant when it is said that the scientific
status of a theory depends on its ‘falsifiability,’ and I suspect some of them will be, too.” 509 U.S. at
579 (Rehnquist, C.J., concurring in part and dissenting in part). His concern was then echoed by Judge
Alex Kozinski when the case was reconsidered by the U.S. Court of Appeals for the Ninth Circuit
following remand by the Supreme Court. 43 F.3d 1311, 1316 (9th Cir. 1995) (“Our responsibility,
then, unless we badly misread the Supreme Court’s opinion, is to resolve disputes among respected,
well-credentialed scientists about matters squarely within their expertise, in areas where there is no
scientific consensus as to what is and what is not ‘good science,’ and occasionally to reject such expert
testimony because it was not ‘derived by the scientific method.’ Mindful of our position in the hierarchy of the federal judiciary, we take a deep breath and proceed with this heady task.”).
52
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
How Science Works
to the central imperatives of the other’s way of doing business, and it is likely
that neither will be shown in its best light. The Daubert decision is an attempt
(not the first, of course) to regulate that encounter. Judges are asked to decide the
“evidential reliability” of the intended testimony, based not on the conclusions to
be offered, but on the methods used to reach those conclusions.
In particular, Daubert says, the methods should be judged by the following
four criteria:
1. The theoretical underpinnings of the methods must yield testable predictions by means of which the theory could be falsified.
2. The methods should preferably be published in a peer-reviewed journal.
3. There should be a known rate of error that can be used in evaluating the
results.
4. The methods should be generally accepted within the relevant scientific
community.
In reading these four illustrative criteria mentioned by the Court, one is struck
immediately by the specter of Karl Popper looming above the robed justices. (It
is no mere illusion. The dependence on Popper is explicit in the written decision.) Popper alone is not enough, however, and the doctrine of falsification is
supplemented by a bow to the institution of peer review, an acknowledgment of
the scientific meaning of error, and a paradigm check (really, an inclusion of the
earlier Frye standard).21
The Daubert case and two others (General Electric v. Joiner,22 and Kumho Tires
v. Carmichael23) have led to increasing attention on the part of judges to scientific
and technical issues and have led to the increased exclusion of expert testimony,
but the Daubert criteria seem too general to resolve many of the difficult decisions
the courts face when considering scientific evidence. Nonetheless, despite some
inconsistency in rulings by various judges, the Daubert decision has given the
courts new flexibility, and so far, it has stood the test of time.
All in all, I would give the decision pretty high marks.24 The justices ventured
into the treacherous crosscurrents of the philosophy of science—where even most
scientists fear to tread—and emerged with at least their dignity intact. Falsifiability
may not be a good way of doing science, but it is not the worst a posteriori way
to judge science, and that is all that’s required here. At least they managed to avoid
the Popperian trap of demanding that the scientists be skeptical of their own ideas.
21. In Frye v. United States, 293 F. 1013, 1014 (D.C. Cir. 1923), the court stated that expert
opinion based on a scientific technique is inadmissible unless the technique is “generally accepted” as
reliable in the relevant scientific community.
22. 522 U.S. 136 (1997).
23. 526 U.S. 137 (1999).
24. For a contrary view, see Gary Edmond & David Mercer, Recognizing Daubert: What Judges
Should Know About Falsification, 5 Expert Evid. 29–42 (1996).
53
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The other considerations help lend substance and flexibility.25 The jury is still out
(so to speak) on how well this decision will work in practice, but it is certainly an
impressive attempt to serve justice, if not truth. Applying it in practice will never
be easy, but then that is what this manual is about.26
25. See supra note 16.
26. For further reading, see John Ziman, PublicKnowledge: An Essay Concerning the Social
Dimension of Science (Cambridge University Press 1968).
54
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
Forensic Identification Expertise
P A U L C . G I A N N E L L I , E D WA R D J . I M W I N K E L R I E D ,
AND JOSEPH L. PETERSON
Paul C. Giannelli, L.L.M, is Albert J. Weatherhead III and Richard W. Weatherhead
Professor of Law, and Distinguished University Professor, Case Western Reserve University.
Edward J. Imwinkelried, J.D., is Edward L. Barrett, Jr. Professor of Law and Director of
Trial Advocacy, University of California, Davis.
Joseph L. Peterson, D.Crim., is Professor of Criminal Justice and Criminalistics, California
State University, Los Angeles.
CONTENTS
I. Introduction, 57
II. Development of Forensic Identification Techniques, 58
III. Reappraisal of Forensic Identification Expertise, 60
A. DNA Profiling and Empirical Testing, 60
B. Daubert and Empirical Testing, 62
IV. National Research Council Report on Forensic Science, 64
A. Research, 66
B. Observer Effects, 67
C. Accreditation and Certification, 68
D. Proficiency Testing, 69
E. Standard Terminology, 70
F. Laboratory Reports, 70
V. Specific Techniques, 71
A. Terminology, 71
VI. Fingerprint Evidence, 72
A. The Technique, 73
B. The Empirical Record, 76
1. Proficiency testing, 78
2. The Mayfield case, 79
C. Case Law Development, 81
VII. Handwriting Evidence, 83
A. The Technique, 83
B. The Empirical Record, 85
1. Comparison of experts and laypersons, 86
2. Proficiency studies comparing experts’ performance to chance, 87
55
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
C. Case Law Development, 89
VIII. Firearms Identification Evidence, 91
A. The Technique, 91
1. Firearms, 91
2. Ammunition, 92
3. Class characteristics, 92
4. Subclass characteristics, 93
5. Individual characteristics, 93
6. Consecutive matching striae, 94
7. Cartridge identification, 94
8. Automated identification systems, 95
9. Toolmarks, 96
B. The Empirical Record, 97
C. Case Law Development, 100
IX. Bite Mark Evidence, 103
A. The Technique, 104
1. Theory of uniqueness, 105
2. Methods of comparison, 106
3. ABFO Guidelines, 107
B. The Empirical Record, 108
1. DNA exonerations, 109
C. Case Law Development, 110
1. Specificity of opinion, 111
2. Post-Daubert cases, 112
X. Microscopic Hair Evidence, 112
A. The Technique, 112
B. The Empirical Record, 113
1. Mitochondrial DNA, 116
2. Proficiency testing, 116
3. DNA exonerations, 117
C. Case Law Development, 117
XI. Recurrent Problems, 120
A. Clarity of Testimony, 120
B. Limitations on Testimony, 121
C. Restriction of Final Argument, 124
XII. Procedural Issues, 124
A. Pretrial Discovery, 125
1. Testifying beyond the report, 126
B. Defense Experts, 127
56
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
I. Introduction
Forensic identification expertise encompasses fingerprint, handwriting, and firearms (“ballistics”), and toolmark comparisons, all of which are used by crime
laboratories to associate or dissociate a suspect with a crime. Shoe and tire prints
also fall within this large pattern evidence domain. These examinations consist of
comparing a known exemplar with evidence collected at a crime scene or from
a suspect. Bite mark analysis can be added to this category, although it developed
within the field of forensic dentistry as an adjunct of dental identification and is
not conducted by crime laboratories. In a broad sense, the category includes trace
evidence such as the analysis of hairs, fibers, soil, glass, and wood. Some forensic
disciplines attempt to individuate and thus attribute physical evidence to a particular source—a person, object, or location.1 Other techniques are useful because
they narrow possible sources to a discrete category based upon what are known as
“class characteristics” (as opposed to “individual characteristics”). Moreover, some
techniques are valuable because they eliminate possible sources.
Following this introduction, Part II of this guide sketches a brief history of
the development of forensic expertise and crime laboratories. Part III discusses
the impact of the advent of DNA analysis and the Supreme Court’s 1993 Daubert
decision,2 developments that prompted a reappraisal of the trustworthiness of testimony by forensic identification experts. Part IV focuses on the 2009 National
Research Council (NRC) report on forensic science.3 Parts V through X examine
specific identification techniques: (1) fingerprint analysis, (2) questioned document
examination, (3) firearms and toolmark identification, (4) bite mark comparison,
and (5) microscopic hair analysis. Part XI considers recurrent problems, including
the clarity of expert testimony, limitations on its scope, and restrictions on closing
arguments. Part XII addresses procedural issues—pretrial discovery and access to
defense experts.
1. Some forensic scientists believe the word individualization is more accurate than identification.
Paul L. Kirk, The Ontogeny of Criminalistics, 54 J. Crim. L., Criminology & Police Sci. 235, 236 (1963).
The identification of a substance as heroin, for example, does not individuate, whereas a fingerprint
identification does.
2. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993). Daubert is discussed in Margaret
A. Berger, The Admissibility of Expert Testimony, in this manual.
3. National Research Council, Strengthening Forensic Science in the United States: A Path
Forward (2009) [hereinafter NRC Forensic Science Report], available at http://www.nap.edu/catalog.
php?record_id=12589.
57
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
II. Development of Forensic Identification
Techniques
An understanding of the current issues requires some appreciation of the past.
The first reported fingerprint case was decided in 1911.4 This case preceded the
establishment of the first American crime laboratory, which was created in Los
Angeles in 1923.5 The Federal Bureau of Investigation (FBI) laboratory came
online in 1932. At its inception, the FBI laboratory staff included only firearms
identification and fingerprint examination.6 Handwriting comparisons, trace evidence examinations, and serological testing of blood and semen were added later.
When initially established, crime laboratories handled a modest number of cases.
For example, in its first full year of operation, the FBI laboratory processed fewer
than 1000 cases.7
Several sensational cases in these formative years highlighted the value of
forensic identification evidence. The Sacco and Vanzetti trial in 1921 was one
of the earliest cases to rely on firearms identification evidence.8 In 1935, the
extensive use of handwriting comparison testimony9 and wood evidence10 at
the Lindbergh kidnapping trial raised the public consciousness of identification
expertise and solidified its role in the criminal justice system. Crime laboratories
soon sprang up in other large cities such as Chicago and New York.11 The num-
4. People v. Jennings, 96 N.E. 1077 (Ill. 1911).
5. See John I. Thornton, Criminalistics: Past, Present and Future, 11 Lex et Scientia 1, 23 (1975)
(“In 1923, Vollmer served as Chief of Police of the City of Los Angeles for a period of one year.
During that time, a crime laboratory was established at his direction.”).
6. See Federal Bureau of Investigation, U.S. Department of Justice, FBI Laboratory 3 (1981),
available at http://www.ncjrs.gov/App/publications/Abstract.aspx?id=78689.
7. See Anniversary Report, 40 Years of Distinguished Scientific Assistance to Law Enforcement, FBI Law
Enforcement Bull., Nov. 1972, at 4 (“During its first month of service, the FBI Laboratory examiners
handled 20 cases. In its first full year of operation, the volume increased to a total of 963 examinations.
By the next year that figure more than doubled.”).
8. See G. Louis Joughin & Edmund M. Morgan, The Legacy of Sacco & Vanzetti 15 (1948);
see also James E. Starrs, Once More Unto the Breech: The Firearms Evidence in the Sacco and Vanzetti Case
Revisited, Parts I & II, 31 J. Forensic Sci. 630, 1050 (1986).
9. See D. Michael Risinger et al., Exorcism of Ignorance as a Proxy for Rational Knowledge: The
Lessons of Handwriting Identification “Expertise,” 137 U. Pa. L. Rev. 731, 738 (1989).
10. See Shirley A. Graham, Anatomy of the Lindbergh Kidnapping, 42 J. Forensic Sci. 368 (1997).
The kidnapper had used a wooden ladder to reach the second-story window of the child’s bedroom.
Arthur Koehler, a wood technologist and identification expert for the Forest Products Laboratory of the
U.S. Forest Service, traced part of the ladder’s wood from its mill source to a lumberyard near the home
of the accused. Relying on plant anatomical comparisons, he also testified that a piece of the ladder
came from a floorboard in the accused’s attic.
11. See Joseph L. Peterson, The Crime Lab, in Thinking About Police 184, 185 (Carl Klockars
ed., 1983) (“[T]he Chicago Crime Laboratory has the distinction of being one of the oldest in the
country. Soon after, however, many other jurisdictions also built police laboratories in an attempt to
cope with the crimes of violence associated with the 1930s gangster era.”).
58
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
ber of laboratories gradually grew and then skyrocketed. The national campaign
against drug abuse led most crime laboratories to create forensic chemistry units,
and today the analysis of suspected contraband drugs constitutes more than 50%
of the caseload of many laboratories.12 By 2005, the nation’s crime laboratories
were handling approximately 2.7 million cases every year.13 According to a 2005
census, there are now 389 publicly funded crime laboratories in the United States:
210 state or regional laboratories, 84 county laboratories, 62 municipal laboratories, and 33 federal laboratories.14 Currently, these laboratories employ more than
11,900 full-time staff members.15
The establishment of crime laboratories represented a significant reform in
the types of evidence used in criminal trials. Previously, prosecutors had relied
primarily on eyewitness testimony and confessions. The reliability of physical evidence is often superior to that of other types of proof.16 However, the seeds of the
current controversies over forensic identification expertise were sown during this
period. Even though the various techniques became the stock and trade of crime
laboratories, many received their judicial imprimatur without a critical evaluation
of the supporting scientific research.17
This initial lack of scrutiny resulted, in part, from the deference that previous standards of admissibility accorded the community of specialists in the various
fields of expert testimony. In 1923, the D.C. Circuit adopted the “general accep12. J. Peterson & M. Hickman, Bureau of Just. Stat. Bull. (Feb. 2005), NCJ 207205. In most
cases, the forensic chemist simply identifies the unknown as a particular drug. However, in some cases
the chemist attempts to individuate and establish that several drug samples originated from the same
production batch at a particular illegal drug laboratory. See Fabrice Besacier et al., Isotopic Analysis of
13C as a Tool for Comparison and Origin Assignment of Seized Heroin Samples, 42 J. Forensic Sci. 429
(1997); C. Sten et al., Computer Assisted Retrieval of Common-Batch Members in Leukart Amphetamine
Profiling, 38 J. Forensic Sci. 1472 (1993).
13. Matthew R. Durose, Crime Labs Received an Estimated 2.7 Million Cases in 2005, Bureau of Just.
State. Bull. (July 2008) NCJ 222181, available at http://pjs.ojp.usdoj.gov/index.cfm?ty=pbdetail&lid=490
(summarizing statistics compiled by the Justice Department’s Bureau of Justice Statistics).
14. NRC Forensic Science Report, supra note 3, at 58.
15. Id. at 59.
16. For example, in 1927, Justice Frankfurter, then a law professor, sharply critiqued the eyewitness identifications in the Sacco and Vanzetti case. See Felix Frankfurter, The Case of Sacco and
Vanzetti 30 (1927) (“What is the worth of identification testimony even when uncontradicted? The
identification of strangers is proverbially untrustworthy.”). In 1936, the Supreme Court expressed
grave reservations about the trustworthiness of confessions wrung from a suspect by abusive interrogation techniques. See Brown v. Mississippi, 297 U.S. 278 (1936) (due process violated by beating
a confession out of a suspect).
17. “[F]ingerprints were accepted as an evidentiary tool without a great deal of scrutiny or
skepticism” of their underlying assumptions. Jennifer L. Mnookin, Fingerprint Evidence in an Age of DNA
Profiling, 67 Brook. L. Rev. 13, 17 (2001); see also Risinger et al., supra note 9, at 738 (“Our literature
search for empirical evaluation of handwriting identification turned up one primitive and flawed validity
study from nearly 50 years ago, one 1973 paper that raises the issue of consistency among examiners
but presents only uncontrolled impressionistic and anecdotal information not qualifying as data in any
rigorous sense, and a summary of one study in a 1978 government report. Beyond this, nothing.”).
59
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
tance” test for determining the admissibility of scientific evidence. The case, Frye
v. United States,18 involved a precursor of the modern polygraph. Although the
general acceptance test was limited to mostly polygraph cases for several decades, it
eventually became the majority pre-Daubert standard.19 However, under that test,
scientific testimony is admissible if the underlying theory or technique is generally
accepted by the specialists within the expert’s field. The Frye test did not require
foundational proof of the empirical validity of the technique’s scientific premises.
III. Reappraisal of Forensic Identification
Expertise
The advent of DNA profiling in the late 1980s, quickly followed by the Supreme
Court’s 1993 Daubert decision (rejecting Frye), prompted a reassessment of identification expertise.20
A. DNA Profiling and Empirical Testing
In many ways, DNA profiling revolutionized the use of expert testimony in criminal cases.21 Population geneticists, often affiliated with universities, used statistical
techniques to define the extent to which a match of DNA markers individuated
the accused as the possible source of the crime scene sample.22 Typically, the
experts testified to a random-match probability, supporting their opinions by
pointing to extensive empirical testing.
The fallout from the introduction of DNA analysis in criminal trials was significant in three ways. First, DNA profiling became the gold standard, regarded
as the most reliable of all forensic techniques.23 NRC issued two reports on the
18. 293 F. 1013 (D.C. Cir. 1923).
19. Frye was cited only five times in published opinions before World War II, mostly in polygraph cases. After World War II, it was cited 6 times before 1950, 20 times in the 1950s, and 21 times
in the 1960s. Bert Black et al., Science and the Law in the Wake of Daubert: A New Search for Scientific
Knowledge, 72 Tex. L. Rev. 715, 722 n.30 (1994).
20. See Michael J. Saks & Jonathan J. Koehler, The Coming Paradigm Shift in Forensic Identification
Science, 309 Science 892 (2005).
21. See People v. Wesley, 533 N.Y.S.2d 643, 644 (County Ct. 1988) (calling DNA evidence the
“single greatest advance in the ‘search for truth’ . . . since the advent of cross-examination”).
22. DNA Profiling is examined in detail in David H. Kaye & George Sensabaugh, Reference
Guide on DNA Identification Evidence, in this manual.
23. See Michael Lynch, God’s Signature: DNA Profiling, The New Gold Standard in Forensic Science, 27
Endeavour 2, 93 (2003); Joseph L. Peterson & Anna S. Leggett, The Evolution of Forensic Science: Progress
Amid the Pitfalls, 36 Stetson L. Rev. 621, 654 (2007) (“The scientific integrity and reliability of DNA testing have helped DNA replace fingerprinting and made DNA evidence the new ‘gold standard’ of forensic
evidence”); see also NRC Forensic Science Report, supra note 3, at 40–41 (the ascendancy of DNA).
60
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
subject, emphasizing the importance of certain practices: “No laboratory should
let its results with a new DNA typing method be used in court, unless it has
undergone . . . proficiency testing via blind trials.”24 Commentators soon pointed
out the broader implications of this development:
The increased use of DNA analysis, which has undergone extensive validation,
has thrown into relief the less firmly credentialed status of other forensic science
identification techniques (fingerprints, fiber analysis, hair analysis, ballistics, bite
marks, and tool marks). These have not undergone the type of extensive testing
and verification that is the hallmark of science elsewhere.25
Second, the DNA admissibility battles highlighted the absence of mandatory
regulation of crime laboratories.26 This situation began to change with the passage of the DNA Identification Act of 1994,27 the first federal statute regulating a
crime laboratory procedure. The Act authorized the creation of a national database
for the DNA profiles of convicted offenders as well as a database for unidentified
profiles from crime scenes: the Combined DNA Index System (CODIS). Bringing CODIS online was a major undertaking, and its successful operation required
an effective quality assurance program. As one government report noted, “the
integrity of the data contained in CODIS is extremely important since the DNA
matches provided by CODIS are frequently a key piece of evidence linking a
suspect to a crime.”28 The statute also established a DNA Advisory Board (DAB)
to assist in promulgating quality assurance standards29 and required proficiency
24. National Research Council, DNA Technology in Forensic Science 55 (1992) [hereinafter
NRC I], available at http://www.nap.edu/catalog.php?record _id=1866. A second report followed.
See National Research Council, The Evaluation of Forensic DNA Evidence (1996), available at http://
www.nap.edu/catalog.php/record_id=5141. The second report also recommended proficiency testing.
Id. at 88 (Recommendation 3.2: “Laboratories should participate regularly in proficiency tests, and the
results should be available for court proceedings.”).
25. Donald Kennedy & Richard A. Merrill, Assessing Forensic Science, 20 Issues in Sci. & Tech. 33,
34 (2003); see also Michael J. Saks & Jonathan J. Koehler, What DNA “Fingerprinting” Can Teach the Law
About the Rest of Forensic Science, 13 Cardozo L. Rev. 361, 372 (1991) (“[F]orensic scientists, like scientists
in all other fields, should subject their claims to methodologically rigorous empirical tests. The results
of these tests should be published and debated.”); Sandy L. Zabell, Fingerprint Evidence, 13 J.L. & Pol’y
143, 143 (2005) (“DNA identification has not only transformed and revolutionized forensic science, it
has also created a new set of standards that have raised expectations for forensic science in general.”).
26. In 1989, Eric Lander, a prominent molecular biologist who became enmeshed in the early
DNA admissibility disputes, wrote: “At present, forensic science is virtually unregulated—with the
paradoxical result that clinical laboratories must meet higher standards to be allowed to diagnose strep
throat than forensic labs must meet to put a defendant on death row.” Eric S. Lander, DNA Fingerprinting on Trial, 339 Nature 501, 505 (1989).
27. 42 U.S.C. § 14131 (2004).
28. Office of Inspector General, U.S. Department of Justice, Audit Report, The Combined
DNA Index System, ii (2001), available at http://www.justice.gov/oig/reports/FBI/a0126/final.pdf.
29. 42 U.S.C. § 14131(b). The legislation contained a “sunset” provision; DAB would expire
after 5 years unless extended by the Director of the FBI. The board was extended for several months
and then ceased to exist. The FBI had established the Technical Working Group on DNA Identifica-
61
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
testing for FBI analysts as well as those in laboratories participating in the national
database or receiving federal funding.30
Third, the use of DNA evidence to exonerate innocent convicts led to a
reexamination of the evidence admitted to secure their original convictions.31
Some studies indicated that, after eyewitness testimony, forensic identification
evidence was one of the most common types of testimony that jurors relied on
at the earlier trials in returning erroneous verdicts.32 These studies suggested that
flawed forensic analyses may have contributed to the convictions.33
B. Daubert and Empirical Testing
The second major development prompting a reappraisal of forensic identification
evidence was the Daubert decision.34 Although there was some uncertainty about
the effect of the decision at the time Daubert was decided, the Court’s subsequent
cases, General Electric Co. v. Joiner35 and Kumho Tire Co. v. Carmichael,36 signaled
tion Methods (TWGDAM) in 1988 to develop standards. TWGDAM functioned under DAB. It was
renamed the Scientific Working Group on DNA Analysis Methods (SWGDAM) in 1999 and replaced
DAB when the latter expired.
30. 42 U.S.C. § 14132(b)(2) (2004) (external proficiency testing for CODIS participation); id.
§ 14133(a)(1)(A) (2004) (FBI examiners). DAB Standard 13 implements this requirement. The Justice
for All Act, enacted in 2004, amended the statute, requiring all DNA labs to be accredited within
2 years “by a nonprofit professional association of persons actively involved in forensic science that is
nationally recognized within the forensic science community” and to “undergo external audits, not
less than once every 2 years, that demonstrate compliance with standards established by the Director
of the Federal Bureau of Investigation.” 42 U.S.C. § 14132(b)(2).
31. See Samuel R. Gross et al., Exonerations in the United States 1989 Through 2003, 95 J. Crim.
L. & Criminology 523, 543 (2005).
32. A study of 200 DNA exonerations found that expert testimony (55%) was the second leading type of evidence (after eyewitness identifications, 79%) used in the wrongful conviction cases.
Pre-DNA serology of blood and semen evidence was the most commonly used technique (79 cases).
Next came hair evidence (43 cases), soil comparison (5 cases), DNA tests (3 cases), bite mark evidence
(3 cases), fingerprint evidence (2 cases), dog scent (2 cases), spectrographic voice evidence (1 case),
shoe prints (1 case), and fibers (1 case). Brandon L. Garrett, Judging Innocence, 108 Colum. L. Rev. 55,
81 (2008). These data do not necessarily mean that the forensic evidence was improperly used. For
example, serological testing at the time of many of these convictions was simply not as discriminating as DNA profiling. Consequently, a person could be included using these serological tests but be
excluded by DNA analysis. Yet, some evidence was clearly misused. See also Paul C. Giannelli, Wrongful Convictions and Forensic Science: The Need to Regulate Crime Labs, 86 N.C. L. Rev. 163, 165–70,
172–207 (2007).
33. See Melendez-Diaz v. Massachusetts, 129 S. Ct. 2527, 2537 (2009) (citing Brandon L.
Garrett & Peter J. Neufeld, Invalid Forensic Science Testimony and Wrongful Convictions, 95 Va. L. Rev.
1, 34–84 (2009)). See also Brandon L. Garrett, Convicting the Innocent: Where Criminal Prosecutions
Go Wrong, ch. 4 (2011).
34. Daubert is discussed in detail in Margaret A. Berger, The Admissibility of Expert Testimony,
in this manual.
35. 522 U.S. 136 (1997).
36. 526 U.S. 137 (1999).
62
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
that the Daubert standard may often be more demanding than the traditional Frye
standard.37 Kumho extended the reliability requirement to all types of expert testimony, and in 2000, the Court characterized Daubert as imposing an “exacting”
standard for the admissibility of expert testimony.38
Daubert’s impact in civil cases is well documented.39 Although Daubert’s
effect on criminal litigation has been less pronounced,40 it nonetheless has partially changed the legal landscape. Defense attorneys invoked Daubert as the basis
for mounting attacks on forensic identification evidence, and a number of courts
view the Daubert trilogy as “inviting a reexamination even of ‘generally accepted’
venerable, technical fields.”41 Several courts have held that a forensic technique
is not exempt from Rule 702 scrutiny simply because it previously qualified for
admission under Frye’s general acceptance standard.42
In addition to enunciating a new reliability test, Daubert listed several factors
that trial judges may consider in assessing reliability. The first and most important Daubert factor is testability. Citing scientific authorities, the Daubert Court
noted that a hallmark of science is empirical testing. The Court quoted Hempel:
37. See United States v. Horn, 185 F. Supp. 2d 530, 553 (D. Md. 2002) (“Under Daubert, . . .
it was expected that it would be easier to admit evidence that was the product of new science or
technology. In practice, however, it often seems as though the opposite has occurred—application
of Daubert/Kumho Tire analysis results in the exclusion of evidence that might otherwise have been
admitted under Frye.”).
38. Weisgram v. Marley Co., 528 U.S. 440, 455 (2000).
39. See Lloyd Dixon & Brian Gill, Changes in the Standards of Admitting Expert Evidence in
Federal Civil Cases Since the Daubert Decision 25 (2002) (“[S]ince Daubert, judges have examined
the reliability of expert evidence more closely and have found more evidence unreliable as a result.”);
Margaret A. Berger, Upsetting the Balance Between Adverse Interests: The Impact of the Supreme Court’s
Trilogy on Expert Testimony in Toxic Tort Litigation, 64 Law & Contemp. Probs. 289, 290 (2001) (“The
Federal Judicial Center conducted surveys in 1991 and 1998 asking federal judges and attorneys about
expert testimony. In the 1991 survey, seventy-five percent of the judges reported admitting all proffered expert testimony. By 1998, only fifty-nine percent indicated that they admitted all proffered
expert testimony without limitation. Furthermore, sixty-five percent of plaintiff and defendant counsel
stated that judges are less likely to admit some types of expert testimony since Daubert.”).
40. See Jennifer L. Groscup et al., The Effects of Daubert on the Admissibility of Expert Testimony
in State and Federal Criminal Cases, 8 Psychol. Pub. Pol’y & L. 339, 364 (2002) (“[T]he Daubert decision did not impact on the admission rates of expert testimony at either the trial or the appellate
court levels.”); D. Michael Risinger, Navigating Expert Reliability: Are Criminal Standards of Certainty
Being Left on the Dock? 64 Alb. L. Rev. 99, 149 (2000) (“[T]he heightened standards of dependability
imposed on expertise proffered in civil cases has continued to expand, but . . . expertise proffered by
the prosecution in criminal cases has been largely insulated from any change in pre-Daubert standards
or approach.”).
41. United States v. Hines, 55 F. Supp. 2d 62, 67 (D. Mass. 1999) (handwriting comparison); see
also United States v. Hidalgo, 229 F. Supp. 2d 961, 966 (D. Ariz. 2002) (“Courts are now confronting
challenges to testimony, as here, whose admissibility had long been settled”; discussing handwriting
comparison).
42. See, e.g., United States v. Williams, 506 F.3d 151, 162 (2d Cir. 2007) (“Nor did [Daubert]
‘grandfather’ or protect from Daubert scrutiny evidence that had previously been admitted under
Frye.”); United States v. Starzecpyzel, 880 F. Supp. 1027, 1040 n.14 (S.D.N.Y. 1995).
63
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
“[T]he statements constituting a scientific explanation must be capable of empirical test,”43 and then Popper: “[T]he criterion of the scientific status of a theory
is its falsifiability, or refutability, or testability.”44 The other factors listed by the
Court are generally complementary. For example, the second factor, peer review
and publication, is a means to verify the results of the testing mentioned in the first
factor; and in turn, verification can lead to general acceptance of the technique
within the broader scientific community.45 These factors serve as circumstantial
evidence that other experts have examined the underlying research and found it
to be sound. Similarly, another factor, an error rate, is derived from testing.
IV. National Research Council Report on
Forensic Science
In 2005, the Science, State, Justice, Commerce, and Related Agencies Appropriations Act became law.46 The accompanying Senate report commented that,
“[w]hile a great deal of analysis exists of the requirements of the discipline of
DNA, there exists little or no analysis of the . . . needs of the [forensic] community outside of the area of DNA.”47 In the Act, Congress authorized the National
Academy of Sciences (NAS) to conduct a comprehensive study of the current
state of forensic science to develop recommendations. In fall 2006, the Academy
established the Committee on Identifying the Needs of the Forensic Science
Community within NRC to fulfill the task appointed by Congress. In February
2009, NRC released the report Strengthening Forensic Science in the United States:
A Path Forward.48
43. Carl G. Hempel, Philosophy of Natural Science 49 (1966).
44. Karl R. Popper, Conjectures and Refutations: The Growth of Scientific Knowledge 37
(5th ed. 1989).
45. In their amici brief in Daubert, the New England Journal of Medicine and other medical journals
observed:
“Good science” is a commonly accepted term used to describe the scientific community’s system of
quality control which protects the community and those who rely upon it from unsubstantiated scientific
analysis. It mandates that each proposition undergo a rigorous trilogy of publication, replication and
verification before it is relied upon.
Brief for the New England Journal of Medicine, Journal of the American Medical Association, and
Annals of Internal Medicine as Amici Curiae supporting Respondent at *2, Daubert v. Merrell Dow
Pharms., Inc., 509 U.S. 579 (1993) (No. 92-102), 1993 WL 13006387. Peer review’s “role is to promote the publication of well-conceived articles so that the most important review, the consideration
of the reported results by the scientific community, may occur after publication.” Id. at *3.
46. Pub. L. No. 109-108, 119 Stat. 2290 (2005).
47. S. Rep. No. 109-88, at 46 (2005).
48. NRC Forensic Science Report, supra note 3. The Supreme Court cited the report 3 months
later. Melendez-Diaz v. Massachusetts, 129 S. Ct. 2527 (2009).
64
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
In keeping with its congressional charge, the NRC committee did not address
admissibility issues. The NRC report stated: “No judgment is made about past
convictions and no view is expressed as to whether courts should reassess cases
that already have been tried.”49 When the report was released, the co-chair of the
NRC committee stated:
I want to make it clear that the committee’s report does not mean to offer any
judgments on any cases in the judicial system. The report does not assess past
criminal convictions, nor does it speculate about pending or future cases. And
the report offers no proposals for law reform. That was beyond our charge.
Each case in the criminal justice system must be decided on the record before
the court pursuant to the applicable law, controlling precedent, and governing
rules of evidence. The question whether forensic evidence in a particular case is
admissible under applicable law is not coterminous with the question whether
there are studies confirming the scientific validity and reliability of a forensic
science discipline.50
Yet, in one passage, the report remarked: “Much forensic evidence—including,
for example, bite marks and firearm and toolmark identifications—is introduced in
criminal trials without any meaningful scientific validation, determination of error
rates, or reliability testing to explain the limits of the discipline.”51 Moreover, the
report did discuss a number of forensic techniques and, where relevant, passages
from the report are cited throughout this chapter.
As the NRC report explained, its primary focus is forward-looking—to outline an “agenda for progress.”52 The report’s recommendations are wide-ranging,
covering diverse topics such as medical examiner systems,53 interoperability of the
automated fingerprint systems,54 education and training in the forensic sciences,55
codes of ethics,56 and homeland security issues.57 Some recommendations are
49. Id. at 85. The report goes on to state:
The report finds that the existing legal regime—including the rules governing the admissibility of forensic evidence, the applicable standards governing appellate review of trial court decisions, the limitations
of the adversary process, and judges and lawyers who often lack the scientific expertise necessary to
comprehend and evaluate forensic evidence—is inadequate to the task of curing the documented ills of
the forensic science disciplines.
Id.
50. Harry T. Edwards, Co-Chair, Forensic Science Committee, Opening Statement of Press
Conference (Feb. 18, 2009), transcript available at http://www.nationalacademies.org/includes/
OSEdwards.pdf.
51. NRC Forensic Science Report, supra note 3, at 107–08.
52. Id. at xix.
53. Recommendation 10 (urging the replacement of the coroner with medical examiner system
in medicolegal death investigation).
54. Recommendation 11.
55. Recommendation 2.
56. Recommendation 9.
57. Recommendation 12.
65
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
structural—that is, the creation of an independent federal entity (to be named the
National Institute of Forensic Sciences) to oversee the field58 and the removal of
crime laboratories from the “administrative” control of law enforcement agencies.59 The National Institute of Forensic Sciences would be responsible for
(1) establishing and enforcing best practices for forensic science professionals
and laboratories; (2) setting standards for the mandatory accreditation of crime
laboratories and the mandatory certification of forensic scientists; (3) promoting
scholarly, competitive, peer-reviewed research and technical development in
the forensic sciences; and (4) developing a strategy to improve forensic science
research. Congressional action would be needed to establish the institute. Several
other recommendations are discussed below.
A. Research
The NRC report urged funding for additional research “to address issues of
accuracy, reliability, and validity in the forensic science disciplines.”60 In the
report’s words, “[a]mong existing forensic methods, only nuclear DNA analysis
has been rigorously shown to have the capacity to consistently, and with a high
degree of certainty, demonstrate a connection between an evidentiary sample
and a specific individual or source.”61 In another passage, the report discussed
the need for further research into the premises underlying forensic disciplines
other than DNA:
A body of research is required to establish the limits and measures of performance and to address the impact of sources of variability and potential bias.
Such research is sorely needed, but it seems to be lacking in most of the forensic
disciplines that rely on subjective assessments of matching characteristics. These
disciplines need to develop rigorous protocols to guide these subjective interpretations and pursue equally rigorous research and evaluation programs.62
58. Recommendation 1.
59. Recommendation 4.
60. Id. at 22 (Recommendation 3).
61. Id. at 100; see also id. at 7 & 87.
62. Id. at 8; see also id. at 15 (“Of the various facets of underresourcing, the committee is most
concerned about the knowledge base. Adding more dollars and people to the enterprise might reduce
case backlogs, but it will not address fundamental limitations in the capabilities of forensic science disciplines to discern valid information from crime scene evidence.”); id. at 22 (“[S]ome forensic science
disciplines are supported by little rigorous systematic research to validate the discipline’s basic premises
and techniques. There is no evident reason why such research cannot be conducted.”).
66
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
B. Observer Effects
Another recommendation focuses on research to investigate observer bias and
other sources of human error in forensic examinations.63 According to psychological theory of observer effects, external information provided to persons conducting analyses may taint their conclusions—a serious problem in techniques
with a subjective component.64 A growing body of modern research, noted in
the report,65 demonstrates that exposure to such information can affect forensic
science experts. For example, a handwriting examiner who is informed that an
exemplar belongs to the prime suspect in a case may be subconsciously influenced by this information.66
One of the first studies to document the biasing effect was a research project
involving hair analysts.67 Some recent studies involving fingerprints have found
biasing.68 Another study concluded that external information had an effect but
not toward making errors. Instead, these researchers found fewer definitive and
63. Recommendation 8:
Such programs might include studies to determine the effects of contextual bias in forensic practice
(e.g., studies to determine whether and to what extent the results of forensic analyses are influenced by
knowledge regarding the background of the suspect and the investigator’s theory of the case). In addition, research on sources of human error should be closely linked with research conducted to quantify
and characterize the amount of error.
64. See generally D. Michael Risinger et al., The Daubert/Kumho Implications of Observer Effects
in Forensic Science: Hidden Problems of Expectation and Suggestion, 90 Cal. L. Rev. 1 (2002).
65. NRC Forensic Science Report, supra note 3, at 139 n.23 & 185 n.2.
66. See L.S. Miller, Bias Among Forensic Document Examiners: A Need for Procedural Change, 12
J. Police Sci. & Admin. 407, 410 (1984) (“The conclusions and opinions reported by the examiners
supported the bias hypothesis.”). Confirmation bias is another illustration. The FBI noted the problem
in its internal investigation of the Mayfield case. A review by another examiner was not conducted
blind—that is, the reviewer knew that a positive identification had already been made—and thus was
subject to the influence of confirmation bias. Robert B. Stacey, A Report on the Erroneous Fingerprint
Individualization in the Madrid Train Bombing Case, 54 J. Forensic Identification 707 (2004).
67. See Larry S. Miller, Procedural Bias in Forensic Science Examinations of Human Hair, 11 Law
& Hum. Behav. 157 (1987). In the conventional method, the examiner is given hair samples from a
known suspect along with a report including other facts and information relating to the guilt of the
suspect. “The findings of the present study raise some concern regarding the amount of unintentional
bias among human hair identification examiners. . . . A preconceived conclusion that a questioned hair
sample and a known hair sample originated from the same individual may influence the examiner’s
opinion when the samples are similar.” Id. at 161.
68. See Itiel Dror & Robert Rosenthal, Meta-analytically Quantifying the Reliability and Biasability
of Forensic Experts, 53 J. Forensic Sci. 900 (2008); Itiel E. Dror et al., Contextual Information Renders
Experts Vulnerable to Making Erroneous Identifications, 156 Forensic Sci. Int’l 74 (2006); Itiel Dror et al.,
When Emotions Get the Better of Us: The Effect of Contextual Tap-Down Processing on Matching Fingerprints,
19 App. Cognit. Psychol. 799 (2005).
67
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
erroneous judgments.69 In any event, forensic examinations should, to the extent
feasible, be conducted “blind.”70
C. Accreditation and Certification
The NRC report called for the mandatory accreditation of crime labs and the
certification of examiners.71 Accreditation and certification standards should be
based on recognized international standards, such as those published by the International Organization for Standardization (ISO). According to the report, no
person (public or private) ought to practice or testify as a forensic expert without
certification.72 In addition, laboratories should establish “quality assurance and
quality control procedures to ensure the accuracy of forensic analyses and the
work of forensic practitioners.”73
The American Society of Crime Lab Directors/Laboratory Accreditation
Board (ASCLD/LAB) is the principal accrediting organization in the United
States. Accreditation requirements generally include ensuring the integrity of
evidence, adhering to valid and generally accepted procedures, employing qualified examiners, and operating quality assurance programs—that is, proficiency
testing, technical reviews, audits, and corrective action procedures.74 Currently,
accreditation is mostly voluntary. Only a few states require accreditation of crime
69. Glenn Langenburg et al., Testing for Potential Contextual Bias Effects During the Verification Stage
of the ACE-V Methodology When Conducting Fingerprint Comparisons, 54 J. Forensic Sci. 571 (2009). As
the researchers acknowledge, the examiners knew that they were being tested.
70. See Mike Redmayne, Expert Evidence and Criminal Justice 16 (2001) (“To the extent that
we are aware of our vulnerability to bias, we may be able to control it. In fact, a feature of good scientific practice is the institution of processes—such as blind testing, the use of precise measurements,
standardized procedures, statistical analysis—that control for bias.”).
71. Recommendation 3; see also NRC Forensic Science Report, supra note 3, at 23 (“In short,
oversight and enforcement of operating standards, certification, accreditation, and ethics are lacking
in most local and state jurisdictions.”).
72. Id., Recommendation 7. The recommendation goes on to state:
Certification requirements should include, at a minimum, written examinations, supervised practice,
proficiency testing, continuing education, recertification procedures, adherence to a code of ethics, and
effective disciplinary procedures. All laboratories (public or private) should be accredited and all forensic
science professionals should be certified, when eligible, within a time period estbalished by NIFS.
73. Id., Recommendation 8. The recommendation further comments: “Quality control procedures should be designed to: identify mistakes, fraud, and bias; confirm the continued validity and
reliability of standard operating procedures and protocols; ensure that best practices are being followed;
and correct procedures and protocols that are found to need improvement.”
74. See Jan S. Bashinski & Joseph L. Peterson, Forensic Sciences, in Local Government: Police
Management 559, 578 (William Geller & Darrel Stephens eds., 4th ed. 2004).
68
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
laboratories.75 New York mandated accreditation in 1994.76 Texas77 and Oklahoma78 followed after major crime laboratory failures.
D. Proficiency Testing
Several of the report’s recommendations referred to proficiency testing,79 of which
there are several types: internal or external, and blind or nonblind (declared).80
The results of the first Laboratory Proficiency Testing Program, sponsored by the
Law Enforcement Assistance Administration (LEAA), were reported in 1978.81
Voluntary proficiency testing continued after this study.82 The DNA Identification
Act of 1994 mandated proficiency testing for examiners at the FBI as well as for
75. The same is true for certification. NRC Forensic Science Report, supra note 3, at 6 (“[M]ost
jurisdictions do not require forensic practitioners to be certified, and most forensic science disciplines
have no mandatory certification program.”).
76. N.Y. Exec. Law § 995-b (McKinney 2003) (requiring accreditation by the state Forensic Science Commission); see also Cal. Penal Code § 297 (West 2004) (requiring accreditation of DNA units
by ASCLD/LAB or any certifying body approved by ASCLD/LAB); Minn. Stat. Ann. § 299C.156(2)
(4) (West Supp. 2006) (specifying that the Forensic Science Advisory Board should encourage accreditation by ASCLD/LAB or other accrediting body).
77. Tex. Code Crim. Proc. Ann. art. 38.35 (Vernon 2004) (requiring accreditation by the
Department of Public Safety). Texas also created a Forensic Science Commission. Id. art. 38.01 (2007).
78. Okla. Stat. Ann. tit. 74, § 150.37(D) (West 2004) (requiring accreditation by ASCLD/LAB
or the American Board of Forensic Toxicology).
79. Recommendations 6 & 7.
80. Proficiency testing does not automatically correlate with a technique’s “error rate.” There
is a question whether error rate should be based on the results of declared and/or blind proficiency
tests of simulated evidence administered to crime laboratories, or if this rate should be based on the
retesting of actual case evidence drawn randomly (1) from the files of crime laboratories or (2) from
evidence presented to courts in prosecuted and/or contested cases.
81. Joseph L. Peterson et al., Crime Laboratory Proficiency Testing Research Program (1978)
[hereinafter Laboratory Proficiency Test]. The report concluded: “A wide range of proficiency levels
among the nation’s laboratories exists, with several evidence types posing serious difficulties for the
laboratories. . . .” Id. at 3. Although the proficiency tests identified few problems in certain forensic
disciplines such as glass analysis, tests of other disciplines such as hair analysis produced very high rates
of “unacceptable proficiency.” According to the report, unacceptable proficiency was most often
caused by (1) misinterpretation of test results due to carelessness or inexperience, (2) failure to employ
adequate or appropriate methodology, (3) mislabeling or contamination of primary standards, and
(4) inadequate databases or standard spectra. Id. at 258.
82. See Joseph L. Peterson & Penelope N. Markham, Crime Laboratory Proficiency Testing
Results, 1978–1991, Part I: Identification and Classification of Physical Evidence, 40 J. Forensic Sci. 994
(1995); Joseph L. Peterson & Penelope N. Markham, Crime Laboratory Proficiency Testing Results,
1978–1991, Part II: Resolving Questions of Common Origin, 40 J. Forensic Sci. 1009 (1995). After
collaborating with the Forensic Sciences Foundation in the initial LEAA-funded crime laboratory proficiency testing research program, Collaborative Testing Services, Inc. (CTS) began in
1978 to offer a fee-based testing program. Today, CTS offers samples in many scientific evidence
testing areas to more than 500 forensic science laboratories worldwide. See test results at www.
collaborativetesting.com/.
69
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
analysts in laboratories that participate in the national DNA database or receive
federal funding.83
E. Standard Terminology
The NRC report voiced concern about the use of terms such as “match,” “consistent with,” “identical,” “similar in all respects tested,” and “cannot be excluded
as the source of.” These terms can have “a profound effect on how the trier of fact
in a criminal or civil matter perceives and evaluates scientific evidence.”84 Such
terms need to be defined and standardized, according to the report.
F. Laboratory Reports
A related recommendation concerns laboratory reports and the need for model
formats.85 The NRC report commented:
As a general matter, laboratory reports generated as the result of a scientific
analysis should be complete and thorough. They should contain, at minimum,
“methods and materials,” “procedures,” “results,” “conclusions,” and, as appropriate, sources and magnitudes of uncertainty in the procedures and conclusions
(e.g., levels of confidence). Some forensic science laboratory reports meet this
standard of reporting, but many do not. Some reports contain only identifying
and agency information, a brief description of the evidence being submitted,
a brief description of the types of analysis requested, and a short statement of
the results (e.g., “the greenish, brown plant material in item #1 was identified
as marijuana”), and they include no mention of methods or any discussion of
measurement uncertainties.86
In addition, reports “must include clear characterizations of the limitations of
the analyses, including measures of uncertainty in reported results and associated
estimated probabilities where possible.”87
83. 42 U.S.C. § 14131(c) (2005). The DNA Act authorized a study of the feasibility of blind
proficiency testing; that study raised questions about the cost and practicability of this type of examination, as well as its effectiveness when compared with other methods of quality assurance such as
accreditation and more stringent external case audits. Joseph L. Peterson et al., The Feasibility of External
Blind DNA Proficiency Testing. 1. Background and Findings, 48 J. Forensic Sci. 21, 30 (2003) (“In the
extreme, blind proficiency testing is possible, but fraught with problems (including costs), and it is
recommended that a blind proficiency testing program be deferred for now until it is more clear how
well implementation of the first two recommendations [accreditation and external case audits] are
serving the same purposes as blind proficiency testing.”).
84. NRC Forensic Science Report, supra note 3, at 21.
85. Id. at 22, Recommendation 2.
86. Id. at 21.
87. Id. at 21–22.
70
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
V. Specific Techniques
The broad field of forensic science includes disparate disciplines such as forensic
pathology, forensic anthropology, arson investigation, and gunshot residue testing.88 The NRC report explained:
Some of the forensic science disciplines are laboratory based (e.g., nuclear and
mitochondrial DNA analysis, toxicology and drug analysis); others are based on
expert interpretation of observed patterns (e.g., fingerprints, writing samples,
toolmarks, bite marks, and specimens such as hair). . . . There are also sharp
distinctions between forensic practitioners who have been trained in chemistry,
biochemistry, biology, and medicine (and who bring these disciplines to bear in
their work) and technicians who lend support to forensic science enterprises.89
The report devoted special attention to forensic disciplines in which the expert’s
final decision is subjective in nature: “In terms of scientific basis, the analytically
based disciplines generally hold a notable edge over disciplines based on expert
interpretation.”90 Moreover, many of the subjective techniques attempt to render
the most specific conclusions—that is, opinions concerning “individualization.”91
Following the report’s example, the remainder of this chapter focuses on “pattern
recognition” disciplines, each of which contains a subjective component. These
disciplines exemplify most of the issues that a trial judge may encounter in ruling
on the admissibility of forensic testimony. Each part describes the technique, the
available empirical research, and contemporary case law.
A. Terminology
Although courts often use the terms “validity” and “reliability” interchangeably, the terms have distinct meanings in scientific disciplines. “Validity” refers
to the ability of a test to measure what it is supposed to measure—its accuracy.
“Reliability” refers to whether the same results are obtained in each instance in
which the test is performed—its consistency. Validity includes reliability, but the
converse is not necessarily true. Thus, a reliable, invalid technique will consistently
88. Other examples include drug analysis, blood spatter examinations, fiber comparisons, toxicology, entomology, voice spectrometry, and explosives and bomb residue analysis. As the Supreme
Court noted in Melendez-Diaz v. Massachusetts, 129 S. Ct. 2527, 2537–38 (2009), errors can be
made when instrumental techniques, such as gas chromatography/mass spectrometry analysis, are used.
89. NRC Forensic Science Report, supra note 3, at 7.
90. Id.
91. “Often in criminal prosecutions and civil litigation, forensic evidence is offered to support
conclusions about ‘individualization’ (sometimes referred to as ‘matching’ a specimen to a particular
individual or other source) or about classification of the source of the specimen into one of several
categories. With the exception of nuclear DNA analysis, however, no forensic method has been rigorously shown to have the capacity to consistently, and with a high degree of certainty, demonstrate a
connection between evidence and a specific individual or source.” Id.
71
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
yield inaccurate results. The Supreme Court acknowledged this distinction in
Daubert, but the Court indicated that it was using the term “reliability” in a different sense. The Court wrote that its concern was “evidentiary reliability—that
is, trustworthiness. . . . In a case involving scientific evidence, evidentiary reliability
will be based upon scientific validity.”92
In forensic science, class and individual characteristics are distinguished. Class
characteristics are shared by a group of persons or objects (e.g., ABO blood
types).93 Individual characteristics are unique to an object or person. The term
“match” is ambiguous because it is sometimes used to indicate the “matching”
of individual characteristics, but on other occasions it is used to refer to “matching” class characteristics (e.g., blood type A at a crime scene “matches” suspect’s
type A blood). Expert opinions involving “individual” and “class” characteristics
raise different issues. In the former, the question is whether an individuation
determination rests on a firm scientific foundation.94 For the latter, the question
is determining the size of the class.95
VI. Fingerprint Evidence
Sir William Herschel, an Englishman serving in the Indian civil service, and
Henry Faulds, a Scottish physician serving as a missionary in Japan, were among
the first to suggest the use of fingerprints as a means of personal identification.
Since 1858, Herschel had been collecting the handprints of natives for that
purpose. In 1880, Faulds published an article entitled “On the Skin—Furrows
92. 509 U.S. at 590 n.9 (“We note that scientists typically distinguish between ‘validity’ (does
the principle support what it purports to show?) and ‘reliability’ (does application of the principle
produce consistent results?). . . .”).
93. See Bashinski & Peterson, supra note 74, at 566 (“The forensic scientist first investigates
whether items possess similar ‘class’ characteristics—that is, whether they possess features shared by all
objects or materials in a single class or category. (For firearms evidence, bullets of the same caliber,
bearing rifling marks of the same number, width, and direction of twist, share class characteristics. They
are consistent with being fired from the same type of weapon.) The forensic scientist then attempts to
determine an item’s ‘individuality’—the features that make one thing different from all others similar
to it, including those with similar class characteristics.”).
94. See Michael Saks & Jonathan Koehler, The Individualization Fallacy in Forensic Science Evidence,
61 Vand. L. Rev. 199 (2008).
95. See Margaret A. Berger, Procedural Paradigms for Applying the Daubert Test, 78 Minn. L.
Rev. 1345, 1356–57 (1994) (“We allow eyewitnesses to testify that the person fleeing the scene wore
a yellow jacket and permit proof that a defendant owned a yellow jacket without establishing the
background rate of yellow jackets in the community. Jurors understand, however, that others than
the accused own yellow jackets. When experts testify about samples matching in every respect, the
jurors may be oblivious to the probability concerns if no background rate is offered, or may be unduly
prejudiced or confused if the probability of a match is confused with the probability of guilt, or if a
background rate is offered that does not have an adequate scientific foundation.”).
72
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
of the Hand” in Nature.96 Sir Francis Galton authored the first textbook on the
subject.97 Individual ridge characteristics came to be known as “Galton details.”98
Subsequently, Edward Henry, the Inspector General of Police in Bengal, realized the potential of fingerprinting for law enforcement and helped establish the
Fingerprint Branch at Scotland Yard when he was recalled to England in 1901.99
English and American courts have accepted fingerprint identification testimony for just over a century. “The first English appellate endorsement of fingerprint identification testimony was the 1906 opinion in Rex v. Castleton. . . . In
1906 and 1908, Sergeant Joseph Faurot, a New York City detective who had in
1904 been posted to Scotland Yard to learn about fingerprinting, used his new
training to break open two celebrated cases: in each instance fingerprint identification led the suspect to confess. . . .”100 A 1911 Illinois Supreme Court decision,
People v. Jennings,101 is the first published American appellate opinion sustaining
the admission of fingerprint testimony.
Over the years, fingerprint analysis became the gold standard of forensic
identification expertise. In fact, proponents of new, emerging techniques in forensics would sometimes attempt to invoke onto the new techniques the prestige
of fingerprint analysis. Thus, advocates of sound spectrography referred to it as
“voiceprint” analysis.102 Likewise, some early proponents of DNA typing alluded
to it as “DNA fingerprinting.”103 However, as previously noted, DNA analysis
has replaced fingerprint analysis as the gold standard.
A. The Technique
Even a cursory study of fingerprints establishes that there is “intense variability . . .
in even small areas of prints.”104 Given that variability, it is generally assumed that
an identification is possible if the comparison involves two sets of clear images of
all 10 fingerprints. These are known as “record” prints and are typically rolled
onto a fingerprint card or digitized and scanned into an electronic file. Two
complete fingerprint sets are available for comparison in some settings such as
96. Henry Faulds, On the Skin—Furrows of the Hand, 22 Nature 605 (1881). See generally Simon
Cole, Suspect Identities: A History of Fingerprint and Criminal Identification (2001).
97. Francis Galton, Fingerprints (1892).
98. See Andre A. Moenssens, Scientific Evidence in Civil and Criminal Cases § 10.02, at 621
(5th ed. 2007).
99. United States v. Llera Plaza, 188 F. Supp. 2d 549, 554 (E.D. Pa. 2002).
100. Id. at 572.
101. 96 N.E. 1077 (Ill. 1911).
102. Kenneth Thomas, Voiceprint—Myth or Miracle, in Scientific and Expert Evidence 1015 (2d
ed. 1981).
103. Colin Norman, Maine Case Deals Blow to DNA Fingerprinting, 246 Science 1556 (Dec. 22,
1989).
104. David A. Stoney, Scientific Status, in 4 David L. Faigman et al., Modern Scientific Evidence:
The Law and Science of Expert Testimony § 32:45, at 361 (2007–2008 ed.).
73
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
immigration matters. However, in the law enforcement setting, the task is more
challenging because only a partial impression (latent print) of a single finger may
be left by a criminal.
Fingerprint evidence is based on three assumptions: (1) the uniqueness of
each person’s friction ridges, (2) the permanence of those ridges throughout
a person’s life, and (3) the transferability of an impression of that uniqueness
to another surface. The last point raises the most significant issue of reliability
because a crime scene (latent) impression is often only a fifth of the size of
the record print. Furthermore, variations in pressure and skin elasticity almost
inevitably distort the impression.105 Consequently, fingerprint impressions from
the same person typically differ in some respects each time the impression is left
on an object.106
Although fingerprint analysis is based on physical characteristics, the final
step in the analysis—the formation of an opinion regarding individuation—is
subjective.107 Examiners lack population frequency data to quantify how rare or
common a particular type of fingerprint characteristic is.108 Rather, in making
that judgment, the examiner relies on personal experience and discussions with
colleagues. Although examiners in some countries must find a certain minimum
number of points of similarities between the latent and the known before declaring a match,109 neither the FBI nor New Scotland Yard requires any set number.110 A single inexplicable difference between the two impressions precludes
finding a match. Because there are frequently “dissimilarities” between the crime
scene and record prints, the examiner must decide whether there is a true dis-
105. See United States v. Mitchell, 365 F.3d 215, 220–21 (3d Cir. 2004) (“Criminals generally do not leave behind full fingerprints on clean, flat surfaces. Rather, they leave fragments that are
often distorted or marred by artifacts. . . . Testimony at the Daubert hearing suggested that the typical
latent print is a fraction—perhaps 1/5th—of the size of a full fingerprint.”). “In the jargon, artifacts
are generally small amounts of dirt or grease that masquerade as parts of the ridge impressions seen in
a fingerprint, while distortions are produced by smudging or too much pressure in making the print,
which tends to flatten the ridges on the finger and obscure their detail.” Id. at 221 n.1.
106. NRC Forensic Science Report, supra note 3, at 144 (“The impression left by a given finger
will differ every time, because of inevitable variations in pressure, which change the degree of contact
between each part of the ridge structure and the impression medium.”).
107. See Commonwealth v. Patterson, 840 N.E.2d 12, 15, 16–17 (Mass. 2005) (“These latent
print impressions are almost always partial and may be distorted due to less than full, static contact
with the object and to debris covering or altering the latent impression”; “In the evaluation stage, . . .
the examiner relies on his subjective judgment to determine whether the quality and quantity of those
similarities are sufficient to make an identification, an exclusion, or neither”); Zabell, supra note 25, at
158 (“In contrast to the scientifically-based statistical calculations performed by a forensic scientist in
analyzing DNA profile frequencies, each fingerprint examiner renders an opinion as to the similarity
of friction ridge detail based on his subjective judgment.”).
108. NRC Forensic Science Report, supra note 3, at 139–40 & 144.
109. Stoney, supra note 104, § 32:34, at 354–55.
110. United States v. Llera Plaza, 188 F. Supp. 2d 549, 566–71 (E.D. Pa. 2002).
74
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
similarity, or whether the apparent dissimilarity can be discounted as an artifact or
resulting from distortion.111
Three levels of details may be scrutinized: Level 1 details are general flow
ridge patterns such as whorls, loops, and arches.112 Level 2 details are fine ridges
or minutiae such as bifurcations, dots, islands, and ridge endings.113 These minutiae
are essentially ridge discontinuities.114 Level 3 details are “microscopic ridge
attributes such as the width of a ridge, the shape of its edge, or the presence of a
sweat pore near a particular ridge.”115 Within the fingerprint community there is
disagreement about the usefulness and reliability of Level 3 details.116
FBI examiners generally follow a procedure known as analysis, comparison,
evaluation, and verification (ACE-V). In the analysis stage, the examiner studies
the latent print to determine whether the quantity and quality of details in the
print are sufficient to permit further evaluation.117 The latent print may be so fragmentary or smudged that analysis is impossible. In the evaluation stage, the examiner considers at least the Level 2 details, including “the type of minutiae (forks
or ridge endings), their direction (loss or production of a ridge) and their relative
position (how many intervening ridges there are between minutiae and how far
along the ridges it is from one minutiae to the next).”118 Again, if the examiner
finds a single, inexplicable difference between the two prints, the examiner concludes that there is no match.119 Alternatively, if the examiner concludes that there
is a match, the examiner seeks verification by a second examiner. “[T]he friction
ridge community actively discourages its members from testifying in terms of the
probability of a match; when a latent print examiner testifies that two impressions
111. Patterson, 840 N.E.2d at 17 (“There is a rule of examination, the ‘one-discrepancy’ rule,
that provides that a nonidentification finding should be made if a single discrepancy exists. However,
the examiner has the discretion to ignore a possible discrepancy if he concludes, based on his experience and the application of various factors, that the discrepancy might have been caused by distortions
of the fingerprint at the time it was made or at the time it was collected.”).
112. See id. at 16 (“Level one detail involves the general ridge flow of a fingerprint, that is, the
pattern of loops, arches, and whorls visible to the naked eye. The examiner compares this information
to the exemplar print in an attempt to exclude a print that has very clear dissimilarities.”).
113. See id. (“Level two details include ridge characteristics (or Galton Points) like islands,
dots, and forks, formed as the ridges begin, end, join or bifurcate.”). See generally FBI, The Science
of Fingerprints (1977).
114. Stoney, supra note 104, § 32:31, at 350.
115. See Patterson, 840 N.E.2d at 16.
116. See Office of the Inspector General, U.S. Dep’t of Justice, A Review of the FBI’s Handling
of the Brandon Mayfield Case, Unclassified Executive Summary 8 (Jan. 2006) available at www.justice.
gov/oig/special/s0601/PDF list.htm. (“Because Level 3 details are so small, the appearance of such
details in fingerprints is highly variable, even between different fingerprints made by the same finger. As
a result, the reliability of Level 3 details is the subject of some controversy within the latent fingerprint
community.”).
117. NRC Forensic Science Report, supra note 3, at 137–38.
118. Stoney, supra note 104, § 32:31, at 350–51.
119. NRC Forensic Science Report, supra note 3, at 140.
75
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
‘match,’ they are communicating the notion that the prints could not possibly
have come from two different individuals.”120 The typical fingerprint analyst will
give one of only three opinions: (1) the prints are unsuitable for analysis, (2) the
suspect is definitely excluded, or (3) the latent print is definitely that of the suspect.
B. The Empirical Record
At several points, the 2009 NRC report noted that there is room for human error
in fingerprint analysis. For example, the report stated that because “the ACE-V
method does not specify particular measurements or a standard test protocol,
. . . examiners must make subjective assessments throughout.”121 The report
further commented that the ACE-V method is too “broadly stated” to “qualify
as a validated method for this type of analysis.”122 The report added that “[t]he
latent print community in the United States has eschewed numerical scores and
corresponding thresholds” and consequently relies “on primarily subjective criteria” in making the ultimate attribution decision.123 In making the decision, the
examiner must draw on his or her personal experience to evaluate such factors as
“inevitable variations” in pressure, but to date these factors have not been “characterized, quantified, or compared.”124 At the conclusion of the section devoted
to fingerprint analysis, the report outlined an agenda for the research it considered
necessary “[t]o properly underpin the process of friction ridge identification.”125
The report noted that some of these research projects have already begun.126
Fingerprint analysis raises a number of scientific issues. For example, do the
salient features of fingerprints remain constant throughout a person’s life?127 Few
of the underlying scientific premises have been subjected to rigorous empirical
investigation,128 although some experiments have been conducted, and proficiency test results are available.
Two experimental studies were discussed at the 2000 trial in United States v.
Mitchell129:
One of the studies conducted by the government for the Daubert hearing [in
Mitchell] employed the two actual latent and the known prints that were at issue
in the case. These prints were submitted to 53 state law enforcement agency
120. Id. at 140–41.
121. Id. at 139.
122. Id. at 142.
123. Id. at 141.
124. Id. at 144.
125. Id.
126. Id.
127. Stoney, supra note 104, § 32:21, at 342.
128. See Zabell, supra note 25, at 164 (“Although there is a substantial literature on the uniqueness of fingerprints, it is surprising how little true scientific support for the proposition exists.”).
129. 365 F.3d 215 (3d Cir. 2004).
76
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
crime laboratories around the country for their evaluation. Though, of the 35
that responded, most concluded that the latent and known prints matched, eight
said that no match could be made to one of the prints and six said that no match
could be made to the other print.130
Although there were no false positives, a significant percentage of the participating
laboratories reported at best inconclusive findings.
Lockheed-Martin conducted the second test, the FBI-sponsored 50K study.
This was an empirical study of 50,000 fingerprint images taken from the FBI’s
Automated Fingerprint System, a computer database. The study
was an effort to obtain an estimate of the probability that one person’s fingerprints
would be mistaken for those of another person, at least to a computer system
designed to match fingerprints. The FBI asked Lockheed-Martin, the manufacturer of its . . . automated fingerprint identification system, . . . to help it run a
comparison of the images of 50,000 single fingerprints against the same 50,000
images, and produce a similarity score for each comparison. The point of this
exercise was to show that the similarity score for an image matched against itself
was far higher than the scores obtained when it was compared to the others.131
The comparisons between the two identical images yielded “extremely high
scores.”132 Nonetheless, some commentators disputed whether the LockheedMartin study demonstrated the validity of fingerprint analysis.133 The study compared a computerized image of a fingerprint impression against other computerized
images in the database. The study did not address the problem examiners encounter
in the real world; it did not attempt to match a partial fingerprint impression against
images in the database. As noted earlier, crime scene prints are typically distorted
from pressure and sometimes only one-fifth the size of record prints.134 Even the
same finger will not leave the exact impression each time: “The impression left by
a given finger will differ every time, because of inevitable variations in pressure,
which change the degree of contact between each part of the ridge structure and
the impression medium.”135 Thus, one scholar asserted that the “study addresses
the irrelevant question of whether one image of a fingerprint is immensely more
similar to itself than to other images—including those of the same finger.”136 Citing
130. Stoney, supra note 104, § 32:3, at 287.
131. Id. § 32:3, at 288.
132. Id. (quoting James L. Wayman, Director, U.S. National Biometric Test Center at the College of Engineering, San Jose State University).
133. E.g., David H. Kaye, Questioning a Courtroom Proof of the Uniqueness of Fingerprints, 71 Int’l
Statistical Rev. 521 (2003); S. Pankanti et al., On the Individuality of Fingerprints, 24 IEEE Trans. Pattern
Analysis Mach. Intelligence 1010 (2002).
134. See supra note 105 & accompanying text.
135. NRC Forensic Science Report, supra note 3, at 144.
136. Kaye, supra note 133, at 527–28. In another passage, he wrote: “[T]he study merely demonstrates the trivial fact that the same two-dimensional representation of the surface of a finger is far
77
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
this assertion, the 2009 NRC report stated that the Lockheed-Martin study “has
several major design and analysis flaws.”137
1. Proficiency testing
In United States v. Llera Plaza,138 the district court described internal and external
proficiency tests of FBI fingerprint analysts and their supervisors. Between 1995
and 2001, the supervisors participated in 16 external tests created by CTS.139 One
false-positive result was reported among the 16 tests.140 During the same period,
there was a total of 431 internal tests of FBI fingerprint personnel. These personnel committed no false-positive errors, but there were three false eliminations.141
Hence, the overall error rate was approximately 0.8%.142
Although these proficiency tests yielded impressive accuracy rates, the quality
of the tests became an issue. First, the examinees participating in the tests knew
that they were being tested and, for that reason, may have been more meticulous
than in regular practice. Second, the rigor of proficiency testing was questioned.
The Llera Plaza court concluded that the FBI’s internal proficiency tests were “less
demanding than they should be.”143 In the judge’s words, “the FBI examiners got
very high proficiency grades, but the tests they took did not.”144
more similar to itself than to such representation of the source of finger from any other person in the
data set.” Id. at 527.
137. NRC Forensic Science Report, supra note 3, at 144 n.35.
138. 188 F. Supp. 2d 549 (E.D. Pa. 2002).
139. Id. at 556.
140. However, a later inquiry led Stephen Meagher, Unit Chief of Latent Print Unit 3 of the
Forensic Analysis Section of the FBI Laboratory “to conclude that the error was not one of faulty evaluation but of faulty recording of the evaluation—i.e., a clerical error rather than a technical error.” Id.
141. Id.
142. Sharon Begley, Fingerprint Matches Come Under More Fire as Potentially Fallible, Wall St. J.,
Oct. 7, 2005, at B1.
143. Llera Plaza, 188 F. Supp. 2d at 565. A fingerprint examiner from New Scotland Yard with
25 years’ experience testified that the FBI tests were deficient:
Mr. Bayle had reviewed copies of the internal FBI proficiency tests. . . . He found the latent prints
utilized in those tests to be, on the whole, markedly unrepresentative of the latent prints that would be
lifted at a crime scene. In general, Mr. Bayle found the test latent prints to be far clearer than the prints
an examiner would routinely deal with. The prints were too clear—they were, according to Mr. Bayle,
lacking in the “background noise” and “distortion” one would expect in latent prints lifted at a crime
scene. Further, Mr. Bayle testified, the test materials were deficient in that there were too few latent
prints that were not identifiable; according to Mr. Bayle, at a typical crime scene only about ten percent
of the lifted latent prints will turn out to be matched. In Mr. Bayle’s view the paucity of non-identifiable
latent prints “makes the test too easy. It’s not testing their ability. . . . [I]f I gave my experts these tests,
they’d fall about laughing.”
Id. at 557–58.
144. Id. at 565; see also United States v. Crisp, 324 F.3d 261, 274 (4th Cir. 2003) (Michael, J.,
dissenting) (“Proficiency testing is typically based on a study of prints that are far superior to those
usually retrieved from a crime scene.”).
78
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
In an earlier proficiency study (1995), the examiners did not do as well,145
although many of the subjects were not certified FBI examiners. Of the 156 examiners who participated, only 44% reached the correct conclusion on all the identification tasks. Eighty-eight examiners or 56% provided divergent (wrong, incorrect,
erroneous) answers. Six examiners failed to identify any of the latent prints. Forty
eight of the 156 examiners made erroneous identifications—representing 22% of
the total identifications made by the examiners.
A 2006 study resurrected some of the questions raised by the 1995 test. In
that study, examiners were presented with sets of prints that they had previously
reviewed.146 The researchers found that “experienced examiners do not necessarily
agree with even their own past conclusions when the examination is presented in
a different context some time later.”147
These studies call into question the soundness of testimonial claims that
fingerprint analysis is infallible148 or has a zero error rate.149 In 2008, Haber and
Haber reviewed the literature describing the ACE-V technique and the supporting research.150 Although many practitioners professed using the technique, Haber
and Haber found that the practitioners’ “descriptions [of their technique] differ, no
single protocol has been officially accepted by the profession and the standards upon
which the method’s conclusion rest[s] have not been specified quantitatively.”151
After considering the Haber study, NRC concluded that the ACE-V “framework
is not specific enough to qualify as a validated method for this type of analysis.”152
2. The Mayfield case
Like the empirical data, several reports of fingerprint misidentifications raised questions about the reliability of fingerprint analysis. The FBI misidentified Brandon
Mayfield as the source of the crime scene prints in the terrorist train bombing in
Madrid, Spain, on March 11, 2004.153 The mistake was attributed in part to several
types of cognitive bias. According to an FBI review, the “power” of the automated
145. See David L. Grieve, Possession of Truth, 46 J. Forensic Identification 521, 524–25 (1996);
James Starrs, Forensic Science on the Ropes: An Upper Cut to Fingerprinting, 20 Sci. Sleuthing Rev. 1
(1996).
146. Itiel E. Dror et al., Contextual Information Renders Experts Vulnerable to Making Erroneous
Identifications, 156 Forensic Sci. Int’l 74, 76 (2006) (Four of five examiners changed their opinions;
three directly contradicted their prior identifications, and the fourth concluded that data were insufficient to reach a definite conclusion); see also I. E. Dror & D. Charlton, Why Experts Make Errors, 56
J. Forensic Identification 600 (2006).
147. NRC Forensic Science Report, supra note 3, at 139.
148. Id. at 104.
149. Id. at 143–44.
150. Lyn Haber & Ralph Norman Haber, Scientific Validation of Fingerprint Evidence Under
Daubert, 7 Law, Probability & Risk 87 (2008).
151. NRC Forensic Science Report, supra note 3, at 143.
152. Id. at 142.
153. Id. at 46 & 105.
79
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
fingerprint correlation “was thought to have influenced the examiner’s initial judgment and subsequent examination.”154 Thus, he was subject to confirmation bias.
Moreover, a second review by another examiner was not conducted blind—that
is, the reviewer knew that a positive identification had already been made and was
thus subject to expectation (context) bias. Indeed, a third expert from outside the
FBI, one appointed by the court, also erroneously confirmed the identification.155
In addition to the Bureau’s review, the Inspector General of the Department of
Justice investigated the case.156 The Mayfield case is not an isolated incident.157
The Mayfield case led to a more extensive FBI review of the scientific basis of
fingerprints.158 In January 2006, the FBI created a three-person review committee
to evaluate the fundamental basis of fingerprint analysis. The committee identified two possible approaches. One approach would be to “develop a quantifiable
minimum threshold based on objective criteria”—if possible.159 “Any minimum
threshold must consider both the clarity (quality) and the quantity of features and
include all levels of detail, not simply points or minutiae.”160 Apparently, some
FBI examiners use an unofficial seven-point cutoff, but this standard has never
been tested.161 As the FBI Review cautioned: “It is compelling to focus on a
quantifiable threshold; however, quality/clarity, that is, distortion and degradation
of prints, is the fundamental issue that needs to be addressed.”162
154. Stacey, supra note 66, at 713.
155. In addition, the culture at the laboratory was poorly suited to detect mistakes: “To disagree
was not an expected response.” Id.
156. See Office of the Inspector General, U.S. Dep’t of Justice, A Review of the FBI’s Handling
of the Brandon Mayfield Case, Unclassified Executive Summary 9 (Jan. 2006). The I.G. made several
recommendations that went beyond the FBI’s internal report:
These include recommendations that the Laboratory [1] develop criteria for the use of Level 3 details
to support identifications, [2] clarify the “one discrepancy rule” to assure that it is applied in a manner
consistent with the level of certainty claimed for latent fingerprint identifications, [3] require documentation of features observed in the latent fingerprint before the comparison phase to help prevent
circular reasoning, [4] adopt alternate procedures for blind verifications, [5] review prior cases in which
the identification of a criminal suspect was made on the basis of only one latent fingerprint searched
through IAFIS, and [6] require more meaningful and independent documentation of the causes of errors
as part of the Laboratory’s corrective action procedures.
157. In 2005, Professor Cole released an article identifying 23 cases of documented fingerprint
misidentifications. See Simon A. Cole, More Than Zero: Accounting for Error in Latent Fingerprint Identification, 95 J. Crim. L. & Criminology 985 (2005). The misidentification cases include some that
involved (1) verification by one or more other examiners, (2) examiners certified by the International
Association of Identification, (3) procedures using a 16-point standard, and (4) defense experts who
corroborated misidentifications made by prosecution experts.
158. See Bruce Budowle et al., Review of the Scientific Basis for Friction Ridge Comparisons as a
Means of Identification: Committee Findings and Recommendations, 8 Forensic Sci. Comm. (Jan. 2006)
[hereinafter FBI Review].
159. Id. at 5.
160. Id.
161. There is also a 12-point cutoff, under which a supervisor’s approval is required.
162. Id.
80
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
The second approach would treat the examiner as a “black box.” This methodology would be necessary if minimum criteria for rendering an identification
cannot be devised—in other words, there is simply too much subjectivity in the
process to formulate meaningful, quantitative guidelines. Under this approach,
it becomes critical to determine just how good a “black box” each examiner is:
“The examiner(s) can be tested with various inputs of a range of defined categories
of prints. This approach would demonstrate whether or not it is possible to obtain
a degree of accuracy (that is, assess the performance of the black-box examiner for
rendering an identification).”163 The review committee noted that this approach
would provide the greatest assurance of reliability if it incorporated blind technical review. According to the review committee’s report, “[t]o be truly blind,
the second examiner should have no knowledge of the interpretation by the first
examiner (to include not seeing notes or reports).”164
Although the FBI Review concluded that reliable identifications could be
made, it conceded that “there are scientific areas where improvements in the
practice can be made particularly regarding validation, more objective criteria for
certain aspects of the ACE-V process, and data collection.”165 Efforts to improve
fingerprint analysis appear to be under way. In 2008, a symposium on validity
testing of fingerprint examinations was published.166 In late 2008, the National
Institute of Standards and Technology formed the Expert Group on Human Factors in Latent Print Analysis tasked to identify the major sources of human error
in fingerprint examination and to develop strategies to minimize such errors.
C. Case Law Development
As noted earlier, the seminal American decision is the Illinois Supreme Court’s
1911 opinion in Jennings.167 Fingerprint testimony was routinely admitted in later
163. Id. at 4.
164. Id.
165. Id. at 10.
166. The lead article is Lyn Haber & Ralph Norman Haber, supra note 150. Other contributors are Christopher Champod, Fingerprint Examination: Towards More Transparency, 7 Law, Probability
& Risk 111 (2008); Simon A. Cole, Comment on “Scientific Validation of Fingerprint Evidence Under
Daubert,” 7 Law, Probability & Risk 119 (2008); Jennifer Mnookin, The Validity of Latent Fingerprint
Identification: Confessions of a Fingerprinting Moderate, 7 Law, Probability & Risk 127 (2008).
167. People v. Jennings, 96 N.E. 1077 (Ill. 1911); see Donald Campbell, Fingerprints: A Review,
[1985] Crim. L. Rev. 195, 196 (“Galton gave evidence to the effect that the chance of agreement
would be in the region of 1 in 64,000,000,000.”). As Professor Mnookin has noted, however, “fingerprints were accepted as an evidentiary tool without a great deal of scrutiny or skepticism.” Mnookin,
supra note 17, at 17. She elaborated:
Even if no two people had identical sets of fingerprints, this did not establish that no two people could
have a single identical print, much less an identical part of a print. These are necessarily matters of prob-
81
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
years. Some courts stated that fingerprint evidence was the strongest proof of a
person’s identity.168
With the exception of one federal district court decision that was later
withdrawn,169 the post-Daubert federal cases have continued to accept fingerprint testimony about individuation at least as sufficiently reliable nonscientific
expertise.170
Two subsequent state court decisions also deserve mention. In one, a Maryland
trial judge excluded fingerprint evidence under the Frye test, which still controls in
that state.171 In the other case, Commonwealth v. Patterson,172 the Supreme Judicial
ability, but neither the court in Jennings nor subsequent judges ever required that fingerprint identification be placed on a secure statistical foundation.
Id. at 19.
168. People v. Adamson, 165 P.2d 3, 12 (Cal. 1946), aff’d, 332 U.S. 46 (1947).
169. United States v. Llera Plaza, 179 F. Supp. 2d 492 (E.D. Pa.), vacated, mot. granted on recons.,
188 F. Supp. 2d 549 (E.D. Pa. 2002). The ruling was limited to excluding expert testimony that two
sets of prints “matched”—that is, a positive identification to the exclusion of all other persons:
Accordingly, this court will permit the government to present testimony by fingerprint examiners who,
suitabl[y] qualified as “expert” examiners by virtue of training and experience, may (1) describe how the
rolled and latent fingerprints at issue in this case were obtained, (2) identify and place before the jury the
fingerprints and such magnifications thereof as may be required to show minute details, and (3) point
out observed similarities (and differences) between any latent print and any rolled print the government
contends are attributable to the same person. What such expert witnesses will not be permitted to do
is to present “evaluation” testimony as to their “opinion” (Rule 702) that a particular latent print is in
fact the print of a particular person.
Id. at 516. On rehearing, however, the court reversed itself. A spate of legal articles followed. See, e.g.,
Simon A. Cole, Grandfathering Evidence: Fingerprint Admissibility Rulings from Jennings to Llera Plaza and
Back Again, 41 Am. Crim. L. Rev. 1189 (2004); Robert Epstein, Fingerprints Meet Daubert: The Myth
of Fingerprint “Science” Is Revealed, 75 S. Cal. L. Rev. 605 (2002); Kristin Romandetti, Recognizing and
Responding to a Problem with the Admissibility of Fingerprint Evidence Under Daubert, 45 Jurimetrics J. 41
(2004).
170. See, e.g., United States v. Baines, 573 F.3d 979, 990 (10th Cir. 2009) (“[U]nquestionably the
technique has been subject to testing, albeit less rigorous than a scientific ideal, in the world of criminal
investigation, court proceedings, and other practical applications, such as identification of victims of
disasters. Thus, while we must agree with defendant that this record does not show that the technique
has been subject to testing that would meet all of the standards of science, it would be unrealistic in the
extreme for us to ignore the countervailing evidence. Fingerprint identification has been used extensively
by law enforcement agencies all over the world for almost a century.”); United States v. Abreu, 406
F.3d 1304, 1307 (11th Cir. 2005) (“We agree with the decisions of our sister circuits and hold that the
fingerprint evidence admitted in this case satisfied Daubert.”); United States v. Janis, 387 F.3d 682, 690
(8th Cir. 2004) (finding fingerprint evidence to be reliable); United States v. Mitchell, 365 F.3d 215,
234–52 (3d Cir. 2004); United States v. Crisp, 324 F.3d 261, 268–71 (4th Cir. 2003); United States v.
Collins, 340 F.3d 672, 682 (8th Cir. 2002) (“Fingerprint evidence and analysis is generally accepted.”);
United States v. Hernandez, 299 F.3d 984, 991 (8th Cir. 2002); United States v. Sullivan, 246 F. Supp.
2d 700, 704 (E.D. Ky. 2003); United States v. Martinez-Cintron, 136 F. Supp. 2d 17, 20 (D.P.R. 2001).
171. State v. Rose, No. K06-0545, 2007 WL 5877145 (Cir. Ct. Baltimore, Md., Oct. 19, 2007).
See NRC Forensic Science Report, supra note 3, at 43 & 105. However, in a parallel federal case, the
evidence was admitted. United States v. Rose, 672 F. Supp. 2d 723 (D. Md. 2009).
172. 840 N.E.2d 12 (Mass. 2005).
82
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
Court of Massachusetts considered the reliability of applying the ACE-V methodology to simultaneous impressions. Simultaneous impressions “are two or more
friction ridge impressions from the fingers and/or palm on one hand that are determined to have been deposited at the same time.”173 The key is deciding whether
the impressions were left at the same time and therefore came from the same
person, rather than having been left by two different people at different times.174
Although the court found that the ACE-V method is generally accepted by the
relevant scientific community, the record did not demonstrate similar acceptance of
that methodology as applied to simultaneous impressions. The court consequently
remanded the case to the trial court.175
VII. Handwriting Evidence
The Lindbergh kidnapping trial showcased testimony by questioned document
examiners. Later, in the litigation over Howard Hughes’ alleged will, both sides
relied on handwriting comparison experts.176 Thanks in part to such cases, questioned document examination expertise has enjoyed widespread use and judicial
acceptance.
A. The Technique
Questioned document examiners are called on to perform a variety of tasks such
as determining the sequence of strokes on a page and whether a particular ink
formulation existed on the purported date of a writing.177 However, the most
common task performed is signature authentication—that is, deciding whether
to attribute the handwriting on a document to a particular person. Here, the
examiner compares known samples of the person’s writing to the questioned
173. FBI Review, supra note 158, at 7.
174. Patterson, 840 N.E.2d at 18 (“[T]he examiner apparently may take into account the distance
separating the latent impressions, the orientation of the impressions, the pressure used to make the
impression, and any other facts the examiner deems relevant. The record does not, however, indicate
that there is any approved standardized method for making the determination that two or more print
impressions have been made simultaneously.”).
175. The FBI review addressed this subject: “[I]f an item could only be held in a certain manner,
then the only way of explaining the evidence is that the multiple prints are from the single person. In
some cases, identifying simultaneous prints may infer, for example, the manner in which a knife was
held.” FBI Review, supra note 158, at 8. However, the review found that there was not agreement
on what constitutes a “simultaneous impression,” and therefore, more explicit guidelines were needed.
176. Irby Todd, Do Experts Frequently Disagree? 18 J. Forensic Sci. 455, 457–59 (1973).
177. Questioned document examinations cover a wide range of analyses: handwriting, hand
printing, typewriting, mechanical impressions, altered documents, obliterated writing, indented writing, and charred documents. See 2 Paul C. Giannelli & Edward J. Imwinkelried, Scientific Evidence
ch. 21 (4th ed. 2007).
83
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
document. In performing this comparison, examiners consider (1) class and
(2) individual characteristics. Of class characteristics, two types are weighed:
system178 and group. People exhibiting system characteristics would include,
for example, those who learned the Palmer method of cursive writing, taught
in many schools. Such people should manifest some of the characteristics of
that writing style. An example of people exhibiting group characteristics would
include persons of certain nationalities who tend to have some writing mannerisms in common.179 The writing of arthritic or blind persons also tends to exhibit
some common general characteristics.180
Individual characteristics take several forms: (1) the manner in which the author
begins or ends the word, (2) the height of the letters, (3) the slant of the letters,
(4) the shading of the letters, and (5) the distance between the words. An identification rarely rests on a single characteristic. More commonly, a combination of
characteristics is the basis for an identification. As in fingerprint analysis, there is no
universally accepted number of points of similarity required for an individuation
opinion. As with fingerprints, the examiner’s ultimate judgment is subjective.
There is one major difference, though, between the approaches taken by
fingerprint analysts and questioned document examiners. As previously stated, the
typical fingerprint analyst will give one of only three opinions: (1) the prints are
unsuitable for analysis, (2) the suspect is definitely excluded, or (3) the latent print is
definitely that of the suspect. In contrast, questioned document examiners recognize
a wider range of permissible opinions: (1) definite identification, (2) strong probability of identification, (3) probable identification, (4) indication of identification,
(5) no conclusion, (6) indication of nonauthorship, (7) probability of nonauthorship,
(8) strong probability of nonauthorship, and (9) elimination.181 In short, in many
cases, a questioned document examiner explicitly acknowledges the uncertainty of
his or her opinion.182 Whether such a nine-level scale is justified is another matter.183
178. See James A. Kelly, Questioned Document Examination, in Scientific and Expert Evidence
695, 698 (2d ed. 1981).
179. See Nellie Chang et al., Investigation of Class Characteristics in English Handwriting of the Three
Main Racial Groups: Chinese, Malay, and Indian in Singapore, 50 J. Forensic Sci. 177 (2005); Robert
J. Muehlberger, Class Characteristics of Hispanic Writing in the Southeastern United States, 34 J. Forensic
Sci. 371 (1989); Sandra L. Ramsey, The Cherokee Syllabary, 39 J. Forensic Sci. 1039 (1994) (one of
the landmark questioned document cases, Hickory v. United States, 151 U.S. 303 (1894), involved
Cherokee writing); Marvin L. Simner et al., A Comparison of the Arabic Numerals One Through Nine,
Written by Adults from Native English-Speaking vs. Non-Native English-Speaking Countries, 15 J. Forensic
Doc. Examination (2003).
180. See Larry S. Miller, Forensic Examination of Arthritic Impaired Writings, 15 J. Police Sci. &
Admin. 51 (1987).
181. NRC Forensic Science Report, supra note 3, at 166.
182. See id. at 47.
183. See United States v. Starzecpyzel, 880 F. Supp. 1027, 1048 (S.D.N.Y. 1995) (“No showing has been made, however, that FDEs can combine their first stage observations into such accurate
conclusions as would justify a nine level scale.”).
84
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
B. The Empirical Record
The 2009 NRC report included a section discussing questioned document examination. The report acknowledged that some tasks performed by examiners are
similar in nature “to other forensic chemistry work.”184 For example, some ink
and paper analyses use the same hardware and rely on criteria as objective as many
tests in forensic chemistry. In contrast, other analyses depend heavily on the examiner’s subjective judgment and do not have as “firm [a] scientific foundation” as
the analysis of inks and paper.185 In particular, the report focused on the typical
task of deciding common authorship. With respect to that task, the report stated:
The scientific basis for handwriting comparisons needs to be strengthened.
Recent studies have increased our understanding of the individuality and consistency of handwriting . . . and suggest that there may be a scientific basis for
handwriting comparison, at least in the absence of intentional obfuscation or
forgery. Although there has been only limited research to quantify the reliability
and replicability of the practices used by trained document examiners, the committee agrees that there may be some value in handwriting analysis.186
Until recently, the empirical record for signature authentication was sparse.
Even today there are no population frequency studies establishing, for example,
the incidence of persons who conclude their “w” with a certain lift. As a 1989
article commented,
our literature search for empirical evaluation of handwriting identification turned
up one primitive and flawed validity study from nearly 50 years ago, one 1973
paper that raises the issue of consistency among examiners but presents only
uncontrolled impressionistic and anecdotal information not qualifying as data in
any rigorous sense, and a summary of one study in a 1978 government report.
Beyond this, nothing.187
This 1989 article then surveyed five proficiency tests administered by CTS in
1975, 1984, 1985, 1986, and 1987. The article set out the results from each of
the tests188 and then aggregated the data by computing the means for the various
categories of answers: “A rather generous reading of the data would be that in
45% of the reports forensic document examiners reached the correct finding,
in 36% they erred partially or completely, and in 19% they were unable to draw
a conclusion.”189
The above studies were conducted prior to Daubert, which was decided in
1993. After the first post-Daubert admissibility challenge to handwriting evidence
184.
185.
186.
187.
188.
189.
NRC Forensic Science Report, supra note 3, at 164.
Id. at 167.
Id. at 166–67.
Risinger et al., supra note 9, at 747.
Id. at 744 (1975 test), at 745 (1984 and 1985 tests), at 746 (1986 test), and at 747 (1987 test).
Id. at 747.
85
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
in 1995,190 a number of research projects investigated two questions: (1) are experienced document examiners better at signature authentication than laypersons and
(2) do experienced document examiners reach correct signature authentication
decisions at a rate substantially above chance?
1. Comparison of experts and laypersons
Two Australian studies support the claim that experienced examiners are more
competent at signature authentication tasks than laypersons. The first study was
reported in 1999.191 In this study, document examiners chose the “inconclusive” option far more frequently than did the laypersons. However, in the cases
in which a conclusion was reached, the overall error rate for lay subjects was
28%, compared with 2% for experts. More specifically, the lay error rate for false
authentication was 7% while it was 0% for the experts. The second Australian
study was released in 2002.192 Excluding “inconclusive” findings, the error rate
for forensic document examiners was 5.8%; for laypersons, it was 23.5%.
In the United States, Dr. Moshe Kam, a computer scientist at Drexel University, has been the leading researcher in signature authentication. Dr. Kam and his
colleagues have published five articles reporting experiments comparing the signature authentication expertise of document examiners and laypersons. Although
the last study involved printing,193 the initial four were related to cursive writing.
In the first, excluding inconclusive findings, document examiners were correct
92.41% of the time and committed false elimination errors in 7.59% of their decisions.194 Lay subjects were correct 72.84% of the time and made false elimination
errors in 27.16% of their decisions. In the second through fourth studies, the
researchers provided the laypersons with incentives, usually monetary, for correct
decisions. In the fourth study, forgeries were called genuine only 0.5% of the time
by experts but 6.5% of the time by laypersons.195 Laypersons were 13 times more
likely to err in concluding that a simulated document was genuine.
Some critics of Dr. Kam’s research have asserted that the tasks performed
in the tests do not approximate the signature authentication challenges faced by
190. See United States v. Starzecpyzel, 880 F. Supp. 1027 (S.D.N.Y. 1995).
191. Bryan Found et al., The Development of a Program for Characterizing Forensic Handwriting
Examiners’ Expertise: Signature Examination Pilot Study, 12 J. Forensic Doc. Examination 69, 72–76 (1999).
192. Jodi Sita et al., Forensic Handwriting Examiners’ Expertise for Signature Comparison, 47 J.
Forensic Sci. 1117 (2002).
193. Moshe Kam et al., Writer Identification Using Hand-Printed and Non-Hand-Printed Questioned
Documents, 48 J. Forensic Sci. 1 (2003).
194. Moshe Kam et al., Proficiency of Professional Document Examiners in Writer Identification, 39
J. Forensic Sci. 5 (1994).
195. Moshe Kam et al., Signature Authentication by Forensic Document Examiners, 46 J. Forensic
Sci. 884 (2001); Moshe Kam et al., The Effects of Monetary Incentives on Performance of Nonprofessionals
in Document Examiners Proficiency Tests, 43 J. Forensic Sci. 1000 (1998); Moshe Kam et al., Writer
Identification by Professional Document Examiners, 42 J. Forensic Sci. 778 (1997).
86
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
examiners in real life.196 In addition, critics have claimed that even the monetary
incentives for the laypersons do not come close to equaling the powerful incentives that experts have to be careful in these tests.197 Yet by now the empirical
research record includes a substantial number of studies. With the exception of a
1975 German study,198 the studies uniformly conclude that professional examiners
are much more adept at signature authentication than laypersons.199
2. Proficiency studies comparing experts’ performance to chance
Numerous proficiency studies have been conducted in the United States200 and
Australia.201 Some of the American tests reported significant error rates. For
example, on a 2001 test, excluding inconclusive findings, the false authentication
rate was 22%, while the false elimination rate was 0%. Moreover, as previously
stated, on the five CTS proficiency tests mentioned in the 1989 article, 36% of
the participating examiners erred partially or completely.202 Further, critics have
claimed that some of the proficiency tests were far easier than the tasks encountered in actual practice,203 and that consequently, the studies tend to overstate
examiners’ proficiency.
196. D. Michael Risinger, Cases Involving the Reliability of Handwriting Identification Expertise Since
the Decision in Daubert, 43 Tulsa L. Rev. 477, 490 (2007).
197. Id.
198. The German study included 25 experienced examiners, laypersons with no handwriting
background, and some university students who had taken courses in handwriting psychology and comparison. On the one hand, the professional examiners outperformed the regular laypersons. The experts
had a 14.7% error rate compared with the 34.4% rate for laypersons without any training. On the other
hand, the university students had a lower aggregate error rate than the professional questioned document examiners. Wolfgang Conrad, Empirische Untersuchungen uber die Urteilsgute vershiedener Gruppen
von Laien und Sachvertstandigen bei der Unterscheidung authentischer und gefalschter Unterschriften [Empirical
Studies Regarding the Quality of Assessments of Various Groups of Lay Persons and Experts in Differentiating Between Authentic and Forged Signatures], 156 Archiv für Kriminologie 169–83 (1975).
199. See Roger Park, Signature Identification in the Light of Science and Experience, 59 Hastings L.J.
1101, 1135–36 (2008).
200. E.g., Collaborative Testing Service (CTS), Questioned Document Examination, Report
No. 92-6 (1992); CTS, Questioned Document Examination, Report No. 9406 (1994), CTS,
Questioned Document Examination, Report No. 9606 (1996); CTS, Forensic Testing Program,
Handwriting Examination, Report No. 9714 (1997); CTS, Forensic Testing Program, Handwriting
Examination, Report No. 9814 (1998); CTS, Forensic Testing Program, Handwriting Examination,
Test No. 99-524 (1999); CTS, Forensic Testing Program, Handwriting Examination, Test No. 00-524
(2000); CTS, Forensic Testing Program, Handwriting Examination, Test No. 01-524 (2001); CTS,
Forensic Testing Program, Handwriting Examination, Test No. 02-524 (2003); available at http://
www.ctsforensics.com/reports/main.aspx.
201. Bryan Found & Doug Rogers, The Probative Character of Forensic Handwriting Examiners’
Identification and Elimination Opinions on Questioned Signatures, 178 Forensic Sci. Int’l 54 (2008).
202. Risinger et al., supra note 9, at 747–48.
203. Risinger, supra note 196, at 485.
87
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The CTS proficiency test results for the 1978–2005 period addressed the
comparison of known and questioned signatures and other writings to determine
authorship. In other exercises participants were asked to examine a variety of
mechanical impressions on paper and the use of photocopying and inks.
• Between1978and1999,204 fewer than 5% of the mechanical impression
comparisons were in error, but 10% of the replies were inconclusive where
the examiner should have excluded the impressions as having a common
source. With regard to handwriting comparisons, the examiners did very
well on the straightforward comparisons, with almost 100% of the comparisons correct. However, in more challenging tests, such as those involving multiple authors, as high as 25% of the replies were inconclusive and
nearly 10% of the author associations were incorrect.
• In the 2000–2005 time period, the participants generally performed
very well (some approaching 99% correct responses) in determining the
genuineness of documents where text in a document had been manipulated or where documents had been altered with various pens and inks.
The handwriting exercises were not as successful; in those exercises, comparisons of questioned and known writings were correct about 92% of the
time, inconclusive 7% of the time, and incorrect 1% of the time. Nearly
all incorrect responses occurred where participants reported handwriting
to be of common origin when it was not.
During these tests, some examiners characterized the tests as too easy, while others
described them as realistic and very challenging.
Thus, the results of the most recent proficiency studies are encouraging.
Moreover, the data in the five proficiency tests discussed in the 1989 article205 can
be subject to differing interpretation. The critics of questioned document examination sometimes suggest that the results of the 1985 test in particular prove that
signature authentication has “a high error rate.”206 However,
[t]hese results can be characterized in different ways. [Another] way of viewing
the result would be to disaggregate the specific decisions made by the experts.
. . . [S]uppose that a teacher gives a multiple-choice test containing fifty questions. There are different ways that the results could be reported. One could
calculate the percentage of students who got any of the fifty questions wrong,
and report that as the error rate. A more customary approach would be to treat
204. John I. Thornton & Joseph L. Peterson, The General Assumptions and Rationale of Forensic
Identification, in 4 Modern Scientific Evidence: The Law and Science of Expert Testimony, supra note
104, § 29:40, at 54.
205. Risinger et al., supra note 9.
206. Park, supra note 199, at 1113.
88
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
each question as a separate task, and report the error rate as the mean percentage
of questions answered incorrectly.207
If the specific decisions made by the examiners were disaggregated, each examiner
had to make 66 decisions regarding whether certain pairs of signatures were written by the same person.208 Under this approach, the false authentication error rate
was 3.8%, and the false elimination error rate was 4.5%.209 In that light, even the
1985 study supports the contention that examiners perform signature authentication tasks at a validity rate considerably exceeding chance.
C. Case Law Development
Although the nineteenth-century cases were skeptical of handwriting expertise,210
in the twentieth century the testimony in leading cases, such as the Lindbergh
prosecution, helped the discipline gain judicial acceptance. There was little dispute
that handwriting comparison testimony was admissible at the time the Federal
Rules of Evidence were enacted in 1975. Rule 901(b)(3) recognized that a document could be authenticated by an expert, and the drafters explicitly mentioned
handwriting comparison “testimony of expert witnesses.”211
The first significant admissibility challenge under Daubert was mounted in
United States v. Starzecpyzel.212 In that case, the district court concluded that
“forensic document examination, despite the existence of a certification program,
professional journals and other trappings of science, cannot, after Daubert, be
regarded as ‘scientific . . . knowledge.’”213 Nonetheless, the court did not exclude
handwriting comparison testimony. Instead, the court admitted the individuation
testimony as nonscientific “technical” evidence.214 Starzecpyzel prompted more
attacks that questioned the lack of empirical validation in the field.215
207. Id. at 1114.
208. Id. at 1115.
209. Id. at 1116.
210. See Strother v. Lucas, 31 U.S. 763, 767 (1832); Phoenix Fire Ins. Co. v. Philip, 13 Wend.
81, 82–84 (N.Y. Sup. Ct. 1834).
211. Fed. R. Evid. 901(b)(3) advisory committee’s note.
212. 880 F. Supp. 1027 (S.D.N.Y. 1995).
213. Id. at 1038.
214. Kumho Tire later called this aspect of the Starzecpyzel opinion into question because Kumho
held that the reliability requirement applies to all types of expertise—“scientific,” “technical,” or
“specialized.” Moreover, the Supreme Court indicated that the Daubert factors, including empirical
testing, may be applicable to technical expertise. Some aspects of handwriting can and have been tested.
215. See, e.g., United States v. Hidalgo, 229 F. Supp. 2d 961, 967 (D. Ariz. 2002) (“Because
the principle of uniqueness is without empirical support, we conclude that a document examiner
will not be permitted to testify that the maker of a known document is the maker of the questioned
document. Nor will a document examiner be able to testify as to identity in terms of probabilities.”).
89
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
As of the date of this publication, there is a three-way split of authority.
The majority of courts permit examiners to express individuation opinions.216 As
one court noted, “all six circuits that have addressed the admissibility of handwriting expert [testimony] . . . [have] determined that it can satisfy the reliability
threshold” for nonscientific expertise.217 In contrast, several courts have excluded
expert testimony,218 although one involved handprinting219 and another Japanese
handprinting.220 Many district courts have endorsed a third view. These courts
limit the reach of the examiner’s opinion, permitting expert testimony about
similarities and dissimilarities between exemplars but not an ultimate conclusion
that the defendant was the author (“common authorship” opinion) of the questioned document.221 The expert is allowed to testify about “the specific similarities
and idiosyncrasies between the known writings and the questioned writings, as
well as testimony regarding, for example, how frequently or infrequently in his
experience, [the expert] has seen a particular idiosyncrasy.”222 As the justification
for this limitation, these courts often state that the examiners’ claimed ability to
individuate lacks “empirical support.”223
216. See, e.g., United States v. Prime, 363 F.3d 1028, 1033 (9th Cir. 2004); United States v.
Crisp, 324 F.3d 261, 265–71 (4th Cir. 2003); United States v. Jolivet, 224 F.3d 902, 906 (8th Cir.
2000) (affirming the introduction of expert testimony that it was likely that the accused wrote the
questioned documents); United States v. Velasquez, 64 F.3d 844, 848–52 (3d Cir. 1995); United States
v. Ruth, 42 M.J. 730, 732 (A. Ct. Crim. App. 1995), aff’d on other grounds, 46 M.J. 1 (C.A.A.F. 1997);
United States v. Morris, No. 06-87-DCR, 2006 U.S. Dist. LEXIS 53983, *5 (E.D. Ky. July 20, 2006);
Orix Fin. Servs. v. Thunder Ridge Energy, Inc., No. 01Civ. 4788 (RJH) (HBP). 2005 U.S. Dist.
LEXIS 41889 (S.D.N.Y. Dec. 29, 2005).
217. Prime, 363 F.3d at 1034.
218. United States v. Lewis, 220 F. Supp. 2d 548 (S.D. W. Va. 2002).
219. United States v. Saelee, 162 F. Supp. 2d 1097 (D. Alaska 2001).
220. United States v. Fujii, 152 F. Supp. 2d 939, 940 (N.D. Ill. 2000) (holding expert testimony
concerning Japanese handprinting inadmissible: “Handwriting analysis does not stand up well under
the Daubert standards. Despite its long history of use and acceptance, validation studies supporting its
reliability are few, and the few that exist have been criticized for methodological flaws.”).
221. See, e.g., United States v. Oskowitz, 294 F. Supp. 2d 379, 384 (E.D.N.Y. 2003) (“Many
other district courts have similarly permitted a handwriting expert to analyze a writing sample for
the jury without permitting the expert to offer an opinion on the ultimate question of authorship.”);
United States v. Rutherford, 104 F. Supp. 2d 1190, 1194 (D. Neb. 2000) (“[T]he Court concludes that
FDE Rauscher’s testimony meets the requirements of Rule 702 to the extent that he limits his testimony to identifying and explaining the similarities and dissimilarities between the known exemplars
and the questioned documents. FDE Rauscher is precluded from rendering any ultimate conclusions
on authorship of the questioned documents and is similarly precluded from testifying to the degree
of confidence or certainty on which his opinions are based.”); United States v. Hines, 55 F. Supp. 2d
62, 69 (D. Mass. 1999) (expert testimony concerning the general similarities and differences between
a defendant’s handwriting exemplar and a stick-up note was admissible while the specific conclusion
that the defendant was the author was not).
222. United States v. Van Wyk, 83 F. Supp. 2d 515, 524 (D.N.J. 2000).
223. United States v. Hidalgo, 229 F. Supp. 2d 961, 967 (D. Ariz. 2002).
90
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
VIII. Firearms Identification Evidence
It is widely considered that the first written reference to firearms identification
(popularly known as “ballistics”) in the United States appeared in 1900.224 In the
1920s, the technique gained considerable attention because of the work of Calvin
Goddard225 and played a controversial role in the Sacco and Vanzetti case during the
same decade.226 Goddard also analyzed the bullet evidence in the St. Valentine’s Day
Massacre in 1929, in which five gangsters and two acquaintances were gunned down
in Chicago.227 In 1923, the Illinois Supreme Court wrote that positive identification
of a bullet was not only impossible but “preposterous.”228 Seven years later, however, that court did an about-face and became one of the first courts in this country
to admit firearms identification evidence.229 The technique subsequently gained
widespread judicial acceptance and was not seriously challenged until recently.
A. The Technique
1. Firearms
Typically, three types of firearms—rifles, handguns, and shotguns—are encountered in criminal investigations.230 The barrels of modern rifles and handguns are
rifled; that is, parallel spiral grooves are cut into the inner surface (bore) of the
barrel. The surfaces between the grooves are called lands. The lands and grooves
twist in a direction: right twist or left twist. For each type of firearm produced,
the manufacturer specifies the number of lands and grooves, the direction of twist,
the angle of twist (pitch), the depth of the grooves, and the width of the lands
and grooves. As a bullet passes through the bore, the lands and grooves force the
224. See Albert Llewellyn Hall, The Missile and the Weapon, 39 Buff. Med. J. 727 (1900).
225. Calvin Goddard, often credited as the “father” of firearms identification, was responsible
for much of the early work on the subject. E.g., Calvin Goddard, Scientific Identification of Firearms and
Bullets, 17 J. Crim. L., Criminology & Police Sci. 254 (1926).
226. See Joughin & Morgan, supra note 8, at 15 (The firearms identification testimony was
“carelessly assembled, incompletely and confusedly presented, and . . . beyond the comprehension” of
the jury); Starrs, supra note 8, at 630 (Part I), 1050 (Part II).
227. See Calvin Goddard, The Valentine Day Massacre: A Study in Ammunition-Tracing, 1 Am.
J. Police Sci. 60, 76 (1930) (“Since two of the members of the execution squad had worn police
uniforms, and since it had been subsequently intimated by various persons that the wearers of the
uniforms might really have been policeman rather than disguised gangsters, it became a matter of no
little importance to ascertain, if possible, whether these rumors had any foundation in fact.”); Jim
Ritter, St. Valentine’s Hit Spurred Creation of Nation’s First Lab, Chicago Sun-Times, Feb. 9, 1997, at
40 (“Sixty-eight years ago this Friday, Al Capone’s hit men, dressed as cops, gunned down seven men
in the Clark Street headquarters of rival mobster Bugs Moran.”).
228. People v. Berkman, 139 N.E. 91, 94 (Ill. 1923).
229. People v. Fisher, 172 N.E. 743, 754 (Ill. 1930).
230. Other types of firearms, such as machine guns, tear gas guns, zip guns, and flare guns, may
also be examined.
91
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
bullet to rotate, giving it stability in flight and thus increased accuracy. Shotguns
are smooth-bore firearms; they do not have lands and grooves.
Rifles and handguns are classified according to their caliber. The caliber is
the diameter of the bore of the firearm; the caliber is expressed in either hundredths or thousandths of an inch (e.g., .22, .45, .357 caliber) or millimeters (e.g.,
7.62 mm).231 The two major types of handguns are revolvers232 and semiautomatic
pistols. A major difference between the two is that when a semiautomatic pistol
is fired, the cartridge case is automatically ejected and, if recovered at the crime
scene, could help link the case to the firearm from which it was fired. In contrast,
when a revolver is discharged the case is not ejected.
2. Ammunition
Rifle and handgun cartridges consist of the projectile (bullet),233 case,234 propellant (powder), and primer. The primer contains a small amount of an explosive
mixture, which detonates when struck by the firing pin. When the firing pin
detonates the primer, an explosion occurs that ignites the propellant. The most
common modern propellant is smokeless powder.
3. Class characteristics
Firearms identifications may be based on either bullet or cartridge case examinations. Identifying features include class, subclass, and individual characteristics.
The class characteristics of a firearm result from design factors and are determined prior to manufacture. They include the following caliber and rifling specifications: (1) the land and groove diameters, (2) the direction of rifling (left or
right twist), (3) the number of lands and grooves, (4) the width of the lands and
grooves, and (5) the degree of the rifling twist.235 Generally, a .38-caliber bullet
with six land and groove impressions and with a right twist could have been fired
only from a firearm with these same characteristics. Such a bullet could not have
been fired from a .32-caliber firearm, or from a .38-caliber firearm with a different
number of lands and grooves or a left twist. In sum, if the class characteristics do
not match, the firearm could not have fired the bullet and is excluded.
231. The caliber is measured from land to land in a rifled weapon. Typically, the designated
caliber is more an approximation than an accurate measurement. See 1 J. Howard Mathews, Firearms
Identification 17 (1962) (“‘nominal caliber’ would be a more proper term”).
232. Revolvers have a cylindrical magazine that rotates behind the barrel. The cylinder typically
holds five to nine cartridges, each within a separate chamber. When a revolver is fired, the cylinder
rotates and the next chamber is aligned with the barrel. A single-action revolver requires the manual
cocking of the hammer; in a double-action revolver the trigger cocks the hammer.
233. Bullets are generally composed of lead and small amounts of other elements (hardeners).
They may be completely covered (jacketed) with another metal or partially covered (semijacketed).
234. Cartridge cases are generally made of brass.
235. 1 Mathews, supra note 231, at 17.
92
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
4. Subclass characteristics
Subclass characteristics are produced at the time of manufacture and are shared
by a discrete subset of weapons in a production run or “batch.” According to the
Association of Firearm and Tool Mark Examiners (AFTE),236 subclass characteristics are discernible surface features that are more restrictive than class characteristics in that they are (1) “produced incidental to manufacture,” (2) “relate to
a smaller group source (a subset to which they belong),” and (3) can arise from
a source that changes over time.237 The AFTE states that “[c]aution should be
exercised in distinguishing subclass characteristics from class characteristics.”238
5. Individual characteristics
Bullet identification involves a comparison of the evidence bullet and a test bullet
fired from the firearm.239 The two bullets are examined by means of a comparison
microscope, which permits a split-screen view of the two bullets and manipulation
in order to attempt to align the striations (marks) on the two bullets.
Barrels are machined during the manufacturing process, and imperfections
in the tools used in the machining process are imprinted on the bore.240 The
subsequent use of the firearm adds further individual imperfections. For example,
mechanical action (erosion) caused by the friction of bullets passing through the
bore of the firearm produces accidental imperfections. Similarly, chemical action
(corrosion) caused by moisture (rust), as well as primer and propellant chemicals,
produce other imperfections.
When a bullet is fired, microscopic striations are imprinted on the bullet
surface as it passes through the bore of the firearm. These bullet markings are produced by the imperfections in the bore. Because these imperfections are randomly
produced, examiners assume that they are unique to each firearm.241 Although the
assumption is plausible, there is no statistical basis for this assumption.242
236. AFTE is the leading professional organization in the field. There is also the Scientific Working Group for Firearms and Toolmarks (SWGGUN), which promulgates guidelines for examiners.
237. Theory of Identification, Association of Firearm and Toolmark Examiners, 30 AFTE J. 86, 88
(1998) [hereinafter AFTE Theory].
238. Id.
239. Test bullets are obtained by firing a firearm into a recovery box or bullet trap, which is
usually filled with cotton, or into a recovery tank, which is filled with water.
240. “No two barrels are microscopically identical, as the surfaces of their bores all possess
individual and characteristic markings.” Gerald Burrard, The Identification of Firearms and Forensic
Ballistics 138 (1962).
241. 1 Mathews, supra note 231, at 3 (“Experience has shown that no two firearms, even those
of the same make and model and made consecutively by the same tools, will produce the same markings on a bullet or a cartridge.”).
242. Alfred A. Biasotti, The Principles of Evidence Evaluation as Applied to Firearms and Tool Mark
Identification, 9 J. Forensic Sci. 428, 432 (1964) (“[W]e lack the fundamental statistical data needed to
93
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Although an identification is based on objective data (the striations on the
bullet surface), the AFTE explains that the examiner’s individuation is essentially
a subjective judgment. The AFTE describes the traditional pattern recognition
methodology as “subjective in nature, founded on scientific principles and based
on the examiner’s training and experience.”243 There are no objective criteria
governing this determination: “Ultimately, unless other issues are involved, it
remains for the examiner to determine for himself the modicum of proof necessary
to arrive at a definitive opinion.”244
The condition of a firearm or evidence bullet may preclude an identification.
For example, there may be insufficient marks on the bullet or, because of mutilation, an insufficient amount of the bullet may have been recovered. Likewise, if
the bore of the firearm has changed significantly as a result of erosion or corrosion,
an identification may be impossible. (Unlike fingerprints, firearms change over
time.) In these situations, the examiner may render a “no conclusion” determination. Such a conclusion, however, may have some evidentiary value even if the
examiner cannot form an individuation opinion; that is, the firearm could have
fired the bullet if the class characteristics match.
6. Consecutive matching striae
In an attempt to make firearms identification more objective, some commentators
advocate a technique known as consecutive matching striae (CMS). As the name
implies, this method is based on finding a specified number of consecutive matching striae on two bullets. Other commentators have questioned this approach,245
and it remains a minority position.246
7. Cartridge identification
Cartridge case identification is based on the same theory of random markings as
bullet identification.247 As with barrels, defects produced in the manufacturing
develop verifiable criteria.”); see also Alfred A. Biasotti, A Statistical Study of the Individual Characteristics
of Fired Bullets, 4 J. Forensic Sci. 34 (1959).
243. AFTE Theory, supra note 237, at 86.
244. Laboratory Proficiency Test, supra note 81, at 207; see also Alfred A. Biasotti, The Principles
of Evidence Evaluation as Applied to Firearms and Tool Mark Identification, supra note 242, at 429 (“In
general, the texts on firearms identification take the position that each practitioner must develop his
own intuitive criteria of identity gained through practical experience.”).
245. See Stephen G. Bunch, Consecutive Matching Striation Criteria: A General Critique, 45 J.
Forensic Sci. 955, 955 (2000) (finding the traditional methodology superior: “[P]resent-day firearm
identification, in the final analysis is subjective.”).
246. Roger C. Nichols, Firearm and Toolmark Identification Criteria: A Review of the Literature, Part
II, 48 J. Forensic Sci. 318, 326 (2003) (CMS “has not been promoted as an alternative [to traditional
pattern recognition], but as a numerical threshold.”).
247. Burrard, supra note 240, at 107. However, bullet and cartridge case identifications differ in
several respects. Because the bullet is traveling through the barrel at the time it is imprinted with the
94
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
process leave distinctive characteristics on the breech face, firing pin, chamber,
extractor, and ejector. Subsequent use of the firearm produces additional defects.
When the trigger is pulled, the firing pin strikes the primer of the cartridge, causing the primer to detonate. This detonation ignites the propellant (powder). In the
process of combustion, the powder is converted rapidly into gases. The pressure
produced by this process propels the bullet from the weapon and also forces the
base of the cartridge case backward against the breech face, imprinting breech
face marks on the base of the cartridge case. Similarly, the firing pin, ejector, and
extractor may leave characteristic marks on a cartridge case.248
Cartridge case identification involves a comparison of the cartridge case
recovered at the crime scene and a test cartridge case obtained from the firearm
after it has been fired. Shotgun shell casings may be identified in this way, as well.
As in bullet identification, the comparison microscope is used in the examination.
According to AFTE, “interpretation of toolmark individualization and identification is still considered to be subjective in nature, based on one’s training and
experience.”249
8. Automated identification systems
“These ballistic imaging systems use the powerful searching capabilities of the
computer to match the images of recovered crime scene evidence against digitized
images stored in a computer database.”250 The current system is the Integrated
Ballistics Information System (IBIS).251 Automated systems “give[ ] firearms examiners the ability to screen virtually unlimited numbers of bullets and cartridge
casings for possible matches.”252 These systems identify a number of candidate
matches. They do not replace the examiner, who still must make the final comparison: “‘High Confidence’ candidates (likely hits) are referred to a firearms
examiner for examination on a comparison microscope.”253 The examiner need
bore imperfections, these marks are “sliding” imprints, called striated marks. In contrast, the cartridge
case receives “static” imprints, called impressed marks. Id. at 145.
248. Ejector and extractor marks by themselves may indicate only that the cartridge case had
been loaded in, not fired from, a particular firearm.
249. Eliot Springer, Toolmark Examinations—A Review of Its Development in the Literature, 40 J.
Forensic Sci. 964, 966–67 (1995).
250. Benchmark Evaluation Studies of the Bulletproof and Drugfire Ballistic Imaging Systems, 22 Crime
Lab. Digest 51 (1995); see also Jan De Kinder & Monica Bonfanti, Automated Comparisons of Bullet
Striations Based on 3D Topography, 101 Forensic Sci. Int’l 85, 86 (1999) (“[A]n automatic system will
cut the time demanding and tedious manual searches for one specific item in large open case files.”).
251. See Jan De Kinder et al., Reference Ballistic Imaging Database Performance, 140 Forensic Sci.
Int’l 207 (2004); Ruprecht Nennstiel & Joachim Rahm, An Experience Report Regarding the Performance
of the IBIS™ Correlator, 51 J. Forensic Sci. 24 (2006).
252. Richard E. Tontarski & Robert M. Thompson, Automated Firearms Evidence Comparison: A
Forensic Tool for Firearms Identification—An Update, 43 J. Forensic Sci. 641, 641 (1998).
253. Id.
95
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
not accept the highest ranked candidate identified by the system. For that matter,
the examiner may reject all the candidates.
9. Toolmarks
Toolmark identifications rest on essentially the same theory as firearms identifications.254 Tools have both (1) class characteristics and (2) individual characteristics;
the latter are accidental imperfections produced by the machining process and subsequent use. When the tool is used, these characteristics are sometimes imparted
onto the surface of another object struck by the tool. Toolmarks may be impressions (compression marks), striations (friction or scrape marks), or a combination
of both.255 Fracture matches constitute another type of examination.
The marks may be left on a variety of different materials, such as wood or
metal. In some cases, only class characteristics can be matched. For example, it
may be possible to identify a mark (impression) left on a piece of wood as having
been produced by a hammer, punch, or screwdriver. A comparison of the mark
and the evidence tool may establish the size of the tool (another class characteristic). Unusual features of the tool, such as a chip, may permit a positive identification. Striations caused by scraping with a tool can also produce distinguishing
marks in much the same way that striations are imprinted on a bullet when a
firearm is discharged. This type of examination has the same limitations as firearms
identification: “[T]he characteristics of a tool will change with use.”256
Firearms identification could be considered a subspecialty of toolmark identification; the firearm (tool) imprints its individual characteristics on the bullet.
However, the markings on a bullet or cartridge case are imprinted in roughly the
same way every time a firearm is fired. In contrast, toolmark analysis can be more
complicated because a tool can be employed in a variety of different ways, each
producing a different mark: “[I]n toolmark work the angle at which the tool was
used must be duplicated in the test standard, pressures must be dealt with, and the
degree of hardness of metals and other materials must be taken into account.”257
The comparison microscope is also used in this examination. As with firearms identification testimony, toolmark identification testimony is based on the
subjective judgment of the examiner, who determines whether sufficient marks of
254. See Biasotti, The Principles of Evidence Evaluation as Applied to Firearms and Tool Mark Identification, supra note 242; see also Springer, supra note 249, at 964 (“The identification is based . . . on
a series of scratches, depressions, and other marks which the tool leaves on the object it comes into
contact with. The combination of these various marks ha[s] been termed toolmarks and the claim is
that every instrument can impart a mark individual to itself.”).
255. David Q. Burd & Roger S. Greene, Tool Mark Examination Techniques, 2 J. Forensic Sci.
297, 298 (1957).
256. Emmett M. Flynn, Toolmark Identification, 2 J. Forensic Sci. 95, 102 (1957).
257. Id. at 105.
96
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
similarity are present to permit an identification.258 There are no objective criteria
governing the determination of whether there is a match.259
B. The Empirical Record
In its 2009 report, NRC summarized the state of the research as follows:
Because not enough is known about the variabilities among individual tools and
guns, we are not able to specify how many points of similarity are necessary for
a given level of confidence in the result. Sufficient studies have not been done
to understand the reliability and repeatability of the methods. The committee
agrees that class characteristics are helpful in narrowing the pool of tools that may
have left a distinctive mark. Individual patterns from manufacture or from wear
might, in some cases, be distinctive enough to suggest one particular source, but
additional studies should be performed to make the process of individualization
more precise and repeatable.260
The 1978 Crime Laboratory Proficiency Testing Program reported mixed
results on firearms identification tests. In one test, 5.3% of the participating laboratories misidentified firearms evidence, and in another test 13.6% erred. These tests
involved bullet and cartridge case comparisons. The Project Advisory Committee
considered these errors “particularly grave in nature” and concluded that they
probably resulted from carelessness, inexperience, or inadequate supervision.261 A
third test required the examination of two bullets and two cartridge cases to identify
the “most probable weapon” from which each was fired. The error rate was 28.2%.
In later tests,
[e]xaminers generally did very well in making the comparisons. For all fifteen
tests combined, examiners made a total of 2106 [bullet and cartridge case] comparisons and provided responses which agreed with the manufacturer responses
88% of the time, disagreed in only 1.4% of responses, and reported inconclusive
results in 10% of cases.262
258. See Springer, supra note 249, at 966–67 (“According to the Association of Firearms and
Toolmarks Examiners’ Criteria for Identification Committee, interpretation of toolmark individualization
and identification is still considered to be subjective in nature, based on one’s training and experience.”).
259. As one commentator has noted: “[I]t is not possible at present to categorically state the
number and percentage of the [striation] lines which must correspond.” Burd & Greene, supra note
255, at 310.
260. Id.
261. Laboratory Proficiency Test, supra note 81, at 207–08.
262. Peterson & Markham, supra note 82, at 1018. The authors also stated:
The performance of laboratories in the firearms tests was comparable to that of the earlier LEAA study,
although the rate of successful identifications actually was slightly over—88% vs. 91%. Laboratories cut
the rate of errant identifications by half (3% to 1.4%) but the rate of inconclusive responses doubled,
from 5% to 10%.
Id. at 1019.
97
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Proficiency testing on toolmark examinations has also been reported.263
For the period 1978–1999, firearms examiners performed well on their CTS
proficiency tests, with only 2% to 3% of their comparisons incorrect, but with 10%
to 13% of their responses inconclusive.264 The scenarios that accompanied the test
materials asked examiners to compare test-fired bullets and/or cartridge cases with
evidence projectiles found at a crime scene. Between 2000 and 2005, participants,
again, performed very well, averaging less than 1% incorrect responses, but with
inconclusive results about 10% of the time. Most of the inconclusive results in
these tests occurred where bullets and/or cartridge cases were actually fired from
different weapons. Examiners frequently stated they were unable to reach the
proper conclusion because they did not have the actual weapon with which they
could perform their own test fires of ammunition.
In CTS toolmark proficiency comparisons, laboratories were asked to compare marks made with such tools as screwdrivers, bolt cutters, hammers, and handstamps. In some cases, tools were supplied to participants, but in most cases they
were given only test marks. Over the entire 1978–2005 period, fewer than 5%
of responses were in error, but individual test results varied substantially. In some
cases, 30% to 40% of replies were inconclusive, because laboratories were unsure
if the blade of the tool in question might have been altered between the time(s)
different markings had been made. During the final 6-year period reviewed
(2000–2005), laboratories averaged a 1% incorrect comparison rate for toolmarks.
Inconclusive responses remained high (30% and greater) and, together with firearms testing, constitute the evidence category where evidence comparisons have
the highest rates of inconclusive responses.
Questions have arisen concerning the significance of these tests. First, such
testing is not required of all firearms examiners, only those working in laboratories voluntarily seeking accreditation by the ASCLD. In short, “the sample is
self-selecting and may not be representative of the complete universe of firearms
examiners.”265 Second, the examinations are not blind—that is, examiners know
when they are being tested. Thus, the examiner may be more meticulous and
careful than in ordinary case work. Third, the results of an evaluation can vary,
depending on whether an “inconclusive” answer is counted. Fourth, the rigor
of the examinations has been questioned. According to one witness, in a 2005
test involving cartridge case comparisons, none of the 255 test-takers nationwide
answered incorrectly. The court observed: “One could read these results to mean
that the technique is foolproof, but the results might instead indicate that the test
was somewhat elementary.”266
263. Id. at 1025 (“Overall, laboratories performed not as well on the toolmark tests as they did
on the firearms tests.”).
264. Thornton & Peterson, supra note 204, § 29:47, at 66.
265. United States v. Monteiro, 407 F. Supp. 2d 351, 367 (D. Mass. 2006).
266. Id.
98
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
In 2008, NAS published a report on computer imaging of bullets.267 Although
firearms identification was not the primary focus of the investigation, a section
of the report commented on this subject.268 After surveying the literature on
the uniqueness, reproducibility, and permanence of individual characteristics, the
committee noted that “[m]ost of these studies are limited in scale and have been
conducted by firearms examiners (and examiners in training) in state and local
law enforcement laboratories as adjuncts to their regular casework.”269 The report
concluded: “The validity of the fundamental assumptions of uniqueness and reproducibility of firearms-related toolmarks has not yet been fully demonstrated.”270
This statement, however, was qualified:
There is one baseline level of credibility . . . that must be demonstrated lest any
discussion of ballistic imaging be rendered moot—namely, that there is at least
some “signal” that may be detected. In other words, the creation of toolmarks
must not be so random and volatile that there is no reason to believe that any
similar and matchable marks exist on two exhibits fired from the same gun. The
existing research, and the field’s general acceptance in legal proceedings for several decades, is more than adequate testimony to that baseline level. Beyond that
level, we neither endorse nor oppose the fundamental assumptions. Our review
in this chapter is not—and is not meant to be—a full weighing of evidence for
or against the assumptions, but it is ample enough to suggest that they are not
fully settled, mechanically or empirically.
Another point follows directly: Additional general research on the uniqueness and
reproducibility of firearms-related toolmarks would have to be done if the basic premises of
firearms identification are to be put on a more solid scientific footing.271
The 2008 report cautioned:
Conclusions drawn in firearms identification should not be made to imply the presence of
a firm statistical basis when none has been demonstrated. Specifically, . . . examiners
tend to cast their assessments in bold absolutes, commonly asserting that a match
can be made “to the exclusion of all other firearms in the world.” Such comments cloak an inherently subjective assessment of a match with an extreme
probability statement that has no firm grounding and unrealistically implies an
error rate of zero.272
267. National Research Council, Ballistic Imaging (2008), available at http://www.nap.edu/
catalog.php?record_id=12162.
268. The committee was asked to assess the feasibility, accuracy, reliability, and technical
capability of developing and using a national ballistic database as an aid to criminal investigations.
It concluded: (1) “A national reference ballistic image database of all new and imported guns is not
advisable at this time.” (2) “NIBIN can and should be made more effective through operational and
technological improvements.” Id. at 5.
269. Id. at 70.
270. Id. at 81.
271. Id. at 81–82.
272. Id. at 82.
99
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The issue of the adequacy of the empirical basis of firearms identification
expertise remains in dispute,273 and research is ongoing. A recent study reported
testing concerning 10 consecutively rifled Ruger pistol barrels. In 463 tests during
the study, no false positives were reported; 8 inconclusive results were reported.274
“But the capsule summaries [in this study] suggest a heavy reliance on the subjective findings of examiners rather than on the rigorous quantification and analysis
of sources of variability.”275
C. Case Law Development
Firearms identification developed in the early part of the last century, and by
1930, courts were admitting evidence based on this technique.276 Subsequent cases
followed these precedents, admitting evidence of bullet,277 cartridge case,278 and
shot shell279 identifications. A number of courts have also permitted an expert to
273. Compare Roger G. Nichols, Defending the Scientific Foundations of the Firearms and Tool
Mark Identification Discipline: Responding to Recent Challenges, 52 J. Forensic Sci. 586 (2007), with Adina
Schwartz, Commentary on “Nichols, R.G., Defending the scientific foundations of the firearms and tool mark
identification discipline: Responding to recent challenges, J. Forensic Sci. 52(3):586-94 (2007),” 52 J. Forensic
Sci. 1414 (2007) (responding to Nichols). Moreover, AFTE disputed the Academy’s conclusions. See
The Response of the Association of Firearm and Tool Mark Examiners to the National Academy of Sciences 2008
Report Assessing the Feasibility, Accuracy, and Capability of a National Ballistic Database August 20, 2008, 40
AFTE J. 234 (2008) (concluding that underlying assumptions of uniqueness and reproducibility have
been demonstrated, and the implication that there is no statistical basis is unwarranted); see also Adina
Schwartz, A Systemic Challenge to the Reliability and Admissibility of Firearms and Toolmark Identification,
6 Colum. Sci. & Tech. L. Rev. 2 (2005).
274. James E. Hamby et al., The Identification of Bullets Fired from 10 Consecutively Rifled 9mm
Ruger Pistol Barrels—A Research Project Involving 468 Participants from 19 Countries, 41 AFTE J. 99
(Spring 2009).
275. NRC Forensic Science Report, supra note 3, at 155.
276. E.g., People v. Fisher, 172 N.E. 743 (Ill. 1930); Evans v. Commonwealth, 19 S.W.2d 1091
(Ky. 1929); Burchett v. State, 172 N.E. 555 (Ohio Ct. App. 1930).
277. E.g., United States v. Wolff, 5 M.J. 923, 926 (N.C.M.R. 1978); State v. Mack, 653 N.E.2d
329, 337 (Ohio 1995) (The examiner “compared the test shot with the morgue bullet recovered from
the victim, . . . and the spent shell casings recovered from the crime scene, concluding that all had
been discharged from appellant’s gun.”).
278. E.g., Bentley v. Scully, 41 F.3d 818, 825 (2d Cir. 1994) (“[A] ballistic expert found that the
spent nine millimeter bullet casing recovered from the scene of the shooting was fired from the pistol
found on the rooftop.”); State v. Samonte, 928 P.2d 1, 6 (Haw. 1996) (“Upon examining the striation
patterns on the casings, [the examiner] concluded that the casing she had fired matched six casings that
police had recovered from the house.”).
279. E.g., Williams v. State, 384 So. 2d 1205, 1210–11 (Ala. Crim. App. 1980); Burge v. State,
282 So. 2d 223, 229 (Miss. 1973); Commonwealth v. Whitacre, 878 A.2d 96, 101 (Pa. Super. Ct.
2005) (“no abuse of discretion in the trial court’s decision to permit admission of the evidence regarding comparison of the two shell casings with the shotgun owned by Appellant”).
100
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
testify that a bullet could have been fired from a particular firearm;280 that is, the class
characteristics of the bullet and the firearm are consistent.281
The early post-Daubert challenges to the admissibility of firearms identification evidence failed.282 This changed in 2005 in United States v. Green,283 where
the court ruled that the expert could describe only the ways in which the casings were similar but not that the casings came from a specific weapon “to the
exclusion of every other firearm in the world.”284 In United States v. Monteiro285
the expert had not made any sketches or taken photographs and thus adequate
documentation was lacking: “Until the basis for the identification is described in
such a way that the procedure performed by [the examiner] is reproducible and
verifiable, it is inadmissible under Rule 702.”286
In 2007 in United States v. Diaz,287 the court found that the record did not
support the conclusion that identifications could be made to the exclusion of all
other firearms in the world. Thus, “the examiners who testify in this case may
only testify that a match has been made to a ‘reasonable degree of certainty in the
ballistics field.’”288 In 2008, United States v. Glynn289 ruled that the expert could
280. E.g., People v. Horning, 102 P.3d 228, 236 (Cal. 2004) (expert “opined that both bullets
and the casing could have been fired from the same gun . . . ; because of their condition he could not
say for sure”); Luttrell v. Commonwealth, 952 S.W.2d 216, 218 (Ky. 1997) (expert “testified only that
the bullets which killed the victim could have been fired from Luttrell’s gun”); State v. Reynolds, 297
S.E.2d 532, 539–40 (N.C. 1982); Commonwealth v. Moore, 340 A.2d 447, 451 (Pa. 1975).
281. This type of evidence has some probative value and satisfies the minimal evidentiary test
for logical relevancy. See Fed. R. Evid. 401. As one court commented, the expert’s “testimony,
which established that the bullet which killed [the victim] could have been fired from the same caliber and make of gun found in the possession of [the defendant], significantly advanced the inquiry.”
Commonwealth v. Hoss, 283 A.2d 58, 68 (Pa. 1971).
282. See United States v. Hicks, 389 F.3d 514, 526 (5th Cir. 2004) (ruling that “the matching of spent shell casings to the weapon that fired them has been a recognized method of ballistics
testing in this circuit for decades”); United States v. Foster, 300 F. Supp. 2d 375, 377 n.1 (D. Md.
2004) (“Ballistics evidence has been accepted in criminal cases for many years. . . . In the years since
Daubert, numerous cases have confirmed the reliability of ballistics identification.”); United States
v. Santiago, 199 F. Supp. 2d 101, 111 (S.D.N.Y. 2002) (“The Court has not found a single case in
this Circuit that would suggest that the entire field of ballistics identification is unreliable.”); State v.
Anderson, 624 S.E.2d 393, 397–98 (N.C. Ct. App. 2006) (no abuse of discretion in admitting bullet
identification evidence); Whitacre, 878 A.2d at 101 (“no abuse of discretion in the trial court’s decision
to permit admission of the evidence regarding comparison of the two shell casings with the shotgun
owned by Appellant”).
283. 405 F. Supp. 2d 104 (D. Mass. 2005).
284. Id. at 107. The court had followed the same approach in a handwriting case. See United
States v. Hines, 55 F. Supp. 2d 62, 67 (D. Mass. 1999) (expert testimony concerning the general similarities and differences between a defendant’s handwriting exemplar and a stick-up note was admissible
but not the specific conclusion that the defendant was the author).
285. 407 F. Supp. 2d 351 (D. Mass. 2006).
286. Id. at 374.
287. No. CR 05-00167 WHA, 2007 WL 485967 (N.D. Cal. Feb. 12, 2007).
288. Id. at *1.
289. 578 F. Supp. 2d 567 (S.D.N.Y. 2008).
101
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
not use the term “reasonable scientific certainty” in testifying. Rather, the expert
would be permitted to testify only that it was “more likely than not” that recovered bullets and cartridge cases came from a particular weapon.
Yet other courts continued to uphold admission.290 By way of example, in
United States v. Williams,291 the Second Circuit upheld the admissibility of firearms
identification evidence—bullets and cartridge casings. The opinion, however,
contained some cautionary language: “We do not wish this opinion to be taken
as saying that any proffered ballistic expert should be routinely admitted.”292
Several cases limited testimony after the 2009 NAS Report was published.293 In
the past, courts often have admitted toolmark identification evidence,294 includ-
290. See United States v. Natson, 469 F. Supp. 2d 1253, 1261 (M.D. Ga. 2007) (“According
to his testimony, these toolmarks were sufficiently similar to allow him to identify Defendant’s gun
as the gun that fired the cartridge found at the crime scene. He opined that he held this opinion to
a 100% degree of certainty. . . . The Court also finds [the examiner’s] opinions reliable and based
upon a scientifically valid methodology. Evidence was presented at the hearing that the toolmark
testing methodology he employed has been tested, has been subjected to peer review, has an
ascertainable error rate, and is generally accepted in the scientific community.”); Commonwealth
v. Meeks, Nos. 2002-10961, 2003-10575, 2006 WL 2819423, at * 50 (Mass. Super. Ct. Sept. 28,
2006) (“The theory and process of firearms identification are generally accepted and reliable, and
the process has been reliably applied in these cases. Accordingly, the firearms identification evidence, including opinions as to matches, may be presented to the juries for their consideration, but
only if that evidence includes a detailed statement of the reasons for those opinions together with
appropriate documentation.”).
291. 506 F.3d 151, 161–62 (2d Cir. 2007) (“Daubert did make plain that Rule 702 embodies a
more liberal standard of admissibility for expert opinions than did Frye. . . . But this shift to a more
permissive approach to expert testimony did not abrogate the district court’s gatekeeping function.
Nor did it ‘grandfather’ or protect from Daubert scrutiny evidence that had previously been admitted
under Frye.”) (citations omitted).
292. Id. at 161.
293. See United States v. Willock, 696 F. Supp. 2d 536, 546, 549 (D. Md. 2010) (holding,
based on a comprehensive magistrate’s report, that “Sgt. Ensor shall not opine that it is a ‘practical
impossibility’ for a firearm to have fired the cartridges other than the common ‘unknown firearm’
to which Sgt. Ensor attributes the cartridges.” Thus, “Sgt. Ensor shall state his opinions and conclusions without any characterization as to the degree of certainty with which he holds them.”); United
States v. Taylor, 663 F. Supp. 2d 1170, 1180 (D.N.M. 2009) (“[B]ecause of the limitations on the
reliability of firearms identification evidence discussed above, Mr. Nichols will not be permitted to
testify that his methodology allows him to reach this conclusion as a matter of scientific certainty.
Mr. Nichols also will not be allowed to testify that he can conclude that there is a match to the
exclusion, either practical or absolute, of all other guns. He may only testify that, in his opinion,
the bullet came from the suspect rifle to within a reasonable degree of certainty in the firearms
examination field.”).
294. In 1975, the Ninth Circuit noted that toolmark identification “rests upon a scientific basis
and is a reliable and generally accepted procedure.” United States v. Bowers, 534 F.2d 186, 193 (9th
Cir. 1976).
102
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
ing screwdrivers,295 crowbars,296 punches,297 knives,298 as well as other objects.299
An expert’s opinion is admissible even if the expert cannot testify to a positive
identification.300
IX. Bite Mark Evidence
Bite mark analysis has been used for more than 50 years to establish a connection
between a defendant and a crime.301 The specialty developed within the field of
forensic dentistry as an adjunct of dental identification, rather than originating in
295. E.g., State v. Dillon, 161 N.W.2d 738, 741 (Iowa 1968) (screwdriver and nail bar fit marks
on door frame); State v. Wessling, 150 N.W.2d 301 (Iowa 1967) (screwdriver); State v. Hazelwood,
498 P.2d 607, 612 (Kan. 1972) (screwdriver and imprint on window molding); State v. Wade, 465
S.W.2d 498, 499–500 (Mo. 1971) (screwdriver and pry marks on door jamb); State v. Brown, 291
S.W.2d 615, 618–19 (Mo. 1956) (crowbar and screwdriver marks on window sash and door); State v.
Eickmeier, 191 N.W.2d 815, 816 (Neb. 1971) (screwdriver and marks on door).
296. E.g., Brown, 291 S.W.2d at 618–19 (Mo. 1956) (crowbar and screwdriver marks on
window sash and door); State v. Raines, 224 S.E.2d 232, 234 (N.C. Ct. App. 1976).
297. E.g., State v. Montgomery, 261 P.2d 1009, 1011–12 (Kan. 1953) (punch marks on safe).
298. E.g., State v. Baldwin, 12 P. 318, 324–25 (Kan. 1886) (experienced carpenters could testify
that wood panel could have been cut by accused’s knife); Graves v. State, 563 P.2d 646, 650 (Okla. Crim.
App. 1977) (blade and knife handle matched); State v. Clark, 287 P. 18, 20 (Wash. 1930) (knife and cuts
on tree branches); State v. Bernson, 700 P.2d 758, 764 (Wash. Ct. App. 1985) (knife tip comparison).
299. E.g., United States v. Taylor, 334 F. Supp. 1050, 1056–57 (E.D. Pa. 1971) (impressions on
stolen vehicle and impressions made by dies found in defendant’s possession), aff’d, 469 F.2d 284 (3d
Cir. 1972); State v. McClelland, 162 N.W.2d 457, 462 (Iowa 1968) (pry bar and marks on “jimmied”
door); Adcock v. State, 444 P.2d 242, 243–44 (Okla. Crim. App. 1968) (tool matched pry marks on
door molding); State v. Olsen, 317 P.2d 938, 940 (Or. 1957) (hammer marks on the spindle of a safe).
300. For example, in United States v. Murphy, 996 F.2d 94 (5th Cir. 1993), an FBI expert gave
limited testimony “that the tools such as the screwdriver associated with Murphy ‘could’ have made
the marks on the ignitions but that he could not positively attribute the marks to the tools identified
with Murphy.” Id. at 99; see also State v. Genrich, 928 P.2d 799, 802 (Colo. App. 1996) (upholding
expert testimony that three different sets of pliers recovered from the accused’s house were used to
cut wire and fasten a cap found in the debris from pipe bombs: “The expert’s premise, that no two
tools make exactly the same mark, is not challenged by any evidence in this record. Hence, the lack
of a database and points of comparison does not render the opinion inadmissible.”).
Although most courts have been receptive to toolmark evidence, a notable exception was Ramirez
v. State, 810 So. 2d 836, 849–51(Fla. 2001). In Ramirez, the Florida Supreme Court rejected the testimony of five experts who claimed general acceptance for a process of matching a knife with a cartilage
wound in a murder victim—a type of “toolmark” comparison. Although the court applied Frye, it
emphasized the lack of testing, the paucity of “meaningful peer review,” the absence of a quantified
error rate, and the lack of developed objective standards. In Sexton v. State, 93 S.W.3d 96 (Tex. Crim.
App. 2002), an expert testified that cartridge cases from unfired bullets found in the appellant’s apartment
had distinct marks that matched fired cartridge cases found at the scene of the offense. The court ruled
the testimony inadmissible: “This record qualifies Crumley as a firearms identification expert, but does
not support his capacity to identify cartridge cases on the basis of magazine marks only.” Id. at 101.
301. See E.H. Dinkel, The Use of Bite Mark Evidence as an Investigative Aid, 19 J. Forensic Sci.
535 (1973).
103
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
crime laboratories. Courts have admitted bite mark comparison evidence in homicide, rape, and child abuse cases. In virtually all the cases, the evidence was first
offered by the prosecution. The typical bite mark case has involved the identification of the defendant by matching his dentition with a mark left on the victim. In
several cases, however, the victim’s teeth have been compared with marks on the
defendant’s body. One bite mark case involved dentures302 and another braces.303
A few cases have entailed bite impressions on foodstuff found at a crime scene:
apple,304 piece of cheese,305 and sandwich.306 Still other cases involved dog bites.307
Bite marks occur primarily in sex-related crimes, child abuse cases, and offenses
involving physical altercations, such as homicide. A survey of 101 cases reported
these findings: “More than one bitemark was present in 48% of all the bite cases
studied. Bitemarks were found on adults in 81.3% of the cases and on children under
18 years-of-age in 16.7% of cases. Bitemarks were associated with the following
types of crimes: murder, including attempted murder (53.9%), rape (20.8%), sexual
assault (9.7%), child abuse (9.7%), burglary (3.3%), and kidnapping (12.6%).”308
A. The Technique
Bite mark identification is an offshoot of the dental identification of deceased
persons, which is often used in mass disasters. Dental identification is based on the
assumption that every person’s dentition is unique. The human adult dentition
consists of 32 teeth, each with 5 anatomic surfaces. Thus, there are 160 dental
surfaces that can contain identifying characteristics. Restorations, with varying
shapes, sizes, and restorative materials, may offer numerous additional points of
individuality. Moreover, the number of teeth, prostheses, decay, malposition,
302. See Rogers v. State, 344 S.E.2d 644, 647 (Ga. 1986) (“Bite marks on one of Rogers’ arms
were consistent with the dentures worn by the elderly victim.”).
303. See People v. Shaw, 664 N.E.2d 97, 101, 103 (Ill. App. Ct. 1996) (In a murder and aggravated sexual assault prosecution, the forensic odontologist opined that the mark on the defendant
was caused by the orthodontic braces on the victim’s teeth; “Dr. Kenney admitted that he was not a
certified toolmark examiner”; no abuse of discretion to admit evidence).
304. See State v. Ortiz, 502 A.2d 400, 401 (Conn. 1985).
305. See Doyle v. State, 263 S.W.2d 779, 779 (Tex. Crim. App. 1954); Seivewright v. State, 7
P.3d 24, 26 (Wyo. 2000) (“On the basis of his comparison of the impressions from the cheese with
Seivewright’s dentition, Dr. Huber concluded that Seivewright was the person who bit the cheese.”).
306. See Banks v. State, 725 So. 2d 711, 714–16 (Miss. 1997) (finding a due process violation
when prosecution expert threw away sandwich after finding the accused’s teeth consistent with the
sandwich bite).
307. See Davasher v. State, 823 S.W.2d 863, 870 (Ark. 1992) (expert testified that victim’s
dog could be eliminated as the source of mark found on defendant); State v. Powell, 446 S.E.2d 26,
27–28 (N.C. 1994) (“A forensic odontologist testified that dental impressions taken from Bruno and
Woody [accused’s dogs] were compatible with some of the lacerations in the wounds pictured in scale
photographs of Prevette’s body.”).
308. Iain A. Pretty & David J. Sweet, Anatomical Location of Bitemarks and Associated Findings in
101 Cases from the United States, 45 J. Forensic Sci. 812, 812 (2000).
104
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
malrotation, peculiar shapes, root canal therapy, bone patterns, bite relationship,
and oral pathology may also provide identifying characteristics.309 The courts have
accepted dental identification as a means of establishing the identity of a homicide
victim,310 with some cases dating back to the nineteenth century.311 According to
one court, “it cannot be seriously disputed that a dental structure may constitute
a means of identifying a deceased person . . . where there is some dental record
of that person with which the structure may be compared.”312
1. Theory of uniqueness
Identification of a suspect by matching his or her dentition with a bite mark
found on the victim of a crime rests on the theory that each person’s dentition
is unique. However, there are significant differences between the use of forensic
dental techniques to identify a decedent and the use of bite mark analysis to identify a perpetrator.313 In 1969, when bite mark comparisons were first studied, one
authority raised the following problems:
[Bite]marks can never be taken to reproduce accurately the dental features of
the originator. This is due partially to the fact that bite marks generally include
only a limited number of teeth. Furthermore, the material (whether food stuff or
human skin) in which the mark has been left is usually found to be a very unsatisfactory impression material with shrinkage and distortion characteristics that are
unknown. Finally, these marks represent only the remaining and fixed picture
of an action, the mechanism of which may vary from case to case. For instance,
there is as yet no precise knowledge of the possible differences between biting
off a morsel of food and using one’s teeth for purposes of attack or defense.314
309. The identification is made by comparing the decedent’s teeth with antemortem dental
records, such as charts and, more importantly, radiographs.
310. E.g., Wooley v. People, 367 P.2d 903, 905 (Colo. 1961) (dentist compared his patient’s
record with dentition of a corpse); Martin v. State, 636 N.E.2d 1268, 1272 (Ind. Ct. App. 1994)
(dentist qualified to compare X rays of one of his patients with skeletal remains of murder victim and
make a positive identification); Fields v. State, 322 P.2d 431, 446 (Okla. Crim. App. 1958) (murder
case in which victim was burned beyond recognition).
311. See Commonwealth v. Webster, 59 Mass. (5 Cush.) 295, 299–300 (1850) (remains of the
incinerated victim, including charred teeth and parts of a denture, were identified by the victim’s
dentist); Lindsay v. People, 63 N.Y. 143, 145–46 (1875).
312. People v. Mattox, 237 N.E.2d 845, 846 (Ill. App. Ct. 1968).
313. See Iain A. Pretty & David J. Sweet, The Scientific Basis for Human Bitemark Analyses—A
Critical Review, 41 Sci. & Just. 85, 88 (2001) (“A distinction must be drawn from the ability of a
forensic dentist to identify an individual from their dentition by using radiographs and dental records
and the science of bitemark analysis.”).
314. S. Keiser-Nielson, Forensic Odontology, 1 U. Tol. L. Rev. 633, 636 (1969); see also NRC
Forensic Science Report, supra note 3, at 174 (“[B]ite marks on the skin will change over time and can
be distorted by the elasticity of the skin, the unevenness of the surface bite, and swelling and healing.
These features may severely limit the validity of forensic odontology. Also, some practical difficulties,
such as distortions in photographs and changes over time in the dentition of suspects, may limit the
accuracy of the results.”).
105
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Dental identifications of decedents do not pose any of these problems; the expert
can often compare all 32 teeth with X rays depicting all those teeth. However,
in the typical bite mark case, all 32 teeth cannot be compared; often only 4 to 8
are biting teeth that can be compared. Similarly, all five anatomic surfaces are not
engaged in biting; only the edges of the front teeth come into play. In sum, bite
mark identification depends not only on the uniqueness of each person’s dentition
but also on “whether there is a [sufficient] representation of that uniqueness in the
mark found on the skin or other inanimate object.”315
2. Methods of comparison
Several methods of bite mark analysis have been reported. All involve three steps:
(1) registration of both the bite mark and the suspect’s dentition, (2) comparison
of the dentition and bite mark, and (3) evaluation of the points of similarity or
dissimilarity. The reproductions of the bite mark and the suspect’s dentition are
analyzed through a variety of methods.316 The comparison may be either direct or
indirect. A model of the suspect’s teeth is used in direct comparisons; the model
is compared to life-size photographs of the bite mark. Transparent overlays made
from the model are used in indirect comparisons.
Although the expert’s conclusions are based on objective data, the ultimate
opinion regarding individuation is essentially a subjective one.317 There is no
accepted minimum number of points of identity required for a positive identification.318 The experts who have appeared in published bite mark cases have
testified to a wide range of points of similarity, from a low of eight points to a
315. Raymond D. Rawson et al., Statistical Evidence for the Individuality of the Human Dentition,
29 J. Forensic Sci. 252 (1984).
316. See David J. Sweet, Human Bitemarks: Examination, Recovery, and Analysis, in Manual of
Forensic Odontology 162 (American Society of Forensic Odontology, 3d ed. 1997) [hereinafter
ASFO Manual] (“The analytical protocol for bitemark comparison is made up of two broad categories. Firstly, the measurement of specific traits and features called a metric analysis, and secondly,
the physical matching or comparison of the configuration and pattern of the injury called a pattern
association.”); see also David J. Sweet & C. Michael Bowers, Accuracy of Bite Mark Overlays: A Comparison of Five Common Methods to Produce Exemplars from a Suspect’s Dentition, 43 J. Forensic Sci. 362,
362 (1998) (“A review of the forensic odontology literature reveals multiple techniques for overlay
production. There is an absence of reliability testing or comparison of these methods to known or
reference standards.”).
317. See Roland F. Kouble & Geoffrey T. Craig, A Comparison Between Direct and Indirect
Methods Available for Human Bite Mark Analysis, 49 J. Forensic Sci. 111, 111 (2004) (“It is important
to remember that computer-generated overlays still retain an element of subjectivity, as the selection
of the biting edge profiles is reliant on the operator placing the ‘magic wand’ onto the areas to be
highlighted within the digitized image.”).
318. See Keiser-Nielson, supra note 314, at 637–38; see also Stubbs v. State, 845 So. 2d 656, 669
(Miss. 2003) (“There is little consensus in the scientific community on the number of points which
must match before any positive identification can be announced.”).
106
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
high of 52 points.319 Moreover, disagreements among experts in court appear
commonplace: “Although bite mark evidence has demonstrated a high degree of
acceptance, it continues to be hotly contested in ‘battles of the experts.’ Review
of trial transcripts reveals that distortion and the interpretation of distortion is
a factor in most cases.”320 Because of the subjectivity, some odontologists have
argued that “bitemark evidence should only be used to exclude a suspect. This
[argument] is supported by research which shows that the exclusion of non-biters
within a population of suspects is extremely accurate; far more so than the positive
identification of biters.”321
3. ABFO Guidelines
In an attempt to develop an objective method, in 1984 the American Board of
Forensic Odontology (ABFO) promulgated guidelines for bite mark analysis,
including a uniform scoring system.322 According to the drafting committee,
“[t]he scoring system . . . has demonstrated a method of evaluation that produced
a high degree of reliability among observers.”323 Moreover, the committee characterized “[t]he scoring guide . . . [as] the beginning of a truly scientific approach
to bite mark analysis.”324 In a subsequent letter, however, the drafting committee
wrote:
While the Board’s published guidelines suggest use of the scoring system, the
authors’ present recommendation is that all odontologists await the results of
further research before relying on precise point counts in evidentiary proceedings. . . . [T]he authors believe that further research is needed regarding the
quantification of bite mark evidence before precise point counts can be relied
upon in court proceedings.325
319. E.g., State v. Garrison, 585 P.2d 563, 566 (Ariz. 1978) (10 points); People v. Slone, 143
Cal. Rptr. 61, 67 (Cal. Ct. App. 1978) (10 points); People v. Milone, 356 N.E.2d 1350, 1356 (Ill.
App. Ct. 1976) (29 points); State v. Sager, 600 S.W.2d 541, 564 (Mo. Ct. App. 1980) (52 points);
State v. Green, 290 S.E.2d 625, 630 (N.C. 1982) (14 points); State v. Temple, 273 S.E.2d 273, 279
(N.C. 1981) (8 points); Kennedy v. State, 640 P.2d 971, 976 (Okla. Crim. App. 1982) (40 points);
State v. Jones, 259 S.E.2d 120, 125 (S.C. 1979) (37 points).
320. Raymond D. Rawson et al., Analysis of Photographic Distortion in Bite Marks: A Report of
the Bite Mark Guidelines Committee, 31 J. Forensic Sci. 1261, 1261–62 (1986). The committee noted:
“[P]hotographic distortion can be very difficult to understand and interpret when viewing prints of
bite marks that have been photographed from unknown angles.” Id. at 1267.
321. Iain A. Pretty, A Web-Based Survey of Odontologist’s Opinions Concerning Bitemark Analyses,
48 J. Forensic Sci. 1117, 1120 (2003) [hereinafter Web-Based Survey].
322. ABFO, Guidelines for Bite Mark Analysis, 112 J. Am. Dental Ass’n 383 (1986).
323. Raymond D. Rawson et al., Reliability of the Scoring System of the American Board of Forensic
Odontology for Human Bite Marks, 31 J. Forensic Sci. 1235, 1259 (1986).
324. Id.
325. Letter, Discussion of “Reliability of the Scoring System of the American Board of Forensic Odontology
for Human Bite Marks,” 33 J. Forensic Sci. 20 (1988).
107
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. The Empirical Record
The 2009 NRC report concluded:
More research is needed to confirm the fundamental basis for the science of bite
mark comparison. Although forensic odontologists understand the anatomy of
teeth and the mechanics of biting and can retrieve sufficient information from
bite marks on skin to assist in criminal investigations and provide testimony at
criminal trials, the scientific basis is insufficient to conclude that bite mark comparisons can result in a conclusive match.326
Moreover, “[t]here is no science on the reproducibility of the different methods
of analysis that lead to conclusions about the probability of a match.”327 Another
passage provides: “Despite the inherent weaknesses involved in bite mark comparison, it is reasonable to assume that the process can sometimes reliably exclude
suspects.”328
Although bitemark identifications are accepted by forensic dentists, only a
few empirical studies have been conducted329 and only a small number of forensic
dentists have addressed the empirical issue. In the words of one expert,
The research suggests that bitemark evidence, at least that which is used to identify biters, is a potentially valid and reliable methodology. It is generally accepted
within the scientific [dental] community, although the basis of this acceptance
within the peer-reviewed literature is thin. Only three studies have examined
the ability of odontologists to utilise bitemarks for the identification of biters,
and only two studies have been performed in what could be considered a contemporary framework of attitudes and techniques.330
326. NRC Forensic Science Report, supra note 3, at 175. See also id. at 176. (“Although the
majority of forensic odontologists are satisfied that bite marks can demonstrate sufficient detail for
positive identification, no scientific studies support this assessment, and no large population studies
have been conducted.”),
327. Id. at 174.
328. Id. at 176.
329. See C. Michael Bowers, Forensic Dental Evidence: An Investigator’s Handbook 189 (2004)
(“As a number of legal commentators have observed, bite mark analysis has never passed through
the rigorous scientific examination that is common to most sciences. The literature does not go far
in disputing that claim.”); Iain A. Pretty, Unresolved Issues in Bitemark Analysis, in Bitemark Evidence
547, 547 (Robert B.J. Dorion ed., 2005) (“As a general rule, case reports add little to the scientific
knowledge base, and therefore, if these, along with noncritical reviews, are discarded, very little new
empirical evidence has been developed in the past five years.”); id. at 561 (“[T]he final question in
the recent survey asked, ‘Should an appropriately trained individual positively identify a suspect from
a bitemark on skin’—70% of the respondents stated yes. However, it is the judicial system that must
assess validity, reliability, and a sound scientific base for expert forensic testimony. A great deal of
further research is required if odontology hopes to continue to be a generally accepted science.”).
330. Iain A. Pretty, Reliability of Bitemark Evidence, in Bitemark Evidence at 543 (Robert B.J.
Dorion ed., 2005).
108
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
Commentators have highlighted the following areas of controversy: “a) accuracy
of the bitemark itself, b) uniqueness of the human dentition, and c) analytical
techniques.”331
One part of a 1975 study involved identification of bites made on pigskin:
“Incorrect identification of the bites made on pigskin ranged from 24% incorrect
identifications under ideal laboratory conditions to as high as 91% incorrect identifications when the bites were photographed 24 hours after the bites made.”332 A
1999 ABFO Workshop, “where ABFO diplomats attempted to match four bitemarks to seven dental models, resulted in 63.5% false positives.”333 A 2001 study
of bites on pigskin “found false positive identifications of 11.9–22.0% for various
groups of forensic odontologists (15.9% false positives for ABFO diplomats), with
some ABFO diplomats faring far worse.”334 Other commentators take a more
favorable view of these studies.335
1. DNA exonerations
In several cases, subsequent DNA testing has demonstrated the error in a prior bite
mark identification. In State v. Krone,336 two experienced experts concluded that
the defendant had made the bite mark found on a murder victim. The defendant,
however, was later exonerated through DNA testing.337 In Otero v. Warnick,338 a
forensic dentist testified that the “plaintiff was the only person in the world who
331. Pretty & Sweet, supra note 313, at 87. Commentators had questioned the lack of research
in the field as long ago as 1985. Two commentators wrote:
There is effectively no valid documented scientific data to support the hypothesis that bite marks are
demonstrably unique. Additionally, there is no documented scientific data to support the hypothesis
that a latent bite mark, like a latent fingerprint, is a true and accurate reflection of this uniqueness. To
the contrary, what little scientific evidence that does exist clearly supports the conclusion that crimerelated bite marks are grossly distorted, inaccurate, and therefore unreliable as a method of identification.
Allen P. Wilkinson & Ronald M. Gerughty, Bite Mark Evidence: Its Admissibility Is Hard to Swallow, 12
W. St. U. L. Rev. 519, 560 (1985).
332. C. Michael Bowers, Problem-Based Analysis of Bitemark Misidentifications: The Role of DNA,
159S Forensic Sci. Int’l S104, S106 (2006) (citing D.K. Whittaker, Some Laboratory Studies on the
Accuracy of Bite Mark Comparison, 25 Int’l Dent. J. 166 (1975)) [hereinafter Problem-Based Analysis].
333. Bowers, Problem-Based Analysis, supra note 332, at S106. But see Kristopher L. Arheart &
Iain A. Pretty, Results of the 4th ABFO Bitemark Workshop 1999, 124 Forensic Sci. Int’l 104 (2001).
334. Bowers, Problem-Based Analysis, supra note 332, at S106 (citing Iain A. Pretty & David J.
Sweet, Digital Bitemark Overlays—An Analysis of Effectiveness, 46 J. Forensic Sci. 1385, 1390 (2001)
(“While the overall effectiveness of overlays has been established, the variation in individual performance of odontologists is of concern.”)).
335. See Pretty, Reliability of Bitemark Evidence, in Bitemark Evidence, supra note 330, at 538–42.
336. 897 P.2d 621, 622, 623 (Ariz. 1995) (“The bite marks were crucial to the State’s case
because there was very little other evidence to suggest Krone’s guilt.”; “Another State dental expert,
Dr. John Piakis, also said that Krone made the bite marks. . . . Dr. Rawson himself said that Krone
made the bite marks. . . .”).
337. See Mark Hansen, The Uncertain Science of Evidence, A.B.A. J. 49 (2005) (discussing Krone).
338. 614 N.W.2d 177 (Mich. Ct. App. 2000).
109
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
could have inflicted the bite marks on [the murder victim’s] body. On January
30, 1995, the Detroit Police Crime Laboratory released a supplemental report that
concluded that plaintiff was excluded as a possible source of DNA obtained from
vaginal and rectal swabs taken from [the victim’s] body.”339 In Burke v. Town of
Walpole,340 the expert concluded that “Burke’s teeth matched the bite mark on
the victim’s left breast to a ‘reasonable degree of scientific certainty.’ That same
morning . . . DNA analysis showed that Burke was excluded as the source of
male DNA found in the bite mark on the victim’s left breast.”341 In the future,
the availability of nuclear DNA testing may reduce the need to rely on bite mark
identifications.342
C. Case Law Development
People v. Marx (1975)343 emerged as the leading bite mark case. After Marx,
bite mark evidence became widely accepted.344 By 1992, it had been introduced
or noted in 193 reported cases and accepted as admissible in 35 states.345 Some
courts described bite mark comparison as a “science,”346 and several cases took
judicial notice of its validity.347
339. Id. at 178.
340. 405 F.3d 66, 73 (1st Cir. 2005).
341. See also Bowers, Problem-Based Analysis, supra note 332, at S104 (citing several cases involving bitemarks and DNA exonerations: Gates, Bourne, Morris, Krone, Otero, Young, and Brewer); Mark
Hansen, Out of the Blue, A.B.A. 50, 51 (1996) (DNA analysis of skin taken from fingernail scrapings
of the victim conclusively excluded Bourne).
342. See Pretty, Web-Based Survey, supra note 321, at 1119 (“The use of DNA in the assessment
of bitemarks has been established for some time, although previous studies have suggested that the
uptake of this technique has been slow. It is encouraging to note that nearly half of the respondents
in this case have employed biological evidence in a bitemark case.”).
343. 126 Cal. Rptr. 350 (Cal. Ct. App. 1975). The court in Marx avoided applying the Frye
test, which requires acceptance of a novel technique by the scientific community as a prerequisite to
admissibility. According to the court, the Frye test “finds its rational basis in the degree to which the
trier of fact must accept, on faith, scientific hypotheses not capable of proof or disproof in court and
not even generally accepted outside the courtroom.” Id. at 355–56.
344. Two Australian cases, however, excluded bite mark evidence. See Lewis v. The Queen
(1987) 29 A. Crim. R. 267 (odontological evidence was improperly relied on, in that this method
has not been scientifically accepted); R v. Carroll (1985) 19 A. Crim. R. 410 (“[T]he evidence given
by the three odontologist is such that it would be unsafe or dangerous to allow a verdict based upon
it to stand.”).
345. Steven Weigler, Bite Mark Evidence: Forensic Odontology and the Law, 2 Health Matrix:
J.L.-Med. 303 (1992).
346. See People v. Marsh, 441 N.W.2d 33, 35 (Mich. Ct. App. 1989) (“the science of bite mark
analysis has been extensively reviewed in other jurisdictions”); State v. Sager, 600 S.W.2d 541, 569
(Mo. Ct. App. 1980) (“an exact science”).
347. See State v. Richards, 804 P.2d 109, 112 (Ariz. Ct. App. 1990) (“[B]ite mark evidence is
admissible without a preliminary determination of reliability. . . .”); People v. Middleton, 429 N.E.2d
100, 101 (N.Y. 1981) (“The reliability of bite mark evidence as a means of identification is sufficiently
110
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
1. Specificity of opinion
In some cases, experts testified only that a bite mark was “consistent with” the
defendant’s teeth.348 In other cases, they went further and opined that it is “highly
probable” or “very highly probable” that the defendant made the mark.349 In still
other cases, experts made positive identifications (to the exclusion of all other
persons).350 It is not unusual to find experts disagreeing in individual cases—often
over the threshold question of whether a wound was even a bite mark.351
established in the scientific community to make such evidence admissible in a criminal case, without
separately establishing scientific reliability in each case. . . .”); State v. Armstrong, 369 S.E.2d 870, 877
(W. Va. 1988) (judicially noticing the reliability of bite mark evidence).
348. E.g., Rogers v. State, 344 S.E.2d 644, 647 (Ga. 1986) (“Bite marks on one of Rogers’ arms
were consistent with the dentures worn by the elderly victim.”); People v. Williams, 470 N.E.2d 1140,
1150 (Ill. App. Ct. 1984) (“could have”); State v. Hodgson, 512 N.W.2d 95, 98 (Minn. 1994) (en
banc) (Board-certified forensic odontologist testified that “there were several similarities between the
bite mark and the pattern of [the victim’s] teeth, as revealed by known molds of his mouth.”); State
v. Routh, 568 P.2d 704, 705 (Or. Ct. App. 1977) (“similarity”); Williams v. State, 838 S.W.2d 952,
954 (Tex. Ct. App. 1992) (“One expert, a forensic odontologist, testified that Williams’s dentition
was consistent with the injury (bite mark) on the deceased.”); State v. Warness, 893 P.2d 665, 669
(Wash. Ct. App. 1995) (“[T]he expert testified that his opinion was not conclusive, but the evidence
was consistent with the alleged victim’s assertion that she had bitten Warness. . . . Its probative value
was therefore limited, but its relevance was not extinguished.”).
349. E.g., People v. Slone, 143 Cal. Rptr. 61, 67 (Cal. Ct. App. 1978); People v. Johnson, 289
N.E.2d 722, 726 (Ill. App. Ct. 1972).
350. E.g., Morgan v. State, 639 So. 2d 6, 9 (Fla. 1994) (“[T]he testimony of a dental expert at
trial positively matched the bite marks on the victim with Morgan’s teeth.”); Duboise v. State, 520 So.
2d 260, 262 (Fla. 1988) (Expert “testified at trial that within a reasonable degree of dental certainty
Duboise had bitten the victim.”); Brewer v. State, 725 So. 2d 106, 116 (Miss. 1998) (“Dr. West opined
that Brewer’s teeth inflicted the five bite mark patterns found on the body of Christine Jackson.”);
State v. Schaefer, 855 S.W.2d 504, 506 (Mo. Ct. App. 1993) (“[A] forensic dentist testified that the bite
marks on Schaefer’s shoulder matched victim’s dental impression, and concluded that victim caused
the marks.”); State v. Lyons, 924 P.2d 802, 804 (Or. 1996) (forensic odontologist “had no doubt
that the wax models were made from the same person whose teeth marks appeared on the victim’s
body”); State v. Cazes, 875 S.W.2d 253, 258 (Tenn. 1994) (A forensic odontologist “concluded to a
reasonable degree of dental certainty that Cazes’ teeth had made the bite marks on the victim’s body
at or about the time of her death.”).
351. E.g., Ege v. Yukins, 380 F. Supp. 2d 852, 878 (E.D. Mich. 2005) (“[T]he defense attempted
to rebut Dr. Warnick’s testimony with the testimony of other experts who opined that the mark on the
victim’s cheek was the result of livor mortis and was not a bite mark at all.”); Czapleski v. Woodward,
No. C-90-0847 MHP, 1991 U.S. Dist. LEXIS 12567, at *3–4 (N.D. Cal. Aug. 30, 1991) (dentist’s
initial report concluded that “bite” marks found on child were consistent with dental impressions of
mother; several experts later established that the marks on child’s body were postmortem abrasion
marks and not bite marks); Kinney v. State, 868 S.W.2d 463, 464–65 (Ark. 1994) (disagreement that
marks were human bite marks); People v. Noguera, 842 P.2d 1160, 1165 n.1 (Cal. 1992) (“At trial,
extensive testimony by forensic ondontologists [sic] was presented by both sides, pro and con, as to
whether the wounds were human bite marks and, if so, when they were inflicted.”); State v. Duncan,
802 So. 2d 533, 553 (La. 2001) (“Both defense experts testified that these marks on the victim’s body
were not bite marks.”); Stubbs v. State, 845 So. 2d 656, 668 (Miss. 2003) (“Dr. Galvez denied the
impressions found on Williams were the results of bite marks.”).
111
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
2. Post-Daubert cases
Although some commentators questioned the underlying basis for the technique
after Daubert,352 courts have continued to admit the evidence.353
X. Microscopic Hair Evidence
The first reported use of forensic hair analysis occurred more than 150 years
ago in 1861 in Germany.354 The first published American opinion was an 1882
Wisconsin decision, Knoll v. State.355 Based on a microscopic comparison, the
expert testified that the hair samples shared a common source. Hair and the
closely related fiber analysis played a prominent role in two of the most famous
twentieth-century American prosecutions: Ted Bundy in Florida and Wayne
Williams, the alleged Atlanta child killer.356 Although hair comparison evidence
has been judicially accepted for decades, it is another forensic identification discipline that is being reappraised today.
A. The Technique
Generally, after assessing whether a sample is a hair and not a fiber, an analyst may
be able to determine: (1) whether the hair is of human or animal origin, (2) the
part of the body that the hair came from, (3) whether the hair has been dyed,
(4) whether the hair was pulled or fell out as a result of natural causes or disease,357
and (5) whether the hair was cut or crushed.358
352. See Pretty & Sweet, supra note 313, at 86 (“Despite the continued acceptance of bitemark
evidence in European, Oceanic and North American Courts the fundamental scientific basis for bitemark analysis has never been established.”).
353. See State v. Timmendequas, 737 A.2d 55, 114 (N.J. 1999) (“Judicial opinion from other
jurisdictions establish that bite-mark analysis has gained general acceptance and therefore is reliable.
Over thirty states considering such evidence have found it admissible and no state has rejected bitemark evidence as unreliable.”) (citations omitted); Stubbs, 845 So. 2d at 670; Howard v. State, 853
So. 2d 781, 795–96 (Miss. 2003); Seivewright v. State, 7 P.3d 24, 30 (Wyo. 2000) (“Given the wide
acceptance of bite mark identification testimony and Seivewright’s failure to present evidence challenging the methodology, we find no abuse of discretion in the district court’s refusal to hold an evidentiary
hearing to analyze Dr. Huber’s testimony.”).
354. E. James Crocker, Trace Evidence, in Forensic Evidence in Canada 259, 265 (1991) (the
analyst was Rudolf Virchow, a Berliner).
355. 12 N.W. 369 (Wis. 1882).
356. Edward J. Imwinkelried, Forensic Hair Analysis: The Case Against the Underemployment of
Scientific Evidence, 39 Wash. & Lee L. Rev. 41, 43 (1982).
357. See Delaware v. Fensterer, 474 U.S. 15. 16–17 (1985) (FBI analyst testified hair found at a
murder scene had been forcibly removed.).
358. See 2 Giannelli & Imwinkelried, supra note 177, § 24-2.
112
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
The most common subject for hair testimony involves an attempt to individuate the hair sample, at least to some degree. If the unknown is head hair,
the expert might gather approximately 50 hair strands from five different areas
of the scalp (the top, front, back, and both sides) from the known source.359
Before the microscopic analysis, the expert examines the hair macroscopically to
identify obvious features visible to the naked eye such as the color of the hair
and its form, that is, whether it is straight, wavy, or curved.360 The expert next
mounts the unknown hair and the known samples on microscope slides for a more
detailed examination of characteristics such as scale patterns, size, color, pigment
distribution, maximum diameter, shaft length, and scale count. Some of these
comparative judgments are subjective in nature: “Human hair characteristics (e.g.,
scale patterns, pigmentation, size) vary within a single individual. . . . Although
the examination procedure involves objective methods of analysis, the subjective
weights associated with the characteristics rest with the examiner.”361
Often the examiner determines only whether the hair samples from the crime
scene and the accused are “microscopically indistinguishable.” Although this finding is consistent with the hypothesis that the samples had the same source, its probative value would, of course, vary if only a hundred people had microscopically
indistinguishable hair as opposed to several million. As discussed below, experts
have often gone beyond this “consistent with” testimony.
B. The Empirical Record
The 2009 NRC report contained an assessment of hair analysis. The report began
the assessment by observing that there are neither “scientifically accepted [population] frequency” statistics for various hair characteristics nor “uniform standards on
the number of features on which hairs must agree before an examiner may declare
a ‘match.’”362 The report concluded,
[T]estimony linking microscopic hair analysis with particular defendants is highly
unreliable. In cases where there seems to be a morphological match (based on
microscopic examination), it must be confirmed using mtDNA analysis; microscopic studies are of limited probative value. The committee found no scientific
support for the use of hair comparisons for individualization in the absence of
nuclear DNA. Microscopy and mtDNA analysis can be used in tandem and add
to one another’s value for classifying a common source, but no studies have been
performed specifically to quantify the reliability of their joint use.363
359.
360.
361.
362.
363.
NRC Forensic Science Report, supra note 3, at 157.
Id.
Miller, supra note 67, at 157–58.
NRC Forensic Science Report, supra note 3, at 160.
Id. at 8.
113
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
There is a general consensus that hair examination can yield reliable information about class characteristics of hair strands.364 Indeed, experts can identify major
as well as secondary characteristics. Major characteristics include such features as
color, shaft form, and hair diameter.365 Secondary characteristics are such features
as pigment size and shaft diameter.366 These characteristics can help narrow the
class of possible sources for the unknown hair sample.
There have been several major efforts to provide an empirical basis for individuation opinions in hair analysis. In the 1940s, Gamble and Kirk investigated
whether hair samples from different persons could be distinguished on the basis
of scale counts.367 However, they used a small database of only thirty-nine hair
samples, and a subsequent attempt to replicate the original experiment yielded
contradictory results.368
In the 1960s, neutron activation analysis was used in an effort to individuate
hair samples. The research focused on determining the occurrence of various trace
element concentrations in human hair.369 Again, subsequent research tended to
show that there are significant hair-to-hair variations in trace element concentration among the hairs of a single person.370
In the 1970s, two Canadian researchers, Gaudette and Keeping, attempted to
develop a “ballpark” estimate of the probability of a false match in hair analysis. They
published articles describing three studies: (1) a 1974 study involving scalp hair,371 (2) a
364. Id. at 157.
365. Id. at 5–23.
366. Id.
367. Their initial research indicated that: (1) the scale count of even a single hair strand is nearly
always representative of all scalp hairs; and (2) while the average or mean scale count is constant for
the individual, the count differs significantly from person to person. Lucy L. Gamble & Paul L. Kirk,
Human Hair Studies II. Scale Counts, 31 J. Crim. L. & Criminology 627, 629 (1941); Paul L. Kirk &
Lucy L. Gamble, Further Investigation of the Scale Count of Human Hair, 33 J. Crim. L. & Criminology
276, 280 (1942).
368. Joseph Beeman, The Scale Count of Human Hair, 32 J. Crim. L. & Criminology 572, 574
(1942).
369. Rita Cornelis, Is It Possible to Identify Individuals by Neutron Activation Analysis of Hair? 12
Med. Sci. & L. 188 (1972); Lima et al., Activation Analysis Applied to Forensic Investigation: Some Observations on the Problem of Human Hair Individualization, 1 Radio Chem. Methods of Analysis 119 (Int’l
Atomic Energy Agency 1965); A.K. Perkins, Individualization of Human Head Hair, in Proceedings of
the First Int’l Conf. on Forensic Activation Analysis 221 (V. Guin ed., 1967).
370. Rita Cornelis, Truth Has Many Facets: The Neutron Activation Analysis Story, 20 J. Forensic
Sci. 93, 95 (1980) (“I am convinced that irrefutable hair identification from its trace element composition still belongs to the realm of wishful thinking. . . . The state of the art can be said to be that nearly
all interest for trace elements present in hair, as a practical identification tool, has faded.”); Dennis S.
Karjala, Evidentiary Uses of Neutron Activation Analysis, 59 Cal. L. Rev. 977, 1039 (1971).
371. B.D. Gaudette & E.S. Keeping, An Attempt at Determining Probabilities in Human Scalp Hair
Comparison, 19 J. Forensic Sci. 599 (1974).
114
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
1976 study using pubic hair,372 and (3) a 1978 followup.373 In the two primary studies
(1974 and 1976), hair samples were analyzed to determine whether hairs from different
persons were microscopically indistinguishable. The analysts used 23 different characteristics such as color, pigment distribution, maximum diameter, shaft length, and scale
count.374 Based on those data, they estimated the probability of a false match in scalp
hair to be 1 in 4500 and the probability of a false match in pubic hair to be 1 in 800.
In the view of one commentator, Gaudette and Keeping’s probability estimates “are easily challenged.”375 One limitation was the relatively small database in
the study.376 Moreover, the studies involved samples from different individuals and
sought the probability that the samples from different persons would nonetheless
appear microscopically indistinguishable. In a criminal trial, the question is quite
different: Assuming the samples appear microscopically indistinguishable, what is
the probability that they came from the same person?377
Early in the twenty-first century, the Verma research team revisited the individualization issue and attempted to develop an objective, automated method for
identifying matches.378 The authors claimed that their “system accurately judged
whether two populations of hairs came from the same person or from different
persons 83% of the time.” However, a close inspection of the authors’ tabular
data indicates that (1) relying on this method, researchers characterized “9 of
73 different pairs as ‘same’ for a false positive rate of 9/73 = 12%”; and (2) the
372. B.D. Gaudette, Probabilities and Human Pubic Hair Comparisons, 21 J. Forensic Sci. 514,
514 (1976).
373. B.D. Gaudette, Some Further Thoughts on Probabilities in Human Hair Comparisons, 23 J.
Forensic Sci. 758 (1978); see also Ray A. Wickenhaiser & David G. Hepworth, Further Evaluation of
Probabilities in Human Scalp Hair Comparisons, 35 J. Forensic Sci. 1323 (1990).
374. They prescribed that with respect to each characteristic, the analysts had to classify into
one of a number of specified subcategories. For example, the length characteristic was subdivided into
five groups, depending on the strand’s length in inches. They computed both the total number of
comparisons made by the analysts and recorded the number of instances in which the analysts reported
finding samples indistinguishable under the specified criteria.
375. D. Kaye, Science in Evidence 28 (1997); see also NRC Forensic Science Report, supra note
3, at 158 ([T]he “assignment of probabilities [by Gaudette and Keeping] has since been shown to be
unreliable.”); P.D. Barnett & R.R. Ogle, Probabilities and Human Hair Comparisons, 27 J. Forensic Sci.
272, 273–74 (1982); Dalva Moellenberg, Splitting Hairs in Criminal Trials: Admissibility of Hair Comparison Probability Estimates, 1984 Ariz. St. L.J. 521. See generally Nicholas Petrarco et al., The Morphology
and Evidential Significance of Human Hair Roots, 33 J. Forensic Sci. 68, 68 (1988) (“Although many
instrumental techniques to the individualization of human hair have been tried in recent years, these
have not proved to be useful or reliable.”).
376. For example, the pubic hair study involved a total of 60 individuals. In addition, the experiments involved primarily Caucasians. While the scalp hair study included 92 Caucasians, there were
only 6 Asians and 2 African Americans in the study.
377. A Tawshunsky, Admissibility of Mathematical Evidence in Criminal Trials, 21 Am. Crim. L.
Rev. 55, 57–66 (1983).
378. M.S. Verma et al., Hair-MAP: A Prototype Automated System for Forensic Hair Comparison and
Analysis, 129 Forensic Sci. Int’l 168 (2002).
115
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
researchers characterized “4 sets of hairs from the same person as ‘different’ for a
false negative rate of 4/9 = 44%.”379
The above studies do not provide the only data relevant to the validity of hair
analysis. There are also comparative studies of microscopic analysis and mtDNA,
proficiency tests, and DNA exoneration cases involving microscopic analysis.
1. Mitochondrial DNA380
An FBI study compared microscopic (“consistent with” testimony) and mtDNA
analysis of hair: “Of the 80 hairs that were microscopically associated, nine comparisons were excluded by mtDNA analysis.”381
2. Proficiency testing
Early proficiency tests indicated a high rate of laboratory error in microscopic
comparisons of hair samples. In the 1970s the LEAA conducted its Laboratory
Proficiency Testing Program.382 The crime laboratories’ performance on hair
analysis was the weakest. Fifty-four percent misanalyzed hair sample C and 67%
submitted unacceptable responses on hair sample D.383 Followup studies between
1980 and 1991 yielded similar results.384 Summarizing the results of this series of
tests, two commentators concluded: “Animal and human (body area) hair identifications are clearly the most troublesome of all categories tested.”385
In another series of hair tests, the examiners were asked to “include” or
“exclude” in comparing known and unknown samples: “Laboratories reported
inclusions and exclusions which agreed with the manufacturer in approximately
74% of their comparisons. About 18% of the responses were inconclusive, and 8%
in disagreement with the manufacturers’ information.”386
379. NRC Forensic Science Report, supra note 3, at 159.
380. For a detailed discussion of mitochondrial DNA, see David H. Kaye & George Sensabaugh,
Reference Guide on DNA Identification Evidence, Section V.A, in this manual
381. Max M. Houck & Bruce Budowle, Correlation of Microscopic and Mitochondrial DNA Hair
Comparisons, 47 J. Forensic Sci. 964, 966 (2002).
382. Laboratory Proficiency Test, supra note 81.
383. Id. at 251. By way of comparison, 20% of the laboratories failed a paint analysis (test #5);
30% failed glass analysis (test #9).
384. Peterson & Markham, supra note 82, at 1007 (“In sum, laboratories were no more successful in identifying the correct species of origin of animal hair . . . than they were in the earlier
LEAA study.”).
385. Id.
386. Id. at 1023; see also id. at 1022 (“Examiners warned that they needed to employ particular
caution in interpreting the hair results given the virtual impossibility of achieving complete sample
homogeneity.”).
116
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
3. DNA exonerations
The publication of the Department of Justice study of the first 28 DNA exonerations spotlighted the significant role that hair analysis played in several of these
miscarriages of justice.387 For example, in the trial of Edward Honeker, an expert
testified that the crime scene hair sample “was unlikely to match anyone” else388—
a clear overstatement. Moreover, an exoneration in Canada triggered a judicial
inquiry, which recommended that “[t]rial judges should undertake a more critical
analysis of the admissibility of hair comparison evidence as circumstantial evidence
of guilt.”389 One study of 200 DNA exoneration cases reported that hair testimony
had been presented at 43 of the original trials.390 A subsequent examination of 137
trial transcripts in exoneration cases concluded: “Sixty-five of the trials examined
involved microscopic hair comparison analysis. Of those, 25—or 38%—had invalid
hair comparison testimony. Most (18) of these cases involved invalid individualizing
claims.”391 The other cases contained flawed probability testimony.
C. Case Law Development
Prior to Daubert, an overwhelming majority of courts accepted expert testimony
that hair samples are microscopically indistinguishable.392 Experts often conceded
that microscopic analysis did not permit a positive identification of the source.393
387. Edward Connors et al., Convicted by Juries, Exonerated by Science: Case Studies in the
Use of DNA Evidence to Establish Innocence After Trial (1996). See id. at 73 (discussing David
Vasquez case); id. at 64–65 (discussing Steven Linscott case).
388. Barry Scheck et al., Actual Innocence: Five Days to Execution and Other Dispatches from
the Wrongly Convicted 146 (2000).
389. Hon. Fred Kaufman, The Commission on Proceedings Involving Guy Paul Morin (Ontario
Ministry of the Attorney General 1998) (Recommendation 2). Morin was erroneously convicted
based, in part, on hair evidence.
390. Brandon L. Garrett, Judging Innocence, 108 Colum. L. Rev. 55, 81 (2008).
391. Garrett & Neufeld, supra note 33, at 47.
392. See, e.g., United States v. Hickey, 596 F.2d 1082, 1089 (1st Cir. 1979); United States v.
Brady, 595 F.2d 359, 362–63 (6th Cir. 1979); United States v. Cyphers, 553 F.2d 1064, 1071–73 (7th
Cir. 1977); Jent v. State, 408 So. 2d 1024, 1028–29 (Fla.1981), Commonwealth v. Tarver, 345 N.E.2d
671, 676–77 (Mass. 1975); State v. White, 621 S.W.2d 287, 292–93 (Mo. 1981); State v. Smith, 637
S.W.2d 232, 236 (Mo. Ct. App. 1982); People v. Allweiss, 396 N.E.2d 735 (N.Y. 1979); State v.
Green, 290 S.E.2d 625, 629–30 (N.C. 1982); State v. Watley, 788 P.2d 375, 381 (N.M. 1989).
393. Moore v. Gibson, 195 F.3d 1152, 1167 (10th Cir. 1999); Butler v. State, 108 S.W.3d 18,
21 (Mo. Ct. App. 2003); see also Thompson v. State, 539 A.2d 1052, 1057 (Del. 1988) (“it is now
universally recognized that although fingerprint comparisons can result in the positive identification
of an individual, hair comparisons are not this precise”). But see People v. Kosters, 467 N.W.2d 311,
313 (Mich. 1991) (Cavanaugh, C.J., dissenting) (the “minuscule probative value” of such opinions is
“clearly . . . outweighed by the unfair prejudicial effect”); State v. Wheeler, 1981 WL 139588, at *4
(Wis. Ct. App. Feb. 8, 1981) (in an unpublished opinion, the appellate court held that the trial judge
did not err in finding that the expert’s opinion that the accused “could have been the source” of the
hair lacked probative value, because it “only include[d] defendant in a broad class of possible assailants”).
117
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Nonetheless, the courts varied in how far they permitted the expert to go. In some
cases, analysts testified only that the samples matched394 or were similar395 and
thus consistent with the hypothesis that the samples had the same source.396 Other
courts permitted experts to directly opine that the accused was the source of the
crime scene sample.397 However, a 1990 decision held it error to admit testimony
that “it would be improbable that these hairs would have originated from another
individual.”398 In the court’s view, this testimony amounted “effectively, [to] a
positive identification of defendant. . . .”399
On the basis of Gaudette and Keeping research, several courts admitted
opinions in statistical terms (e.g., 1 in 4500 chance of a false match).400 In
contrast, other courts, including a federal court of appeals, reached a contrary
conclusion.401
The most significant post-Daubert challenge to microscopic hair analysis came
in Williamson v. Reynolds,402 a habeas case decided in 1995. There, an expert testified that, after considering approximately 25 characteristics, he concluded that
the hair samples were “consistent microscopically.” He then elaborated: “In other
words, hairs are not an absolute identification, but they either came from this individual or there is—could be another individual somewhere in the world that would
have the same characteristics to their hair.”403 The district court was “unsuccessful in its attempts to locate any indication that expert hair comparison testimony
394. Garland v. Maggio, 717 F.2d 199, 207 n.9 (5th Cir. 1983).
395. United States v. Brady, 595 F.2d 359, 362–63 (6th Cir. 1979).
396. People v. Allen, 115 Cal. Rptr. 839, 842 (Cal. Ct. App. 1974).
397. In the 1986 Mississippi prosecution of Randy Bevill for murder, the expert testified that
“there was a transfer of hair from the Defendant to the body of” the victim. Clive A. Stafford Smith &
Patrick D. Goodman, Forensic Hair Comparison Analysis: Nineteenth Century Science or Twentieth Century
Snake Oil? 27 Colum. Hum. Rts. L. Rev. 227, 273 (1996).
398. State v. Faircloth, 394 S.E.2d 198, 202–03 (N.C. Ct. App. 1990).
399. Id. at 202.
400. United States v. Jefferson, 17 M.J. 728, 731 (N.M.C.M.R. 1983); People v. DiGiacomo,
388 N.E.2d 1281, 1283 (Ill. App. Ct. 1979); see also United States ex rel. DiGiacomo v. Franzen, 680
F.2d 515, 516 (7th Cir. 1982) (During its deliberations, the jury submitted the following question to
the judge: “Has it been established by sampling of hair specimens that the defendant was positively
proven to have been in the automobile?”).
401. United States v. Massey, 594 F.2d 676, 679–80 (8th Cir. 1979) (the expert testified that he
“had microscopically examined 2,000 cases and in only one or two cases was he ever unable to make
identification”; the expert cited a study for the proposition that there was a 1 in 4500 chance of a
random match; the expert added that “there was only ‘one chance in a 1,000’ that hair comparisons
could be in error”); State v. Carlson, 267 N.W.2d 170, 176 (Minn. 1978).
402. 904 F. Supp. 1529, 1554 (E.D. Okla. 1995), rev’d on this issue sub nom. Williamson v. Ward,
110 F.3d 1508, 1523 (10th Cir. 1997). The district court noted that the “expert did not explain which
of the ‘approximately’ 25 characteristics were consistent, any standards for determining whether the
samples were consistent, how many persons could be expected to share this same combination of
characteristics, or how he arrived at his conclusions.” Id. at 1554.
403. Id. (emphasis added).
118
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
meets any of the requirements of Daubert.”404 Finally, the prosecutor in closing
argument declared, “There’s a match.”405 Even the state court had misinterpreted
the evidence, writing that the “hair evidence placed [petitioner] at the decedent’s
apartment.”406 Although the Tenth Circuit did not fault the district judge’s reading of the empirical record relating to hair analysis and ultimately upheld habeas
relief, that court reversed the district judge on this issue. The Tenth Circuit ruled
that the district had committed legal error because the due process (fundamental
fairness), not the more stringent Daubert (reliability), standard controls evidentiary
issues in habeas corpus proceedings.407 Before retrial, the defendant was exonerated by exculpatory DNA evidence.408
Post-Daubert, many cases have continued to admit testimony about microscopic hair analysis.409 In 1999, one state court judicially noticed the reliability
of hair evidence,410 implicitly finding this evidence to be not only admissible
but also based on a technique of indisputable validity.411 In contrast, a Missouri
court reasoned that, without the benefit of population frequency data, an expert
overreached in opining to “a reasonable degree of certainty that the unidentified
hairs were in fact from” the defendant.412 The NRC report commented that
there appears to be growing judicial support for the view that “testimony linking
microscopic hair analysis with particular defendants is highly unreliable.”413
404. Id. at 1558. The court also observed: “Although the hair expert may have followed procedures accepted in the community of hair experts, the human hair comparison results in this case were,
nonetheless, scientifically unreliable.” Id.
405. Id. at 1557.
406. Id. (quoting Williamson v. State, 812 P.2d 384, 387 (Okla. Crim. 1991)).
407. Williamson v. Ward, 110 F.3d 1508, 1523 (10th Cir. 1997).
408. Scheck et al., supra note 388, at 146 (hair evidence was shown to be “patently unreliable.”);
see also John Grisham, The Innocent Man (2006) (examining the Williamson case).
409. E.g., State v. Fukusaku, 946 P.2d 32, 44 (Haw. 1997) (“Because the scientific principles
and procedures underlying hair and fiber evidence are well-established and of proven reliability, the
evidence in the present case can be treated as ‘technical knowledge.’ Thus, an independent reliability
determination was unnecessary.”); McGrew v. State, 682 N.E.2d 1289, 1292 (Ind. 1997) (concluding
that hair comparison is “more a ‘matter of observation by persons with specialized knowledge’ than
‘a matter of scientific principles’”); see also NRC Forensic Science Report, supra note 3, at 161 n.88
(citing State v. West, 877 A.2d 787 (Conn. 2005), and Bookins v. State, 922 A.2d 389 (Del. Super.
Ct. 2007)).
410. See Johnson v. Commonwealth, 12 S.W.3d 258, 262 (Ky. 1999).
411. See Fed. R. Evid. 201(b); Daubert, 509 U.S. at 593 n.11 (“[T]heories that are so firmly
established as to have attained the status of scientific law, such as the laws of thermodynamics, properly
are subject to judicial notice under Federal Rule [of] Evidence 201.”).
412. Butler v. State, 108 S.W.3d 18, 21–22 (Mo. Ct. App. 2003).
413. NRC Forensic Science Report, supra note 3, at 161.
119
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
XI. Recurrent Problems
The discussions of specific techniques in this chapter, as well as the 2009 NRC
report, reveal several recurrent problems in the presentation of testimony about
forensic expertise.
A. Clarity of Testimony
As noted earlier, the report voiced concern about the use of terms such as “match,”
“consistent with,” “identical,” “similar in all respects tested,” and “cannot be
excluded as the source of.” These terms can have “a profound effect on how the trier
of fact in a criminal or civil matter perceives and evaluates scientific evidence.”414
The comparative bullet lead cases are illustrative of this point.415 The technique was used when conventional firearms identification was not possible because
the recovered bullet was so deformed that the striations were destroyed. In the
bullet lead cases, the phrasing of the experts’ opinions varied widely. In some,
experts testified only to the limited opinion that two exhibits were “analytically
indistinguishable.”416 In other cases, examiners concluded that samples could have
come from the same “source” or “batch.”417 In still others, they stated that the
samples came from the same source.418 In several cases, the experts went even
further and identified a particular “box” of ammunition (usually 50 loaded cartridges, sometimes 20) as the source of the bullet recovered at the crime scene.
For example, experts opined that two specimens:
• Couldhavecomefromthesamebox.419
• Couldhavecomefromthesameboxoraboxmanufacturedonthesame
day.420
414. Id. at 21.
415. The technique compared trace chemicals found in bullets at crime scenes with ammunition
found in the possession of a suspect. It was used when firearms (“ballistics”) identification could not be
employed. FBI experts used various analytical techniques (first, neutron activation analysis, and then
inductively coupled plasma-atomic emission spectrometry) to determine the concentrations of seven
elements—arsenic, antimony, tin, copper, bismuth, silver, and cadmium—in the bullet lead alloy of
both the crime-scene and suspect’s bullets. Statistical tests were then used to compare the elements in
each bullet and determine whether the fragments and suspect’s bullets were “analytically indistinguishable” for each of the elemental concentration means.
416. See Wilkerson v. State, 776 A.2d 685, 689 (Md. Ct. Spec. App. 2001).
417. See State v. Krummacher, 523 P.2d 1009, 1012–13 (Or. 1974) (en banc).
418. See United States v. Davis, 103 F.3d 660, 673–74 (8th Cir. 1996); People v. Lane, 628
N.E.2d 682, 689–90 (Ill. App. Ct. 1993).
419. See State v. Jones, 425 N.E.2d 128, 131 (Ind. 1981); State v. Strain, 885 P.2d 810, 817
(Utah Ct. App. 1994).
420. See State v. Grube, 883 P.2d 1069, 1078 (Idaho 1994); People v. Johnson, 499 N.E.2d
1355, 1366 (Ill. 1986); Earhart v. State, 823 S.W.2d 607, 614 (Tex. Crim. App. 1991) (en banc) (“He
120
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
• Wereconsistentwiththeirhavingcomefromthesameboxofammunition.421
• Probablycamefromthesamebox.422
• Musthavecomefromthesameboxorfromanotherboxthatwouldhave
been made by the same company on the same day.423
Moreover, these inconsistent statements were not supported by empirical research. According to a 2004 NRC report, the number of bullets that can
be produced from an “analytically indistinguishable” melt “can range from the
equivalent of as few as 12,000 to as many as 35 million 40 grain, .22 caliber longrifle bullets.”424 Consequently, according to the 2004 NRC report, the “available
data do not support any statement that a crime bullet came from a particular box
of ammunition. [R]eferences to ‘boxes’ of ammunition in any form should be
excluded as misleading under Federal Rule of Evidence 403.”425
B. Limitations on Testimony
Some courts have limited the scope of the testimony, permitting expert testimony
about the similarities and dissimilarities between exemplars but not the specific
conclusion that the defendant was the author (“common authorship” opinion).426
Although the courts have used this approach most frequently in questioned doculater modified that statement to acknowledge that analytically indistinguishable bullets which do not
come from the same box most likely would have been manufactured at the same place on or about
the same day; that is, in the same batch.”), vacated, 509 U.S. 917 (1993).
421. See State v. Reynolds, 297 S.E.2d 532, 534 (N.C. 1982).
422. See Bryan v. State, 935 P.2d 338, 360 (Okla. Crim. App. 1997).
423. See Davis, 103 F.3d at 666–67 (“An expert testified that such a finding is rare and that the
bullets must have come from the same box or from another box that would have been made by the
same company on the same day.”); Commonwealth v. Daye, 587 N.E.2d 194, 207 (Mass. 1992); State
v. King, 546 S.E.2d 575, 584 (N.C. 2001) (Kathleen Lundy “opined that, based on her lead analysis,
the bullets she examined either came from the same box of cartridges or came from different boxes
of the same caliber, manufactured at the same time.”).
424. National Research Council, Forensic Analysis: Weighing Bullet Lead Evidence 6 (2004),
[hereinafter NRC Bullet Lead Evidence], available at http://www.nap.edu/catalog.php?record_id=10924.
425. Id.
426. See United States v. Oskowitz, 294 F. Supp. 2d 379, 384 (E.D.N.Y. 2003) (“Many other
district courts have similarly permitted a handwriting expert to analyze a writing sample for the jury
without permitting the expert to offer an opinion on the ultimate question of authorship.”); United
States v. Rutherford, 104 F. Supp. 2d 1190, 1194 (D. Neb. 2000) (“[T]he Court concludes that FDE
Rauscher’s testimony meets the requirements of Rule 702 to the extent that he limits his testimony
to identifying and explaining the similarities and dissimilarities between the known exemplars and
the questioned documents. FDE Rauscher is precluded from rendering any ultimate conclusions on
authorship of the questioned documents and is similarly precluded from testifying to the degree of
confidence or certainty on which his opinions are based.”); United States v. Hines, 55 F. Supp. 2d 62,
69 (D. Mass. 1999) (expert testimony concerning the general similarities and differences between a
defendant’s handwriting exemplar and a stick-up note was admissible but not the specific conclusion
that the defendant was the author).
121
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ment cases, they have sometimes applied the same approach to other types of
forensic expertise such as firearms examination as well.427
The NRC report criticized “exaggerated”428 testimony such as claims of perfect accuracy,429 infallibility,430 or a zero error rate.431 Several courts have barred
excessive expert claims for lack of empirical support. For example, in United States
v. Mitchell,432 the court commented: “Testimony at the Daubert hearing indicated
that some latent fingerprint examiners insist that there is no error rate associated
with their activities. . . . This would be out-of-place under Rule 702.”433 Similarly, in a firearms identification case, one court noted that
during the testimony at the hearing, the examiners testified to the effect that they
could be 100 percent sure of a match. Because an examiner’s bottom line opinion
as to an identification is largely a subjective one, there is no reliable statistical or
scientific methodology which will currently permit the expert to testify that it is a
‘match’ to an absolute certainty, or to an arbitrary degree of statistical certainty.434
Other courts have excluded the use of terms such as “science” or “scientific,”
because of the risk that jurors may bestow the aura of the infallibility of science on
the testimony.435
In particular, some courts are troubled by the use of the expression “reasonable scientific certainty” by some forensic experts. The term “reasonable scientific
certainty” is problematic. Although it is used frequently in cases, its legal meaning
is ambiguous.436 Sometimes it is used in lieu of a confidence statement (i.e., “high
degree of certainty”), in which case the expert could altogether avoid the term
and directly testify how confident he or she is in the opinion.
In other cases, courts have interpreted reasonable scientific certainty to mean
that the expert must testify that a sample probably came from the defendant and not
427. United States v. Green, 405 F. Supp. 2d 104, 124 (D. Mass. 2005).
428. NRC Forensic Science Report, supra note 3, at 4.
429. Id. at 47.
430. Id. at 104.
431. Id. at 142–43.
432. 365 F.3d 215 (3d Cir. 2004).
433. Id. at 246.
434. United States v. Monteiro, 407 F. Supp. 2d 351, 372 (D. Mass. 2006).
435. United States v. Starzecpyzel, 880 F. Supp. 1027, 1038 (S.D.N.Y. 1995).
436. James E. Hullverson, Reasonable Degree of Medical Certainty: A Tort et a Travers, 31 St. Louis
U. L.J. 577, 582 (1987) (“[T]here is nevertheless an undercurrent that the expert in federal court
express some basis for both the confidence with which his conclusion is formed, and the probability
that his conclusion is accurate.”); Edward J. Imwinkelried & Robert G. Scofield, The Recognition of
an Accused’s Constitutional Right to Introduce Expert Testimony Attacking the Weight of Prosecution Science
Evidence: The Antidote for the Supreme Court’s Mistaken Assumption in California v. Trombetta, 33 Ariz.
L. Rev. 59, 69 (1991) (“Many courts continue to exclude opinions which fall short of expressing a
probability or certainty. . . . These opinions have been excluded in jurisdictions which have adopted
the Federal Rules of Evidence.”).
122
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
that it possibly came from the defendant.437 However, experts frequently testify that
two samples “could have come from the same source.” Such testimony meets the
relevancy standard of Federal Rule 401, and there is no requirement in Article VII
of the Federal Rules that an expert’s opinion be expressed in terms of “probabilities.” Thus, in United States v. Cyphers438 the expert testified that hair samples
found on items used in a robbery “could have come” from the defendants.439 The
defendants argued that the testimony was inadmissible because the expert did not
express his opinion in terms of reasonable scientific certainty. The court wrote:
“There is no such requirement.”440
In Burke v. Town of Walpole,441 a bite mark identification case, the court of
appeals had to interpret the term as used in an arrest warrant:
[W]e must assume that the magistrate who issued the arrest warrant assigned no
more than the commonly accepted meaning among lawyers and judges to the
term “reasonable degree of scientific certainty”—“a standard requiring a showing
that the injury was more likely than not caused by a particular stimulus, based on
the general consensus of recognized [scientific] thought.” Black’s Law Dictionary
1294 (8th ed. 2004) (defining “reasonable medical probability,” or “reasonable
medical certainty,” as used in tort actions). That standard, of course, is fully
consistent with the probable cause standard.442
The case involved the guidelines adopted by ABFO that recognized several levels
of certainty (“reasonable medical certainty,” “high degree of certainty,” and
“virtual certainty”). The guidelines described “reasonable medical certainty” as
“convey[ing] the connotation of virtual certainty or beyond reasonable doubt.”443
This is not the way that some courts use the term.
437. State v. Holt, 246 N.E.2d 365, 368 (Ohio 1969). The expert testified, based on neutron
activation analysis, that two hair samples were “similar and . . . likely to be from the same source”
(emphasis in original).
438. 553 F.2d 1064 (7th Cir. 1977).
439. Id. at 1072; see also United States v. Davis, 44 M.J. 13, 16 (C.A.A.F. 1996) (“Evidence was
also admitted that appellant owned sneakers which ‘could have’ made these prints.”).
440. Cyphers, 553 F.2d at 1072; see also United States v. Oaxaca, 569 F.2d 518, 526 (9th Cir.
1978) (expert’s opinion regarding hair comparison admissible even though expert was less than certain);
United States v. Spencer, 439 F.2d 1047, 1049 (2d Cir. 1971) (expert’s opinion regarding handwriting
comparison admissible even though expert did not make a positive identification); United States v.
Longfellow, 406 F.2d 415, 416 (4th Cir. 1969) (expert’s opinion regarding paint comparison admissible, even though expert did not make a positive identification); State v. Boyer, 406 So. 2d 143, 148
(La. 1981) (reasonable scientific certainty not required where expert testifies concerning the presence
of gunshot residue based on neutron activation analysis).
441. 405 F.3d 66 (1st Cir. 2005).
442. Id. at 91.
443. Id. at 91 n.30 (emphasis omitted).
123
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Moreover, the term may be problematic for a different reason—misleading
the jury. One court ruled that the term “reasonable scientific certainty” could not
be used because of the subjective nature of the opinion.444
C. Restriction of Final Argument
In a number of cases, in summation counsel has overstated the content of the
expert testimony. In People v. Linscott,445 for example, “the prosecutor argued
that hairs found in the victim’s apartment and on the victim’s body were in fact
defendant’s hairs.”446 Reversing, the Illinois Supreme Court wrote: “With these
statements, the prosecutor improperly argued that the hairs removed from the
victim’s apartment were conclusively identified as coming from defendant’s head
and pubic region. There simply was no testimony at trial to support these statements. In fact, [the prosecution experts] and the defense hair expert . . . testified
that no such identification was possible.”447 DNA testing exculpated Linscott.448
Trial judges can police the attorneys’ descriptions of the testimony during closing
argument as well as the content of expert testimony presented.
XII. Procedural Issues
The Daubert standard operates in a procedural setting, not a vacuum. In Daubert,
the Supreme Court noted that “[v]igorous cross-examination, presentation of contrary evidence, and careful instruction on the burden of proof are the traditional
and appropriate means of attacking shaky but admissible evidence.”449 Adversarial
testing presupposes advance notice of the content of the expert’s testimony and
access to comparable expertise to evaluate that testimony. This section discusses
some of the procedural mechanisms that trial judges may use to assure that jurors
properly evaluate any expert testimony by forensic identification experts.
444. United States v. Glynn, 578 F. Supp. 2d 567, 568–75 (S.D.N.Y. 2008) (firearms identification case).
445. 566 N.E.2d 1355 (Ill. 1991).
446. Id. at 1358.
447. Id. at 1359.
448. See Connors et al., supra note 387, at 65 (“The State’s expert on the hair examination
testified that only 1 in 4,500 persons would have consistent hairs when tested for 40 different characteristics. He only tested between 8 and 12 characteristics, however, and could not remember which
ones. The appellate court ruled on July 29, 1987, that his testimony, coupled with the prosecution’s
use of it at closing arguments, constituted denial of a fair trial.”) (citation omitted).
449. 509 U.S. at 596 (citing Rock v. Arkansas, 483 U.S. 44, 61 (1987)).
124
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
A. Pretrial Discovery
Judges can monitor discovery in scientific evidence cases to ensure that disclosure
is sufficiently comprehensive.450 Federal Rule 16 requires discovery of laboratory
reports451 and a summary of the expert’s opinion.452 The efficacy of these provisions depends on the content of the reports and the summary. The Journal of Forensic
Sciences, the official publication of the American Academy of Forensic Sciences,
published a symposium on the ethical responsibilities of forensic scientists in 1989.
One symposium article described a number of unacceptable laboratory reporting
practices, including (1) “preparation of reports containing minimal information in
order not to give the ‘other side’ ammunition for cross-examination,” (2) “reporting of findings without an interpretation on the assumption that if an interpretation
is required it can be provided from the witness box,” and (3) “[o]mitting some
significant point from a report to trap an unsuspecting cross-examiner.”453
NRC has recommended extensive discovery in DNA cases: “All data and
laboratory records generated by analysis of DNA samples should be made freely
available to all parties. Such access is essential for evaluating the analysis.”454 The
NRC report on bullet lead contained similar comments about the need for a
thorough report in bullet lead cases:
The conclusions in laboratory reports should be expanded to include the limitations of compositional analysis of bullet lead evidence. In particular, a further
450. See Fed. R. Crim. P. 16 (1975) advisory committee’s note (“[I]t is difficult to test expert
testimony at trial without advance notice and preparation.”), reprinted in 62 F.R.D. 271, 312 (1974);
Paul C. Giannelli, Criminal Discovery, Scientific Evidence, and DNA, 44 Vand. L. Rev. 791 (1991). “Early
disclosure can have the following benefits: [1] Avoiding surprise and unnecessary delay. [2] Identifying the need for defense expert services. [3] Facilitating exoneration of the innocent and encouraging
plea negotiations if DNA evidence confirms guilt.” National Institute of Justice, President’s DNA
Initiative: Principles of Forensic DNA for Officers of the Court (2005), available at http://www.dna.
gov/training/otc.
451. Fed. R. Crim. P. 16(a)(1)(F).
452. Id. 16(a)(1)(G).
453. Douglas M. Lucas, The Ethical Responsibilities of the Forensic Scientist: Exploring the Limits, 34
J. Forensic Sci. 719, 724 (1989). Lucas was the Director of The Centre of Forensic Sciences, Ministry
of the Solicitor General, Toronto, Ontario.
454. National Research Council, DNA Technology in Forensic Science 146 (1992) (“The
prosecutor has a strong responsibility to reveal fully to defense counsel and experts retained by the
defendant all material that might be necessary in evaluating the evidence.”); see also id. at 105 (“Case
records—such as notes, worksheets, autoradiographs, and population databanks—and other data or
records that support examiners’ conclusions are prepared, retained by the laboratory, and made available for inspection on court order after review of the reasonableness of a request.”); National Research
Council, The Evaluation of Forensic DNA Evidence 167–69 (1996) (“Certainly, there are no strictly
scientific justifications for withholding information in the discovery process, and in Chapter 3 we
discussed the importance of full, written documentation of all aspects of DNA laboratory operations.
Such documentation would facilitate technical review of laboratory work, both within the laboratory
and by outside experts. . . . Our recommendations that all aspects of DNA testing be fully documented
is most valuable when this documentation is discoverable in advance of trial.”).
125
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
explanatory comment should accompany the laboratory conclusions to portray
the limitations of the evidence. Moreover, a section of the laboratory report
translating the technical conclusions into language that a jury could understand
would greatly facilitate the proper use of this evidence in the criminal justice
system. Finally, measurement data (means and standard deviations) for all of the
crime scene bullets and those deemed to match should be included.455
As noted earlier, the recent NRC report made similar comments:
Some reports contain only identifying and agency information, a brief description of the evidence being submitted, a brief description of the types of analysis
requested, and a short statement of the results (e.g., “the greenish, brown plant
material in item #1 was identified as marijuana”), and they include no mention
of methods or any discussion of measurement uncertainties.456
Melendez-Diaz v. Massachusetts457 illustrates the problem. The laboratory
report in that case “contained only the bare-bones statement that ‘[t]he substance
was found to contain: Cocaine.’ At the time of trial, petitioner did not know
what tests the analysts performed, whether those tests were routine, and whether
interpreting their results required the exercise of judgment or the use of skills that
the analysts may not have possessed.”458
1. Testifying beyond the report
Experts should generally not be allowed to testify beyond the scope of the report
without issuing a supplemental report. Troedel v. Wainwright,459 a capital murder
case, illustrates the problem. In that case, a report of a gunshot residue test based
on neutron activation analysis stated the opinion that swabs “from the hands of
Troedel and Hawkins contained antimony and barium in amounts typically found
on the hands of a person who has discharged a firearm or has had his hands in close
proximity to a discharging firearm.”460 An expert testified consistently with this
report at Hawkins’ trial but embellished his testimony at Troedel’s trial by adding
the more inculpatory opinion that “Troedel had fired the murder weapon.”461 In
contrast, at a deposition during federal habeas proceedings, the same expert testified that “he could not, from the results of his tests, determine or say to a scientific
certainty who had fired the murder weapon” and the “amount of barium and
antimony on the hands of Troedel and Hawkins were basically insignificant.”462
The district court found the trial testimony, “at the very least,” misleading and
455.
456.
457.
458.
459.
460.
461.
462.
See NRC Bullet Lead Evidence, supra note 424, at 110–11.
NRC Forensic Science Report, supra note 3, at 21.
129 S. Ct. 2527 (2009).
Id. at 2537.
667 F. Supp. 1456 (S.D. Fla. 1986), aff’d, 828 F.2d 670 (11th Cir. 1987).
Id. at 1458.
Id.
Id. at 1459.
126
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Forensic Identification Expertise
granted relief.463 The expert claimed that the prosecutor had “pushed” him to
embellish his testimony, a claim the prosecutor substantiated.464
B. Defense Experts
In appropriate cases, trial judges can provide the opposition with access to expert
resources. Defense experts are often important in cases involving forensic identification expertise. Counsel will frequently need expert guidance to determine
whether a research study is methodologically sound and, if so, whether the data
adequately support the specific opinion proffered, and the role, if any, that subjective judgment played in forming the opinion.
The NAS 1992 DNA report stressed that experts are necessary for an adequate defense in many cases: “Defense counsel must have access to adequate
expert assistance, even when the admissibility of the results of analytical techniques
is not in question because there is still a need to review the quality of the laboratory work and the interpretation of results.”465 According to the President’s DNA
Initiative, “[e]ven if DNA evidence is admitted, there still may be disagreement
about its interpretation—what do the DNA results mean in a particular case?”466
The need for defense experts is not limited to cases involving DNA evidence.
In Ake v. Oklahoma,467 the Supreme Court recognized a due process right to a
defense expert under certain circumstances.468 In federal trials, the Criminal Justice
Act of 1964469 provides for expert assistance for indigent defendants.
463. “[T]he Court concludes that the opinion Troedel had fired the weapon was known by the
prosecution not to be based on the results of the neutron activation analysis tests, or on any scientific
certainty or even probability. Thus, the subject testimony was not only misleading, but also was used
by the State knowing it to be misleading.” Id. at 1459–60.
464. Id. at 1459 (“[A]s Mr. Riley candidly admitted in his deposition, he was ‘pushed’ further
in his analysis at Troedel’s trial than at Hawkins’ trial. . . . [At the] evidentiary hearing held before
this Court, one of the prosecutors testified that, at Troedel’s trial, after Mr. Riley had rendered his
opinion which was contained in his written report, the prosecutor pushed to ‘see if more could have
been gotten out of this witness.’”).
465. NRC I, supra note 24, at 149 (“Because of the potential power of DNA evidence, authorities must make funds available to pay for expert witnesses. . . .”).
466. President’s DNA Initiative, supra note 450.
467. 470 U.S. 68 (1985); see Paul C. Giannelli, Ake v. Oklahoma: The Right to Expert Assistance
in a Post-Daubert, Post-DNA World, 89 Cornell L. Rev. 1305 (2004).
468. Ake, 470 U.S. at 74.
469. 18 U.S.C. § 3006(A).
127
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
DNA Identification Evidence
DAVID H. KAYE AND GEORGE SENSABAUGH
David H. Kaye, M.A., J.D., is Distinguished Professor of Law, Weiss Family Scholar, and
Graduate Faculty Member, Forensic Science Program, The Pennsylvania State University,
University Park, and Regents’ Professor Emeritus, Arizona State University Sandra Day
O’Connor College of Law and School of Life Sciences, Tempe.
George Sensabaugh, D.Crim., is Professor of Biomedical and Forensic Sciences, School of
Public Health, University of California, Berkeley.
CONTENTS
I. Introduction, 131
A. Summary of Contents, 131
B. A Brief History of DNA Evidence, 132
C. Relevant Expertise, 134
II. Variation in Human DNA and Its Detection, 135
A. What Are DNA, Chromosomes, and Genes? 136
B. What Are DNA Polymorphisms and How Are They Detected? 139
1. Sequencing, 139
2. Sequence-specific probes and SNP chips, 140
3. VNTRs and RFLP testing, 140
4. STRs, 141
5. Summary, 142
C. How Is DNA Extracted and Amplified? 143
D. How Is STR Profiling Done with Capillary Electrophoresis? 144
E. What Can Be Done to Validate a Genetic System for
Identification? 148
F. What New Technologies Might Emerge? 148
1. Miniaturized “lab-on-a-chip” devices, 148
2. High-throughput sequencing, 149
3. Microarrays, 150
4. What questions do the new technologies raise? 150
III. Sample Collection and Laboratory Performance, 151
A. Sample Collection, Preservation, and Contamination, 151
1. Did the sample contain enough DNA? 151
2. Was the sample of sufficient quality? 152
129
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. Laboratory Performance, 153
1. What forms of quality control and assurance should be
followed? 153
2. How should samples be handled? 156
IV. Inference, Statistics, and Population Genetics in Human Nuclear DNA
Testing, 159
A. What Constitutes a Match or an Exclusion? 159
B. What Hypotheses Can Be Formulated About the Source? 160
C. Can the Match Be Attributed to Laboratory Error? 161
D. Could a Close Relative Be the Source? 162
E. Could an Unrelated Person Be the Source? 163
1. Estimating allele frequencies from samples, 164
2. The product rule for a randomly mating population, 165
3. The product rule for a structured population, 166
F. Probabilities, Probative Value, and Prejudice, 167
1. Frequencies and match probabilities, 167
2. Likelihood ratios, 172
3. Posterior probabilities, 173
G. Verbal Expressions of Probative Value, 174
1. “Rarity” or “strength” testimony, 175
2. Source or uniqueness testimony, 175
V. Special Issues in Human DNA Testing, 176
A. Mitochondrial DNA, 176
B. Y Chromosomes, 181
C. Mixtures, 182
D. Offender and Suspect Database Searches, 186
1. Which statistics express the probative value of a match to a
defendant located by searching a DNA database? 186
2. Near-miss (familial) searching, 189
3. All-pairs matching within a database to verify estimated
random-match probabilities, 191
VI. Nonhuman DNA Testing, 193
A. Species and Subspecies, 193
B. Individual Organisms, 195
Glossary of Terms, 199
References on DNA, 210
130
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
I. Introduction
Deoxyribonucleic acid, or DNA, is a molecule that encodes the genetic information in all living organisms. Its chemical structure was elucidated in 1954. More
than 30 years later, samples of human DNA began to be used in the criminal
justice system, primarily in cases of rape or murder. The evidence has been the
subject of extensive scrutiny by lawyers, judges, and the scientific community. It
is now admissible in all jurisdictions, but there are many types of forensic DNA
analysis, and still more are being developed. Questions of admissibility arise as
advancing methods of analysis and novel applications of established methods are
introduced.1
This reference guide addresses technical issues that are important when considering the admissibility of and weight to be accorded analyses of DNA, and it
identifies legal issues whose resolution requires scientific information. The goal is
to present the essential background information and to provide a framework for
resolving the possible disagreements among scientists or technicians who testify
about the results and import of forensic DNA comparisons.
A. Summary of Contents
Section I provides a short history of DNA evidence and outlines the types of
scientific expertise that go into the analysis of DNA samples.
Section II provides an overview of the scientific principles behind DNA typing. It describes the structure of DNA and how this molecule differs from person
to person. These are basic facts of molecular biology. The section also defines
the more important scientific terms and explains at a general level how DNA
differences are detected. These are matters of analytical chemistry and laboratory
procedure. Finally, the section indicates how it is shown that these differences
permit individuals to be identified. This is accomplished with the methods of
probability and statistics.
Section III considers issues of sample quantity and quality as well as laboratory
performance. It outlines the types of information that a laboratory should produce
to establish that it can analyze DNA reliably and that it has adhered to established
laboratory protocols.
Section IV examines issues in the interpretation of laboratory results. To assist
the courts in understanding the extent to which the results incriminate the defendant, it enumerates the hypotheses that need to be considered before concluding
that the defendant is the source of the crime scene samples, and it explores the
1. For a discussion of other forensic identification techniques, see Paul C. Giannelli et al., Reference Guide on Forensic Identification Expertise, in this manual. See also David H. Kaye et al., The
New Wigmore, A Treatise on Evidence: Expert Evidence (2d ed. 2011).
131
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
issues that arise in judging the strength of the evidence. It focuses on questions of
statistics, probability, and population genetics.2
Section V describes special issues in human DNA testing for identification.
These include the detection and interpretation of mixtures, Y-STR testing,
mitochondrial DNA testing, and the evidentiary implications of DNA database
searches of various kinds.
Finally, Section VI discusses the forensic analysis of nonhuman DNA. It identifies questions that can be useful in judging whether a new method or application
of DNA science has the scientific merit and power claimed by the proponent of
the evidence.
A glossary defines selected terms and acronyms encountered in genetics,
molecular biology, and forensic DNA work.
B. A Brief History of DNA Evidence
“DNA evidence” refers to the results of chemical or physical tests that directly
reveal differences in the structure of the DNA molecules found in organisms as
diverse as bacteria, plants, and animals.3 The technology for establishing the identity of individuals became available to law enforcement agencies in the mid to
late 1980s.4 The judicial reception of DNA evidence can be divided into at least
five phases.5 The first phase was one of rapid acceptance. Initial praise for RFLP
(restriction fragment length polymorphism) testing in homicide, rape, paternity,
and other cases was effusive. Indeed, one judge proclaimed “DNA fingerprinting”
to be “the single greatest advance in the ‘search for truth’ . . . since the advent of
cross-examination.”6 In this first wave of cases, expert testimony for the prosecution rarely was countered, and courts readily admitted DNA evidence.
In a second wave of cases, however, defendants pointed to problems at two
levels—controlling the experimental conditions of the analysis and interpreting the
results. Some scientists questioned certain features of the procedures for extracting
and analyzing DNA employed in forensic laboratories, and it became apparent
2. For a broader discussion of statistics, see David H. Kaye & David A. Freedman, Reference
Guide on Statistics, in this manual.
3. Differences in DNA also can be revealed by differences in the proteins that are made according to the “instructions” in a DNA molecule. Blood group factors, serum enzymes and proteins,
and tissue types all reveal information about the DNA that codes for these chemical structures. Such
immunogenetic testing predates the “direct” DNA testing that is the subject of this chapter. On the
nature and admissibility of the “indirect” DNA testing, see, for example, David H. Kaye, The Double
Helix and the Law of Evidence 5–19 (2010); 1 McCormick on Evidence § 205(B) (Kenneth Broun
ed., 6th ed. 2006).
4. The first reported appellate opinion is Andrews v. State, 533 So. 2d 841 (Fla. Dist. Ct. App.
1988).
5. The description that follows is adapted from 1 McCormick on Evidence, supra note 3, § 205(B).
6. People v. Wesley, 533 N.Y.S.2d 643, 644 (Alb. County. Ct. 1988).
132
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
that declaring matches or nonmatches in the DNA variations being compared
was not always trivial. Despite these concerns, most cases continued to find the
DNA analyses to be generally accepted, and a number of states provided for
admissibility of DNA tests by legislation. Concerted attacks by defense experts of
impressive credentials, however, produced a few cases rejecting specific proffers
on the ground that the testing was not sufficiently rigorous.7
A different attack on DNA profiling begun in cases during this period proved
far more successful and led to a third wave of cases in which many courts held
that estimates of the probability of a coincidentally matching DNA profile were
inadmissible. These estimates relied on a simple population genetics model for the
frequencies of DNA profiles, and some prominent scientists claimed that the applicability of the mathematical model had not been adequately verified. A heated
debate on this point spilled over from courthouses to scientific journals and convinced the supreme courts of several states that general acceptance was lacking. A
1992 report of the National Academy of Sciences proposed a more “conservative”
computational method as a compromise,8 and this seemed to undermine the claim
of scientific acceptance of the less conservative procedure that was in general use.
In response to the population genetics criticism and the 1992 report came an
outpouring of critiques of the report and new studies of the distribution of the DNA
variations in many populations. Relying on the burgeoning literature, a second
National Academy panel concluded in 1996 that the usual method of estimating frequencies in broad racial groups generally was sound, and it proposed improvements
and additional procedures for estimating frequencies in subgroups within the major
population groups.9 In the corresponding fourth phase of judicial scrutiny of DNA
evidence, the courts almost invariably returned to the earlier view that the statistics
associated with DNA profiling are generally accepted and scientifically valid.
In the fifth phase of the judicial evaluation of DNA evidence, results obtained
with the newer “PCR-based methods” entered the courtroom. Once again,
courts considered whether the methods rested on a solid scientific foundation and
were generally accepted in the scientific community. The opinions are practically
unanimous in holding that the PCR-based procedures satisfy these standards.
Before long, forensic scientists settled on the use of one type of DNA variation
(known as short tandem repeats, or STRs) to include or exclude individuals as
the source of crime scene DNA.
7. Moreover, a minority of courts, perhaps concerned that DNA evidence might be conclusive
in the minds of jurors, added a “third prong” to the general-acceptance standard of Frye v. United
States, 293 F. 1013 (D.C. Cir. 1923). This augmented Frye test requires not only proof of the general
acceptance of the ability of science to produce the type of results offered in court, but also of the
proper application of an approved method on the particular occasion. For criticism of this approach,
see David H. Kaye et al., supra note 1, § 6.3.3(a)(2).
8. National Research Council, DNA Technology in Forensic Science (1992) [hereinafter NRC I].
9. National Research Council, The Evaluation of Forensic DNA Evidence (1996) [hereinafter
NRC II].
133
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Throughout these phases, DNA tests also exonerated an increasing number
of men who had been convicted of capital and other crimes, posing a challenge
to traditional postconviction remedies and raising difficult questions of postconviction access to DNA samples.10 The value of DNA evidence in solving older
crimes also prompted extensions of some statutes of limitations.11
In sum, in little more than a decade, forensic DNA typing made the transition
from a novel set of methods for identification to a relatively mature and wellstudied forensic technology. However, one should not lump all forms of DNA
identification together. New techniques and applications continue to emerge,
ranging from the use of new genetic systems and new analytical procedures to the
typing of DNA from plants and animals. Before admitting such evidence, courts
normally inquire into the biological principles and knowledge that would justify
inferences from these new technologies or applications. As a result, this guide
describes not only the predominant STR technology, but also newer analytical
techniques that can be used for forensic DNA identification.
C. Relevant Expertise
Human DNA identification can involve testimony about laboratory findings,
about the statistical interpretation of those findings, and about the underlying
principles of molecular biology. Consequently, expertise in several fields might be
required to establish the admissibility of the evidence or to explain it adequately to
the jury. The expert who is qualified to testify about laboratory techniques might
not be qualified to testify about molecular biology, to make estimates of population frequencies, or to establish that an estimation procedure is valid.12
10. See, e.g., Osborne v. District Attorney’s Office for Third Judicial District, 129 S. Ct. 2308 (2009)
(narrowly rejecting a convicted offender’s claim of a due process right to DNA testing at his expense,
enforceable under 42 U.S.C. § 1983, to establish that he is probably innocent of the crime for which
he was convicted after a fair trial, when (1) the convicted offender did not seek extensive DNA testing
before trial even though it was available, (2) he had other opportunities to prove his innocence after a
final conviction based on substantial evidence against him, (3) he had no new evidence of innocence (only
the hope that more extensive DNA testing than that done before the trial would exonerate him), and
(4) even a finding that he was not source of the DNA would not conclusively demonstrate his innocence);
Skinner v. Switzer, 131 S. Ct. 1289 (2011); Brandon L. Garrett, Judging Innocence, 108 Colum. L. Rev.
55 (2008); Brandon L. Garrett, Claiming Innocence, 92 Minn. L. Rev. 1629 (2008).
11. See, e.g., Veronica Valdivieso, DNA Warrants: A Panacea for Old, Cold Rape Cases? 90 Geo.
L.J. 1009 (2002).
12. Nonetheless, if previous cases establish that the testing and estimation procedures are legally
acceptable, and if the computations are essentially mechanical, then highly specialized statistical expertise might not be essential. Reasonable estimates of DNA characteristics in major population groups can
be obtained from standard references, and many quantitatively literate experts could use the appropriate
formulae to compute the relevant profile frequencies or probabilities. NRC II, supra note 9, at 170.
Limitations in the knowledge of a technician who applies a generally accepted statistical procedure
can be explored on cross-examination. See Kaye et al., supra note 1, § 2.2. Accord Roberson v. State,
16 S.W.3d 156, 168 (Tex. Crim. App. 2000).
134
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
Trial judges ordinarily are accorded great discretion in evaluating the qualifications of a proposed expert witness, and the decisions depend on the background
of each witness. Courts have noted the lack of familiarity of academic experts—
who have done respected work in other fields—with the scientific literature on
forensic DNA typing and on the extent to which their research or teaching lies in
other areas.13 Although such concerns may affect the persuasiveness of particular
testimony, they rarely result in exclusion on the grounds that the witness simply
is not qualified as an expert.
The scientific and legal literature on the objections to DNA evidence is
extensive. By studying the scientific publications, or perhaps by appointing a special master or expert adviser to assimilate this material, a court can ascertain where
a party’s expert falls within the spectrum of scientific opinion. Furthermore, an
expert appointed by the court under Federal Rule of Evidence 706 could testify
about the scientific literature generally or even about the strengths or weaknesses
of the particular arguments advanced by the parties.
Given the great diversity of forensic questions to which DNA testing might
be applied, it is not feasible to list the specific scientific expertise appropriate to all
applications. Assessing the value of DNA analyses of a novel application involving unfamiliar species can be especially challenging. If the technology is novel,
expertise in molecular genetics or biotechnology might be necessary. If testing
has been conducted on a particular organism or category of organisms, expertise
in that area of biology may be called for. If a random-match probability has been
presented, one might seek expertise in statistics as well as the population biology
or population genetics that goes with the organism tested. Given the penetration
of molecular technology into all areas of biological inquiry, it is likely that individuals can be found who know both the technology and the population biology
of the organism in question. Finally, when samples come from crime scenes, the
expertise and experience of forensic scientists can be crucial. Just as highly focused
specialists may be unaware of aspects of an application outside their field of expertise, so too scientists who have not previously dealt with forensic samples can be
unaware of case-specific factors that can confound the interpretation of test results.
II. Variation in Human DNA and Its
Detection
DNA is a complex molecule that contains the “genetic code” of organisms as
diverse as bacteria and humans. Although the DNA molecules in human cells are
13. E.g., State v. Copeland, 922 P.2d 1304, 1318 n.5 (Wash. 1996) (noting that defendant’s
statistical expert “was also unfamiliar with publications in the area,” including studies by “a leading
expert in the field” whom he thought was “a ‘guy in a lab somewhere’”).
135
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
largely identical from one individual to another, there are detectable variations—
except for identical twins, every two human beings have some differences in the
detailed structure of their DNA. This section describes the basic features of DNA
and some ways in which it can be analyzed to detect these differences.
A. What Are DNA, Chromosomes, and Genes?
The DNA molecule is made of subunits that include four chemical structures
known as nucleotide bases. The names of these bases (adenine, thymine, guanine,
and cytosine) usually are abbreviated as A, T, G, and C. The physical structure of
DNA is often described as a double helix because the molecule has two spiraling
strands connected to each other by weak bonds between the nucleotide bases.
As shown in Figure 1, A pairs only with T and G only with C. Thus, the order
of the single bases on either strand reveals the order of the pairs from one end of
the molecule to the other, and the DNA molecule could be said to be like a long
sequence of As, Ts, Gs, and Cs.
Figure 1. Sketch of a small part of a double-stranded DNA molecule. Nucleotide
bases are held together by weak bonds. A pairs with T; C pairs with G.
Most human DNA is tightly packed into structures known as chromosomes, which come in different sizes and are located in the nuclei of cells. The
chromosomes are numbered (in descending order of size) 1 through 22, with the
remaining chromosome being an X or a much smaller Y. If the bases are like
letters, then each chromosome is like a book written in this four-letter alphabet,
and the nucleus is like a bookshelf in the interior of the cell. All the cells in one
136
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
individual contain identical copies of the same collection of books. The sequence
of the As, Ts, Gs, and Cs that constitutes the “text” of these books is referred to
as the individual’s nuclear genome.
All told, the genome comprises more than three billion “letters” (As, Ts, Gs,
and Cs). If these letters were printed in books, the resulting pile would be as high
as the Washington Monument. About 99.9% of the genome is identical between
any two individuals. This similarity is not really surprising—it accounts for the
common features that make humans an identifiable species (and for features that
we share with many other species as well). The remaining 0.1% is particular to an
individual. This variation makes each person (other than identical twins) genetically unique. This small percentage may not sound like a lot, but it adds up to
some three million sites for variation among individuals.
The process that gives rise to this variation among people starts with the production of special sex cells—sperm cells in males and egg cells in females. All the
nucleated cells in the body other than sperm and egg cells contain two versions of
each of the 23 chromosomes—two copies of chromosome 1, two copies of chromosome 2, and so on, for a total of 46 chromosomes. The X and Y chromosomes are
the sex-determining chromosomes. Cells in females contain two X chromosomes,
and cells in males contain one X and one Y chromosome. An egg cell, however,
contains only 23 chromosomes—one chromosome 1, one chromosome 2, . . . , and
one X chromosome—each selected at random from the woman’s full complement
of 23 chromosome pairs. Thus, each egg carries half the genetic information present
in the mother’s 23 chromosome pairs, and because the assortment of the chromosomes is random, each egg carries a different complement of genetic information.
The same situation exists with sperm cells. Each sperm cell contains a single copy
of each of the 23 chromosomes selected at random from a man’s 23 pairs, and each
sperm differs in the assortment of the 23 chromosomes it carries. Fertilization of an
egg by a sperm therefore restores the full number of 46 chromosomes, with the 46
chromosomes in the fertilized egg being a new combination of those in the mother
and father. The process resembles taking two decks of cards (a male and a female
deck) and shuffling a random half from the male deck into a random half from the
female deck, to produce a new deck.
During pregnancy, the fertilized cell divides to form two cells, each of which
has an identical copy of the 46 chromosomes. The two then divide to form four,
the four form eight, and so on. As gestation proceeds, various cells specialize
(“differentiate”) to form different tissues and organs. Although cell differentiation
yields many different kinds of cells, the process of cell division results in each progeny cell having the same genomic complement as the cell that divided. Thus, each
of the approximately 100 trillion cells in the adult human body has the same DNA
text as was present in the original 23 pairs of chromosomes from the fertilized egg,
one member of each pair having come from the mother and one from the father.
A second mechanism operating during the chromosome reduction process in
sperm and egg cells further shuffles the genetic information inherited from mother
137
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
and father. In the first stage of the reduction process, each chromosome of a
chromosome pair aligns with its partner. The maternally inherited chromosome 1
aligns with the paternally inherited chromosome 1, and so on through the 22 pairs;
X chromosomes align with each other as well, but X and Y chromosomes do not.
While the chromosome pairs are aligned, they exchange pieces to create new combinations. The recombined chromosomes are passed on in the sperm and eggs. As a
consequence, the chromosomes we inherit from our parents are not exact copies of
their chromosomes, but rather are mosaics of these parental chromosomes.
The swapping of material between chromosome pairs (as they align in the
emerging sex cells) and the random selection (of half of each parent’s 46 chromosomes) in making sex cells is called recombination. Recombination is the principal
source of diversity in individual human genomes.
The diverse variations occur both within the genes and in the regions of
DNA sequences between the genes. A gene can be defined as a segment of DNA,
usually from 1000 to 10,000 base pairs long, that “codes” for a protein. The cell
produces specific proteins that correspond to the order of the base pairs (the
“letters”) in the coding part of the gene.14 Human genes also contain noncoding
sequences that regulate the cell type in which a protein will be synthesized and
how much protein will be produced.15 Many genes contain interspersed noncoding, nonregulatory sequences that no longer participate in protein synthesis.
These sequences, which have no apparent function, constitute about 23% of the
base pairs within human genes.16 In terms of the metaphor of DNA as text, the
gene is like an important paragraph in the book, often with some gibberish in it.
Proteins perform all sorts of functions in the body and thus produce observable characteristics. For example, a tiny part of the sequence that directs the production of the human group-specific complement protein (a protein that binds to
vitamin D and transports it to certain tissues) is
G C A A A A T T G C C T G A T G C C A C A C C C A A G G A A C T G G C A.
14. The sequence in which the building blocks (amino acids) of a protein are arranged corresponds to the sequence of base pairs within a gene. (A sequence of three base pairs specifies a particular
1 of the 20 possible amino acids in the protein. The mapping of a set of three nucleotide bases to a particular amino acid is the genetic code. The cell makes the protein through intermediate steps involving
coding RNA transcripts.) About 1.5% of the human genome codes for the amino acid sequences.
15. These noncoding but functional sequences include promoters, enhancers, and repressors.
16. This gene-related DNA consists of introns (which interrupt the coding sequences, called
exons, in genes and which are edited out of the RNA transcript for the protein), pseudogenes (evolutionary remnants of once-functional genes), and gene fragments. The idea of a gene as a block of
DNA (some of which is coding, some of which is regulatory, and some of which is functionless) is
an oversimplification, but it is useful enough here. See, e.g., Mark B. Gerstein et al., What Is a Gene,
Post-ENCODE? History and Updated Definition, 17 Genome Res. 669 (2007).
138
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
This gene always is located at the same position, or locus, on chromosome 4.
As we have seen, most individuals have two copies of each gene at a given locus—
one from the father and one from the mother.
A locus where almost all humans have the same DNA sequence is called
monomorphic (“of one form”). A locus where the DNA sequence varies among
significant numbers of individuals (more than 1% or so of the population possesses the variant) is called polymorphic (“of many forms”), and the alternative
forms are called alleles. For example, the GC protein gene sequence has three
common alleles that result from substitutions in a base at a given point. Where an
A appears in one allele, there is a C in another. The third allele has the A, but at
another point a G is swapped for a T. These changes are called single nucleotide
polymorphisms (SNPs, pronounced “snips”).
If a gene is like a paragraph in a book, a SNP is a change in a letter somewhere within that paragraph (a substitution, a deletion, or an insertion), and the
two versions of the gene that result from this slight change are the alleles. An
individual who inherits the same allele from both parents is called a homozygote.
An individual with distinct alleles is a heterozygote.
DNA sequences used for forensic analysis usually are not genes. They lie in
the vast regions between genes (about 75% of the genome is extragenic) or
in the apparently nonfunctional regions within genes. These extra- and intragenic
regions of DNA have been found to contain considerable sequence variation,
which makes them particularly useful in distinguishing individuals. Although
the terms “locus,” “allele,” “homozygous,” and “heterozygous” were developed
to describe genes, the nomenclature has been carried over to describe all DNA
variation—coding and noncoding alike. Both types are inherited from mother and
father in the same fashion.
B. What Are DNA Polymorphisms and How Are They
Detected?
By determining which alleles are present at strategically chosen loci, the forensic
scientist ascertains the genetic profile, or genotype, of an individual (at those loci).
Although the differences among the alleles arise from alterations in the order of
the ATGC letters, genotyping does not necessarily require “reading” the full DNA
sequence. Here we outline the major types of polymorphisms that are (or could
be) used in identity testing and the methods for detecting them.
1. Sequencing
Researchers are investigating radically new and efficient technologies to sequence
entire genomes, one base pair at a time, but the direct sequencing methods now in
existence are technically demanding, expensive, and time-consuming for wholegenome sequencing. Therefore, most genetic typing focuses on identifying only
139
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
those variations that define the alleles and does not attempt to “read out” each
and every base pair as it appears. The exception is mitochondrial DNA, described
in Section V. As next-generation sequencing technologies are perfected, however
(see infra Section II.F), this situation could change.
2. Sequence-specific probes and SNP chips
Simple sequence variation, such as that for the GC locus, is conveniently detected
using sequence-specific oligonucleotide (SSO) probes. A probe is a short, single
strand of DNA. With GC typing, for example, probes for the three common
alleles are attached to designated locations on a membrane. Copies of the variable
sequence region of the GC gene in the crime scene sample are made with the
polymerase chain reaction (PCR), which is discussed in the next section. These
copies (in the form of single strands) are poured onto the membrane. Whichever
allele is present in a single-stranded DNA fragment will cause the fragment to stick
to the corresponding, immobilized probe strands. To permit the fragments of this
type to be seen, a chemical “label” that catalyzes a color change at the spot where
the DNA binds to its probe can be attached when the copies are made. A colored
spot showing that the allele is present thus should appear on the membrane at the
location of the probe that corresponds to this particular allele. If only one allele
is present in the crime scene DNA (because of homozygosity), there will be no
change at the spots where the other probes are located. If two alleles are present
(heterozygosity), the corresponding two spots will change color.
This approach can be miniaturized and automated by embedding probes
for many loci on a silicon chip. Commercially available “SNP chips” for disease
research incorporate enough different probes to detect on the order of a million
different known SNPs throughout the human genome. These chips have become
a basic tool in searches for genetic changes associated with human diseases. They
are described further in Section II.F.
3. VNTRs and RFLP testing
Another category of DNA variations comes from the insertion of a variable number of tandem repeats (VNTR) at a locus. These were the first polymorphisms to
find widespread use in identity testing and hence were the subject of most of the
court opinions on the admissibility of DNA in the late 1980s and early 1990s. The
core unit of a VNTR is a particular short DNA sequence that is repeated many
times end-to-end. The first VNTRs to be used in genetic and forensic testing
had core repeat sequences of 15–35 base pairs. In this testing, bacterial enzymes
(known as “restriction enzymes”) were used to cut the DNA molecule both
before and after the VNTR sequence. A small number of repeats in the VNTR
region gives rise to a small “restriction fragment,” and a large number of repeats
yields a large fragment. A substantial quantity of DNA from a crime scene sample
is required to give a detectable number of VNTR fragments with this procedure.
140
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
The detection is accomplished by applying a probe that binds when it encounters the repeated core sequence. A radioactive or fluorescent molecule attached to
the probe provides a way to mark the VNTR fragment. The probe ignores DNA
fragments that do not include the VNTR core sequence. (There are many of these
unwanted fragments, because the restriction enzymes chop up the DNA throughout the genome—not just at the VNTR loci.) The restriction fragments are sorted
by a process known as electrophoresis, which separates DNA fragments based on
size. Many early court opinions refer to this process as RFLP testing.17
4. STRs
Although RFLP-VNTR profiling is highly discriminating,18 it has several drawbacks. Not only does it require a substantial sample size, but it also is timeconsuming and does not measure the fragment lengths to the nearest number
of repeats. The measurement error inherent in the form of electrophoresis used
(known as “gel electrophoresis”) is not a fundamental obstacle, but it complicates
the determination of which profiles match and how often other profiles in the
population would be declared to match.19 Consequently, forensic scientists have
moved from VNTRs to another form of repetitive DNA known as short tandem
repeats (STRs) or microsatellites. STRs have very short core repeats, two to seven
base pairs in length, and they typically extend for only some 50 to 350 base pairs.20
Like the larger VNTRs, which extend for thousands of base pairs, STR sequences
do not code for proteins, and the ones used in identity testing convey little or no
information about an individual’s propensity for disease.21 Because STR alleles
17. It would be clearer to call it RFLP-VNTR testing, because the fragments being measured
contain the VNTRs rather than some simpler polymorphisms that were used in genetic research
and disease testing. A more detailed exposition of the steps in RFLP-VNTR profiling (including gel
electrophoresis, Southern blotting, and autoradiography) can be found in the previous edition of this
guide and in many judicial opinions circa 1990.
18. Alleles at VNTR loci generally are too long to be measured precisely by electrophoretic
methods—alleles differing in size by only a few repeat units may not be distinguished. Although this
makes for complications in deciding whether two length measurements that are close together result
from the same allele, these loci are quite powerful for the genetic differentiation of individuals, because
they tend to have many alleles that occur relatively rarely in the population. At a locus with only 20
such alleles (and most loci typically have many more), there are 210 possible genotypes. With five such
loci, the number of possible genotypes is 2105, which is more than 400 billion.
19. For a case reversing a conviction as a result of an expert’s confusion on this score, see People
v. Venegas, 954 P.2d 525 (Cal. 1998). More suitable procedures for match windows and probabilities
are described in NRC II, supra note 9.
20. The numbers, and the distinction between “minisatellites” (VNTRs) and microsatellites
(STRs), are not precise, but the mechanisms that give rise to the shorter tandem repeats differ from
those that produce the longer ones. See Benjamin Lewin, Genes IX 124–25 (9th ed. 2008).
21. See David H. Kaye, Please, Let’s Bury the Junk: The CODIS Loci and the Revelation of Private
Information, 102 Nw. U. L. Rev. Colloquy 70 (2007), available at http://www.law.northwestern.edu/
lawreview/colloquy/2007/25/.
141
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
are much smaller than VNTR alleles, however, they can be amplified with PCR
designed to copy only the locus of interest. This obviates the need for restriction enzymes, and it allows laboratories to analyze STR loci much more quickly.
Because the amplified fragments are shorter, electrophoretic detection permits the
exact number of base pairs in an STR to be determined, allowing alleles to be
defined as discrete entities. Figure 2 illustrates the nature of allelic variation at an
STR locus found on chromosome 16.
Figure 2. Three alleles of the D16S539 STR. The core sequence is GATA. The
first allele listed has 9 tandem repeats, the second has 10, and the third
has 11. The locus has other alleles (different numbers of repeats), shown
in Figure 4.
Nine-repeat allele:
GATAGATAGATAGATAGATAGATAGATAGATAGATA
Ten-repeat allele:
GATAGATAGATAGATAGATAGATAGATAGATAGATAGATA
Eleven-repeat allele:
GATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATA
Although there are fewer alleles per locus for STRs than for VNTRs, there
are many STRs, and they can be analyzed simultaneously. Such “multiplex” systems now permit the simultaneous analysis of 16 loci. A subset of 13 is standard
in the United States (see infra Section II.D), and these are capable of distinguishing
among almost everyone in the population.22
5. Summary
DNA contains the genetic information of an organism. In humans, most of the
DNA is found in the cell nucleus, where it is organized into separate chromosomes. Each chromosome is like a book, and each cell has the same library
(genome) of books of various sizes and shapes. There are two copies of each book
of a particular size and shape, one that came from the father, the other from the
mother. Thus, there are two copies of the book entitled “Chromosome One,”
two copies of “Chromosome Two,” and so on. Genes are the most meaningful
paragraphs in the books. Other parts of the text appear to have no coherent message. Two individuals sometimes have different versions (alleles) of the same paragraph. Some alleles result from the substitution of one letter for another. These are
SNPs. Others come about from the insertion or deletion of single letters, and still
22. Usually, there are between 7 and 15 STR alleles per locus. Thirteen loci that have 10 STR
alleles each can give rise to 5513, or 42 billion trillion, possible genotypes.
142
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
others represent a kind of stuttering repetition of a string of extra letters. These
are the VNTRs and STRs.23 The locations within a chromosome where these
interpersonal variations occur are called loci.
C. How Is DNA Extracted and Amplified?
DNA usually can be found in biological materials such as blood, bone, saliva, hair,
semen, and urine. A combination of routine chemical and physical methods permits DNA to be extracted from cell nuclei and isolated from the other chemicals
in a sample. PCR then is used to make exponentially large numbers of copies of
targeted regions of the extracted DNA. PCR might be applied to the doublestranded DNA segments extracted and purified from a forensic sample as follows:
First, the purified DNA is separated into two strands by heating it to near the boiling
point of water. This “denaturing” takes about a minute. Second, the single strands
are cooled, and “primers” attach themselves to the points at which the copying
will start and stop. (Primers are small, manmade pieces of DNA, usually between
15 and 30 nucleotides long, of known sequences. If a locus of interest starts near
the sequence ATCGAATCGGTAGCCATATG on one strand, a suitable primer
would have the complementary sequence TAGCTTAGCCATCGGTATAC.)
“Annealing” these primers takes about 45 seconds. Finally, the soup containing
the annealed DNA strands, the enzyme DNA polymerase, and lots of the four
nucleotide building blocks (A, C, G, and T) is warmed to a comfortable working
temperature for the polymerase to insert the complementary base pairs one at a
time, building a matching second strand bound to the original “template” and
thus replicating part of the DNA strand that was separated from its partner in the
first step. The same replication occurs with the separated partner as the template.
This “extension” step for both templates takes about 2 minutes. The result is two
identical double-stranded DNA segments, one made from each strand of the original DNA. The three-step cycle is repeated, usually 20 to 35 times in automated
machines known as thermocyclers. Ideally, the first cycle results in two doublestranded DNA segments. The second cycle produces four, the third eight, and
so on, until the number of copies of the original DNA is enormous. In practice,
there is some inefficiency in the doubling process, but the yield from a 30-cycle
amplification is generally about 1 million to 10 million copies of the targeted
sequence.24 In this way, PCR magnifies short sequences of interest in a small
number of DNA fragments into millions of exact copies. Machines that automate
the PCR process are commercially available.
For PCR amplification to work properly and yield copies of only the desired
sequence, however, care must be taken to achieve the appropriate chemical con23. In addition to the 23 pairs of books in the cell nucleus, other scraps of text reside in each
of the mitochondria, the power plants of the cell. See infra Section V.
24. NRC II, supra note 9, at 69–70.
143
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ditions and to avoid excessive contamination of the sample. A laboratory should
be able to demonstrate that it can amplify targeted sequences faithfully with the
equipment and reagents that it uses and that it has taken suitable precautions to
avoid or detect contamination from foreign DNA. With small samples, it is possible that some alleles will be amplified and others missed (preferential amplification, discussed infra Section III.A.1), and mutations in the region of a primer can
prevent the amplification of the allele downstream of the primer (null alleles).25
D. How Is STR Profiling Done with Capillary Electrophoresis?
In the most commonly used analytical method for detecting STRs, the STR fragments in the sample are amplified using primers with fluorescent tags. Each new
STR fragment made in a PCR cycle bears a fluorescent dye. When struck by a
source light, each dye glows with a particular color. The fragments are separated
according to their length by electrophoresis in automated “genetic analyzer”
machinery—a byproduct of the technology developed for the Human Genome
Project that first sequenced most of the entire genome. In these machines, a long,
narrow tube (a “capillary”) is filled with an entangled polymer or comparable sieving medium, and an electric field is applied to pull DNA fragments placed at one
end of the tube through the medium. Shorter fragments slip through the medium
more quickly than larger, bulkier ones. A laser beam is sent through a small glass
window in the tube. The laser light excites the dye, causing it to fluoresce at a
characteristic wavelength as the tagged fragments pass under the light. The intensity of the light emitted by the dye is recorded by a kind of electronic camera and
transformed into a graph (an electropherogram), which shows a peak as an STR
flashes by. A shorter allele will pass by the window and fluoresce first; a longer
fragment will come by later, giving rise to another peak on the graph. Figure 3
provides a sketch of how the alleles with five and eight repeats of the GATA
sequence at the D16S539 STR locus might appear in an electropherogram.
Medical and human geneticists were interested in STRs as markers in family
studies to locate the genes that are associated with inherited diseases, and papers
on their potential for identity testing appeared in the early 1990s. Developmental
research to pick suitable loci moved into high gear in England, Europe, and
Canada. Britain’s Forensic Science Service applied a four-locus testing system in
1994. Then it introduced the “second generation multiplex” (SGM)—for simultaneously typing six loci in 1996. These soon would be used to build England’s
National DNA Database. The database system allows a computer to check the
STR types of millions of known or suspected criminals against thousands of crime
25. A null allele will not lead to a false exclusion if the two DNA samples from the same individual are amplified with the same primer system, but it could lead to an exclusion at one locus when
searching a database of STR profiles if the database profile was determined with a different PCR kit
than the one used to analyze the crime scene DNA.
144
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
Figure 3. Sketch of an electropherogram for two D16S539 alleles. One allele has
five repeats of the sequence GATA; the other has eight. Each GATA
repeat is depicted as a small rectangle. Although only one copy of each
allele (with a fluorescent molecule, or “tag” attached) is shown here,
PCR generates a great many copies from the DNA sample with these
alleles at the D16S539 locus. These copies are drawn through the capillary tube, and the tags glow as the STR fragments move through the
laser beam. An electronic camera measures the colored light from the
tags. Finally, a computer processes the signal from the camera to produce
the electropherogram. Source: David H. Kaye, The Double Helix and
the Law of Evidence 189, fig. 9.1 (2010).
scene samples. A six-locus STR profile can be represented as a string of 12 digits;
each digit indicates the number of repeat units in the alleles at each locus. These
discrete, numerical DNA profiles are far easier to compare mechanically than the
complex patterns of fingerprints. In the United States, the FBI settled on 13 “core
loci” to use in the U.S. national DNA database system. These are often called
the “CODIS core loci,” and an additional 7 STR loci are under consideration.26
Modern genetic analyzers produce electropherograms for many loci at once.
This “multiplexing” is accomplished by using dyes that fluoresce at distinct colors
26. Douglas R. Hares, Expanding the CODIS Core Loci in the United States, Forensic Sci. Int’l:
Genetics (forthcoming 2011). CODIS stands for “convicted offender DNA index system.”
145
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
to label the alleles from different groups of loci. A separate set of fragments of
known sizes that comigrate through the capillary function as a kind of ruler (an
“internal-lane size standard”) to determine the lengths of the allelic fragments.
Software processes the raw data to generate an electropherogram of the separate
allele peaks of each color. By comparing the positions of the allele peaks to the
size standard, the program determines the number of repeats in each allele. The
plotted heights of the peaks (measured in relative fluorescent units, or RFUs) are
proportional to the amount of the PCR product.
Figure 4 is an electropherogram of all 203 major alleles at 15 STR loci that
can be typed in a single “multiplex” PCR reaction. (In addition, it shows the
two alleles of the gene used to determine the sex of the contributor of a DNA
Figure 4. Alleles of 15 STR loci and the amelogenin sex-typing test from the
AmpFISTR Identifiler kit. The bottom panel is a “sizing standard”—a
set of peaks from DNA sequences of known lengths (in base pairs). The
numbers in the vertical axis in each panel are relative fluorescence units
(RFUs) that indicate the amount of light emitted after the laser beam
strikes the fluorescent tag on an STR fragment.
Note: Applied Biosystems makes the kit that produced these allelic ladders.
Source: John M. Butler, Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers
128 (2d ed. 2005), Copyright Elsevier 2005, with the permission of Elsevier Academic Press. John
Butler supplied the illustration.
146
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
sample.27) An electropherogram from an individual’s DNA would have only one
or two peaks at each of these 15 STR loci (depending on whether the person is
homozygous or heterozygous). These “allelic ladders” aid in deciding which allele
a peak from an unknown sample represents.
Figure 5 is an electropherogram from the vaginal epithelial cells of the body
of a girl who had been sexually assaulted and killed in People v. Pizarro.28 It was
produced for the retrial in 2008 of the defendant who was linked to the victim
by VNTR typing at his first trial in 1990.
Figure 5. Electropherogram for nine STR loci of the victim’s DNA in People v.
Pizzaro. (The amelogenin locus and a sizing standard at the bottom also
are included.) Some STR loci have small peaks, indicating that there
was not much PCR product for those loci, likely because of DNA
degradation. All of the STR loci have two peaks, as would be expected
when the source is heterozygous at those loci.
Source: Steven Myers and Jeanette Wallin, California Department of Justice, provided the image.
27. The amelogenin gene, which is found on the X and the Y chromosomes, codes for a protein
that is a major component of tooth enamel matrix. The copy on the X chromosome is 112 bp long.
The copy on the Y chromosome has a string of six base pairs deleted, making it slightly shorter (106 bp).
A female (XX) will have one peak at 112 bp. A male (XY) will have two peaks (at 106 and 112 bp).
28. 12 Cal. Rptr. 2d 436 (Ct. App. 1992), after remand, 3 Cal. Rptr. 3d 21 (Ct. App. 2003),
review denied (Oct 15, 2003).
147
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
E. What Can Be Done to Validate a Genetic System for
Identification?
Regardless of the kind of genetic system used for typing—STRs, SNPs, or still
other polymorphisms—some general principles and questions can be applied
to each system that is offered for courtroom use. First, the nature of the polymorphism should be well characterized. Is it a simple sequence polymorphism or
a fragment length polymorphism? This information should be in the published
literature or in archival genome databanks.
Second, the published scientific literature can be consulted to verify claims that
a particular method of analysis can produce accurate profiles under various conditions. Although such validation studies have been conducted for all the systems
ordinarily used in forensic work, determining the point at which the empirical
validation of a particular system is sufficiently convincing to pass scientific muster
may well require expert assistance.
Finally, the population genetics of the system should be characterized. As
new systems are discovered, researchers typically analyze convenient collections of
DNA samples from various human populations and publish studies of the relative
frequencies of each allele in these population samples. These studies measure the
extent of genetic variability at the polymorphic locus in the various populations,
and thus of the potential probative power of the marker for distinguishing among
individuals.
At this point, the capability of PCR-based procedures to ascertain DNA
genotypes accurately cannot be doubted. Of course, the fact that scientists have
shown that it is possible to extract DNA, to amplify it, and to analyze it in ways
that bear on the issue of identity does not mean that a particular laboratory has
adopted a suitable protocol and is proficient in following it. These case-specific
issues are considered in Sections III and IV.
F. What New Technologies Might Emerge?
1. Miniaturized “lab-on-a-chip” devices
Miniaturized capillary electrophoresis (CE) devices have been developed for rapid
detection of STRs (described in Section II.D) and other genetic analyses. The
mini-CE systems consist of microchannels roughly the diameter of a hair etched
on glass wafers (“chips”) using technology borrowed from the computer industry.
The principles of electrophoretic separation are the same as with conventional CE
systems. With microfluidic technologies, it is possible to integrate DNA extraction and PCR amplification processes with the CE separation in a single device,
a so-called lab on a chip. Once a sample is added to the device, all the analytical
steps are performed on the chip without further human contact. These integrated
devices combine the benefits of simplified sample handling with rapid analysis
148
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
and are under active development for point-of-care medical diagnostics.29 Efforts
are under way to develop an integrated microdevice for STR analysis that would
improve the speed and efficiency of forensic DNA profiling. A portable device for
rapid and secure analysis of samples in the field is a distinct possibility.30
2. High-throughput sequencing
The initial success of the Human Genome Project and the promise of “personalized medicine” is driving research to develop technologies for DNA analysis
that are faster, cheaper, and less labor intensive. In 2004, the National Human
Genome Research Institute announced funding for research leading to the “$1000
genome,” an achievement that would permit sequencing an individual’s genome
for medical diagnosis and improved drug therapies. Advances in the years since
2004 suggest that this goal will be achieved before the target date of 2014,31 and
the successful innovations could provide major advances in forensic DNA testing.
However, it is too soon to identify which of the nascent sequencing technologies
might emerge from the pack.
As of 2009, three different next-generation sequencing technologies were
commercially available, and more instruments are in the pipeline.32 These new
technologies generate massive amounts of DNA sequence data (100 million to
1 billion base pairs per run) at very low cost (under $50 per megabase). They
do so by simultaneously sequencing millions of short fragments, then applying
bioinformatics software to assemble the sequences in the correct order. These
high-throughput sequencing technologies have demonstrated their usefulness in
research applications. Two of these applications, the analysis of highly degraded
DNA33 and the identification of microbial bioterrorism agents, are of forensic
relevance.34 As the speed and cost of sequencing diminish and the necessary bioinformatics software becomes more accessible and effective, full-genome sequence
29. P. Yager et al., Microfluidic Diagnostic Technologies for Global Public Health, 442 Nature 412
(2006).
30. K.M. Horsman et al., Forensic DNA Analysis on Microfluidic Devices: A Review, 52 J. Forensic
Sci. 784 (2007). As indicated in this review, there remain challenges to overcome before the forensic lab
on a chip comes to fruition. However, given the progress being made on multiple research fronts in chip
fabrication design and in microfluidic technology, these challenges seem surmountable.
31. R.F. Service, The Race for the $1000 Genome, 311 Science 1544 (2006).
32. Michael L. Metzker, Sequencing Technologies—The Next Generation, 11 Nature Rev. Genetics
31 (2010).
33. The next-generation technologies have been used to sequence highly degraded DNA from
Neanderthal bones and from the hair of the extinct woolly mammoth. R.E. Green et al., Analysis of
One Million Base Pairs of Neanderthal DNA, 444 Nature 330 (2006); W. Miller, Sequencing the Nuclear
Genome of the Extinct Woolly Mammoth, 456 Nature 387 (2008). The approaches used in these studies
are readily translatable to SNP typing of highly degraded DNA such as found in cases involving victims
of mass disasters.
34. By sequencing entire bacterial genomes, researchers can rapidly differentiate organisms that
have been genetically modified for biological warfare or terrorism from routine clinical and envi-
149
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
analysis or something approaching it could become a practical tool for human
identification.
3. Microarrays
Hybridization microarrays are the third technological innovation with readily
foreseeable forensic application. A microarray consists of a two-dimensional grid
of many thousands of microscopic spots on a glass or plastic surface, each containing many copies of a short piece of single-stranded DNA tethered to the surface
at one end; each spot can be thought of as a dense cluster of tiny, single-stranded
DNA “whiskers” with their own particular sequence. A solution containing
single-stranded target DNA is washed over the microarray surface. The whiskers
on the array serve as probes to detect DNA (or RNA) with the corresponding
complementary sequence. The spots that capture target DNA are identified,
indicating the presence of that sequence in the target sample. (The hybridization
can be detected in several different ways.) Microarrays are commercially available for the detection of SNPs in the human genome and for sequencing human
mitochondrial DNA.35
4. What questions do the new technologies raise?
As these or other emerging technologies are introduced in court, certain basic
questions will need to be answered. What is the principle of the new technology?
Is it simply an extension of existing technologies, or does it invoke entirely new
concepts? Is the new technology used in research or clinical applications independent of forensic science? Does the new technology have limitations that might
affect its application in the forensic sphere? Finally, what testing has been done and
with what outcomes to establish that the new technology is reliable when used
on forensic samples? For next-generation sequencing technologies and microarray
technologies, the questions may be directed as well to the bioinformatics methods
used to analyze and interpret the raw data. Obtaining answers to these questions
would likely require input both from experts involved in technology development
and application and from knowledgeable forensic experts.
ronmental strains. B. La Scola et al., Rapid Comparative Genomic Analysis for Clinical Microbiology: The
Francisella Tularensis Paradigm, 18 Genome Res. 742 (2008).
35. One study of 3000 Europeans used a commercial microarray with over half a million SNPs
“to infer [the individuals’] geographic origin with surprising accuracy—often to within a few hundred
kilometers.” John Novembre et al., Genes Mirror Geography Within Europe, 456 Nature 98, 98 (2008).
Microarrays also are used in studies of variation in the number of copies of certain genes in different
people’s genomes (copy number variation). Microarrays to detect pathogens and other targets also
have been developed.
150
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
III. Sample Collection and Laboratory
Performance
A. Sample Collection, Preservation, and Contamination
The primary determinants of whether DNA typing can be done on any particular
sample are (1) the quantity of DNA present in the sample and (2) the extent to
which it is degraded. Generally speaking, if a sufficient quantity of reasonable
quality DNA can be extracted from a crime scene sample, no matter what the
nature of the sample, DNA typing can be done without problem. Thus, DNA
typing has been performed on old blood stains, semen stains, vaginal swabs, hair,
bone, bite marks, cigarette butts, urine, and fecal material. This section discusses
what constitutes sufficient quantity and reasonable quality in the context of STR
typing. Complications from contaminants and inhibitors also are discussed. The
special technique of mitotyping and the treatment of samples that contain DNA
from two or more contributors are discussed in Section V.
1. Did the sample contain enough DNA?
Amounts of DNA present in some typical kinds of samples vary from a trillionth
or so of a gram for a hair shaft to several millionths of a gram for a postcoital
vaginal swab. Most PCR test protocols recommend samples on the order of
1 billionth to 5 billionths of a gram for optimum yields. Normally, the number
of amplification cycles for nuclear DNA is limited to 28 or so to ensure that there
is no detectable product for samples containing less than about 20 cell equivalents
of DNA.36
Procedures for typing still smaller samples—down to a single cell’s worth of
nuclear DNA—have been studied. These have been shown to work, to some
extent, with trace or contact DNA left on the surface of an object such as the
steering wheel of a car. The most obvious strategy is to increase the number of
amplification cycles. The danger is that chance effects might result in one allele
being amplified much more than another. Alleles then could drop out, small peaks
from unusual alleles at other loci might “drop in,” and a bit of extraneous DNA
could contribute to the profile. Other protocols have been developed for typing such “low copy number” (LCN) or “low template” (LT) DNA.37 LT-STR
36. This is about 100 to 200 trillionths of a gram. A lower limit of about 10 to 15 cells’ worth
of DNA has been determined to give balanced amplification.
37 See, e.g., John Buckleton & Peter Gill, Low Copy Number, in Forensic DNA Evidence Interpretation 275 (John S. Buckleton et al. eds., 2005); Pamela J. Smith & Jack Ballantyne, Simplified
Low-Copy-Number DNA Analysis by Post-PCR Purification, 52 J. Forensic Sci. 820 (2007).
151
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
profiles have been admitted in courts in a few countries,38 and they are beginning
to appear in prosecutions in the United States.39
Although there are tests to estimate the quantity of DNA in a sample,
whether a particular sample contains enough human DNA to allow typing cannot
always be predicted accurately. The best strategy is to try. If a result is obtained,
and if the controls (samples of known DNA and blank samples) have behaved
properly, then the sample had enough DNA. The appearance of the same peaks
in repeated runs helps assure that these alleles are present.40
2. Was the sample of sufficient quality?
The primary determinant of DNA quality for forensic analysis is the extent to
which the long DNA molecules are intact. Within the cell nucleus, each molecule
of DNA extends for millions of base pairs. Outside the cell, DNA spontaneously
degrades into smaller fragments at a rate that depends on temperature, exposure to
oxygen, and, most importantly, the presence of water.41 In dry biological samples,
protected from air, and not exposed to temperature extremes, DNA degrades very
slowly. STR testing has proved effective with old and badly degraded material
such as the remains of the Tsar Nicholas family (buried in 1918 and recovered
in 1991).42
38. E.g., R. v. Reed [2009] (CA Crim. Div.) EWCA Crim. 2698, ¶ 74 (reviewing expert
submissions and concluding that “Low Template DNA can be used to obtain profiles capable of reliable interpretation if the quantity of DNA that can be analysed is above the stochastic threshold [of]
between 100 and 200 picograms”).
39. People v. Megnath, 898 N.Y.S.2d 408 (N.Y. Sup. Ct. 2010) (reasoning that “LCN DNA
analysis” uses the same steps as STR analysis of larger samples and that the modifications in the procedure used by the laboratory in the case were generally accepted); cf. United States v. Davis, 602 F.
Supp. 2d 658 (D. Md. 2009) (avoiding “making a finding with regard to the dueling definitions of
LCN testing advocated by the parties” by finding that “the amount of DNA present in the evidentiary
samples tested in this case” was in the normal range). These cases and the admissibility of low-template
DNA analysis are discussed in Kaye et al., supra note 1, § 9.2.3(c).
40. John M. Butler & Cathy R. Hill, Scientific Issues with Analysis of Low Amounts of DNA, LCN Panel
on Scientific Issues with Low Amounts of DNA, Promega Int’l Symposium on Human Identification,
Oct. 15, 2009, available at http://www.cstl.nist.gov/strbase/pub_pres/Butler_Promega2009-LCNpanelfor-STRBase.pdf.
41. Other forms of chemical alteration to DNA are well studied, both for their intrinsic interest
and because chemical changes in DNA are a contributing factor in the development of cancers in living
cells. Some forms of DNA modification, such as that produced by exposure to ultraviolet radiation,
inhibit the amplification step in PCR-based tests, whereas other chemical modifications appear to
have no effect. C.L. Holt et al., TWGDAM Validation of AmpFlSTR PCR Amplification Kits for Forensic
DNA Casework, 47 J. Forensic Sci. 66 (2002); George F. Sensabaugh & Cecilia von Beroldingen, The
Polymerase Chain Reaction: Application to the Analysis of Biological Evidence, in Forensic DNA Technology
63 (Mark A. Farley & James J. Harrington eds., 1991).
42. Peter Gill et al., Identification of the Remains of the Romanov Family by DNA Analysis, 6 Nature
Genetics 130 (1994).
152
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
The extent to which degradation affects a PCR-based test depends on the
size of the DNA segment to be amplified. For example, in a sample in which
the bulk of the DNA has been degraded to fragments well under 1000 base
pairs in length, it may be possible to amplify a 100-base-pair sequence, but not
a 1000-base-pair target. Consequently, the shorter alleles may be detected in a
highly degraded sample, but the larger ones may be missed. Fortunately, the size
differences among STR alleles at a locus are quite small (typically no more than
50 base pairs). Therefore, if there is a degradation effect on STR typing, it is usually “locus dropout”—in cases involving severe degradation, loci yielding larger
products (greater than 200 base pairs) may not be detected.43
DNA can be exposed to a great variety of environmental insults without any
effect on its capacity to be typed correctly. Exposure studies have shown that
contact with a variety of surfaces, both clean and dirty, and with gasoline, motor
oil, acids, and alkalis either have no effect on DNA typing or, at worst, render
the DNA untypable.44
Although contamination with microbes generally does little more than
degrade the human DNA, other problems sometimes can occur. Therefore, the
validation of DNA typing systems should include tests for interference with a
variety of microbes to see if artifacts occur. If artifacts are observed, then control
tests should be applied to distinguish between the artifactual and the true results.
B. Laboratory Performance
1. What forms of quality control and assurance should be followed?
DNA profiling is valid and reliable, but confidence in a particular result depends
on the quality control and quality assurance procedures in the laboratory. Quality
control refers to measures to help ensure that a DNA-typing result (and its
interpretation) meets a specified standard of quality. Quality assurance refers to
monitoring, verifying, and documenting laboratory performance. A quality assurance program helps demonstrate that a laboratory is meeting its quality control
objectives and thus justifies confidence in the quality of its product.45
43.. Holt et al., supra note 41. Special primers and very short STRs give better results with
extremely degraded samples. See Michael D. Coble & John M. Butler, Characterization of New MiniSTR
Loci to Aid Analysis of Degraded DNA, 50 J. Forensic Sci. 43 (2005).
44.. Holt et al., supra note 41. Most of the effects of environmental insult readily can be
accounted for in terms of basic DNA chemistry. For example, some agents produce degradation or
damaging chemical modifications. Other environmental contaminants inhibit restriction enzymes or
PCR. (This effect sometimes can be reversed by cleaning the DNA extract to remove the inhibitor.)
But environmental insult does not result in the selective loss of an allele at a locus or in the creation
of a new allele at that locus.
45. For a review of the history of quality assurance in forensic DNA testing, see J.L. Peterson et
al., The Feasibility of External Blind DNA Proficiency Testing. I. Background and Findings, 48 J. Forensic
Sci. 21, 22 (2003).
153
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Professional bodies within forensic science have described procedures for
quality assurance. Guidelines for DNA analysis have been prepared by FBIappointed groups (the current incarnation is known as SWGDAM);46 a number
of states require forensic DNA laboratories to be accredited;47 and federal law
requires accreditation or other safeguards of laboratories that receive certain federal
funds48 or participate in the national DNA database system.49
a. Documentation
Quality assurance guidelines normally call for laboratories to document laboratory
organization and management, personnel qualifications and training, facilities, evidence control procedures, validation of methods and procedures, analytical procedures, equipment calibration and maintenance, standards for case documentation
and report writing, procedures for reviewing case files and testimony, proficiency
testing, corrective actions, audits, safety programs, and review of subcontractors.
46. The FBI established the Technical Working Group on DNA Analysis Methods (TWGDAM)
in 1988 to develop standards. The DNA Identification Act of 1994, 42 U.S.C. § 14131(a) & (c) (2006),
created a DNA Advisory Board (DAB) to assist in promulgating quality assurance standards, but the
legislation allowed the DAB to expire after 5 years (unless extended by the Director of the FBI). 42
U.S.C. § 14131(b) (2008). TWGDAM functioned under DAB, 42 U.S.C. § 14131(a) (2006), and was
renamed the Scientific Working Group on DNA Analysis Methods (SWGDAM) in 1999. When the
FBI allowed DAB to expire, SWGDAM replaced DAB. See Norah Rudin & Keith Inman, An Introduction to Forensic DNA Analysis 180 (2d ed. 2002); Paul C. Giannelli, Regulating Crime Laboratories:
The Impact of DNA Evidence, 15 J.L. & Pol’y 59, 82–83 (2007).
47. New York was the first state to impose this requirement. N.Y. Exec. Law § 995-b (McKinney
2006) (requiring accreditation by the state Forensic Science Commission).
48. The Justice for All Act, enacted in 2004, required DNA labs to be accredited within 2 years
“by a nonprofit professional association of persons actively involved in forensic science that is nationally
recognized within the forensic science community” and to “undergo external audits, not less than once
every 2 years, that demonstrate compliance with standards established by the Director of the Federal
Bureau of Investigation.” 42 U.S.C. § 14132(b)(2) (2006). Established in 1981, the American Society
of Crime Laboratory Directors–Laboratory Accreditation Board (ASCLD-LAB) accredits forensic
laboratories. Giannelli, supra note 46, at 75. The 2004 Act also requires applicants for federal funds
for forensic laboratories to certify that the laboratories use “generally accepted laboratory practices
and procedures, established by accrediting organizations or appropriate certifying bodies,” 42 U.S.C.
§ 3797k(2) (2004), and that “a government entity exists and an appropriate process is in place to
conduct independent external investigations into allegations of serious negligence or misconduct
substantially affecting the integrity of the forensic results committed by employees or contractors of
any forensic laboratory system, medical examiner’s office, coroner’s office, law enforcement storage
facility, or medical facility in the State that will receive a portion of the grant amount.” Id. § 3797k(4).
There have been problems in implementing the § 3797k(4) certification requirement. See Office of
the Inspector General, U.S. Dep’t of Justice, Review of the Office of Justice Programs’ Paul Coverdell
Forensic Science Improvement Grants Program, Evaluation and Inspections Report I-2008-001
(2008), available at http://www.usdoj.gov/oig/reports/OJP/e0801/index.htm.
49. See 42 U.S.C § 14132 (b)(2) (2006) (requiring as of late 2006, that records in the database
come from laboratories that “have been accredited by a nonprofit professional association . . . and . . .
undergo external audits, not less than once every 2 years [and] that demonstrate compliance with
standards established by the Director of the Federal Bureau of Investigation. . . .”).
154
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
Of course, maintaining documentation and records alone does not guarantee the
correctness of results obtained in any particular case. Errors in analysis or interpretation might occur as a result of a deviation from an established procedure,
analyst misjudgment, or an accident. Although case review procedures within a
laboratory should be designed to detect errors before a report is issued, it is always
possible that some incorrect result will slip through. Accordingly, determination
that a laboratory maintains a strong quality assurance program does not eliminate
the need for case-by-case review.
b. Validation
The validation of procedures is central to quality assurance. Developmental validation is undertaken to determine the applicability of a new test to crime scene
samples; it defines conditions that give reliable results and identifies the limitations of the procedure. For example, a new genetic marker being considered for
use in forensic analysis will be tested to determine if it can be typed reliably in
both fresh samples and in samples typical of those found at crime scenes. The
validation would include testing samples originating from different tissues—blood,
semen, hair, bone, samples containing degraded DNA, samples contaminated
with microbes, samples containing DNA mixtures, and so on. Developmental
validation of a new set of loci also includes the generation of population databases
and the testing of alleles for statistical independence. Developmental validation
normally results in publication in the scientific literature, but a new procedure can
be validated in multiple laboratories well ahead of publication.
Internal validation, on the other hand, involves the capacity of a specific
laboratory to analyze the new loci. The laboratory should verify that it can reliably perform an established procedure that already has undergone developmental
validation. In particular, before adopting a new procedure, the laboratory should
verify its ability to use the system in a proficiency trial.50
c. Proficiency testing
Proficiency testing in forensic genetic testing is designed to ascertain whether an
analyst can correctly determine genetic types in a sample whose origin is unknown
to the analyst but is known to a tester. Proficiency is demonstrated by making
correct genetic typing determinations in repeated trials. The laboratory also can be
tested to verify that it correctly computes random-match probabilities or similar
statistics.
An internal proficiency trial is conducted within a laboratory. One person in the
laboratory prepares the sample and administers the test to another person in the labo-
50. Both forms of validation build on the accumulated body of knowledge and experience.
Thus, some aspects of validation testing need be repeated only to the extent required to verify that
previously established principles apply.
155
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ratory. In an external trial, the test sample originates from outside the laboratory—
from another laboratory, a commercial vendor, or a regulatory agency. In a declared
(or open) proficiency trial, the analyst knows the sample is a proficiency sample.
The DNA Identification Act of 1994 requires proficiency testing for analysts in the
FBI as well as those in laboratories participating in the national database or receiving
federal funding,51 and the standards of accrediting bodies typically call for periodic
open, external proficiency testing.52
In a blind (or, more properly, “full blind”) trial, the sample is submitted so that
the analyst does not recognize it as a proficiency sample. A full-blind trial provides
a better indication of proficiency because it ensures that the analyst will not give the
trial sample any special attention, and it tests more steps in the laboratory’s processing
of samples. However, full-blind proficiency trials entail considerably more organizational effort and expense than open proficiency trials. Obviously, the “evidence”
samples prepared for the trial have to be sufficiently realistic that the laboratory does
not suspect the legitimacy of the submission. A police agency and prosecutor’s office
have to submit the “evidence” and respond to laboratory inquiries with information
about the “case.” Finally, the genetic profile from a proficiency test must not be
entered into regional and national databases. Consequently, although some forensic
DNA laboratories participate in full-blind testing, they are not required to do so.53
2. How should samples be handled?
Sample mishandling, mislabeling, or contamination, whether in the field or in the
laboratory, is more likely to compromise a DNA analysis than is an error in genetic
typing. For example, a sample mixup due to mislabeling reference blood samples
taken at the hospital could lead to incorrect association of crime scene samples to
a reference individual or to incorrect exclusions. Similarly, packaging two items
with wet bloodstains into the same bag could result in a transfer of stains between
the items, rendering it difficult or impossible to determine whose blood was
originally on each item. Contamination in the laboratory may result in artifactual
51. 42 U.S.C. § 14132(b)(2) (requiring external proficiency testing of laboratories for participation in the national database); id. § 14133(a)(1)(A) (2006) (same for FBI examiners).
52. See Peterson et al., supra note 45, at 24 (describing the ASCL-LAB standards). Certification
by the American Board of Criminalistics as a specialist in forensic biology DNA analysis requires one
proficiency trial per year. Accredited laboratories must maintain records documenting compliance with
required proficiency test standards.
53. The DNA Identification Act of 1994 required the director of the National Institute of Justice
to report to Congress on the feasibility of establishing an external blind proficiency testing program
for DNA laboratories. 42 U.S.C. § 14131(c) (2006). A National Forensic DNA Review Panel advised
the Director that “blind proficiency testing is possible, but fraught with problems” of the kind listed
above). Peterson et al., supra note 46, at 30. It “recommended that a blind proficiency testing program
be deferred for now until it is more clear how well implementation of the first two recommendations
[the promulgation of guidelines for accreditation, quality assurance, and external audits of casework]
are serving the same purposes as blind proficiency testing.” Id.
156
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
typing results or in the incorrect attribution of a DNA profile to an individual
or to an item of evidence. Procedures should be prescribed and implemented to
guard against such error.
Mislabeling or mishandling can occur when biological material is collected in
the field, when it is transferred to the laboratory, when it is in the analysis stream
in the laboratory, when the analytical results are recorded, or when the recorded
results are transcribed into a report. Mislabeling and mishandling can happen with
any kind of physical evidence and are of great concern in all fields of forensic
science. Checkpoints should be established to detect mislabeling and mishandling
along the line of evidence flow. Investigative agencies should have guidelines for
evidence collection and labeling so that a chain of custody is maintained. Similarly,
there should be guidelines, produced with input from the laboratory, for handling
biological evidence in the field.
Professional guidelines and recommendations require documented procedures
to ensure sample integrity and to avoid sample mixups, labeling errors, recording
errors, and the like.54 They also mandate case review to identify inadvertent errors
before a final report is released. Finally, laboratories must retain, when feasible,
portions of the crime scene samples and extracts to allow reanalysis.55 However,
retention is not always possible. For example, retention of original items is not to
be expected when the items are large or immobile (e.g., a wall or sidewalk). In
such situations, a swabbing or scraping of the stain from the item would typically
be collected and retained. There also are situations where the sample is so small
that it will be consumed in the analysis.
Assuming that appropriate chain-of-custody and evidence-handling protocols
are in place, the critical question is whether there are deviations in the particular
case. This may require a review of the total case documentation as well as the
laboratory findings. In addition, the opportunity to retest original evidence items
or the material extracted from them is an important safeguard against error because
of mislabeling and mishandling. Should mislabeling or mishandling have occurred,
reanalysis of the original sample and the intermediate extracts should detect not
only the fact of the error but also the point at which it occurred.56
54. SWGDAM guidelines are published as FBI, Standards for Forensic DNA Testing Labs, available at http://www.fbi.gov/hq/lab/codis/forensic.htm (last visited Feb. 16, 2010).
55. Forensic laboratories have a professional responsibility to preserve retained evidence so as to
minimize degradation. See id., standard 7.2.1. Furthermore, failure to preserve potentially exculpatory
evidence has been treated as a denial of due process and grounds for suppression. People v. Nation,
604 P.2d 1051, 1054–55 (Cal. 1980). In Arizona v. Youngblood, 488 U.S. 51 (1988), however, the
Supreme Court held that a police agency’s failure to preserve evidence not known to be exculpatory
does not constitute a denial of due process unless “bad faith” can be shown. Ironically, DNA testing
that was not available at Youngblood’s trial established that he had been falsely convicted. Maurice
Possley, DNA Exonerates Inmate Who Lost Key Test Case: Prosecutors Ruined Evidence in Original Trial,
Chi. Trib., Aug. 10, 2000, at 6.
56. Of course, retesting cannot correct all errors that result from mishandling of samples, but
it is even possible in some cases to detect mislabeling at the point of sample collection if the genetic
157
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Contamination describes any situation in which foreign material is mixed
with a sample of DNA. As noted in Section III.A.2, contamination by nonbiological materials, such as gasoline or grit, can cause test failures, but they are
not a source of genetic typing errors. Similarly, contamination with nonhuman
biological materials, such as bacteria, fungi, or plant materials, is generally not a
problem. These contaminants may accelerate DNA degradation, but they do not
generate spurious human genetic types.
The contamination of greatest concern is that resulting from the addition of
human DNA. This sort of contamination can occur three ways. First, the crime
scene samples by their nature may contain a mixture of fluids or tissues from different individuals. Examples include vaginal swabs collected as sexual assault evidence and bloodstain evidence from scenes where several individuals shed blood.
Mixtures are the subject of Section V.C.
Second, the crime scene samples may be inadvertently contaminated in the
course of sample handling in the field or in the laboratory. Inadvertent contamination of crime scene DNA with DNA from a reference sample could lead to a
false inclusion.
Third, carryover contamination in PCR-based typing can occur if the amplification products of one typing reaction are carried over into the reaction mix
for a subsequent PCR reaction. If the carryover products are present in sufficient
quantity, they could be preferentially amplified over the target DNA. The primary
strategy used in most forensic laboratories to protect against carryover contamination is to keep PCR products away from sample materials and test reagents by
having separate work areas for pre-PCR and post-PCR sample handling, by preparing samples in controlled-air-flow biological safety hoods, by using dedicated
equipment (such as pipetters) for each of the various stages of sample analysis, by
decontaminating work areas after use (usually by wiping down or by irradiating
with ultraviolet light), and by having a one-way flow of sample from the pre-PCR
to post-PCR work areas. Additional protocols are used to detect any carryover
contamination.57
In the end, whether a laboratory has conducted proper tests and whether it
conducted them properly depends both on the general standard of practice and
typing results on a particular sample are inconsistent with an otherwise consistent reconstruction of
events. For example, a mislabeling of husband and wife samples in a paternity case might result in an
apparent maternal exclusion, a very unlikely event. The possibility of mislabeling could be confirmed
by testing the samples for gender and ultimately verified by taking new samples from each party under
better controlled conditions.
57. Standard protocols include the amplification of blank control samples—those to which no
DNA has been added. If carryover contaminants have found their way into the reagents or sample
tubes, these will be detected as amplification products. Outbreaks of carryover contamination can also
be recognized by monitoring test results. Detection of an unexpected and persistent genetic profile
in different samples indicates a contamination problem. When contamination outbreaks are detected,
appropriate corrective actions should be taken, and both the outbreak and the corrective action should
be documented.
158
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
on the questions posed in the particular case. There is no universal checklist,
but the selection of tests and the adherence to the correct test procedures can
be reviewed by experts and by reference to professional standards such as the
SWGDAM guidelines.
IV. Inference, Statistics, and Population
Genetics in Human Nuclear DNA
Testing
The results of DNA testing can be presented in various ways. With discrete allele
systems, such as STRs, it is natural to speak of “matching” and “nonmatching”
profiles. If the genetic profile obtained from the biological sample taken from the
crime scene or the victim (the “trace evidence sample”) matches that of a particular individual, that individual is included as a possible source of the sample. But
other individuals also might possess a matching DNA profile. Accordingly, the
expert should be asked to provide some indication of how significant the match
is. If, on the other hand, the genetic profiles are different, then the individual is
excluded as the source of the trace evidence. Typically, proof tending to show
that the defendant is the source incriminates the defendant, whereas proof that
someone else is the source exculpates the defendant.58 This section elaborates on
these ideas, indicating issues that can arise in connection with an expert’s testimony interpreting the results of a DNA test.
A. What Constitutes a Match or an Exclusion?
When the DNA from the trace evidence clearly does not match the DNA sample
from the suspect, the DNA analysis demonstrates that the suspect’s DNA is not
in the forensic sample. Indeed, if the samples have been collected, handled, and
analyzed properly, then the suspect is excluded as a possible source of the DNA
in the forensic sample. As a practical matter, such exclusionary results normally
would keep charges from being filed against the excluded suspect.
At the other extreme, the genotypes at a large number of loci can be clearly
identical. In these cases, the DNA evidence is quite incriminating, and the challenge for the legal system lies in explaining just how probative it is. Naturally,
as with exclusions, inclusions are most powerful when the samples have been
58. Whether being the source of the forensic sample is incriminating and whether someone
else being the source is exculpatory depends on the circumstances. For example, a suspect who might
have committed the offense without leaving the trace evidence sample still could be guilty. In a rape
case with several rapists, a semen stain could fail to incriminate one assailant because insufficient semen
from that individual is present in the sample.
159
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
collected, handled, and analyzed properly. But there is one logical difference
between exclusions and inclusions. If it is accepted that the samples have different
genotypes, then the conclusion that the DNA in them came from different individuals is essentially inescapable.59 In contrast, even if two samples have the same
genotype, there is a chance that the forensic sample came not from the defendant,
but from another individual who has the same genotype. This complication has
produced extensive arguments over the statistical procedures for assessing this
chance or related quantities. This problem of describing the significance of an
unequivocal match is the subject of the remaining parts of this section.
Some cases lie between the poles of a clear inclusion or a definite exclusion.
For example, when the trace evidence sample is small and extremely degraded,
STR profiling can be afflicted with allelic “drop-in” and “drop-out,” requiring judgments as to whether true peaks are missing and whether spurious peaks
are present. Experts then might disagree about whether a suspect is included or
excluded—or whether any conclusion can be drawn.60
B. What Hypotheses Can Be Formulated About the Source?
If the defendant is the source of DNA of sufficient quantity and quality found
at a crime scene, then a DNA sample from the defendant and the crime scene
sample should have the same profile. The inference required in assessing the
evidence, however, runs in the opposite direction. The forensic scientist reports
that the sample of DNA from the crime scene and a sample from the defendant
have the same genotype. The prosecution’s hypothesis is that the defendant is the
source of the crime scene sample.61
Conceivably, other hypotheses could account for the matching profiles.
One possibility is laboratory error—the genotypes are not actually the same even
though the laboratory thinks that they are. This situation could arise from mistakes
59. The legal implications of this fact are discussed in Kaye et al., supra note 1, § 13.3.2.
60. See, e.g., State v. Murray, 174 P.3d 407, 417–18 (Kan. 2008) (inconclusive Y-STR results
were presented as consistent with the defendant’s blood). Since the early days of DNA testing,
concerns have been expressed about subjective aspects of specific procedures that leave room for
“observer effects” in interpreting data. See William C. Thompson & Simon Ford, The Meaning of a
Match: Sources of Ambiguity in the Interpretation of DNA Prints, in Forensic DNA Technology (M. Farley
& J. Harrington eds., 1990); see generally D. Michael Risinger et al., The Daubert/Kumho Implications
of Observer Effects in Forensic Science: Hidden Problems of Expectation and Suggestion, 90 Calif. L. Rev. 1
(2002). A number of commentators have proposed that the analyst determine the profile of a trace
evidence sample before knowing the profile of any suspects. Dan E. Krane et al., Sequential Unmasking:
A Means of Minimizing Observer Effects in Forensic DNA Interpretation, 53 J. Forensic Sci. 1006 (2008).
61. That the defendant is the source does not necessarily mean that the defendant is guilty of
the offense charged. Aside from issues of intent or knowledge that have nothing to do with DNA,
there remains, for example, the possibility that the two samples match because someone framed the
defendant by putting a sample of defendant’s DNA at the crime scene or in the container of DNA
thought to have come from the crime scene.
160
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
in labeling or handling samples or from cross-contamination of the samples. As the
1992 NRC report cautioned, “[e]rrors happen, even in the best laboratories, and
even when the analyst is certain that every precaution against error was taken.”62
Another possibility is that the laboratory analysis is correct—the genotypes are
truly identical—but the forensic sample came from another individual. In general,
the true source might be a close relative of the defendant63 or an unrelated person
who, as luck would have it, just happens to have the same profile as the defendant.
The former hypothesis we shall refer to as kinship, and the latter as coincidence.
To infer that the defendant is the source of the crime scene DNA, one must reject
these alternative hypotheses of laboratory error, kinship, and coincidence. Table 1
summarizes the logical possibilities.
Table 1. Hypotheses That Might Explain a Match Between Defendant’s DNA
and DNA at a Crime Scenea
IDENTITY:
NONIDENTITY:
Lab error
Kinship
Coincidence
aCf.
Same genotype, defendant’s DNA at crime scene
Different genotypes mistakenly found to be the same
Same genotype, relative’s DNA at crime scene
Same genotype, unrelated individual’s DNA
N.E. Morton, The Forensic DNA Endgame, 37 Jurimetrics J. 477, 480 tbl. 1 (1997).
Some scientists have urged that probabilities associated with false-positive
error, kinship, or coincidence be presented to juries. Although it is not clear that
this goal is feasible, scientific knowledge and more conventional evidence can
help in assessing the plausibility of these alternative hypotheses. If laboratory error,
kinship, and coincidence are rejected as implausible, then only the hypothesis of
identity remains. We turn, then, to the considerations that affect the chances of a
match when the defendant is not the source of the trace evidence.
C. Can the Match Be Attributed to Laboratory Error?
Although many experts would concede that even with rigorous protocols, the
chance of a laboratory error exceeds that of a coincidental match, quantifying
the former probability is a formidable task. Some commentary proposes using the
proportion of false positives that the particular laboratory has experienced in blind
62. NRC I, supra note 8, at 89.
63. A close relative, for these purposes, would be a brother, uncle, nephew, etc. For relationships
more distant than second cousins, the probability of a chance match is nearly as small as for persons of
the same ethnic subgroup. Bernard Devlin & Kathryn Roeder, DNA Profiling: Statistics and Population
Genetics, in 1 Modern Scientific Evidence: The Law and Science of Expert Testimony § 18-3.1.3, at
724 (David L. Faigman et al. eds., 1997).
161
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
proficiency tests or the rate of false positives on proficiency tests averaged across
all laboratories.64 Indeed, the 1992 NRC Report remarks that “proficiency tests
provide a measure of the false-positive and false-negative rates of a laboratory.”65
Yet the same report recognizes that “errors on proficiency tests do not necessarily
reflect permanent probabilities of false-positive or false-negative results,”66 and the
1996 NRC report suggests that a probability of a false-positive error that would
apply to a specific case cannot be estimated objectively.67 If the false-positive
probability were, say, 0.001, it would take tens of thousands of proficiency tests to
estimate that probability accurately, and the application of an historical industrywide error rate to a particular laboratory at a later time would be debatable.68
Most commentators who urge the use of proficiency tests to estimate the
probability that a laboratory has erred in a particular case agree that blind proficiency testing cannot be done in sufficient numbers to yield an accurate estimate
of a small error rate. However, they maintain that proficiency tests, blind or
otherwise, should be used to provide a conservative estimate of the false-positive
error probability.69 For example, if there were no errors in 100 tests, a 95% confidence interval would include the possibility that the error rate could be almost
as high as 3%.70
Whether or not a case-specific probability of laboratory error can be estimated with proficiency tests, traditional legal and scientific procedures can help to
assess the possibilities of errors in handling or analyzing the samples. Scrutinizing
the chain of custody, examining the laboratory’s protocol, verifying that it adhered
to that protocol, and conducting confirmatory tests (including testing by the
defense) can help show that the profiles really do match.
D. Could a Close Relative Be the Source?
With enough loci to test, all individuals except identical twins should be distinguishable. With existing technology and small sample sizes of DNA recovered
from crime scenes, however, this ideal is not always attainable. A thorough inves64. E.g., Jonathan J. Koehler, Error and Exaggeration in the Presentation of DNA Evidence at Trial,
34 Jurimetrics J. 21, 37–38 (1993).
65. NRC I, supra note 8, at 94.
66. Id. at 89.
67. NRC II, supra note 9, at 85–87.
68. Id. at 85–86; Devlin & Roeder, supra note 63, § 18-5.3, at 744–45. Such arguments have
not persuaded the proponents of estimating the probability of error from industry-wide proficiency
testing. E.g., Jonathan J. Koehler, Why DNA Likelihood Ratios Should Account for Error (Even When a
National Research Council Report Says They Should Not), 37 Jurimetrics J. 425 (1997).
69. E.g., Jonathan J. Koehler, DNA Matches and Statistics: Important Questions, Surprising Answers,
76 Judicature 222, 228 (1993); Richard Lempert, After the DNA Wars: Skirmishing with NRC II, 37
Jurimetrics J. 439, 447–48, 453 (1997).
70. See NRC II, supra note 9, at 86 n.1. For an explanation of confidence intervals, see David
H. Kaye & David A. Freedman, Reference Guide on Statistics, in this manual.
162
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
tigation might extend to all known relatives, but this is not feasible in every case,
and there is always the chance that some unknown relatives are in the suspect
population. Formulas are available for computing the probability that any person
with a specified degree of kinship to the defendant also possesses the incriminating
genotype. For example, the probability that an untested brother (or sister) would
match at four loci (with alleles that each occur in 10% of the population) is about
1/380;71 the probability that an aunt (or uncle) would match is about 1/100,000.72
E. Could an Unrelated Person Be the Source?
Another rival hypothesis is coincidence: The defendant is not the source of the
crime scene DNA but happens to have the same genotype as an unrelated individual who is the true source. Various procedures for assessing the plausibility of
this hypothesis are available. In principle, one could test all conceivable suspects.
If everyone except the defendant has a nonmatching profile, then the defendant
must be the source. But exhaustive, error-free testing of the population of conceivable suspects is almost never feasible. The suspect population normally defies
any enumeration, and in the typical crime where DNA evidence is found, the
population of possible perpetrators is so huge that even if all of its members could
be listed, they could not all be tested.73
An alternative procedure would be to take a sample of people from the
suspect population, find the relative frequency of the profile in this sample, and
use that statistic to estimate the frequency in the entire suspect population. The
smaller the frequency, the less likely it is that the defendant’s DNA would match if
the defendant were not the source of trace evidence. Again, however, the suspect
population is difficult to define, so some surrogate must be used. The procedure
commonly followed is to estimate the relative frequency of the incriminating
71. For a case with conflicting calculations of the probability of an untested brother having a
matching genotype, see McDaniel v. Brown, 130 S. Ct. 665 (2010) (per curiam). The correct computation is given in David H. Kaye, “False, but Highly Persuasive”: How Wrong Were the Probability
Estimates in McDaniel v. Brown? 108 Mich. L. Rev. First Impressions 1 (2009), available at http://
www.michiganlawreview.org/assets/fi/108/kaye.pdf.
72. These figures follow from the equations in NRC II, supra note 9, at 113. The large discrepancy between two siblings on the one hand, and an uncle and nephew on the other, reflects the
fact that the siblings have far more shared ancestry. All their genes are inherited through the same two
parents. In contrast, a nephew and an uncle inherit from two unrelated mothers, and so will have few
maternal alleles in common. As for paternal alleles, the nephew inherits not from his uncle, but from
his uncle’s brother, who shares by descent only about one-half of his alleles with the uncle.
73. As the cost of DNA profiling drops, it will become technically and economically feasible to
have a comprehensive, population-wide DNA database that could be used to produce a list of nearly
everyone whose DNA profile is consistent with the trace evidence DNA. Whether such a system
would be constitutionally and politically acceptable is another question. See David H. Kaye & Michael
S. Smith, DNA Identification Databases: Legality, Legitimacy, and the Case for Population-Wide Coverage,
2003 Wis. L. Rev. 413.
163
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
genotype in a large population. But even this cannot be done directly because each
possible multilocus profile is so rare that it is not likely to show up in any sample
of a reasonable size. However, the frequencies of most alleles can be determined
accurately by sampling the population to construct databases that reveal how
often each allele occurs. Principles of population genetics then can be applied to
combine the estimated allele frequencies into an estimate of the probability that a
person born in the population will have the multilocus genotype. This probability
often is referred to as the random-match probability. This section describes how
the allele frequencies are estimated from samples and how the random-match
probability is computed from allele frequencies.
1. Estimating allele frequencies from samples
As we saw in Section II.B, the loci currently used in forensic testing have been
chosen partly because their alleles tend to be different in different people. For
example, 2% of the population might have the alleles with 7 and 10 repeats at a
particular STR locus; 1% might have the combination of 5 and 6; and so on. If
we take a DNA molecule’s view of the population, human beings are containers
for DNA and machines for copying and propagating them to the next generation
of human beings. The different DNA molecules are swimming, so to speak, in
a huge pool of humanity. All the possible alleles (the fives, sixes, sevens, and so
on) form a large population, or pool, of alleles. Each allele constitutes a certain
proportion of allele pool. Suppose, then, that a five-repeat allele represents 12%
of all of the allele pool, a six-repeat allele contributes 20%, and so on, for all the
alleles at a locus.
The first step in computing a random-match probability is to estimate these
allele frequencies. Ideally, a probability sample from the human population of
interest would be taken.74 We would start with a list of everyone who might
have left the trace evidence, take a random sample of these people, and count
the numbers of alleles of each length that are present in the sample. Unfortunately, a list of the people who comprise the entire population of possible
suspects is almost never available; consequently, probability sampling from the
directly relevant population is impossible. Probability sampling from a comparable population (with regard to the individuals’ DNA) is possible, but it is
not the norm in studies of the distributions of genes in populations. Typically,
convenience samples (from blood banks or paternity cases) are used.75 Rela-
74. Probability sampling is described in Kaye & Freedman, supra note 2, and Shari Seidman
Diamond, Reference Guide on Survey Research, in this manual.
75. A few experts have testified that no meaningful conclusions can be drawn in the absence of
random sampling. E.g., People v. Soto, 88 Cal. Rptr. 2d 34 (1999); State v. Anderson, 881 P.2d 29,
39 (N.M. 1994). The 1996 NRC report suggests that for the purpose of estimating allele frequencies,
convenience sampling should give results comparable to random sampling, and it discusses procedures
for estimating the random sampling error. NRC II, supra note 9, at 126–27, 146–48, 186. The courts
164
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
tively small samples can produce fairly accurate estimates of individual allele
frequencies.76
Once the allele frequencies have been estimated, the next step in arriving at
a random-match probability is to combine them. This requires some knowledge
of how DNA is copied and recombined in the course of sexual reproduction and
how human beings choose their mates.
2. The product rule for a randomly mating population
All scientists use simplified models of a complex reality. Physicists solve equations
of motion in the absence of friction. Economists model exchanges among rational
agents who bargain freely with no transaction costs. Population geneticists compute
genotype frequencies in an infinite population of individuals who choose their
mates independently of their alleles at the loci in question. Although geneticists
describe this situation as random mating, geneticists know that people do not
choose their mates by a lottery. “Random mating” simply indicates that the choices
are uncorrelated with the specific alleles that make up the genotypes in question.
In a randomly mating population, the expected frequency of a pair of alleles
at any single locus depends on whether the two alleles are distinct. If the offspring
happens to inherit the same allele from each parent, the expected single-locus
genotype frequency is the square of the allele frequency (p2). If a different allele
is inherited from each parent, the expected single-locus genotype frequency
is twice the product of the two individual allele frequencies (often written as
2p1p2).77 These proportions are known as Hardy-Weinberg proportions. Even if
two populations with distinct allele frequencies are thrown together, within the
limits of chance variation, random mating produces Hardy-Weinberg equilibrium
in a single generation.
generally have rejected the argument that random samples are essential to valid or generally accepted
random-match probabilities. See D.H. Kaye, Bible Reading: DNA Evidence in Arizona, 28 Ariz. St. L.J.
1035 (1996).
76. In the formative years of forensic DNA testing, defendants frequently contended that
forensic databases were too small to give accurate estimates, but this argument generally proved unpersuasive. E.g., United States v. Shea, 957 F. Supp. 331, 341–43 (D.N.H. 1997); State v. Dishon, 687
A.2d 1074, 1090 (N.J. Super. Ct. App. Div. 1997); State v. Copeland, 922 P.2d 1304, 1321 (Wash.
1996). To the extent that the databases are comparable to random samples, confidence intervals are a
standard method for indicating the uncertainty resulting from sample size. Unfortunately, the meaning
of a confidence interval is subtle, and the estimate commonly is misconstrued. See Kaye & Freedman,
supra note 2.
77. Suppose that 10% of the sperm in the gene pool of the population carry allele 1 (A1), and
50% carry allele 2 (A2). Similarly, 10% of the eggs carry A1, and 50% carry A2. (Other sperm and eggs
carry other types.) With random mating, we expect 10% × 10% = 1% of all the fertilized eggs to be
A1A1, and another 50% × 50% = 25% to be A2A2. These constitute two distinct homozygote profiles.
Likewise, we expect 10% × 50% = 5% of the fertilized eggs to be A1A2 and another 50% × 10% =
5% to be A2A1. These two configurations produce indistinguishable profiles—a peak, band, or dot for
A1 and another mark for A2. So the expected proportion of heterozygotes A1A2 is 5% + 5% = 10%.
165
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Once the proportion of the population that has each of the single-locus
genotypes for the forensic profile has been estimated, the proportion of the
population that is expected to share the combination of them—the multilocus
profile frequency—is given by multiplying all the single-locus proportions. This
multiplication is exactly correct when the single-locus genotypes are statistically
independent. In that case, the population is said to be in linkage equilibrium.
Early estimates of DNA genotype frequencies assumed that alleles were inherited independently within and across loci (Hardy-Weinberg and linkage equilibrium, respectively). Because the frequencies of the VNTR loci then in use were
shown to vary across census groups (whites, blacks, Hispanics, Asians, and Native
Americans), it became common to present the estimated genotype frequencies
within each of these groups (in cases in which the “race” of the source of the
trace evidence was unknown) or only in a particular census group (if the “race”
of the source was known).78
3. The product rule for a structured population
Population geneticists understood that the equilibrium frequencies were only
approximations and that the major racial populations are composed of ethnic subpopulations whose members tend to mate among themselves. Within each ethnic
subpopulation, mating still can be random, but if, say, Italian Americans have allele
frequencies that are markedly different than the average for all whites, and if Italian
Americans only mate among themselves, then using the average frequencies for all
whites in the basic product formula could understate—or overstate—a multilocus
profile frequency for the subpopulation of Italian Americans. Similarly, using the
population frequencies could understate—or overstate—the profile frequencies in
the white population itself.
Consequently, if we want to know the frequency of an incriminating profile
among Italian Americans, the basic product rule applied to the allele frequencies
for whites in general could be in error; and there is even some chance that the rule
will understate the profile frequency in the white population as a whole. Experts
have disagreed, however, as to whether the major population groups are so severely
structured that the departures from equilibrium would be substantial. Courts applying the Daubert and Frye rules for scientific evidence issued conflicting opinions as
to the admissibility of basic product-rule estimates.79 A 1992 report from a committee of the National Academy of Sciences did not resolve the question, but a
second committee concluded in 1996 that the basic product rule provided reasonable estimates in most cases, and it described a modified version of the product rule
78. The use of a range of estimates conditioned on race is defended, and several alternatives are
discussed in Kaye, supra note 3, at 192–97; David H. Kaye, The Role of Race in DNA Evidence: What
Experts Say, What California Courts Allow, 37 Sw. U. L. Rev. 303 (2008).
79. These legal and scientific developments are chronicled in detail in Kaye, supra note 3.
166
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
to account for population structure.80 By the mid-1990s, the population-structure
objection to admitting random-match probabilities had lost its power.81
F. Probabilities, Probative Value, and Prejudice
Up to this point, we have described the random-match probabilities that commonly are presented in conjunction with the finding that the trace evidence
sample contains DNA of the same type as the defendant’s. We have concentrated
on the methods used to compute the probabilities. Assuming that these methods
meet Daubert’s demand for scientific validity and reliability (or, in many states,
Frye’s requirement of general acceptance in the scientific community) and thus
satisfy Federal Rule of Evidence 702, a further issue can arise under Rule 403: To
what extent will the presentation assist the jury in understanding the meaning of
a match so that the jury can give the evidence the weight that it deserves? This
question involves psychology and law, and we summarize the arguments about
probative value and prejudice that have been made in litigation and in the legal
and scientific literature. We take no position on how the legal issue of the admissibility of any particular statistic generally should be resolved under the balancing
standard of Rule 403. The answer may turn not only on the general features of the
evidence described here, but on the context and circumstances of particular cases.
1. Frequencies and match probabilities
a. Argument: Frequencies or probabilities are prejudicial because they are so small
The most common form of expert testimony about matching DNA involves
an explanation of how the laboratory ascertained that the defendant’s DNA has
the profile of the forensic sample plus an estimate of the profile frequency or
random-match probability. It has been suggested, however, that jurors do not
understand probabilities in general, and that infinitesimal match probabilities will
so bedazzle jurors that they will not appreciate the other evidence in the case or
any innocent explanations for the match.82 Empirical research into this hypothesis
has been limited,83 and commentators have noted that remedies short of exclusion
80. The 1996 committee’s recommendations for computing random-match probabilities with
broad populations and particular subpopulations are summarized in the previous edition of this guide.
The 1992 committee had proposed a more conservative (and less elegant) method of dealing with
variations across subpopulations (the “ceiling principle”), also described in the previous edition.
81. See, e.g., Kaye, supra note 3.
82. Cf. Gov’t of the Virgin Islands v. Byers, 941 F. Supp. 513, 527 (D.V.I. 1996) (“Vanishingly small probabilities of a random match may tend to establish guilt in the minds of jurors and are
particularly suspect.”).
83. This research is tabulated in David H. Kaye et al., Statistics in the Jury Box: Do Jurors Understand Mitochondrial DNA Match Probabilities? 4 J. Empirical Legal Stud. 797 (2007). The findings do
167
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
are available.84 Thus, although there once was a line of cases that excluded probability testimony in criminal matters, by the mid-1990s, no jurisdiction excluded
DNA match probabilities on this basis.85 The opposite argument—that relatively
large random-match probabilities are prejudicial—also has been advanced without
success.86
b. Argument: Frequencies or probabilities are prejudicial because they might be
transposed
A related concern is that the jury will misconstrue the random-match probability
as the probability that the evidence DNA came from a random individual.87
The words are almost identical, but the probabilities can be quite different. The
random-match probability is the probability that the suspect has the DNA genotype of the crime scene sample if he is not the true source of that sample (and is unrelated to the true source). The tendency to invert or transpose the probability—to
go from a one-in-a-million chance if the suspect is not the source to a million-to-one
chance that the suspect is the source is known as the fallacy of the transposed conditional.88 To appreciate that the transposition is fallacious, consider the probability
not clearly support the argument that jurors will overweight the probability, but the details of how
the probability is presented and countered may be important.
84. According to the 1996 NRC committee, suitable cross-examination, defense experts, and
jury instructions might reduce the risk that small estimates of the match probability will produce an
unwarranted sense of certainty and lead a jury to disregard other evidence. NRC II, supra note 9, at 197.
85. E.g., United States v. Chischilly, 30 F.3d 1144 (9th Cir. 1994) (citing cases); State v. Weeks,
891 P.2d 477, 489 (Mont. 1995) (rejecting the argument that “the exaggerated opinion of the accuracy of DNA testing is prejudicial, as juries would give undue weight and deference to the statistical
evidence” and “that the probability aspect of the DNA analysis invades the province of the jury to
decide the guilt or innocence of the defendant”).
86. See United States v. Morrow, 374 F. Supp. 2d 51, 65 (D.D.C. 2005) (rejecting the argument
because “the DNA evidence remains probative, and helps to corroborate other evidence and support
the Government’s case as to the identity of the relevant perpetrators. Indeed, the low statistical significance actually benefits Defendants, as Defendants can argue that having random match probabilities
running between 1:12 and 1:1 means that hundreds, if not thousands, of others in the Washington,
D.C. area cannot be excluded as possible contributors as well.”).
87. Numerous opinions or experts present the random-match probability in this manner. E.g.,
State v. Davolt, 84 P.3d 456, 475 (Ariz. 2004) (stating that “the chance the saliva found on cigarette
remains in the house did not belong to [the defendant] was one in 280 quadrillion for the Caucasian
population”); Kaye et al., supra note 1, § 14.1.2(a) (collecting opinions reflecting this fallacy).
88. The transposition fallacy also is called the “prosecutor’s fallacy” in the legal literature—
despite the fact that it hardly is limited to prosecutors. Our description of the fallacy is imprecise.
In this context, the random-match probability is the chance that (A) the suspect has the crime scene
genotype given that (B) he is not the true source. The probability that the match is random is the
probability that (B) the individual tested has been selected at random given that (A) the individual
has the requisite genotype. In general, for two events A and B, the probability of A given B, which
we can write as P(A given B), does not equal P(B given A). See Kaye & Freedman, supra note 2. The
claim that the probabilities are necessarily equal is the transposition fallacy. Id. (also noting instances
of the fallacy in other types of litigation).
168
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
that a lawyer picked at random from all lawyers in the United States is an appellate
judge. This “random-judge probability” is practically zero. But the probability
that a person randomly selected from the current appellate judiciary is a lawyer
is one. The random-judge probability, P(judge given lawyer), does not equal the
transposed probability P(lawyer given judge). Likewise, the random-match probability P(genotype given unrelated source) does not necessarily equal P(unrelated
source given genotype).89
No federal court has excluded a random-match probability (or, for that
matter, an estimate of the small frequency of a DNA profile in the general
population) as unfairly prejudicial simply because the jury might misinterpret it
as a probability that the defendant is the source of the forensic DNA.90 Courts,
however, have noted the need to have the concept “properly explained,”91 and
prosecutorial or expert misrepresentations of the random-match probabilities for
89. To avoid this fallacious reasoning by jurors, some scientific and legal commentators have
urged the exclusion of random-match probabilities. In response, the 1996 NRC committee suggested
that “if the initial presentation of the probability figure, cross-examination, and opposing testimony
all fail to clarify the point, the judge can counter [the fallacy] by appropriate instructions to the jurors
that minimize the possibility of cognitive errors.” NRC II, supra note 9, at 198 (footnote omitted).
The committee suggested the following instruction to define the random-match probability:
In evaluating the expert testimony on the DNA evidence, you were presented with a number indicating
the probability that another individual drawn at random from the [specify] population would coincidentally have the same DNA profile as the [bloodstain, semen stain, etc.]. That number, which assumes
that no sample mishandling or laboratory error occurred, indicates how distinctive the DNA profile is.
It does not by itself tell you the probability that the defendant is innocent.
Id. at 198 n.93. An alternative adopted in England is to confine the prosecution to stating a frequency
rather than a probability. See Kaye et al., supra note 1, § 14.1.2(b); cf. D.H. Kaye, The Admissibility of
“Probability Evidence” in Criminal Trials—Part II, 27 Jurimetrics J. 160, 168 (1987) (similar proposal).
The NRC committee also noted the opposing “defendant’s fallacy” of dismissing or undervaluing
the matches with high likelihood ratios because other matches are to be expected in unrealistically
large populations of potential suspects. For example, defense counsel might argue that (1) with a
random-match probability of one in a million, we would expect to find three or four unrelated people
with the requisite genotypes in a major metropolitan area with a population of 3.6 million; (2) the
defendant just happens to be one of these three or four, which means that the chances are at least 2
out of 3 that someone unrelated to the defendant is the source; so (3) the DNA evidence does nothing
to incriminate the defendant. The problem with this argument is that in a case involving both DNA
and non-DNA evidence against the defendant, it is unrealistic to assume that there are 3.6 million
equally likely suspects. When juries are confronted with both fallacies, the defendant’s fallacy seems
to dominate. NRC II, supra note 9, at 198; cf. Jonathan J. Koehler, The Psychology of Numbers in the
Courtroom: How to Make DNA-Match Statistics Seem Impressive or Insufficient, 74 S. Cal. L. Rev. 1275
(2001) (discussing ways of framing the evidence that make it more or less persuasive).
90. See, e.g., United States v. Morrow, 374 F. Supp. 2d 51, 66 (D.D.C. 2005) (“careful oversight
by the district court and proper explanation can easily thwart this issue”).
91. United States v. Shea, 957 F. Supp. 331, 345 (D.N.H. 1997); see also United States v.
Chischilly, 30 F.3d 1144, 1158 (9th Cir. 1994) (stating that the government must be “careful to frame
the DNA profiling statistics presented at trial as the probability of a random match, not the probability
of the defendant’s innocence that is the crux of the prosecutor’s fallacy”).
169
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
DNA and other trace evidence have produced reversals or contributed to the
setting aside of verdicts.92
c. Argument: Random-match probabilities that are smaller than false-positive
error probabilities are irrelevant or prejudicial
Some scientists and lawyers have maintained that match probabilities are
logically irrelevant when they are far smaller than the probability of a frameup, a
blunder in labeling samples, cross-contamination, or other events that would yield
a false positive.93 The argument is that the jury should concern itself only with the
chance that the forensic sample is reported to match the defendant’s profile even
though the defendant is not the source. Match probabilities do not express this
chance unless the probability of a false-positive report (because of fraud or an error
in the collection, handling, or analysis of the DNA samples) is essentially zero.
The mathematical observation has led to the argument that because these other
possible explanations for a match are more probable than the very small randommatch probabilities for most STR profiles, the latter probabilities are irrelevant.
Commentators have crafted theoretical, doctrinal, and practical rejoinders to this
claim.94 The essence of the counterargument is that it is logical to give jurors
information about kinship or random-match probabilities because, even if these
numbers do not give the whole picture, they address pertinent hypotheses about
the true source of the trace evidence.
It also has been argued that even if very small match probabilities are logically
relevant, they are unfairly prejudicial in that they will cause jurors to neglect the
probability of a match arising due to a false-positive laboratory error.95 A court
92. E.g., United States v. Massey, 594 F.2d 676, 681 (8th Cir. 1979) (explaining that in closing
argument about hair evidence, “the prosecutor ‘confuse[d] the probability of concurrence of the identifying marks with the probability of mistaken identification’”) (alteration in original). The Supreme
Court noted the transposition fallacy in the prosecution’s presentation of DNA evidence as a basis for
a federal writ of habeas corpus in McDaniel v. Brown, 130 S. Ct. 665 (2010) (per curiam). The Court
unanimously held that the prisoner had not properly raised the issue of whether this error amounted
to a violation of due process. For comments on that issue, see Kaye, supra note 71.
93. E.g., Jonathan J. Koehler et al., The Random Match Probability in DNA Evidence: Irrelevant
and Prejudicial? 35 Jurimetrics J. 201 (1995); Richard C. Lewontin & Daniel L. Hartl, Population
Genetics in Forensic DNA Typing, 254 Science 1745, 1749 (1991) (“[p]robability estimates like 1 in
738,000,000,000,000 . . . are terribly misleading because the rate of laboratory error is not taken into
account”).
94. See Kaye et al., supra note 1, § 14.1.1 (discussing the issue).
95. Some commentators believe that this prejudice is so likely and so serious that “jurors
ordinarily should receive only the laboratory’s false positive rate. . . .” Richard Lempert, Some Caveats
Concerning DNA as Criminal Identification Evidence: With Thanks to the Reverend Bayes, 13 Cardozo L.
Rev. 303, 325 (1991) (emphasis added). The 1996 NRC committee was skeptical of this view, especially when the defendant has had a meaningful opportunity to retest the DNA at a laboratory of his
or her choice, and it suggested that judicial instructions can be crafted to avoid this form of prejudice.
NRC II, supra note 9, at 199. Pertinent psychological research includes Dale A. Nance & Scott B.
Morris, Juror Understanding of DNA Evidence: An Empirical Assessment of Presentation Formats for Trace
170
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
that shares this concern might require the expert who presents a random-match
probability also to report a probability that the laboratory is mistaken about the
profiles. Of course, for reasons given in Section III.B.2, some experts would deny
that they can provide a meaningful statistic for the case at hand, but it has been
pointed out that they could report the results of proficiency tests and leave it to
the jury to use this figure as best it can in considering whether a false-positive
error has occurred.96 In any event, the courts have been unreceptive to efforts to
replace random-match probabilities with a blended figure that incorporates the
risk of a false-positive error97 or to exclude random-match probabilities that are
not accompanied by a separate false-positive error probability.98
Evidence with a Relatively Small Random Match Probability, 34 J. Legal Stud. 395 (2005); Dale A. Nance
& Scott B. Morris, An Empirical Assessment of Presentation Formats for Trace Evidence with a Relatively Large
and Quantifiable Random Match Probability, 42 Jurimetrics J. 1 (2002); Jason Schklar & Shari Seidman
Diamond, Juror Reactions to DNA Evidence: Errors and Expectancies, 23 Law & Hum. Behav. 159, 179
(1999) (concluding that separate figures for laboratory error and a random match to a correctly ascertained profile are desirable in that “[j]urors . . . may need to know the disaggregated elements that
influence the aggregated estimate as well as how they were combined in order to evaluate the DNA
test results in the context of their background beliefs and the other evidence introduced at trial”).
96. Cf. Williams v. State, 679 A.2d 1106, 1120 (Md. 1996) (reversing because the trial court
restricted cross-examination about the results of proficiency tests involving other DNA analysts at the
same laboratory). But see United States v. Shea, 957 F. Supp. 331, 344 n.42 (D.N.H. 1997) (“The
parties assume that error rate information is admissible at trial. This assumption may well be incorrect. Even though a laboratory or industry error rate may be logically relevant, a strong argument
can be made that such evidence is barred by Fed. R. Evid. 404 because it is inadmissible propensity
evidence.”).
97. United States v. Ewell, 252 F. Supp. 2d 104, 113–14 (D.N.J. 2003) (stating that exclusion
of the random-match probability is not justified when “the defendant’s argument is not based on
evidence of actual errors by the laboratory, but instead has simply challenged the Government’s failure to quantify the rate of laboratory error,” while “the Government has demonstrated the scientific
method has a virtually zero rate of error, and that it employs sufficient procedures and controls to
limit laboratory error,” and the defendant had an expert who could testify to the probability of error);
United States v. Shea, 957 F. Supp. 331, 334–45 (D.N.H. 1997) (holding that separate figures for
match and error probabilities are not prejudicial); People v. Reeves, 109 Cal. Rptr. 2d 728, 753 (Ct.
App. 2001) (holding that probability of laboratory error need not be combined with random-match
probability); Armstead v. State, 673 A.2d 221, 245 (Md. 1996) (finding that the failure to combine a
random-match probability with an error rate on proficiency tests that was many orders of magnitude
greater (and that was placed before the jury) did not deprive the defendant of due process); State v.
Tester, 968 A.2d 895 (Vt. 2009).
98. United States v. Trala, 162 F. Supp. 2d 336, 350–51 (D. Del. 2001) (stating that presenting
a nonzero laboratory error rate is not a condition of admissibility, and Daubert does not require separate figures for match and error probabilities to be combined); United States v. Lowe, 954 F. Supp.
401, 415–16 (D. Mass. 1997), aff’d, 145 F.3d 45 (1st Cir. 1998) (finding that a “theoretical” error
rate need not be presented when quality assurance standards have been followed and defendant had
the opportunity to retest the sample); Roberts v. United States, 916 A.2d 922, 930–31 (D.C. 2007)
(finding that presenting a laboratory error rate is not a condition of admissibility); Roberson v. State,
16 S.W.3d 156, 168 (Tex. Crim. App. 2000) (finding that error rate not needed when laboratory
was accredited and underwent blind proficiency testing); Tester, 968 A.2d 895 (stating that when the
laboratory chemist stated that “[t]here is no error rate to report” because the number of proficiency
171
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
2. Likelihood ratios
Sufficiently small probabilities of a match for close relatives and unrelated members
of the suspect population undermine the hypotheses of kinship and coincidence.
Adequate safeguards and checks for possible laboratory error make that explanation
of the finding of matching genotypes implausible. The inference that the defendant
is the source of the crime scene DNA is then secure. But this mode of reasoning
by elimination is not the only way to analyze DNA evidence. This subsection and
the next describe alternatives—likelihoods and posterior probabilities—that some
statisticians prefer and that have been used in a growing number of court cases.
To choose between two competing hypotheses, one can compare how probable the evidence is under each hypothesis. Suppose that the probability of a
match in a well-run laboratory is close to 1 when the samples both contain only
the defendant’s DNA, while both the probability of a coincidental match and the
probability of a match to a close relative are close to 0. In these circumstances,
the DNA profiling strongly supports the claim that the defendant is the source,
because the observed outcome—the match—is many times more probable when
the defendant is the source than when someone else is. How many times more
probable? Suppose that there is a 1% chance that the laboratory would miss a true
match, so that the probability of its finding a match when the defendant is the
source is 0.99. Suppose further that p = 0.00001 is the random-match probability.
Then the match is 0.99/0.00001, or 99,000 times more likely to be seen if the
defendant is the source than if an unrelated individual is. Such a ratio is called a
likelihood ratio, and a likelihood ratio of 99,000 means that the DNA profiling
supports the claim of identity 99,000 times more strongly than it supports the
hypothesis of coincidence.99
Likelihood ratios have been presented in court in many cases. They are routinely introduced under the name “paternity index” in civil and criminal cases
that involve DNA testing for paternity.100 Experts also have used them in cases in
which the issue is whether two samples originated from the same individual. For
example, in one California case, an expert stated that “for the Caucasian population, the evidence DNA profile was approximately 1.9 trillion times more likely to
match appellant’s DNA profile if he was the contributor of that DNA rather than
some unknown, unrelated individual; for the Hispanic population, it was 2.6 trillion times more likely; and for the African-American population, it was about
9.1 trillion times more likely.”101 And, as explained below (Section V.C), likeli-
trials was insufficient, the random-match probability was admissible and preferable to presenting the
finding of a match with no accompanying statistic).
99. Another likelihood ratio would give the relative likelihood of the hypotheses of identity and
a falsely declared match arising from an error in the laboratory. See supra Section IV.F.1.
100. See Kaye, supra note 3; 1 McCormick on Evidence, supra note 3, § 211.
101. People v. Prince, 36 Cal. Rptr. 3d 300, 310 (Ct. App. 2005), review denied, 142 P.3d 1184
(Cal. 2006).
172
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
hood ratios are especially useful for samples that are mixtures of DNA from several
people.
The major objection to likelihoods is not statistical, but psychological.102 As
with random-match probabilities, they are easily transposed.103 With randommatch probabilities, we saw that courts have reasoned that the possibility of transposition does not justify a blanket rule of exclusion. The same issue has not been
addressed directly for likelihood ratios.
3. Posterior probabilities
The likelihood ratio expresses the relative strength of two hypotheses, but the judge
or jury ultimately must assess a different type of quantity—the probability of the
hypotheses themselves. An elementary rule of probability theory known as Bayes’
theorem yields this probability. The theorem states that the odds in light of the data
(here, the observed profiles) are the odds as they were known prior to receiving the
data times the likelihood ratio. More succinctly, posterior odds = likelihood ratio
× prior odds.104 For example, if the relevant match probability105 were 1/100,000,
and if the chance that the laboratory would report a match between samples from
the same source were 0.99, then the likelihood ratio would be 99,000, and the
jury could be told how the DNA evidence raises various prior probabilities that
the defendant’s DNA is in the evidence sample.106
102. For legal commentary and additional cases upholding the admission of likelihood ratios over
objections based on Frye and Daubert, see Kaye et al., supra note 1, § 14.2.2.
103. United States v. Thomas, 43 M.J. 626 (A.F. Ct. Crim. App. 1995), provides an example.
In this murder case, a military court described testimony from a population geneticist that “conservatively, it was 76.5 times more likely that the samples . . . came from the victim than from someone
else in the Filipino population.” Id. at 635. Yet, this is not what the DNA testing showed. A more
defensible statement is that “the match between the bloodstains was 76.5 times more probable if the
stains came from the victim than from an unrelated Filipino” or “the match supports the hypothesis
that the stains came from the victim 76.5 times more than it supports the hypothesis that they came
from an unrelated Filipino woman.” Kaye et al., supra note 7, § 14.2.2.
104. Odds and probabilities are two ways to express chances quantitatively. If the probability of
an event is P, the odds are P/(1 − P ). If the odds are O, the probability is O/(O + 1). For instance,
if the probability of rain is 2/3, the odds of rain are 2 to 1 because (2/3)/(1 − 2/3) = (2/3)/(1/3) =
2. If the odds of rain are 2 to 1, then the probability is 2/(2 + 1) = 2/3.
105. By “relevant match probability,” we mean the probability of a match given a specified type
of kinship or the probability of a random match in the relevant suspect population. For relatives more
distantly related than second cousins, the probability of a chance match is nearly as small as for persons
of the same subpopulation. Devlin & Roeder, supra note 63, § 18-3.1.3, at 724.
106. If this procedure is followed, the analyst could explain that these calculations rest on many
premises, including the premise that the genotypes have been correctly determined. See, e.g., Richard
Lempert, The Honest Scientist’s Guide to DNA Evidence, 96 Genetica 119 (1995). If the jury accepted
these premises and also decided to accept the hypothesis of identity over those of kinship and coincidence, it still would be open to the defendant to offer explanations of how the forensic samples came
to include his or her DNA even though he or she is innocent.
173
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
One difficulty with this use of Bayes’ theorem is that the computations consider only one alternative to the claim of identity at a time. As indicated earlier,
however, several rival hypotheses might apply in a given case. If the DNA in the
crime scene sample is not the defendant’s, is it from his father, his brother, his
uncle, or another relative? Is the true source a member of the same subpopulation? Or is the source a member of a different subpopulation in the same general
population? In principle, the likelihood ratio can be generalized to a likelihood
function that takes on suitable values for every person in the world, and the prior
probability for each person can be cranked into a general version of Bayes’ rule to
yield the posterior probability that the defendant is the source. In this vein, some
commentators suggest that Bayes’ rule be used to combine the various likelihood
ratios for all possible degrees of kinship and subpopulations.107
As with likelihood ratios, Bayes’ rule is routine in cases involving parentage
testing. Some courts have held that the “probability of paternity” derived from
the formula is inadmissible in criminal cases, but most have reached the opposite
conclusion, at least when the prior odds used in the calculation are disclosed to
the jury.108 An extended literature has grown up on the subject of how posterior
probabilities might be useful in criminal cases.109
G. Verbal Expressions of Probative Value
Having surveyed the issues related to the value and dangers of probabilities
and statistics for DNA evidence, we turn to a related issue that can arise under
Rules 702 and 403: Should an expert be permitted to offer a nonnumerical
judgment about the DNA profiles? Many courts have held that a DNA match is
inadmissible unless the expert attaches a scientifically valid number to the match.
Indeed, some opinions state that this requirement flows from the nature of science
itself. However, this view has been challenged,110 and not all courts agree that
an expert must explain the power of a DNA match in purely numerical terms.
107. David J. Balding, Weight-of-Evidence for Forensic DNA Profiles (2005); David J. Balding
& Peter Donnelly, Inference in Forensic Identification, 158 J. Royal Stat. Soc’y Ser. A 21 (1995); cf.
Lempert, supra note 69, at 458 (describing a similar procedure).
108. Kaye et al., supra note 1, § 14.3.2.
109. See id.; 1 McCormick on Evidence, supra note 3, § 211; David H. Kaye, Rounding Up
the Usual Suspects: A Legal and Logical Analysis of DNA Database Trawls, 87 N.C. L. Rev. 425 (2009)
(defending a Bayesian presentation by a defendant identified by a “cold hit” in a DNA database).
110. See, e.g., Commonwealth v. Crews, 640 A.2d 395, 402 (Pa. 1994) (explaining that “[t]he
factual evidence of the physical testing of the DNA samples and the matching alleles, even without
statistical conclusions, tended to make appellant’s presence more likely than it would have been without the evidence, and was therefore relevant.”). The 1996 NRC committee wrote that science only
demands “underlying data that permit some reasonable estimate of how rare the matching characteristics actually are,” and “[o]nce science has established that a methodology has some individualizing
power, the legal system must determine whether and how best to import that technology into the trial
process.” NRC II, supra note 9, at 192.
174
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
1. “Rarity” or “strength” testimony
Instead of presenting numerical frequencies or match probabilities, a scientist
could characterize a 13-locus STR profile as “rare,” “extremely rare,” or the
like. Instead of quoting a numerical likelihood ratio, the analyst could refer to
the match as “powerful,” “very strong evidence,” and so on. At least one state
supreme court has endorsed this qualitative approach as a substitute to the presentation of quantitative estimates.111
2. Source or uniqueness testimony
The most extreme case of a purely verbal description of the infrequency of a profile occurs when that profile can be said to be unique. Of course, the uniqueness
of any object, from a snowflake to a fingerprint, in a population that cannot be
enumerated never can be proved directly. As with all sample evidence, one must
generalize from the sample to the entire population. There is always some probability that a census would prove the generalization to be false. Over a decade
ago, the second NRC committee therefore wrote that “[t]here is no ‘bright-line’
standard in law or science that can pick out exactly how small the probability of
the existence of a given profile in more than one member of a population must be
before assertions of uniqueness are justified. . . . There might already be cases in
which it is defensible for an expert to assert that, assuming that there has been no
sample mishandling or laboratory error, the profile’s probable uniqueness means
that the two DNA samples come from the same person.”112 Before concluding
that a DNA profile is unique in a given population, however, a careful expert
also should consider not only the random-match probability (which pertains to
unrelated individuals) but also the chance of a match to a close relative. Indeed,
the possible existence of an unknown, identical twin also means that a scientist
never can be absolutely certain that crime scene evidence could have come from
only the defendant.
Courts have accepted or approved of expert assertions of uniqueness or of
individual source identification.113 For these assertions to be justified, a large
111. State v. Bloom, 516 N.W.2d 159, 166–67 (Minn. 1994) (“Since it may be pointless to
expect ever to reach a consensus on how to estimate, with any degree of precision, the probability
of a random match and that, given the great difficulty in educating the jury as to precisely what that
figure means and does not mean, it might make sense to simply try to arrive at a fair way of explaining
the significance of the match in a verbal, qualitative, nonquantitative, nonstatistical way.”). A related
question is whether an expert should be allowed to declare a match without adding any information on
how common or rare the profile is. For discussion of such pure “defendant-not-excluded” testimony,
see United States v. Morrow, 374 F. Supp. 2d 51 (D.D.C. 2005); Kaye et al., supra note 1, § 15.4.
112. NRC II, supra note 9, at 194.
113. E.g., United States v. Davis, 602 F. Supp. 2d 658 (D. Md. 2009) (“the random match
probability figures . . . are sufficiently low so that the profile can be considered unique”); People v.
Baylor, 118 Cal. Rptr. 2d 518, 522 (Ct. App. 2002) (testimony that “defendant had a unique DNA
175
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
number of sufficiently polymorphic loci must have been tested, making the probabilities of matches to both relatives and unrelated individuals so tiny that the
probability of finding another person who could be the source within the relevant
population is negligible.114
V. Special Issues in Human DNA Testing
A. Mitochondrial DNA
Mitochondria are small structures, with their own membranes, found inside the
cell but outside its nucleus. Inside these organelles, molecules are broken down to
supply energy. Mitochondria have a small genome—a circle of 16,569 nucleotide
base pairs within the mitochondrion—that bears no relation to the comparatively
monstrous chromosomal genome in the cell nucleus.115
Mitochondrial DNA (mtDNA) has four features that make it useful for
forensic DNA testing. First, the typical cell, which has but one nucleus, contains
hundreds or thousands of nearly identical mitochondria. Hence, for every copy of
profile that ‘probably does not exist in anyone else in the world.’”); State v. Hauge, 79 P.3d 131 (Haw.
2003) (uniqueness); Young v. State, 879 A.2d 44, 46 (Md. 2005) (holding that “when a DNA method
analyzes genetic markers at sufficient locations to arrive at an infinitesimal random match probability,
expert opinion testimony of a match and of the source of the DNA evidence is admissible”; hence,
it was permissible to introduce a report providing no statistics but stating that “(in the absence of an
identical twin), Anthony Young (K1) is the source of the DNA obtained from the sperm fraction of
the Anal Swab (R1).”); State v. Buckner, 941 P.2d 667, 668 (Wash. 1997) (finding that in light of 1996
NRC Report, “we now conclude there should be no bar to an expert giving his or her expert opinion
that, based upon an exceedingly small probability of a defendant’s DNA profile matching that of
another in a random human population, the profile is unique.”).
114. We apologize for the length of this sentence, but there are three distinct probabilities that
arise in speaking of the uniqueness of DNA profiles. First, there is the probability of a match to a single,
randomly selected individual in the population. This is the random-match probability. Second, there is
the probability that the particular profile is unique. This probability involves pairing the profile with
every member of the population. Third, there is the probability that all pairs of all profiles are unique.
The first probability is larger than the second, which is many times larger than the third. Uniqueness
or source testimony need only establish that the one DNA profile in the trace evidence is unique—and
not that all DNA profiles are unique. Thus, it is the second probability, properly computed, that must
be quite small to warrant the conclusion that no one but the defendant (and any identical twins) could
be the source of the crime scene DNA. See David H. Kaye, Identification, Individuality, and Uniqueness:
What’s the Difference? 8 Law, Probability & Risk 85 (2009).
Formulas for estimating all these probabilities are given in NRC II, supra note 9, but DNA
analysts and judges sometimes infer uniqueness on the basis of incorrect intuitions about the size of
the random-match probability. See Balding, supra note 107, at 148 (2005) (describing “the uniqueness fallacy”); cf. State v. Lee, 976 So. 2d 109, 117 (La. 2008) (incorrect but harmless miscalculation).
115. Mitochondria probably started out as bacteria that were engulfed by cells eons ago. Some
of their genes have migrated to the chromosomes, but STR and other DNA sequences in the nucleus
are not physically or statistically associated with the sequences of the DNA in the mitochondria.
176
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
chromosomal DNA, there are hundreds or thousands of copies of mitochondrial
DNA. This means that it is possible to detect mtDNA in samples, such as bone
and hair shafts, that contain too little nuclear DNA for conventional typing.
Second, two “hypervariable” regions that tend to be different in different
individuals lie within the “control region” or “D-loop” (displacement loop) of
the mitochondrial genome.116 These regions extend for a bit more than 300 base
pairs each—short enough to be typable even in highly degraded samples such as
very old human remains.
Third, mtDNA comes solely from the egg cell.117 For this reason, mtDNA
is inherited maternally, with no fatherly contribution:118 Siblings, maternal halfsiblings, and others related through maternal lineage normally possess the same
mtDNA sequence. This feature makes mtDNA particularly useful for associating
persons related through their maternal lineage. It has been exploited to identify the
remains of the last Russian tsar and other members of the royal family, of soldiers
missing in action, and of victims of mass disasters.
Finally, point mutations accumulate in the noncoding D-loop without altering how the mitochondrion functions. Hence, a single individual can develop
distinct internal populations of mitochondria.119 As discussed below, this phenomenon, known as heteroplasmy, complicates the interpretation of mtDNA
sequences. Yet, it is mutations that make mtDNA polymorphic and hence useful
in identifying individuals. Over time, mutations in egg cells can propagate to
later generations, producing more heterogeneity in mitochondrial genomes in the
human population.120 This polymorphism allows scientists to compare mtDNA
from crime scenes to mtDNA from given individuals to ascertain whether the
tested individuals are within the maternal line (or another coincidentally matching
maternal line) of people who could have been the source of the trace evidence.
The small mitochondrial genome can be analyzed with a PCR-based method
that gives the order of all the base pairs.121 The sequences of two samples—say,
DNA extracted from a hair shaft found at a crime scene and hairs plucked from
a suspect—then can be compared. Most analysts describe the results in terms on
116. A third, somewhat less polymorphic, region in the D-loop can be used for additional discrimination. The remainder of the control region, although noncoding, consists of DNA sequences
that are involved in the transcription of the mitochondrial genes. These control sequences are essentially the same in everyone (monomorphic).
117. The relatively few mitochondria in the spermatozoan that fertilizes the egg cell soon
degrade and are not replicated in the multiplying cells of the pre-embryo.
118. The possibility of paternal contributions to mtDNA in humans is discussed in, e.g., John
Buckleton et al., Nonautosomal Forensic Markers, in Forensic DNA Evidence Interpretation 299, 302
(John Buckleton et al. eds., 2005).
119. A single tissue has only one mitotype; another tissue from the same individual might have
another mitotype; a third might have both mitotypes.
120. Evolutionary studies suggest an average mutation rate for the mtDNA control region of as
little as one nucleotide difference every 300 generations, or one difference every 6000 years.
121. Other methods to ascertain the base-pair sequences also are available.
177
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
inclusions and exclusions, although, in principle, a likelihood ratio is better suited
to cases in which there are slight sequence differences.122 In the simplest case, the
two sequences show a good number of differences (a clear exclusion), or they are
identical (an inclusion). In such cases, mitotyping can exclude individuals as the
source of stray hairs even when the hairs are microscopically indistinguishable.123
As with nuclear DNA, to indicate the significance of the match, analysts usually estimate the frequency of the sequence in some population. The estimation
procedure is actually much simpler with mtDNA. It is not necessary to combine
any allele frequencies because the entire mtDNA sequence, whatever its internal
structure may be, is inherited as a single unit (a “haplotype”). In other words, the
sequence itself is like a single allele, and one can simply see how often it occurs
in a sample of unrelated people.124
Laboratories therefore refer to databases of mtDNA sequences to see how
often the type in question has been seen before. Often, the mtDNA sequence
from the crime scene is not represented in the database, indicating that it is a
relatively rare sequence. For example, in State v. Pappas,125 the reference database
consisted of 1219 mtDNA sequences from whites, and it did not include the
sequence that was present in the hairs near the crime scene and in the defendant.
Thus, this particular sequence was observed once (at the crime scene) out of 1220
times (adding the new sequence to the 1219 different sequences on file). This
would correspond to a population frequency of 0.082%. However, to account for
sampling error (the inevitable differences between random samples and the population from which they are drawn), a laboratory might use a slightly different estimate. In general, laboratories count the occurrences in the database and take the
upper end of a 95% confidence interval around the corresponding proportion.126
Applying this logic, an FBI analyst in Pappas testified to “the maximum match
probability . . . of three in 1000. . . . [O]ut of 1000 randomly selected persons, it
could be expected that three persons would share the same mtDNA type as the
defendant.”127 The basic idea is that even if 3/1000 people in the white population have the sequence, there still is a 5% chance that it would not show up in
a specific (randomly drawn) database of size 1219; hence, 3/1000 is a reasonable
122. The likelihood-ratio approach is developed in Buckleton et al., supra note 118.
123. The implications of this fact for the admissibility of microscopic hair analysis is discussed
in Kaye, supra note 3.
124. In this context, “unrelated people” means individuals with a different maternal lineage.
125. 776 A.2d 1091 (Conn. 2001).
126. The Reference Manual on Statistics discusses the meaning of a confidence interval. It has
been argued that instead of using x/N in forming the confidence interval, one should use the proportion (x + 1)/(N + 1), where x is the number of matching sequences in the database and N is the size
of the database. After all, following the testing, x + 1 is the number of times that the sequence has
been seen in N + 1 individuals. This is the reasoning that produced the point estimate of 1/1220 rather
than 0/1219. For large databases, this alteration will make little difference in the confidence interval.
127. 776 A.2d at 1111 (emphasis added).
178
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
upper estimate for the population frequency.128 If the population frequency of
the sequence in unrelated whites were much larger, the chance that the sequence
would have been missed in the database sampling would be even less than 5%.
Computations that rely on databases of major population groups (such as
whites) assume that the reference database is a representative sample of a population of unrelated individuals who might have committed the alleged crime. This
assumption is justified if there has been sufficient random mating within the racial
population. In principle, the adjustment that accounts for population structure (see
supra Section IV.E.3) could be used, but how large the adjustment should be is not
clear.129 Statistics derived from many databases from different locations also have
been proposed.130 An alternative is to develop local databases that would reflect
the proportion of all the people in the vicinity of the crime possessing each possible mitotype.131 Until these databases exist, an expert might give rather restricted
quantitative testimony. In Pappas, for example, the expert could have said that
the hairs and the defendants have the same mitotype and that this mitotype did
not appear in a group of 1219 other people in a national sample, and the expert
could have refrained from offering any estimate of the frequency in all whites.
This restricted presentation suggests that the match has some probative value, but
a court might need to consider whether it is sufficient to leave it to the jury to
decide how to weigh the fact of the match and the absence of the same sequence
in a convenience sample that might—or might not—be representative of the local
white population.
Another issue is heteroplasmy. The simple inclusion-exclusion approach must
be modified to account for the fact that the same individual can have detectably
different mitotypes in different tissues or even in different cells in the same tissue.
To understand the implications of heteroplasmy, we need to understand how it
comes into existence.132 Heteroplasmy can occur because of mtDNA mutations
during the division of adult cells, such as those at the roots of hair shafts. These
new mitotypes are confined to the individual. They will not be passed on to
future generations. Heteroplasmy also can result from a mutation contained in the
egg cell that grew into an individual. Such mutations can make their way into
succeeding generations, establishing new mitotypes in the population. But this is
128. In general, if the sequence does not exist in the database of size N, the upper 95% confidence limit is approximately 3/N. E.g., J.A. Hanley & A. Lipp-Hand, If Nothing Goes Wrong, Is
Everything All Right? Interpreting Zero Numerators, 249 JAMA 1743 (1983). In Pappas, 3/N is 3/1219 =
0.25%, which rounds off to the 3 per 1000 figure quoted by the FBI analyst.
129. See Buckleton et al., supra note 118.
130. T. Egeland & A. Salas, Statistical Evaluation of Haploid Genetic Evidence, 1 Open Forensic
Sci. J. 4 (2008).
131. Id.; see also F.A. Kaestle et al., Database Limitations on the Evidentiary Value of Forensic Mitochondrial DNA Evidence, 43 Am. Crim. L. Rev. 53 (2006).
132. An entertaining discussion can be found in Brian Sykes, The Seven Daughters of Eve: The
Science That Reveals Our Genetic Ancestry 55–57, 62, 77–78 (2001).
179
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
an uncertain process. Eggs cells contain many mitochondria, and the mature egg
cell will not contain just the mutation—it will house a mixed population of the
old-style mitochondria and a number of the mutated ones (with DNA that usually
differs from the original at a single base pair). Figuratively speaking, the original
mtDNA sequence and the mutated version fight it out for several generations until
one of them becomes “fixed” in the population. In the interim, the progeny of
the mutated egg cell will harbor both strains of mitochondria.
When mtDNA from a crime scene sample is compared to a suspect’s sample,
there are three possibilities: (1) neither sample is detectably heteroplasmic; (2) one
sample displays heteroplasmy, but the other does not; (3) both samples display
heteroplasmy. In each scenario, the comparison can produce an exclusion or an
inclusion:
1. Neither sample heteroplasmic. In the first situation, if the sequence in the
crime scene sample is markedly different from the sequence in the suspect’s
sample, then the suspect is excluded. But heteroplasmy could be the reason for a difference of only a single base or so. For example, the sequence
in a hair shaft coming from the suspect could be a slight mutation of the
dominant sequence in the suspect. Therefore, the FBI treats a difference
at a single base pair as inconclusive.133 When the one mtDNA sequence
characteristic of each sample is identical, the issue becomes how to use the
reference database of mtDNA sequences, as discussed above.
2. Suspect’s sample heteroplasmic, crime scene sample not. One version of the second scenario arises when heteroplasmy is seen in the suspect’s tissues but
not in the crime scene sample. If the crime scene sequence is not close
to either of the suspect’s sequences, then the suspect is excluded. If it is
identical to one of the suspect’s sequences, then the suspect is included,
and a suitable reference database should indicate how infrequent such an
inclusion would be. If crime scene DNA is one base pair removed from
either of the suspect’s sequences, then the result is inconclusive.
133. Scientific Working Group on DNA Analysis Methods (SWGDAM), Guidelines for Mitochondrial DNA (mtDNA) Nucleotide Sequence Interpretation, Forensic Sci. Comm., Apr. 2003, available
at http://www.fbi.gov/hq/lab/fsc/backissu/april2003/swgdammitodna.htm. But see Vaughn v. State,
646 S.E.2d 212, 215 (Ga. 2007) (apparently transforming the statement that a suspect “cannot be
excluded” when “there is a single base pair difference” into “a match”). These inconclusive sequences
contribute to the number of people who would not be excluded. Therefore, in Pappas, it is misleading to conclude “that approximately 99.75% of the Caucasian population could be excluded as the
source of the mtDNA in the sample.” 776 A.2d 1091, 1104 (Conn. 2001) (footnote omitted). This
percentage neglects the individuals whose mtDNA sequences are off by one base pair. Along with
the 0.25% who are included because their mtDNA matches completely, these one-off people would
not be excluded. An analyst who speaks of the fraction of people who would not be excluded should
report a nonexclusion rate that accounts for these inconclusive cases. Of course, the difference may
be fairly small. In Pappas, a defense expert reported that the actual nonexclusion rate was still “99.3
percent of the Caucasian population.” Id. at 1105 (footnote omitted). See Kaye et al., supra note 83.
180
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
3. Both samples heteroplasmic. In this third scenario, multiple sequences are seen
in each sample. To keep track of things, we can call the sequences in the
crime scene sample C1 and C2, and those in the suspect’s sample S1 and
S2. If either C1 or C2 is very different from both S1 and S2, the suspect is
excluded. If C1 and C2 are the same as S1 and S2, the suspect is included.
Because detectable heteroplasmy is not very common, this inclusion is
stronger evidence of identity than the simple match in the first scenario.
Finally, in the middle range, where C1 is very close to S1 or S2, or C2 is
very close to S1 or S2, the result is inconclusive.
A number of courts have rejected objections that the methods for mtDNA
sequencing do not comport with Frye134 or Daubert135 and that the phenomenon
of heteroplasmy or the limitations in the statistical analysis preclude the forensic
use of this technology under either Rule 702 or Rule 403.136
B. Y Chromosomes
Y chromosomes contain genes that result in development as a male rather than a
female. Therefore, men are type XY and women are XX. A male child receives
an X chromosome from his mother and a Y from his father; females receive two
different X chromosomes, one from each parent. Like all chromosomes, the Y
chromosome contains STRs and SNPs.
Because there is limited recombination between Y and X chromosomes,
Y-STRs and Y-SNPs are inherited as a single block—a haplotype—from father
to son. This means that the issues of population genetics and statistics are similar to
those for mtDNA. No matter how many Y-STRs are in the haplotype, all the
men in the same paternal line (up to the last mutation giving rise to a new line in
the family tree) would match the crime scene sample.
134. E.g., Magaletti v. State, 847 So. 2d 523, 528 (Fla. Dist. Ct. App. 2003) (“[T]he mtDNA
analysis conducted [on hair] determined an exclusionary rate of 99.93 percent. In other words, the
results indicate that 99.93 percent of people randomly selected would not match the unknown hair
sample found in the victim’s bindings.”); People v. Sutherland. 860 N.E.2d 178, 271–72 (Ill. 2006);
People v. Holtzer, 660 N.W.2d 405, 411 (Mich. Ct. App. 2003); Wagner v. State, 864 A.2d 1037,
1043–49 (Md. Ct. Spec. App. 2005) (mtDNA sequencing admissible despite contamination and
heteroplasmy).
135. E.g., United States v. Beverly, 369 F.3d 516, 531 (6th Cir. 2004) (“The scientific basis
for the use of such DNA is well established.”); United States v. Coleman, 202 F. Supp. 2d 962, 967
(E.D. Mo. 2002) (“‘[a]t the most,’ seven out of 10,000 people would be expected to have that exact
sequence of As, Ts, Cs, and Gs.”), aff’d, 349 F.3d 1077 (8th Cir. 2003); Pappas, 776 A.2d at 1095;
State v. Underwood, 518 S.E.2d 231, 240 (N.C. Ct. App. 1999); State v. Council, 515 S.E.2d 508,
518 (S.C. 1999).
136. E.g., Beverly, 369 F.3d at 531 (“[T]he mathematical basis for the evidentiary power of the
mtDNA evidence was carefully explained, and was not more prejudicial than probative.”); Pappas,
776 A.2d 1091.
181
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Consequently, multiplication of allele frequencies is inappropriate, and an
estimate of how many men might share the haplotype must be based on the
frequency of that one haplotype in a relevant population. Population structure
is a concern, and obtaining a suitable sample to estimate the frequency in a local
population could be a challenge. If such a database is not available, DNA analysts
might consider limiting their testimony on direct examination to the size of the
available database, the population sampled, and the number of individuals in
the database who share the crime scene haplotype. This presentation is less ambitious than a random-match probability, and courts must decide whether it gives
the jury sufficient information to fairly assess the probative value of the match,
which could be substantial.
When a standard DNA profile (involving a reasonable number of STRs or
other polymorphisms of the other chromosomes) is available, there is little reason
to add a Y-STR test. The profiles already are extremely rare. In some cases, however, standard STR typing will fail. Consider, for example, what happens when
a PCR primer that targets an STR locus on, say, chromosome 16 is applied to a
sample that contains a small number of sperm (from, say, a vasectomized man) and
a huge number of cells from a woman who is a victim of sexual assault. Almost
never will the primer lock onto the man’s chromosome 16. Therefore, his alleles
on this chromosome will not produce a detectable peak in an electropherogram.
But a primer for a Y-STR will not bind to the victim’s chromosomes—her
chromosomes swamp the sample, but they are essentially invisible to the Y-STR
primer. Because this primer binds only to the Y chromosomes from the man, only
his STRs will be amplified. This is one example of how Y-STR profiling can be
valuable in dealing with a mixture of DNA from several individuals. The next
section provides other examples and describes other ways in which analysis of the
Y chromosome can be valuable in mixture cases.
Although the statistics and population genetics of Y-STRs are different from
the other STRs, the underlying technology for obtaining the profile is the same.
On this basis, some courts have upheld the admission of these markers.137
C. Mixtures
Samples of biological trace evidence recovered from crime scenes often contain
a mixture of fluids or tissues from different individuals. Examples include vaginal
swabs collected as sexual assault evidence and bloodstain evidence from scenes
where several individuals shed blood. However, not all mixed samples produce
mixed STR profiles.138 Consider a sample in which 99% of the DNA comes from
137. E.g., Shabazz v. State, 592 S.E.2d 876, 879 (Ga. Ct. App. 2004); Curtis v. State, 205
S.W.3d 656, 660–61 (Tex. Ct. App. 2006).
138. The discussion in this section is limited to electropherograms of STR alleles. A recent
paper reports a statistical technique that compares the known SNP genotypes (involving hundreds of
182
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
the defendant and 1% comes from a different individual. Even if some of the molecules from the minor contributor come in contact with the polymerase and an
STR is amplified, the resulting signal might be too small to be detected—the peak
in an electropherogram will blend into the background. Because the vast bulk of
the amplified STRs will come from the defendant’s DNA, the electropherogram
should show only one STR profile. In these situations, the interpretation of the
single DNA profile is the same as when 100% of the DNA molecules in the sample
are the defendant’s.
When the mixtures are more evenly balanced among contributors, however,
the STRs from multiple contributors can appear as “extra” peaks. As a rule, because
DNA from a single individual can have no more than two alleles at each locus,139
the presence of three or more peaks at several loci indicates that a mixture of DNA
is in the sample.140 Figure 6 shows another electropherogram from DNA recovered
in People v. Pizarro.141 The fact that there are as many as four alleles at some loci and
that many of the peaks match the victim’s) suggests that the sample is a mixture of
the victim’s and another person’s DNA. Furthermore, a peak at the amelogenein
locus shows that male DNA is part of the mixture. Because all the peaks that do not
match the victim are part of the defendant’s STR profile, the mixture is consistent
with the state’s theory that the defendant raped the victim.
Five approaches are available to cope with detectable mixtures. First, if a
laboratory has other samples that do not show evidence of mixing, it can avoid
the problem of deciphering the convoluted set of profiles. Even across a single
stain, the proportions of a mixture can vary, and it might be possible to extract a
DNA sample that does not produce a mixed signal.
Second, a chemical procedure exists to separate the DNA from sperm from
a rape victim’s vaginal epithelial cell DNA.142 When this procedure works, the
thousands of SNPs) of a set of individuals to the SNPs detected in complex mixtures. The report states
that the technique is able to discern “whether an individual is within a series of complex mixtures (2 to
200 individuals) when the individual contributes trace levels (at and below 1%) of the total genomic
DNA.” Nils Homer et al., Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex
Mixtures Using High-Density SNP Genotyping Microarrays, 4 PLoS Genetics No. 8 (2008), available at
http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000167.
139. This follows from the fact that individuals inherit chromosomes in pairs, one from each parent. An individual who inherits the same allele from each parent (a homozygote) can contribute only that
one allele to a sample, and an individual who inherits a different allele from each parent (a heterozygote)
will contribute those two alleles. Finding three or more alleles at several loci therefore indicates a mixture.
140. On rare occasions, an individual exhibits a phenotype with three alleles at a locus. This
can be the result of a chromosome anomaly (such as a duplicated gene on one chromosome or a
mutation). A sample from such an individual is usually easily distinguished from a mixed sample. The
three-allele variant is seen at only the affected locus, whereas with mixtures, more than two alleles
typically are evident at several loci.
141. See supra Figure 5.
142. The nucleus of a sperm cell lies behind a protective structure that does not break down as
easily as the membrane in an epithelial cell. This makes it possible to disrupt the epithelial cells first
and extract their DNA, and then to use a harsher treatment to disrupt the sperm cells.
183
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 6. Electropherogram in People v. Pizarro that can be interpreted as a
mixture of DNA from the victim and the defendant.
Source: Steven Myers and Jeanette Wallin, California Department of Justice, provided the image.
laboratory can assign the DNA profiles to the different individuals because it has
created, in effect, two samples that are not mixtures.
Third, in sexual assault cases, Y chromosome testing can reveal the number
of men (from different paternal lines) whose DNA is being detected and whether
the defendant’s Y chromosome is consistent with his being in one of these paternal
lines.143 Because only males have Y chromosomes, the female DNA in a mixture
has no effect.
Fourth, a laboratory simply can report that a defendant’s profile is consistent
with the mixed profile, and it can provide an estimate of the proportion of the
relevant population that also cannot be excluded (or would be included).144 When
143 E.g., State v. Polizzi, 924 So. 2d 303, 308–09 (La. Ct. App. 2006) (testing for Y-STRs on
“the genital swab with the DNA profile from the Defendant’s buccal swab, . . . the Defendant or any
of his paternal relatives could not be excluded as having been a donor to the sample from the victim,”
while “99.7 percent of the Caucasian population, 99.8 percent of the African American population,
and 99.3 percent of the Hispanic population could be excluded as donors of the DNA in the sample.”).
144. E.g., State v. Roman Nose, 667 N.W.2d 386, 394 n.5 (Minn. 2003). If the laboratory
can explain why one or more of the defendant’s alleles do not appear in the mixed profile from the
184
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
an individual’s DNA—for example, the victim’s—is known to be in a two-person
crime scene sample, the profile of the unknown person is readily deduced. In
those situations, the analysis of a remaining single-person profile can proceed
in the ordinary fashion.
Finally, a laboratory can try to determine (or make assumptions about) how
many contributors are present and then deduce which set of alleles is likely to
be from each contributor. To accomplish this, DNA analysts look to such clues
as the number of peaks in an expected allele-size range and the imbalance in the
heights of the peaks.145 A good deal of judgment can go into the determination
of which peaks are real, which are artifacts, which are “masked,” and which are
absent for some other reason.146 Courts generally have rejected arguments that
mixture analysis is so unreliable or so open to manipulation that the results are
inadmissible.147 In addition, expert computer systems have been devised for facilitating the analysis and for automatically “deconvoluting” mixtures.148 Once they
are validated, these systems can make the process more standardized.
The five approaches listed here are not mutually exclusive (and not all apply
to every case). When the number of contributors to a mixture is in doubt, for
example, a laboratory is not limited to giving the overall probability of excluding (or including) an individual as a possible contributor (the statistic mentioned
as part of the fourth method). The 1996 NRC report observed that “when the
contributors to a mixture are not known or cannot otherwise be distinguished, a
likelihood-ratio approach offers a clear advantage [over the simplistic exclusioninclusion statistic] and is particularly suitable.”149 Despite the arguments of some
crime scene, it might be willing to declare a match not withstanding this discrepancy. Of course, as
the number of alleles that must be present for there to be a match declines, the proportion of the
population that would be included goes up.
145. See, e.g., Roberts v. United States, 916 A.2d 922, 932–35 (D.C. 2007) (holding such
inferences to be admissible).
146. The proportion of the population included in a mixture and the likelihood ratios conditioned on a particular genotype do not take into account the other possible genotypes that the expert
eliminated in a subjective analysis. William C. Thompson, Painting the Target Around the Matching
Profile: The Texas Sharpshooter Fallacy in Forensic DNA Interpretation, 8 Law, Probability & Risk 257
(2009). Adhering to preestablished standards and protocols for interpreting mixtures reduces the range
of judgment in settling on the most likely set of genotypes to consider. Recent recommendations
appear in Bruce Budowle et al., Mixture Interpretation: Defining the Relevant Features for Guidelines for the
Assessment of Mixed DNA Profiles in Forensic Casework, 54 J. Forensic Sci. 810 (2009) (with commentary
at 55 J. Forensic Sci. 265 (2010)); Peter Gill et al., National Recommendations of the Technical UK DNA
Working Group on Mixture Interpretation for the NDNAD and for Court Going Purposes, 2 Forensic Sci.
Int’l Genetics 76 (2008).
147. Roberts, 916 A.2d at 932 n.9 (citing cases).
148. See, e.g., Tim Clayton & John Buckleton, Mixtures, in Forensic DNA Evidence Interpretation 217 (John Buckleton et al. eds., 2005); Mark W. Perlin et al., Validating TrueAllele® DNA Mixture
Interpretation, 56 J. Forensic Sci. (forthcoming 2011).
149. NRC II, supra note 9, at 129.
185
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
legal commentators that likelihood ratios are inherently prejudicial,150 and despite
objections based on Frye or Daubert, almost all courts have found likelihood ratios
admissible in mixture cases.151
D. Offender and Suspect Database Searches
1. Which statistics express the probative value of a match to a defendant
located by searching a DNA database?
States and the federal government are amassing huge databases consisting of the
DNA profiles of suspected or convicted offenders.152 If the DNA profile from a
crime scene stain matches one of those on file, the person identified by this “cold
hit” will become the target of the investigation. Prosecution may follow.
These database-trawl cases can be contrasted with traditional “confirmation cases” in which the defendant already was a suspect and the DNA testing
provided additional evidence against him. In confirmation cases, statistics such as
the estimated frequency of the matching DNA profile in various populations, the
equivalent random-match probabilities, or the corresponding likelihood ratios can
be used to indicate the probative value of the DNA match.153
In trawl cases, however, an additional question arises—does the fact that the
defendant was selected for prosecution by trawling require some adjustment to
the usual statistics? The legal issues are twofold. First, is a particular quantity—be
it the unadjusted random-match probability or some adjusted probability—scientifically valid (or generally accepted) in the case of a database search? If not, it
must be excluded under the Daubert (or Frye) standards. Second, is the statistic
irrelevant or unduly misleading? If so, it must be excluded under the rules that
150. E.g., William C. Thompson, DNA Evidence in the O.J. Simpson Trial, 67 U. Colo. L. Rev.
827, 855–56 (1996); see also R.C. Lewontin, Population Genetic Issues in the Forensic Use of DNA, in 1
Modern Scientific Evidence: The Law and Science of Expert Testimony § 17-5.0, at 703–05 (Faigman
et al. eds, 1st ed. 1998).
151. E.g., State v. Garcia, 3 P.3d 999 (Ariz. Ct. App. 1999) (likelihood ratios admissible under
Frye to explain mixed sample); Commonwealth v. Gaynor, 820 N.E.2d 233, 252 (Mass. 2005) (“Likelihood ratio analysis is appropriate for test results of mixed samples when the primary and secondary
contributors cannot be distinguished. . . . It need not be applied when a primary contributor can
be identified.”) (citation omitted); People v. Coy, 669 N.W.2d 831, 835–39 (Mich. Ct. App. 2003)
(incorrectly treating mixed-sample likelihood ratios as a part of the statistics on single-source DNA
matches that had already been held to be generally accepted); State v. Ayers, 68 P.3d 768, 775 (Mont.
2003) (affirming trial court’s admission of expert testimony where expert used likelihood ratios to
explain DNA results from a sample known to contain a mixture of DNA); cf. Coy v. Renico, 414 F.
Supp. 2d 744, 762–63 (E.D. Mich. 2006) (stating that the use of likelihood ratio and other statistics
for a mixed stain in People v. Coy, supra, was sufficiently accepted in the scientific community to be
consistent with due process).
152. See supra Section II.E.
153. On the computation and admissibility of such statistics, see supra Section IV.
186
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
require all evidence to be relevant and not unfairly prejudicial. To clarify, we
summarize the statistical literature on this point. Then, we describe the emerging case law.
a. The statistical analyses of adjustment
All statisticians agree that, in principle, the search strategy affects the probative
value of a DNA match. One group describes and emphasizes the impact of the
database match on the hypothesis that the database does not contain the source of
the crime scene DNA. This is a “frequentist” view. It asks how frequently searches
of innocent databases—those for which the true source is someone outside the
database—will generate cold hits. From this perspective, trawling is a form of
“data mining” that produces a “selection effect” or “ascertainment bias.” If we
pick a lottery ticket at random, the probability p that we have the winning ticket is
negligible. But if we search through all the tickets, sooner or later we will find the
winning one. And even if we search through some smaller number N of tickets,
the probability of picking a winning ticket is no longer p, but Np.154 Likewise, if
DNA from N innocent people is examined to determine if any of them match the
crime scene DNA, then the probability of a match in this group is not p, but some
quantity that could be as large as Np. This type of reasoning led the 1996 NRC
committee to recommend that “[w]hen the suspect is found by a search of DNA
databases, the random-match probability should be multiplied by N, the number
of persons in the database.”155 The 1992 committee156 and the FBI’s former DNA
Advisory Board157 took a similar position.
154. The analysis of the DNA database search is more complicated than the lottery example
suggests. In the simple lottery, there was exactly one winner. The trawl case is closer to a lottery in
which we hold a ticket with a winning number, but it might be counterfeit, and we are not sure how
many counterfeit copies of the winning ticket were in circulation when we bought our N tickets.
155. NRC II, supra note 9, at 161 (Recommendation 5.1).
156. Initially, the board explained that
Two questions arise when a match is derived from a database search: (1) What is the rarity of the DNA
profile? and (2) What is the probability of finding such a DNA profile in the database searched? These
two questions address different issues. That the different questions produce different answers should
be obvious. The former question addresses the random match probability, which is often of particular
interest to the fact finder. Here we address the latter question, which is especially important when a
profile found in a database search matches the DNA profile of an evidence sample.
DNA Advisory Board, Statistical and Population Genetics Issues Affecting the Evaluation of the Frequency of
Occurrence of DNA Profiles Calculated from Pertinent Population Database(s), 2 Forensic Sci. Comm., July
2000, available at http://www.fbi.gov/hq/lab/fsc/backissu/july2000/dnastat.htm. After a discussion of
the literature as of 2000, the Board wrote that “we continue to endorse the recommendation of the
NRC II Report for the evaluation of DNA evidence from a database search.”
157. The first NRC committee wrote that “[t]he distinction between finding a match between
an evidence sample and a suspect sample and finding a match between an evidence sample and one
of many entries in a DNA profile databank is important.” It used the same Np formula in a numerical example to show that “[t]he chance of finding a match in the second case is considerably higher,
187
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
No one questions the mathematics that show that when the database size N
is very small compared with the size of the population, Np is an upper bound on
the expected frequency with which searches of databases will incriminate innocent
individuals when the true source of the crime scene DNA is not represented in the
databases. The “Bayesian” school of thought, however, suggests that the frequency
with which innocent databases will be falsely accused of harboring the source of
the crime scene DNA is basically irrelevant. The question of interest to the legal
system is whether the one individual whose database DNA matches the traceevidence DNA is the source of that trace. As the size of a database approaches that
of the entire population, finding one and only one matching individual should be
more, not less, convincing evidence against that person. Thus, instead of looking
at how surprising it would be to find a match in a large group of innocent suspects,
this school of thought asks how much the result of the database search enhances
the probability that the individual so identified is the source. The database search
is actually more probative than the confirmation search because the DNA evidence in the trawl case is much more extensive. Trawling through large databases
excludes millions of people, thereby reducing the number of people who might
have left the trace evidence if the suspect did not. This additional information
increases the likelihood that the defendant is the source, although the effect is
indirect and generally small.158
Of course, when the cold hit is the only evidence against the defendant, the
total package of evidence in the trawl case is less than in the confirmation case.
Nonetheless, the Bayesian treatment shows that the DNA part of the total evidence is more powerful in a cold-hit case because this part of the evidence is more
complete than when the search for matching DNA is limited to a single suspect.
This reasoning suggests that the random-match probability (or, equivalently, the
frequency p in the population) understates the probative value of the unique
DNA match in the trawl case. And if this is so, then the unadjusted randommatch probability or frequency p can be used as a conservative indication of the
probative value of the finding that, of the many people in the database, only the
defendant matches.159
because one . . . fishes through the databank, trying out many hypotheses.” NRC I, supra note 8,
at 124. Rather than proposing a statistical adjustment to the match probability, however, that committee recommended using only a few loci in the databank search, then confirming the match with
additional loci, and presenting only “the statistical frequency associated with the additional loci. . . .”
Id. at 124 tbl. 1.1.
158. When the size of the database approaches the size of the entire population, the effect is
large. Finding that only one individual in a large database has a particular profile also raises the probability that this profile is very rare, further enhancing the probative value of the DNA evidence.
159. This analysis was developed by David Balding and Peter Donnelly. For informal expositions, see, for example, Peter Donnelly & Richard D. Friedman, DNA Database Searches and the Legal
Consumption of Scientific Evidence, 97 Mich. L. Rev. 931 (1999); Kaye, supra note 109; Simon Walsh
& John Buckleton, DNA Intelligence Databases, in Forensic DNA Evidence Interpretation 439 (John
Buckleton et al. eds., 2005). For a related analysis directed at the average probability that an individual
188
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
b. The judicial opinions on adjustment
The need for an adjustment has been vigorously debated in the statistical, and to
a lesser extent, the legal literature.160 The dominant view in the journal articles
is that the random-match probability or frequency need not be inflated to protect
the defendant. The major opinions to confront this issue agree that p is admissible against the defendant in a trawl case.161 They reason that all the statistics are
admissible under Frye and Daubert because there is no controversy over how they
are computed. They then assume that both p and Np are logically relevant and
not prejudicial in a trawl case.
But commentators have pointed out that if the frequentist position that
trawling degrades the probative value of the match is correct, then it is hard
to see what p offers the jury. Conversely, if the Bayesian position that trawling
enhances the probative value of the match is correct, then it is hard to see what
Np offers the jury.162 Thus, it has been argued that, to decide whether p should
be admissible when offered by the prosecution and whether Np (or some variant of it) should be admissible when offered by the defense, the law needs to
directly confront the schism between the frequentist and Bayesian perspectives
on characterizing the probative value of a cold hit.163
2. Near-miss (familial) searching
Normally, police trawl a DNA database to see if any recorded STR profiles match
a crime scene profile. It is not generally necessary to inform the jury that the
defendant was located in this manner. Indeed, the rules of evidence sometimes
prohibit this proof over the objection of the defendant.164 Another search proidentified through a database trawl is the source of a crime scene DNA sample, see Yun S. Song et
al., Average Probability That a “Cold Hit” in a DNA Database Search Results in an Erroneous Attribution,
54 J. Forensic Sci. 22, 23–24 (2009).
160. For citations to this literature, see Kaye, supra note 109; Walsh & Buckleton, supra note
159, at 464.
161. People v. Nelson, 185 P.3d 49 (Cal. 2008); United States v. Jenkins, 887 A.2d 1013 (D.C.
2005). The cases are analyzed in Kaye, supra note 109.
162. Furthermore, even if an adjustment is logically required, Np might be too extreme because
the offender databases include the profiles of individuals—those in prison at the time of the offense,
for instance—who could not have been the source of the crime scene sample. To that extent, Np
overstates the expected frequency of matches to innocent individuals in the database. Kaye, supra note
109; David H. Kaye, People v. Nelson: A Tale of Two Statistics, 7 Law, Probability & Risk 249 (2008).
163. Kaye, supra note 109; Kaye, supra note 162.
164. The common law of evidence and Federal Rule of Evidence 404 prevent the government
from proving that a defendant has committed other crimes when the only purpose of the revelation is
to suggest a general propensity toward criminality. See, e.g., 1 McCormick on Evidence, supra note 3,
§ 190. Proof that the defendant was identified through a database search is likely to suggest the existence of a criminal record, because it is widely known that existing law enforcement DNA databases
are largely filled with the profiles of convicted offenders. Nonetheless, where the bona fide and important purpose of the disclosure is “to complete the story” (id.) or to help the jury to understand an Np
189
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
cedure can lead to charges against a defendant who is not even in the database.
The clearest illustration is the case of identical twins, one of whom is a convicted
offender whose DNA type is on file and the other who has never been typed. If
the convicted twin was in prison when the crime under investigation was committed, and if the police realized that he had an identical twin, suspicion should
fall on the identical twin. Presumably, the police would seek a sample from this
twin, and at trial it would not be necessary for the prosecution to explain the
roundabout process through which he was identified.
In this example, the defendant was found because a relative’s DNA led the
police to him. More generally, the fact that close relatives share more alleles than
other members of the same subpopulation can be exploited as an investigative tool.
Rather than search for a match at all 13 loci in an STR profile, police could search
for a near miss—a partial match that is much more probable when the partly
matching profile in the database comes from a close relative than when it comes
from an unrelated person. (Analysis of Y-STRs or mtDNA then could determine
whether the offender who provided the partially matching DNA in the database
probably is in the same paternal or maternal lineage as the unknown individual
who left DNA at the scene of the crime.)
Such “familial searching” raises technical and policy questions. The technical and practical challenge is to devise a search strategy that keeps the number of
false leads to a tolerable level. The policy question is whether exposing relatives
to the possibility of being investigated on the basis of genetic leads from their kin
is appropriate.165
In receiving the DNA evidence, courts might consider having the prosecution describe the match without revealing that the defendant’s close relative is a
known or suspected criminal. In addition, if database trawls degrade the probative value of a perfect match in the database—a theory discussed in the previous
subsection—then the usual random-match probability or estimated frequency
exaggerates the value of the match derived from a database search. From the
frequentist perspective, one must ask how often trawling databases for leads to
individuals (both within and outside the database) will produce false accusations.
From the Bayesian perspective, however, the usual match probabilities and likeli-
statistic, Rule 404 itself arguably does not prevent the prosecution (and certainly not the defense) from
revealing that the defendant was found through a DNA database trawl. In the absence of a categorical
rule of exclusion (like the one in Rule 404), a case-by-case balancing of the value of the information
for its legitimate purposes as against its potential prejudice to the defendant is required. See id.
165. See, e.g., Bruce Budowle et al., Clarification of Statistical Issues Related to the Operation of
CODIS, in Genetic Identity Conference Proceedings: 18th Int’l Symposium on Human Identification
(2006), available at http://www.promega.com/GENETICIDPROC/ussymp17proc/oralpresentations/
budowle.pdf; Henry T. Greely et al., Family Ties: The Use of DNA Offender Databases to Catch Offenders’
Kin, 34 J.L. Med. & Ethics 248 (2006); Erica Haimes, Social and Ethical Issues in the Use of Familiar
Searching in Forensic Investigations: Insights from Family and Kinship Studies, 34 J.L. Med. & Ethics 262
(2006).
190
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
hoods can be used because, if anything, they understate the probative value of
the DNA information.166
3. All-pairs matching within a database to verify estimated random-match
probabilities
A third and final use of police intelligence databases has evidentiary implications.
Section IV.E explained how population genetics models and reference samples
for determining allele frequencies are used to estimate DNA genotype frequencies. Large databases can be used to check these theoretical computations. In
New Zealand, for example, researchers compared every six-locus STR profile
in the national database with every other profile.167 At the time, there were
10,907 profiles. This means that there were about 59 million distinct pairs.168
Because the theoretical random-match probability was about 1 in 50 million, if
all the individuals represented in the database were unrelated, one would expect
that an exhaustive comparison of the profiles for these 59 million pairs would
produce only about one match. In fact, the 59 million comparisons revealed 10
matches. The excess number of matches is evidence that not all the individuals in
the database were unrelated, that the true match probability was smaller than the
theoretical calculation, or both. In fact, eight of the pairs were twins or brothers.
The ninth was a duplicate (because one person gave a sample as himself and then
again pretending to be someone else). The tenth was apparently a match between
two unrelated people. This exercise thus confirmed the theoretical computation
of the random-match probability. On average, the theoretical match probability
was about 1/50,000,000, and the rate of matches in the unrelated pairs within the
database was 1/59,000,000.
In the United States, defendants have sought discovery of the criminaloffender databases to determine whether the number of matching and partially matching pairs exceeds the predictions made with the population genetics
model.169 An early report about partial matches in a state database in Arizona was
said to show extraordinarily large numbers of partial matches (without accounting
for the combinatorial explosion in the number of comparisons in an all-pairs data166. See Kaye, supra note 3; supra Section V.D.1.
167. Walsh & Buckleton, supra note 159, at 463.
168. Altogether, nearly 11,000 people were represented in the New Zealand database. Hence,
about 10,907 × 10,907 pairs such as (1,1), (1,2), (1,3), … , (1,10907), (2,1), (2,2), (2,3), . . . , (2,10907),
. . . (10907,1), (10907,2), (10907,3), (10907,10907) can be formed. This amounts to almost 119 million possible pairs. Of course, there is no point in checking the pairs (1,1), (2,2), . . . (10907,10907).
Thus, the number of ordered pairs with different individuals is 119 million minus a mere 10,907. The
subtraction hardly changes anything. Finally, ordered pairs such as (1,5) and (5,1) involve the same
two people. Therefore, the number of distinct pairs of people is about half of 119 million—the 59
million figure in the text.
169. Jason Felch & Maura Dolan, How Reliable Is DNA in Identifying Suspects? L.A. Times,
July 20, 2008.
191
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
base search).170 However, some scientists question the utility of the investigative
databases for population genetics research.171 They observe that these databases
contain an unknown number of relatives, that they might contain duplicates, and
that the population in the offender databases is highly structured. These complicating factors would need to be considered in testing for an excess of matches or
partial matches. Studies of offender databases in Australia and New Zealand that
make adjustments for population structure and close relatives have shown substantial agreement between the expected and observed numbers of partial matches, at
least up to the nine STR loci used in those databases.172
The existence of large databases also provides a means of estimating a
random-match probability without making any modeling assumptions. For the
New Zealand study, even ignoring the possibility of relatives and duplicates, there
were only 10 matches out of 59 million comparisons. The empirical estimate of the
random-match probability is therefore about 1 in 5.9 million. This is about 10 times
larger than the theoretical estimate, but still quite small. As this example indicates,
crude but simple empirical estimates from all-pairs comparisons in large databases
may well produce random-match probabilities that are larger than the theoretical
estimates (as expected when full siblings or other close relatives are in the databases),
but the estimated probabilities are likely to remain impressively small.
170. Id.; Kaye, supra note 3. As illustrated supra note 168, an all-pairs search in a large database
of size N will involve N(N – 1)/2, or about N 2/2 comparisons. For example, a database of 6 million samples gives rise to some 18,000,000,000,000 comparisons. Even with no population structure,
relatives, and duplicates, and with random-match probabilities in the trillionths, one would expect
to find a large number of matches or near-matches. An analogy can be made to the famous “birthday problem” mentioned in the 1996 NRC Report, supra note 9, at 165. In its simplest form, the
birthday problem assumes that equal numbers of people are born every day of the year. The problem
is to determine the minimum number of people in a room such that the odds favor there being at least
two of them who were born on the same day of the same month. Focusing solely on the randommatch probability of 1/365 for a specified birthday makes it appear that a huge number of people must
be in the room for a match to be likely. After all, the chance of a match between two individuals having a given birthday (say, January 1) is (ignoring leap years) a miniscule 1/365 × 1/365 = 1/133,225.
But because the matching birthday can be any one of the 365 days in the year and because there are
N(N – 1)/2 ways to have a match, it takes only N = 23 people before it is more likely than not that
at least two people share a birthday. The birthday problem thus shows that surprising coincidences
commonly occur even in relatively small databases. See, e.g., Persi Diaconis & Frederick Mosteller,
Methods for Studying Coincidences, 84 J. Am. Statistical Ass’n 853 (1989).
171. Bruce Budowle et al., Partial Matches in Heterogeneous Offender Databases Do Not Call into
Question the Validity of Random Match Probability Calculations, 123 Int’l J. Legal Med. 59 (2009).
172. James M. Curran et al., Empirical Support for the Reliability of DNA Evidence Interpretation in
Australia and New Zealand, 40 Australian J. Forensic Sci. 99, 102–06 (2008); Bruce S. Weir, The Rarity
of DNA Profiles, 1 Annals Applied Stat. 358 (2007); B.S. Weir, Matching and Partially-Matching DNA
Profiles, 49 J. Forensic Sci. 1009, 1013 (2004); cf. Laurence D. Mueller, Can Simple Population Genetic
Models Reconcile Partial Match Frequencies Observed in Large Forensic Databases? 87 J. Genetics (India) 101
(2008) (maintaining that excess partial matches in an Arizona offender database are not easily reconciled
with theoretical expectations). This literature is reviewed in David H. Kaye, Trawling DNA Databases
for Partial Matches: What Is the FBI Afraid of? 19 Cornell J.L. & Pub. Pol’y 145 (2009).
192
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
VI. Nonhuman DNA Testing
Most routine applications of DNA technology in the forensic setting involve the
identification of human beings—suspects in criminal cases, missing persons, or victims of mass disasters. However, inasmuch as DNA analysis might be informative
in any kind of case involving biological material, DNA analysis has found application in such diverse situations as identification of individual plants and animals
that link suspects to crime scenes, enforcement of endangered species and other
wildlife regulations, investigation of patent issues involving specific animal breeds
and plant cultivars, identification of fraudulently labeled foodstuffs, identification
of sources of bacterial and viral epidemic outbreaks, and identification of agents
of bioterrorism.173 These applications are directed either at identifying the species
origin of an item or at distinguishing among individuals (or subgroups) within a
species. In deciding whether the evidence is scientifically sound, it can be important to consider the novelty of the application, the validity of the underlying
scientific theory, the validity of any statistical interpretations, and the relevant
scientific community to consult in assessing the application. This section considers
these factors in the context of nonhuman DNA testing.
A. Species and Subspecies
Evolution is a branching process. Over time, populations may split into distinct
species. Ancestral species and some or all of their branches become extinct.
Phylogenetics uses DNA sequences to elucidate these evolutionary “trees.” This
information can help determine the species of the organism from which material
has been obtained. For example, the most desirable Russian black caviar originates
from three species of wild sturgeon inhabiting the Volga River–Caspian Sea basin.
But caviar from other sturgeon species is sometimes falsely labeled as originating
from these three species—in violation of food labeling laws. Moreover, the three
sturgeon species are listed as endangered, and trade in their caviar is restricted. A
test of caviar species based on DNA sequence variation in a mitochondrial gene
found that 23% of caviar products in the New York City area were mislabeled,174
and in United States v. Yazback, caviar species testing was used to convict an
173. See, e.g., R.G. Breeze et al., Microbial Forensics (2005); Laurel A. Neme, Animal Investigators: How the World’s First Wildlife Forensics Lab Is Solving Crimes and Saving Endangered Species
(2009). In still other situations, DNA testing has been used to establish the identity of a missing or
stolen animal. E.g., Augillard v. Madura, 257 S.W.3d 494 (Tex. App. 2008) (action for conversion to
recover dog lost during Hurricane Katrina); Guillermo Giovambattista et al., DNA Typing in a Cattle
Stealing Case, 46 J. Forensic Sci. 1484 (2001).
174. Rob DeSalle & Vadim J. Birstein, PCR Identification of Black Caviar, 381 Nature 197 (1996);
Vadim J. Birstein et al., Population Aggregation Analysis of Three Caviar-Producing Species of Sturgeons and
Implications for the Species Identification of Black Caviar, 12 Conservation Biology 766 (1998).
193
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
importer of gourmet foods for falsely labeling fish eggs from the environmentally
protected American paddlefish as the more prized Russian sevruga caviar.175
Phylogenetic analysis also is used to study changes in populations of organisms
of the same species. In State v. Schmidt,176 a physician was convicted of attempting
to murder his former lover by injecting her with the HIV virus obtained from an
infected patient. The virus evolves rapidly—its sequence can change by as much
as 1% per year over the course of infection in a single individual. In time, an
infected individual will harbor new strains, but these will be more closely related
to the particular strain (or strains) that originally infected the individual than to the
diverse strains of the virus in the geographic area. The victim in Schmidt had fewer
strains of HIV than the patient—indicating a later infection—and all the victim’s
strains were closely related to a subset of the patient’s strains—indicating that the
victim’s strains originated from that subset then in the patient. This technique of
examining the genetic similarities and differences in two populations of viruses
has been used in other cases across the world.177
The FBI employed similar reasoning to conclude that the anthrax spores in
letters sent through the mail in 2001 came from the descendants of bacteria first
cultured from an infected cow in Texas in 1981. This “Ames strain” was disseminated to various research laboratories over the years, and the FBI also attempted to
associate the letter spores with particular collections of anthrax bacteria (all derived
from the one Ames strain) now housed in different laboratories.178
Both the caviar and the HIV cases exemplify the translation of established
scientific methods into a forensic application. The mitochondrial gene used for
species identification in Yazback was the cytochrome b gene. Having accumulated
mutations over time, this gene commonly is used for assessing species relationships
among vertebrates, and the database of cytochrome b sequences is extensive. In
particular, this gene sequence previously had been used to determine the evolutionary placement of sturgeons among other species of fish.179 Likewise, the use
of phylogenetic analysis for assessing relationships among HIV strains has provided
critical insights into the biology of this deadly virus.
175. Dep’t of Justice, Caviar Company and President Convicted in Smuggling Conspiracy, available at http://www.usdoj.gov/opa/pr/2002/January/02_enrd_052.htm. An earlier case is described in
Andrew Cohen, Sturgeon Poaching and Black Market Caviar: A Case Study, 48 J. Env’l Biology Fishes 423
(1997).
176. 699 So. 2d 448 (La. Ct. App. 1997) (holding that the evidence satisfied Daubert).
177. Edwin J Bernard et al., The Use of Phylogenetic Analysis as Evidence in Criminal Investigation
of HIV Transmission, Feb. 2007, available at http://www.nat.org.uk/Media%20library/Files/PDF%20
Documents/HIV-Forensics.pdf.
178. See National Research Council, Committee on the Review of the Scientific Approaches
Used During the FBI’s Investigation of the 2001 Bacillus Anthracis Mailings, Review of the Scientific
Approaches Used During the FBI’s Investigation of the 2001 Anthrax Letters (2011).
179. Sturgeon Biodiversity and Conservation (Vadim J. Birstein et al. eds., 1997).
194
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
That said, how the phylogenetic analysis is implemented and the genes used
for the analysis may prompt questions in some cases.180 Both the computer algorithm used to align DNA sequences prior to the construction of the phylogenetic
tree and the computer algorithms used to build the tree contain assumptions that
can influence the outcomes. Consequently, alignments generated by different
software can result in different trees, and different tree-building algorithms can
yield different trees from the same alignments. Thus, phylogenetic analysis should
not be looked upon as a simple mechanical process. In Schmidt, the investigators
anticipated and addressed potential problem areas in their choice of sequence data
to collect and by using different algorithms for phylogenetic analysis. The results
from the multiple analyses were the same, supporting the overall conclusion.181
B. Individual Organisms
DNA analysis to determine that trace evidence originated from a particular
individual within a species requires both a valid analytical procedure for forensic
samples and at least a rough assessment of how rare the DNA types are in the
population. In human DNA testing, suitable reference databases permit reasonable estimates of allele frequencies among groups of human beings (see supra Section IV), but adequate databases will not always be available for other organisms.
Nonetheless, a match between the DNA at a crime scene and the organism that
could be the source of that trace evidence still may be informative. In these cases,
a court may consider admitting testimony about the matching features along with
circumscribed, qualitative explanations of the significance of the similarities.182
Such cases began appearing in the 1990s. In State v. Bogan,183 for example,
a woman’s body was found in the desert, near several Palo Verde trees. A detective noticed two Palo Verde seed pods in the bed of a truck that the suspect was
driving before the murder. However, genetic variation in Palo Verde tree DNA
had not been widely studied, and no one knew how much variation actually
180. One cannot assume that cytochrome b gene testing, for example, is automatically appropriate for all species identification. Mitochondria are maternally inherited, and one can ask whether
cross-breeding between different species of sturgeon could make a sturgeon of one species appear to
be another species because it carries mitochondrial DNA originating from the other species. Mitochondrial introgression has been detected in several vertebrate species. Coyote mtDNA in wolves
and cattle mtDNA in bison are notable examples. Introgression in sturgeon has been reported—some
individual sturgeon appearing to be of one of the prized Volga region caviar species were found to
carry cytochrome b genes from a lesser regarded non-Volga species. These examples indicate the need
for specialized knowledge of the basic biology and ecology of the species in question.
181. On the need for caution in the interpretation of HIV sequence similarities, see Bernard
et al., supra note 177.
182. See generally Kaye et al., supra note 1 (discussing various ways to explain “matches” in
forensic identification tests).
183. 905 P.2d 515 (Ariz. Ct. App. 1995) (holding that the admission of the DNA match was
proper under Frye).
195
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
existed within the species. Accordingly, a method for genetic analysis had to be
developed, assessed for what it revealed about genetic variability, and evaluated
for reliability. A university biologist chose random amplified polymorphic DNA
(RAPD) analysis, a PCR-based method then commonly used for the detection of
variation within species for which no genomic sequence information exists. This
approach employs a single short piece of DNA with a random, arbitrary sequence
as the PCR primer; the amplification products are DNA fragments of unknown
sequence and variable length that can be separated by electrophoresis into a
barcode-like “fingerprint” pattern. In a blind trial, the biologist was able to show
that the DNA from nearly 30 Palo Verde trees yielded distinct RAPD patterns.184
He testified that the two pods “were identical” and “matched completely with” a
particular tree and “didn’t match any of the [other] trees.” In fact, he went so as
to say that he felt “quite confident in concluding that” the tree’s DNA would be
distinguishable from that of “any tree that might be furnished” to him. Numerical
estimates of the random-match probability were not introduced.185
The first example of an animal identification using STR typing involved linking evidence cat hairs to a particular cat. In R. v. Beamish, a woman disappeared
from her home on Prince Edward Island, on Canada’s eastern seaboard. Weeks
later a man’s brown leather jacket stained with blood was discovered in a plastic
bag in the woods. In the jacket’s lining were white cat hairs. After the missing
woman’s body was found in a shallow grave, her estranged common-law husband
was arrested and charged with murder. He lived with his parents and a white cat
named Snowball. A laboratory already engaged in the study of genetic diversity
in cats showed that the DNA profile of the evidence cat hairs matched Snowball
at 10 STR loci. Based on a survey of genetic variation in domestic cats generated
for this case, the probability of a chance match was offered as evidence in support
of the hypothesis that the hairs originated from Snowball.186
184. He analyzed samples from the nine trees near the body and another 19 trees from across
the county. He “was not informed, until after his tests were completed and his report written, which
samples came from” which trees. Id. at 521. Furthermore, unbeknownst to the experimenter, two
apparently distinct samples were prepared from the tree at the crime scene that appeared to have been
abraded by the defendant’s truck. The biologist correctly identified the two samples from the one tree
as matching, and he “distinguished the DNA from the seed pods in the truck bed from the DNA of
all twenty-eight trees except” that one. Id.
185. RAPD analysis does not provide systematic information about sequence variation at defined
loci. As a result, it is not possible to make a reliable estimate of allele or genotype frequencies at a
locus, nor can one make the assumption of genetic independence required to legitimately multiply
frequencies across multiple loci, as one can with STR markers. Furthermore, RAPD profile results
are not generally portable between laboratories. Often, profiles generated by different laboratories
will differ in their details. Therefore, RAPD profile data are not amenable to the generation of large
databases. Nonetheless, the state’s expert estimated a random match probability of 1 in 1,000,000, and
the defense expert countered with 1 in 136,000. The trial court excluded both estimates because of
the then-existing controversy (see Section IV) over analogous estimates for human RFLP genotypes.
186. See Marilyn A. Menott-Haymond et al., Pet Cat Hair Implicates Murder Suspect, 386 Nature
774 (1997).
196
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
In Beamish, there was no preexisting population database characterizing STR
polymorphism in domestic cats, but the premise that cats exhibit substantial
genetic variation at STR loci was in accord with knowledge of STR variation in
other mammals. Moreover, testing done on two small cat populations provided
evidence that the STR loci chosen for analysis were polymorphic and behaved as
independent genetic characteristics, allowing allele frequency estimates to be used
for the calculation of random-match probabilities as is done with human STR
data. On this basis, the random-match probability for Snowball’s STR profile was
estimated to be one in many millions, and the trial court admitted this statistic.187
An animal-DNA random-match probability prompted a reversal, however,
in a Washington case. In State v. Leuluaialii,188 the prosecution offered testimony
of an STR match with a dog’s blood that linked the defendants to the victims’
bodies. The defendants objected, seeking a Frye hearing, but the trial court denied
this motion and admitted testimony that included the report that “the probability
of finding another dog with Chief’s DNA profile was 1 in 18 billion [or] 1 in
3 trillion.”189 The state court of appeals remanded the case for a hearing on general
acceptance, cautioning that “[b]ecause PE Zoogen has not yet published sufficient
data to show that its DNA markers and associated probability estimates are reliable, we would suggest that other courts tread lightly in these waters and closely
examine canine DNA results before accepting them at trial.”190
The scientific literature shows continued use of STR profiling191 (as well as
the use of SNP typing)192 to characterize individuals in plant and animal populations. STR databases have been established for domestic and agriculturally significant animals such as dogs, cats, cattle, and horses as well as for a number of
plant species.193 Critical to the use of these databases is an understanding of the
187. David N. Leff, Killer Convicted by a Hair: Unprecedented Forensic Evidence from Cat’s DNA
Convinced Canadian Jury, Bioworld Today, Apr. 24, 1997, available in 1997 WL 7473675 (“the frequency of the match came out to be on the order of about one in 45 million,” quoting Steven
O’Brien); All Things Considered: Cat DNA (NPR broadcast, Apr. 23, 1997), available in 1997 WL
12832754 (“it was less than one in two hundred million,” quoting Steven O’Brien).
188. 77 P.3d 1192 (Wash. Ct. App. 2003).
189. Id. at 1196.
190. Id. at 1201.
191. E.g., Kathleen J. Craft et al., Application of Plant DNA Markers in Forensic Botany: Genetic
Comparison of Quercus Evidence Leaves to Crime Scene Trees Using Microsatellites, 165 Forensic Sci. Int’l
64 (2007) (differentiation of oak tree leaves); Christine Kubik et al., Genetic Diversity in Seven Perennial
Ryegrass (Lolium perenne L.) Cultivars Based on SSR Markers, 41 Crop Sci. 1565 (2001) (210 ryegrass
samples correctly assigned to seven cultivars).
192. E.g., Bridgett M. vonHoldt et al., Genome-wide SNP and Haplotype Analyses Reveal a Rich
History Underlying Dog Domestication, 464 Nature 898 (2010) (48,000 SNPs in 912 dogs and 225
wolves).
193. Joy Halverson & Christopher J. Basten, A PCR Multiplex and Database for Forensic DNA
Identification of Dogs, 50 J. Forensic Sci. 352 (2005); Marilyn A. Menotti-Raymond et al., An STR
Forensic Typing System for Genetic Individualization of Domestic Cat (Felis catus) Samples, 50 J. Forensic
Sci. 1061 (2005); L.H.P. van de Goor et al., Population Studies of 16 Bovine STR Loci for Forensic
197
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
basic reproductive patterns of the species in question. The simple product rule
(Section IV) assumes that the sexually reproducing species mates at random with
regard to the STR loci used for typing. When that is the case, the usual STR
alleles and loci can be regarded as independent. But if mating is nonrandom,
as occurs when individuals within a species are selectively bred to obtain some
property such as coat color, body type, or behavioral repertoire, or as occurs
when a species exists in geographically distinct subpopulations, the inheritance of
loci may no longer be independent. Because it cannot be assumed a priori that a
crime scene sample originates from a mixed-breed animal, inbreeding normally
must be accounted for.194
A different approach is called for if the species is not sexually reproducing.
For example, many plants, some simple animals, and bacteria reproduce asexually.
With asexual reproduction, most offspring are genetically identical to the parent. All the individuals that originate from a common parent constitute, collectively, a clone. The major source of genetic variation in asexually reproducing
species is mutation. When a mutation occurs, a new clonal lineage is created.
Individuals in the original clonal lineage continue to propagate, and two clonal
lineages now exist where before there was one. Thus, in species that reproduce
asexually, genetic testing distinguishes clones, not individuals; hence, the product
rule cannot be applied to estimate genotype frequencies for individuals. Rather,
the frequency of a particular clone in a population of clones must be determined
by direct observation. For example, if a rose thorn found on a suspect’s clothing
were to be identified as originating from a particular cultivar of rose, the relevant
question becomes how common that variety of rose bush is and where it is located
in the community.
In short, the approach for estimating a genotype frequency depends on the
reproductive pattern and population genetics of the species. In cases involving
unusual organisms, a court will need to rely on experts with sufficient knowledge
of the species to verify that the method for estimating genotype frequencies is
appropriate.
Purposes, 125 Int’l J.L. Med. 111 (2009). But see People v. Sutherland, 860 N.E.2d 178 (Ill. 2006)
(conflicting expert testimony on the representativeness of dog databases); Barbara van Asch & Filipe
Pereira, State-of-the-Art and Future Prospects of Canine STR-Based Genotyping, 3 Open Forensic Sci. J.
45 (2010). (recommending collaborative efforts for standardization and additional development of
population databases)..
194. This can be done either by using the affinal model for a structured population or by using
the probability of a match to a littermate or other closely related animal in lieu of the general randommatch probability. See Sutherland, 860 N.E.2d 178 (describing such testimony).
198
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
Glossary of Terms
adenine (A). One of the four bases, or nucleotides, that make up the DNA
molecule. Adenine binds only to thymine. See nucleotide.
affinal method. A method for computing the single-locus profile probabilities
for a theoretical subpopulation by adjusting the single-locus profile probability, calculated with the product rule from the mixed population database, by
the amount of heterogeneity across subpopulations. The model is appropriate
even if there is no database available for a particular subpopulation, and the
formula always gives more conservative probabilities than the product rule
applied to the same database.
allele. In classical genetics, an allele is one of several alternative forms of a gene.
A biallelic gene has two variants; others have more. Alleles are inherited
separately from each parent, and for a given gene, an individual may have
two different alleles (heterozygosity) or the same allele (homozygosity). In
DNA analysis, the term is applied to any DNA region (even if it is not a
gene) used for analysis.
allelic ladder. A mixture of all the common alleles at a given locus. Periodically
producing electropherograms of the allelic ladder aids in designating the alleles
detected in an unknown sample. The positions of the peaks for the unknown
can be compared to the positions in a ladder electropherogram produced near
the time when the unknown was analyzed. Peaks that do not match up with
the ladder require further analysis.
Alu sequences. A family of short interspersed elements (SINEs) distributed
throughout the genomes of primates.
amplification. Increasing the number of copies of a DNA region, usually by
PCR.
amplified fragment length polymorphism (AMP-FLP). A DNA identification technique that uses PCR-amplified DNA fragments of varying
lengths. The DS180 locus is a VNTR whose alleles can be detected with
this technique.
antibody. A protein (immunoglobulin) molecule, produced by the immune
system, that recognizes a particular foreign antigen and binds to it; if the
antigen is on the surface of a cell, this binding leads to cell aggregation and
subsequent destruction.
antigen. A molecule (typically found in the surface of a cell) whose shape triggers
the production of antibodies that will bind to the antigen.
autoradiograph (autoradiogram, autorad). In RFLP analysis, the X-ray film
(or print) showing the positions of radioactively marked fragments (bands)
of DNA, indicating how far these fragments have migrated, and hence their
molecular weights.
199
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
autosome. A chromosome other than the X and Y sex chromosomes.
band. See autoradiograph.
band shift. Movement of DNA fragments in one lane of a gel at a different rate
than fragments of an identical length in another lane, resulting in the same
pattern “shifted” up or down relative to the comparison lane. Band shift does
not necessarily occur at the same rate in all portions of the gel.
base pair (bp). Two complementary nucleotides bonded together at the matching bases (A and T or C and G) along the double helix “backbone” of the
DNA molecule. The length of a DNA fragment often is measured in numbers
of base pairs (1 kilobase (kb) = 1000 bp); base-pair numbers also are used to
describe the location of an allele on the DNA strand.
Bayes’ theorem. A formula that relates certain conditional probabilities. It
can be used to describe the impact of new data on the probability that a
hypothesis is true. See the chapter on statistics in this manual.
bin, fixed. In VNTR profiling, a bin is a range of base pairs (DNA fragment
lengths). When a database is divided into fixed bins, the proportion of bands
within each bin is determined and the relevant proportions are used in estimating the profile frequency.
binning. Grouping VNTR alleles into sets of similar sizes because the alleles’
lengths are too similar to differentiate.
bins, floating. In VNTR profiling, a bin is a range of base pairs (DNA fragment
lengths). In a floating bin method of estimating a profile frequency, the bin is
centered on the base-pair length of the allele in question, and the width of the
bin can be defined by the laboratory’s matching rule (e.g., ±5% of band size).
blind proficiency test. See proficiency test.
capillary electrophoresis. A method for separating DNA fragments (including STRs) according to their lengths. A long, narrow tube is filled with an
entangled polymer or comparable sieving medium, and an electric field is
applied to pull DNA fragments placed at one end of the tube through the
medium. The procedure is faster and uses smaller samples than gel electrophoresis, and it can be automated.
ceiling principle. A procedure for setting a minimum DNA profile frequency
proposed in 1992 by a committee of the National Academy of Sciences. One
hundred persons from each of 15 to 20 genetically homogeneous populations
spanning the range of racial groups in the United States are sampled. For each
allele, the higher frequency among the groups sampled (or 5%, whichever is
larger) is used in calculating the profile frequency. Compare interim ceiling
principle.
chip. A miniaturized system for genetic analysis. One such chip mimics capillary electrophoresis and related manipulations. DNA fragments, pulled by
200
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
small voltages, move through tiny channels etched into a small block of glass,
silicon, quartz, or plastic. This system should be useful in analyzing STRs.
Another technique mimics reverse dot blots by placing a large array of oligonucleotide probes on a solid surface. Such hybridization arrays are useful in
identifying SNPs and in sequencing mitochondrial DNA.
chromosome. A rodlike structure composed of DNA, RNA, and proteins.
Most normal human cells contain 46 chromosomes, 22 autosomes and a sex
chromosome (X) inherited from the mother, and another 22 autosomes and
one sex chromosome (either X or Y) inherited from the father. The genes are
located along the chromosomes. See also homologous chromosomes.
coding and noncoding DNA. The sequence in which the building blocks
(amino acids) of a protein are arranged corresponds to the sequence of base
pairs within a gene. (A sequence of three base pairs specifies a particular
one of the 20 possible amino acids in the protein. The mapping of a set of
three nucleotide bases to a particular amino acid is the genetic code. The
cell makes the protein through intermediate steps involving coding RNA
transcripts.) About 1.5% of the human genome codes for the amino acid
sequences. Another 23.5% of the genome is classified as genetic sequence but
does not encode proteins. This portion of the noncoding DNA is involved
in regulating the activity of genes. It includes promoters, enhancers, and
repressors. Other gene-related DNA consists of introns (that interrupt the
coding sequences, called exons, in genes and that are edited out of the RNA
transcript for the protein), pseudogenes (evolutionary remnants of oncefunctional genes), and gene fragments. The remaining, extragenic DNA
(about 75% of the genome) also is noncoding.
CODIS (combined DNA index system). A collection of databases on STR
and other loci of convicted felons, maintained by the FBI.
complementary sequence. The sequence of nucleotides on one strand of DNA
that corresponds to the sequence on the other strand. For example, if one
sequence is CTGAA, the complementary bases are GACTT.
control region. See D-loop.
cytoplasm. A jelly-like material (80% water) that fills the cell.
cytosine (C). One of the four bases, or nucleotides, that make up the DNA
double helix. Cytosine binds only to guanine. See nucleotide.
database. A collection of DNA profiles.
degradation. The breaking down of DNA by chemical or physical means.
denature, denaturation. The process of splitting, as by heating, two complementary strands of the DNA double helix into single strands in preparation
for hybridization with biological probes.
201
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
deoxyribonucleic acid (DNA). The molecule that contains genetic information. DNA is composed of nucleotide building blocks, each containing a
base (A, C, G, or T), a phosphate, and a sugar. These nucleotides are linked
together in a double helix—two strands of DNA molecules paired up at
complementary bases (A with T, C with G). See adenine, cytosine, guanine,
thymine.
diploid number. See haploid number.
D-loop. A portion of the mitochrondrial genome known as the “control
region” or “displacement loop” instrumental in the regulation and initiation
of mtDNA gene products. Two short “hypervariable” regions within the
D-loop do not appear to be functional and are the sequences used in identity
or kinship testing.
DNA polymerase. The enzyme that catalyzes the synthesis of double-stranded
DNA.
DNA probe. See probe.
DNA profile. The alleles at each locus. For example, a VNTR profile is the
pattern of band lengths on an autorad. A multilocus profile represents the
combined results of multiple probes. See genotype.
DNA sequence. The ordered list of base pairs in a duplex DNA molecule or of
bases in a single strand.
DQ. The antigen that is the product of the DQA gene. See DQA, human
leukocyte antigen.
DQA. The gene that codes for a particular class of human leukocyte antigen
(HLA). This gene has been sequenced completely and can be used for forensic
typing. See human leukocyte antigen.
EDTA. A preservative added to blood samples.
electropherogram. The PCR products separated by capillary electrophoresis
can be labeled with a dye that glows at a given wavelength in response to
light shined on it. As the tagged fragments pass the light source, an electronic
camera records the intensity of the fluorescence. Plotting the intensity as a
function of time produces a series of peaks, with the shorter fragments producing peaks sooner. The intensity is measured in relative fluorescent units
and is proportional to the number of glowing fragments passing by the detector. The graph of the intensity over time is an electropherogram.
electrophoresis. See capillary electrophoresis, gel electrophoresis.
endonuclease. An enzyme that cleaves the phosphodiester bond within a
nucleotide chain.
environmental insult. Exposure of DNA to external agents such as heat, moisture, and ultraviolet radiation, or chemical or bacterial agents. Such exposure
202
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
can interfere with the enzymes used in the testing process or otherwise make
DNA difficult to analyze.
enzyme. A protein that catalyzes (speeds up or slows down) a reaction.
epigenetic. Heritable changes in phenotype (appearance) or gene expression
caused by mechanisms other than changes in the underlying DNA sequence.
Epigenetic marks are molecules attached to DNA that can determine whether
genes are active and used by the cell.
ethidium bromide. A molecule that can intercalate into DNA double helices
when the helix is under torsional stress. Used to identify the presence of DNA
in a sample by its fluorescence under ultraviolet light.
exon. See coding and noncoding DNA.
fallacy of the transposed conditional. See transposition fallacy.
false match. Two samples of DNA that have different profiles could be declared
to match if, instead of measuring the distinct DNA in each sample, there is
an error in handling or preparing samples such that the DNA from a single
sample is analyzed twice. The resulting match, which does not reflect the
true profiles of the DNA from each sample, is a false match. Some people
use “false match” more broadly, to include cases in which the true profiles of
each sample are the same, but the samples come from different individuals.
Compare true match. See also match, random match.
gel, agarose. A semisolid medium used to separate molecules by electrophoresis.
gel electrophoresis. In RFLP analysis, the process of sorting DNA fragments
by size by applying an electric current to a gel. The different-size fragments
move at different rates through the gel.
gene. A set of nucleotide base pairs on a chromosome that contains the “instructions” for controlling some cellular function such as making an enzyme. The
gene is the fundamental unit of heredity; each simple gene “codes” for a
specific biological characteristic.
gene frequency. The relative frequency (proportion) of an allele in a population.
genetic drift. Random fluctuation in a population’s allele frequencies from
generation to generation.
genetics. The study of the patterns, processes, and mechanisms of inheritance of
biological characteristics.
genome. The complete genetic makeup of an organism, including roughly
23,000 genes and many other DNA sequences in humans. Over three billion
nucleotide base pairs comprise the haploid human genome.
genotype. The particular forms (alleles) of a set of genes possessed by an organism (as distinguished from phenotype, which refers to how the genotype
expresses itself, as in physical appearance). In DNA analysis, the term is
203
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
applied to the variations within all DNA regions (whether or not they constitute genes) that are analyzed.
genotype, multilocus. The alleles that an organism possesses at several sites in
its genome.
genotype, single locus. The alleles that an organism possesses at a particular
site in its genome.
guanine (G). One of the four bases, or nucleotides, that make up the DNA
double helix. Guanine binds only to cytosine. See nucleotide.
haploid number. Human sex cells (egg and sperm) contain 23 chromosomes
each. This is the haploid number. When a sperm cell fertilizes an egg cell, the
number of chromosomes doubles to 46. This is the diploid number.
haplotype. A specific combination of linked alleles at several loci.
Hardy-Weinberg equilibrium. A condition in which the allele frequencies
within a large, random, intrabreeding population are unrelated to patterns of
mating. In this condition, the occurrence of alleles from each parent will be
independent and have a joint frequency estimated by the product rule. See
independence, linkage disequilibrium.
heteroplasmy, heteroplasty. The condition in which some copies of mitochondrial DNA in the same individual have different base pairs at certain
points.
heterozygous. Having a different allele at a given locus on each of a pair of
homologous chromosomes. See allele. Compare homozygous.
homologous chromosomes. The 44 autosomes (nonsex chromosomes) in the
normal human genome are in homologous pairs (one from each parent) that
share an identical set of genes, but may have different alleles at the same loci.
homozygous. Having the same allele at a given locus on each of a pair of
homologous chromosomes. See allele. Compare heterozygous.
human leukocyte antigen (HLA). Antigen (foreign body that stimulates an
immune system response) located on the surface of most cells (excluding red
blood cells and sperm cells). HLAs differ among individuals and are associated
closely with transplant rejection. See DQA.
hybridization. Pairing up of complementary strands of DNA from different sources at the matching base-pair sites. For example, a primer with
the sequence AGGTCT would bond with the complementary sequence
TCCAGA on a DNA fragment.
independence. Two events are said to be independent if one is neither more
nor less likely to occur when the other does.
interim ceiling principle. A procedure proposed in 1992 by a committee of
the National Academy of Sciences for setting a minimum DNA profile frequency. For each allele, the highest frequency (adjusted upward for sampling
204
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
error) found in any major racial group (or 10%, whichever is higher), is used
in product-rule calculations. Compare ceiling principle.
intron. See coding and noncoding DNA.
kilobase (kb). A measure of DNA length (1000 bases).
likelihood ratio. A measure of the support that an observation provides for one
hypothesis as opposed to an alternative hypothesis. The likelihood ratio is
computed by dividing the conditional probability of the observation given
that one hypothesis is true by the conditional probability of the observation
given the alternative hypothesis. For example, the likelihood ratio for the
hypothesis that two DNA samples with the same STR profile originated
from the same individual (as opposed to originating from two unrelated
individuals) is the reciprocal of the random-match probability. Legal scholars
have introduced the likelihood ratio as a measure of the probative value of
evidence. Evidence that is 100 times more probable to be observed when
one hypothesis is true as opposed to another has more probative value than
evidence that is only twice as probable.
linkage. The inheritance together of two or more genes on the same chromosome.
linkage equilibrium. A condition in which the occurrence of alleles at different
loci is independent.
locus. A location in the genome, that is, a position on a chromosome where a
gene or other structure begins.
mass spectroscopy. The separation of elements or molecules according to their
molecular weight. In the version being developed for DNA analysis, small
quantities of PCR-amplified fragments are irradiated with a laser to form
gaseous ions that traverse a fixed distance. Heavier ions have longer times of
flight, and the process is known as matrix-assisted laser desorption-ionization
time-of-flight mass spectroscopy. MALDI-TOF-MS, as it is abbreviated, may
be useful in analyzing STRs.
match. The presence of the same allele or alleles in two samples. Two DNA
profiles are declared to match when they are indistinguishable in genetic type.
For loci with discrete alleles, two samples match when they display the same
set of alleles. For RFLP testing of VNTRs, two samples match when the
pattern of the bands is similar and the positions of the corresponding bands
at each locus fall within a preset distance. See match window, false match,
true match.
match window. If two RFLP bands lie within a preset distance, called the
match window, that reflects normal measurement error, they can be declared
to match.
microsatellite. Another term for an STR.
minisatellite. Another term for a VNTR.
205
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
mitochondria. A structure (organelle) within nucleated (eukaryotic) cells that
is the site of the energy-producing reactions within the cell. Mitochondria
contain their own DNA (often abbreviated as mtDNA), which is inherited
only from mother to child.
molecular weight. The weight in grams of 1 mole (approximately 6.02 × 1023
molecules) of a pure, molecular substance.
monomorphic. A gene or DNA characteristic that is almost always found in
only one form in a population. Compare polymorphism.
multilocus probe. A probe that marks multiple sites (loci). RFLP analysis using
a multilocus probe will yield an autorad showing a striped pattern of 30 or
more bands. Such probes are no longer used in forensic applications.
multilocus profile. See profile.
multiplexing. Typing several loci simultaneously.
mutation. The process that produces a gene or chromosome set differing from
the type already in the population; the gene or chromosome set that results
from such a process.
nanogram (ng). A billionth of a gram.
nucleic acid. RNA or DNA.
nucleotide. A unit of DNA consisting of a base (A, C, G, or T) and attached to
a phosphate and a sugar group; the basic building block of nucleic acids. See
deoxyribonucleic acid.
nucleus. The membrane-covered portion of a eukaryotic cell containing most
of the DNA and found within the cytoplasm.
oligonucleotide. A synthetic polymer made up of fewer than 100 nucleotides;
used as a primer or a probe in PCR. See primer.
paternity index. A number (technically, a likelihood ratio) that indicates the support that the paternity test results lend to the hypothesis that the alleged father
is the biological father as opposed to the hypothesis that another man selected
at random is the biological father. Assuming that the observed phenotypes correctly represent the phenotypes of the mother, child, and alleged father tested,
the number can be computed as the ratio of the probability of the phenotypes
under the first hypothesis to the probability under the second hypothesis. Large
values indicate substantial support for the hypothesis of paternity; values near
zero indicate substantial support for the hypothesis that someone other than
the alleged father is the biological father; and values near unity indicate that
the results do not help in determining which hypothesis is correct.
pH. A measure of the acidity of a solution.
phenotype. A trait, such as eye color or blood group, resulting from a genotype.
point mutation. See SNP.
206
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
polymarker. A commercially marketed set of PCR-based tests for protein
polymorphisms.
polymerase chain reaction (PCR). A process that mimics DNA’s own replication processes to make up to millions of copies of short strands of genetic
material in a few hours.
polymorphism. The presence of several forms of a gene or DNA characteristic
in a population.
population genetics. The study of the genetic composition of groups of
individuals.
population structure. When a population is divided into subgroups that do not
mix freely, that population is said to have structure. Significant structure can
lead to allele frequencies being different in the subpopulations.
primer. An oligonucleotide that attaches to one end of a DNA fragment and
provides a point for more complementary nucleotides to attach and replicate
the DNA strand. See oligonucleotide.
probe. In forensics, a short segment of DNA used to detect certain alleles. The
probe hybridizes, or matches up, to a specific complementary sequence.
Probes allow visualization of the hybridized DNA, either by a radioactive
tag (usually used for RFLP analysis) or a biochemical tag (usually used for
PCR-based analyses).
product rule. When alleles occur independently at each locus (Hardy-Weinberg
equilibrium) and across loci (linkage equilibrium), the proportion of the
population with a given genotype is the product of the proportion of each
allele at each locus, times factors of two for heterozygous loci.
proficiency test. A test administered at a laboratory to evaluate its performance.
In a blind proficiency study, the laboratory personnel do not know that they
are being tested.
prosecutor’s fallacy. See transposition fallacy.
protein. A class of biologically important molecules made up of a linear string
of building blocks called amino acids. The order in which these components
are arranged is encoded in the DNA sequence of the gene that expresses the
protein. See coding DNA.
pseudogenes. Genes that have been so disabled by mutations that they can no
longer produce proteins. Some pseudogenes can still produce noncoding
RNA.
quality assurance. A program conducted by a laboratory to ensure accuracy
and reliability.
quality audit. A systematic and independent examination and evaluation of a
laboratory’s operations.
207
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
quality control. Activities used to monitor the ability of DNA typing to meet
specified criteria.
random match. A match in the DNA profiles of two samples of DNA, where one
is drawn at random from the population. See also random-match probability.
random-match probability. The chance of a random match. As it is usually
used in court, the random-match probability refers to the probability of a true
match when the DNA being compared to the evidence DNA comes from
a person drawn at random from the population. This random true match
probability reveals the probability of a true match when the samples of DNA
come from different, unrelated people.
random mating. The members of a population are said to mate randomly with
respect to particular genes of DNA characteristics when the choice of mates
is independent of the alleles.
recombination. In general, any process in a diploid or partially diploid cell that
generates new gene or chromosomal combinations not found in that cell or
in its progenitors.
reference population. The population to which the perpetrator of a crime is
thought to belong.
relative fluorescent unit (RFU). See electropherogram.
replication. The synthesis of new DNA from existing DNA. See polymerase
chain reaction.
restriction enzyme. Protein that cuts double-stranded DNA at specific basepair sequences (different enzymes recognize different sequences). See restriction site.
restriction fragment length polymorphism (RFLP). Variation among people
in the length of a segment of DNA cut at two restriction sites.
restriction fragment length polymorphism (RFLP) analysis. Analysis of
individual variations in the lengths of DNA fragments produced by digesting
sample DNA with a restriction enzyme.
restriction site. A sequence marking the location at which a restriction enzyme
cuts DNA into fragments. See restriction enzyme.
reverse dot blot. A detection method used to identify SNPs in which DNA
probes are affixed to a membrane, and amplified DNA is passed over the
probes to see if it contains the complementary sequence.
ribonucleic acid (RNA). A single-stranded molecule “transcribed” from
DNA. “Coding” RNA acts as a template for building proteins according
the sequences in the coding DNA from which it is transcribed. Other RNA
transcripts can be a sensor for detecting signals that affect gene expression, a
switch for turning genes off or on, or they may be functionless.
208
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on DNA Identification Evidence
sequence-specific oligonucleotide (SSO) probe. Also, allele-specific oligonucleotide (ASO) probe. Oligonucleotide probes used in a PCR-associated
detection technique to identify the presence or absence of certain base-pair
sequences identifying different alleles. The probes are visualized by an array
of dots rather than by the electrophoretograms associated with STR analysis.
sequencing. Determining the order of base pairs in a segment of DNA.
short tandem repeat (STR). See variable number tandem repeat.
single-locus probe. A probe that only marks a specific site (locus). RFLP analysis using a single-locus probe will yield an autorad showing one band if the
individual is homozygous, two bands if heterozygous. Likewise, the probe
will produce one or two peaks in an STR electrophoretogram.
SNP (single nucleotide polymorphism). A substitution, insertion, or deletion
of a single base pair at a given point in the genome.
SNP chip. See chip.
Southern blotting. Named for its inventor, a technique by which processed
DNA fragments, separated by gel electrophoresis, are transferred onto a nylon
membrane in preparation for the application of biological probes.
thymine (T). One of the four bases, or nucleotides, that make up the DNA
double helix. Thymine binds only to adenine. See nucleotide.
transposition fallacy. Also called the prosecutor’s fallacy, the transposition
fallacy confuses the conditional probability of A given B [P(A|B)] with that
of B given A [P(B|A)]. Few people think that the probability that a person
speaks Spanish (A) given that he or she is a citizen of Chile (B) equals the
probability that a person is a citizen of Chile (B) given that he or she speaks
Spanish (A). Yet, many court opinions, newspaper articles, and even some
expert witnesses speak of the probability of a matching DNA genotype (A)
given that someone other than the defendant is the source of the crime scene
DNA (B) as if it were the probability of someone else being the source (B)
given the matching profile (A). Transposing conditional probabilities correctly
requires Bayes’ theorem.
true match. Two samples of DNA that have the same profile should match
when tested. If there is no error in the labeling, handling, and analysis of the
samples and in the reporting of the results, a match is a true match. A true
match establishes that the two samples of DNA have the same profile. Unless
the profile is unique, however, a true match does not conclusively prove that
the two samples came from the same source. Some people use “true match”
more narrowly, to mean only those matches among samples from the same
source. Compare false match. See also match, random match.
variable number tandem repeat (VNTR). A class of RFLPs resulting from
multiple copies of virtually identical base-pair sequences, arranged in succession at a specific locus on a chromosome. The number of repeats varies from
209
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
individual to individual, thus providing a basis for individual recognition.
VNTRs are longer than STRs.
window. See match window.
X chromosome. See chromosome.
Y chromosome. See chromosome.
References on DNA
Forensic DNA Interpretation (John Buckleton et al. eds., 2005).
John M. Butler, Fundamentals of Forensic DNA Typing (2010).
Ian W. Evett & Bruce S. Weir, Interpreting DNA Evidence: Statistical Genetics
for Forensic Scientists (1998).
William Goodwin et al., An Introduction to Forensic Genetics (2d ed. 2011).
David H. Kaye, The Double Helix and the Law of Evidence (2010).
National Research Council Committee on DNA Forensic Science: An Update,
The Evaluation of Forensic DNA Evidence (1996).
National Research Council Committee on DNA Technology in Forensic Science,
DNA Technology in Forensic Science (1992).
The President’s DNA Initiative, Forensic DNA Resources for Specific Audiences,
available at www.dna.gov/audiences/.
210
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
DAVID H. KAYE AND DAVID A. FREEDMAN
David H. Kaye, M.A., J.D., is Distinguished Professor of Law and Weiss Family Scholar,
The Pennsylvania State University, University Park, and Regents’ Professor Emeritus, Arizona
State University Sandra Day O’Connor College of Law and School of Life Sciences, Tempe.
David A. Freedman, Ph.D., was Professor of Statistics, University of California, Berkeley.
[Editor’s Note: Sadly, Professor Freedman passed away during the production of this
manual.]
CONTENTS
I. Introduction, 213
A. Admissibility and Weight of Statistical Studies, 214
B. Varieties and Limits of Statistical Expertise, 214
C. Procedures That Enhance Statistical Testimony, 215
1. Maintaining professional autonomy, 215
2. Disclosing other analyses, 216
3. Disclosing data and analytical methods before trial, 216
II. How Have the Data Been Collected? 216
A. Is the Study Designed to Investigate Causation? 217
1. Types of studies, 217
2. Randomized controlled experiments, 220
3. Observational studies, 220
4. Can the results be generalized? 222
B. Descriptive Surveys and Censuses, 223
1. What method is used to select the units? 223
2. Of the units selected, which are measured? 226
C. Individual Measurements, 227
1. Is the measurement process reliable? 227
2. Is the measurement process valid? 228
3. Are the measurements recorded correctly? 229
D. What Is Random? 230
III. How Have the Data Been Presented? 230
A. Are Rates or Percentages Properly Interpreted? 230
1. Have appropriate benchmarks been provided? 230
2. Have the data collection procedures changed? 231
3. Are the categories appropriate? 231
4. How big is the base of a percentage? 233
5. What comparisons are made? 233
B. Is an Appropriate Measure of Association Used? 233
211
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
C. Does a Graph Portray Data Fairly? 236
1. How are trends displayed? 236
2. How are distributions displayed? 236
D. Is an Appropriate Measure Used for the Center of a Distribution? 238
E. Is an Appropriate Measure of Variability Used? 239
IV. What Inferences Can Be Drawn from the Data? 240
A. Estimation, 242
1. What estimator should be used? 242
2. What is the standard error? The confidence interval? 243
3. How big should the sample be? 246
4. What are the technical difficulties? 247
B. Significance Levels and Hypothesis Tests, 249
1. What is the p-value? 249
2. Is a difference statistically significant? 251
3. Tests or interval estimates? 252
4. Is the sample statistically significant? 253
C. Evaluating Hypothesis Tests, 253
1. What is the power of the test? 253
2. What about small samples? 254
3. One tail or two? 255
4. How many tests have been done? 256
5. What are the rival hypotheses? 257
D. Posterior Probabilities, 258
V. Correlation and Regression, 260
A. Scatter Diagrams, 260
B. Correlation Coefficients, 261
1. Is the association linear? 262
2. Do outliers influence the correlation coefficient? 262
3. Does a confounding variable influence the coefficient? 262
C. Regression Lines, 264
1. What are the slope and intercept? 265
2. What is the unit of analysis? 266
D. Statistical Models, 268
Appendix, 273
A. Frequentists and Bayesians, 273
B. The Spock Jury: Technical Details, 275
C. The Nixon Papers: Technical Details, 278
D. A Social Science Example of Regression: Gender Discrimination in
Salaries, 279
1. The regression model, 279
2. Standard errors, t-statistics, and statistical significance, 281
Glossary of Terms, 283
References on Statistics, 302
212
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
I. Introduction
Statistical assessments are prominent in many kinds of legal cases, including
antitrust, employment discrimination, toxic torts, and voting rights cases.1 This
reference guide describes the elements of statistical reasoning. We hope the explanations will help judges and lawyers to understand statistical terminology, to see
the strengths and weaknesses of statistical arguments, and to apply relevant legal
doctrine. The guide is organized as follows:
• Section I provides an overview of the field, discusses the admissibility
of statistical studies, and offers some suggestions about procedures that
encourage the best use of statistical evidence.
• SectionIIaddressesdatacollectionandexplainswhythedesignofastudy
is the most important determinant of its quality. This section compares
experiments with observational studies and surveys with censuses, indicating when the various kinds of study are likely to provide useful results.
• SectionIIIdiscussestheartofsummarizingdata.Thissectionconsidersthe
mean, median, and standard deviation. These are basic descriptive statistics,
and most statistical analyses use them as building blocks. This section also
discusses patterns in data that are brought out by graphs, percentages, and
tables.
• SectionIVdescribesthelogicofstatisticalinference,emphasizingfoundations and disclosing limitations. This section covers estimation, standard
errors and confidence intervals, p-values, and hypothesis tests.
• SectionV shows how associationscan be describedby scatter diagrams,
correlation coefficients, and regression lines. Regression is often used to
infer causation from association. This section explains the technique, indicating the circumstances under which it and other statistical models are
likely to succeed—or fail.
• Anappendixprovidessometechnicaldetails.
• Theglossarydefinesstatisticaltermsthatmaybeencounteredinlitigation.
1. See generally Statistical Science in the Courtroom (Joseph L. Gastwirth ed., 2000); Statistics
and the Law (Morris H. DeGroot et al. eds., 1986); National Research Council, The Evolving Role
of Statistical Assessments as Evidence in the Courts (Stephen E. Fienberg ed., 1989) [hereinafter The
Evolving Role of Statistical Assessments as Evidence in the Courts]; Michael O. Finkelstein & Bruce
Levin, Statistics for Lawyers (2d ed. 2001); 1 & 2 Joseph L. Gastwirth, Statistical Reasoning in Law
and Public Policy (1988); Hans Zeisel & David Kaye, Prove It with Figures: Empirical Methods in
Law and Litigation (1997).
213
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
A. Admissibility and Weight of Statistical Studies
Statistical studies suitably designed to address a material issue generally will be
admissible under the Federal Rules of Evidence. The hearsay rule rarely is a
serious barrier to the presentation of statistical studies, because such studies may
be offered to explain the basis for an expert’s opinion or may be admissible under
the learned treatise exception to the hearsay rule.2 Because most statistical methods
relied on in court are described in textbooks or journal articles and are capable
of producing useful results when properly applied, these methods generally satisfy
important aspects of the “scientific knowledge” requirement in Daubert v. Merrell
Dow Pharmaceuticals, Inc.3 Of course, a particular study may use a method that is
entirely appropriate but that is so poorly executed that it should be inadmissible
under Federal Rules of Evidence 403 and 702.4 Or, the method may be inappropriate for the problem at hand and thus lack the “fit” spoken of in Daubert.5 Or
the study might rest on data of the type not reasonably relied on by statisticians or
substantive experts and hence run afoul of Federal Rule of Evidence 703. Often,
however, the battle over statistical evidence concerns weight or sufficiency rather
than admissibility.
B. Varieties and Limits of Statistical Expertise
For convenience, the field of statistics may be divided into three subfields: probability theory, theoretical statistics, and applied statistics. Probability theory is the
mathematical study of outcomes that are governed, at least in part, by chance.
Theoretical statistics is about the properties of statistical procedures, including
error rates; probability theory plays a key role in this endeavor. Applied statistics
draws on both of these fields to develop techniques for collecting or analyzing
particular types of data.
2. See generally 2 McCormick on Evidence §§ 321, 324.3 (Kenneth S. Broun ed., 6th ed. 2006).
Studies published by government agencies also may be admissible as public records. Id. § 296.
3. 509 U.S. 579, 589–90 (1993).
4. See Kumho Tire Co. v. Carmichael, 526 U.S. 137, 152 (1999) (suggesting that the trial court
should “make certain that an expert, whether basing testimony upon professional studies or personal
experience, employs in the courtroom the same level of intellectual rigor that characterizes the practice
of an expert in the relevant field.”); Malletier v. Dooney & Bourke, Inc., 525 F. Supp. 2d 558, 562–63
(S.D.N.Y. 2007) (“While errors in a survey’s methodology usually go to the weight accorded to the
conclusions rather than its admissibility, . . . ‘there will be occasions when the proffered survey is so
flawed as to be completely unhelpful to the trier of fact.’”) (quoting AHP Subsidiary Holding Co. v.
Stuart Hale Co., 1 F.3d 611, 618 (7th Cir.1993)).
5. Daubert, 509 U.S. at 591; Anderson v. Westinghouse Savannah River Co., 406 F.3d 248 (4th
Cir. 2005) (motion to exclude statistical analysis that compared black and white employees without
adequately taking into account differences in their job titles or positions was properly granted under
Daubert); Malletier, 525 F. Supp. 2d at 569 (excluding a consumer survey for “a lack of fit between the
survey’s questions and the law of dilution” and errors in the execution of the survey).
214
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Statistical expertise is not confined to those with degrees in statistics. Because
statistical reasoning underlies many kinds of empirical research, scholars in a
variety of fields—including biology, economics, epidemiology, political science,
and psychology—are exposed to statistical ideas, with an emphasis on the methods
most important to the discipline.
Experts who specialize in using statistical methods, and whose professional
careers demonstrate this orientation, are most likely to use appropriate procedures
and correctly interpret the results. By contrast, forensic scientists often lack basic
information about the studies underlying their testimony. State v. Garrison6 illustrates the problem. In this murder prosecution involving bite mark evidence, a
dentist was allowed to testify that “the probability factor of two sets of teeth being
identical in a case similar to this is, approximately, eight in one million,” even
though “he was unaware of the formula utilized to arrive at that figure other than
that it was ‘computerized.’”7
At the same time, the choice of which data to examine, or how best to model
a particular process, could require subject matter expertise that a statistician lacks.
As a result, cases involving statistical evidence frequently are (or should be) “two
expert” cases of interlocking testimony. A labor economist, for example, may
supply a definition of the relevant labor market from which an employer draws
its employees; the statistical expert may then compare the race of new hires to
the racial composition of the labor market. Naturally, the value of the statistical
analysis depends on the substantive knowledge that informs it.8
C. Procedures That Enhance Statistical Testimony
1. Maintaining professional autonomy
Ideally, experts who conduct research in the context of litigation should proceed
with the same objectivity that would be required in other contexts. Thus, experts
who testify (or who supply results used in testimony) should conduct the analysis
required to address in a professionally responsible fashion the issues posed by the
litigation.9 Questions about the freedom of inquiry accorded to testifying experts,
6. 585 P.2d 563 (Ariz. 1978).
7. Id. at 566, 568. For other examples, see David H. Kaye et al., The New Wigmore: A Treatise
on Evidence: Expert Evidence § 12.2 (2d ed. 2011).
8. In Vuyanich v. Republic National Bank, 505 F. Supp. 224, 319 (N.D. Tex. 1980), vacated, 723
F.2d 1195 (5th Cir. 1984), defendant’s statistical expert criticized the plaintiffs’ statistical model for an
implicit, but restrictive, assumption about male and female salaries. The district court trying the case
accepted the model because the plaintiffs’ expert had a “very strong guess” about the assumption, and
her expertise included labor economics as well as statistics. Id. It is doubtful, however, that economic
knowledge sheds much light on the assumption, and it would have been simple to perform a less
restrictive analysis.
9. See The Evolving Role of Statistical Assessments as Evidence in the Courts, supra note 1, at
164 (recommending that the expert be free to consult with colleagues who have not been retained
215
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
as well as the scope and depth of their investigations, may reveal some of the
limitations to the testimony.
2. Disclosing other analyses
Statisticians analyze data using a variety of methods. There is much to be said for
looking at the data in several ways. To permit a fair evaluation of the analysis that
is eventually settled on, however, the testifying expert can be asked to explain
how that approach was developed. According to some commentators, counsel
who know of analyses that do not support the client’s position should reveal them,
rather than presenting only favorable results.10
3. Disclosing data and analytical methods before trial
The collection of data often is expensive and subject to errors and omissions.
Moreover, careful exploration of the data can be time-consuming. To minimize
debates at trial over the accuracy of data and the choice of analytical techniques,
pretrial discovery procedures should be used, particularly with respect to the quality of the data and the method of analysis.11
II. How Have the Data Been Collected?
The interpretation of data often depends on understanding “study design”—the
plan for a statistical study and its implementation.12 Different designs are suited to
answering different questions. Also, flaws in the data can undermine any statistical
analysis, and data quality is often determined by study design.
In many cases, statistical studies are used to show causation. Do food additives
cause cancer? Does capital punishment deter crime? Would additional disclosures
by any party to the litigation and that the expert receive a letter of engagement providing for these
and other safeguards).
10. Id. at 167; cf. William W. Schwarzer, In Defense of “Automatic Disclosure in Discovery,” 27
Ga. L. Rev. 655, 658–59 (1993) (“[T]he lawyer owes a duty to the court to make disclosure of core
information.”). The National Research Council also recommends that “if a party gives statistical data
to different experts for competing analyses, that fact be disclosed to the testifying expert, if any.” The
Evolving Role of Statistical Assessments as Evidence in the Courts, supra note 1, at 167.
11. See The Special Comm. on Empirical Data in Legal Decision Making, Recommendations
on Pretrial Proceedings in Cases with Voluminous Data, reprinted in The Evolving Role of Statistical
Assessments as Evidence in the Courts, supra note 1, app. F; see also David H. Kaye, Improving Legal
Statistics, 24 Law & Soc’y Rev. 1255 (1990).
12. For introductory treatments of data collection, see, for example, David Freedman et al.,
Statistics (4th ed. 2007); Darrell Huff, How to Lie with Statistics (1993); David S. Moore & William
I. Notz, Statistics: Concepts and Controversies (6th ed. 2005); Hans Zeisel, Say It with Figures (6th
ed. 1985); Zeisel & Kaye, supra note 1.
216
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
in a securities prospectus cause investors to behave differently? The design of
studies to investigate causation is the first topic of this section.13
Sample data can be used to describe a population. The population is the
whole class of units that are of interest; the sample is the set of units chosen for
detailed study. Inferences from the part to the whole are justified when the sample
is representative. Sampling is the second topic of this section.
Finally, the accuracy of the data will be considered. Because making and
recording measurements is an error-prone activity, error rates should be assessed
and the likely impact of errors considered. Data quality is the third topic of this
section.
A. Is the Study Designed to Investigate Causation?
1. Types of studies
When causation is the issue, anecdotal evidence can be brought to bear. So can
observational studies or controlled experiments. Anecdotal reports may be of
value, but they are ordinarily more helpful in generating lines of inquiry than in
proving causation.14 Observational studies can establish that one factor is associ-
13. See also Michael D. Green et al., Reference Guide on Epidemiology, Section V, in this
manual; Joseph Rodricks, Reference Guide on Exposure Science, Section E, in this manual.
14. In medicine, evidence from clinical practice can be the starting point for discovery of
cause-and-effect relationships. For examples, see David A. Freedman, On Types of Scientific Enquiry, in
The Oxford Handbook of Political Methodology 300 (Janet M. Box-Steffensmeier et al. eds., 2008).
Anecdotal evidence is rarely definitive, and some courts have suggested that attempts to infer causation from anecdotal reports are inadmissible as unsound methodology under Daubert v. Merrell Dow
Pharmaceuticals, Inc., 509 U.S. 579 (1993). See, e.g., McClain v. Metabolife Int’l, Inc., 401 F.3d 1233,
1244 (11th Cir. 2005) (“simply because a person takes drugs and then suffers an injury does not show
causation. Drawing such a conclusion from temporal relationships leads to the blunder of the post hoc
ergo propter hoc fallacy.”); In re Baycol Prods. Litig., 532 F. Supp. 2d 1029, 1039–40 (D. Minn. 2007)
(excluding a meta-analysis based on reports to the Food and Drug Administration of adverse events);
Leblanc v. Chevron USA Inc., 513 F. Supp. 2d 641, 650 (E.D. La. 2007) (excluding plaintiffs’ experts’
opinions that benzene causes myelofibrosis because the causal hypothesis “that has been generated by
case reports . . . has not been confirmed by the vast majority of epidemiologic studies of workers being
exposed to benzene and more generally, petroleum products.”), vacated, 275 Fed. App’x. 319 (5th
Cir. 2008) (remanding for consideration of newer government report on health effects of benzene);
cf. Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309, 1321 (2011) (concluding that adverse event
reports combined with other information could be of concern to a reasonable investor and therefore
subject to a requirement of disclosure under SEC Rule 10b-5, but stating that “the mere existence of
reports of adverse events . . . says nothing in and of itself about whether the drug is causing the adverse
events”). Other courts are more open to “differential diagnoses” based primarily on timing. E.g., Best v.
Lowe’s Home Ctrs., Inc., 563 F.3d 171 (6th Cir. 2009) (reversing the exclusion of a physician’s opinion
that exposure to propenyl chloride caused a man to lose his sense of smell because of the timing in this
one case and the physician’s inability to attribute the change to anything else); Kaye et al., supra note
7, §§ 8.7.2 & 12.5.1. See also Matrixx Initiatives, supra, at 1322 (listing “a temporal relationship” in a
single patient as one indication of “a reliable causal link”).
217
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ated with another, but work is needed to bridge the gap between association and
causation. Randomized controlled experiments are ideally suited for demonstrating causation.
Anecdotal evidence usually amounts to reports that events of one kind are
followed by events of another kind. Typically, the reports are not even sufficient
to show association, because there is no comparison group. For example, some
children who live near power lines develop leukemia. Does exposure to electrical
and magnetic fields cause this disease? The anecdotal evidence is not compelling
because leukemia also occurs among children without exposure.15 It is necessary
to compare disease rates among those who are exposed and those who are not.
If exposure causes the disease, the rate should be higher among the exposed and
lower among the unexposed. That would be association.
The next issue is crucial: Exposed and unexposed people may differ in ways
other than the exposure they have experienced. For example, children who live
near power lines could come from poorer families and be more at risk from other
environmental hazards. Such differences can create the appearance of a cause-andeffect relationship. Other differences can mask a real relationship. Cause-and-effect
relationships often are quite subtle, and carefully designed studies are needed to
draw valid conclusions.
An epidemiological classic makes the point. At one time, it was thought that
lung cancer was caused by fumes from tarring the roads, because many lung cancer
patients lived near roads that recently had been tarred. This is anecdotal evidence.
But the argument is incomplete. For one thing, most people—whether exposed
to asphalt fumes or unexposed—did not develop lung cancer. A comparison of
rates was needed. The epidemiologists found that exposed persons and unexposed
persons suffered from lung cancer at similar rates: Tar was probably not the causal
agent. Exposure to cigarette smoke, however, turned out to be strongly associated
with lung cancer. This study, in combination with later ones, made a compelling
case that smoking cigarettes is the main cause of lung cancer.16
A good study design compares outcomes for subjects who are exposed to
some factor (the treatment group) with outcomes for other subjects who are
15. See National Research Council, Committee on the Possible Effects of Electromagnetic Fields
on Biologic Systems (1997); Zeisel & Kaye, supra note 1, at 66–67. There are problems in measuring exposure to electromagnetic fields, and results are inconsistent from one study to another. For
such reasons, the epidemiological evidence for an effect on health is inconclusive. National Research
Council, supra; Zeisel & Kaye, supra; Edward W. Campion, Power Lines, Cancer, and Fear, 337 New
Eng. J. Med. 44 (1997) (editorial); Martha S. Linet et al., Residential Exposure to Magnetic Fields and Acute
Lymphoblastic Leukemia in Children, 337 New Eng. J. Med. 1 (1997); Gary Taubes, Magnetic Field-Cancer
Link: Will It Rest in Peace?, 277 Science 29 (1997) (quoting various epidemiologists).
16. Richard Doll & A. Bradford Hill, A Study of the Aetiology of Carcinoma of the Lung, 2 Brit.
Med. J. 1271 (1952). This was a matched case-control study. Cohort studies soon followed. See
Green et al., supra note 13. For a review of the evidence on causation, see 38 International Agency
for Research on Cancer (IARC), World Health Org., IARC Monographs on the Evaluation of the
Carcinogenic Risk of Chemicals to Humans: Tobacco Smoking (1986).
218
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
not exposed (the control group). Now there is another important distinction to
be made—that between controlled experiments and observational studies. In a
controlled experiment, the investigators decide which subjects will be exposed
and which subjects will go into the control group. In observational studies, by
contrast, the subjects themselves choose their exposures. Because of self-selection,
the treatment and control groups are likely to differ with respect to influential
factors other than the one of primary interest. (These other factors are called lurking variables or confounding variables.)17 With the health effects of power lines,
family background is a possible confounder; so is exposure to other hazards. Many
confounders have been proposed to explain the association between smoking and
lung cancer, but careful epidemiological studies have ruled them out, one after
the other.
Confounding remains a problem to reckon with, even for the best observational research. For example, women with herpes are more likely to develop cervical cancer than other women. Some investigators concluded that herpes caused
cancer: In other words, they thought the association was causal. Later research
showed that the primary cause of cervical cancer was human papilloma virus
(HPV). Herpes was a marker of sexual activity. Women who had multiple sexual
partners were more likely to be exposed not only to herpes but also to HPV.
The association between herpes and cervical cancer was due to other variables.18
What are “variables?” In statistics, a variable is a characteristic of units in a
study. With a study of people, the unit of analysis is the person. Typical variables include income (dollars per year) and educational level (years of schooling
completed): These variables describe people. With a study of school districts, the
unit of analysis is the district. Typical variables include average family income of
district residents and average test scores of students in the district: These variables
describe school districts.
When investigating a cause-and-effect relationship, the variable that represents the effect is called the dependent variable, because it depends on the causes.
The variables that represent the causes are called independent variables. With a
study of smoking and lung cancer, the independent variable would be smoking
(e.g., number of cigarettes per day), and the dependent variable would mark the
presence or absence of lung cancer. Dependent variables also are called outcome
variables or response variables. Synonyms for independent variables are risk factors,
predictors, and explanatory variables.
17. For example, a confounding variable may be correlated with the independent variable and
act causally on the dependent variable. If the units being studied differ on the independent variable,
they are also likely to differ on the confounder. The confounder—not the independent variable—could
therefore be responsible for differences seen on the dependent variable.
18. For additional examples and further discussion, see Freedman et al., supra note 12, at 12–28,
150–52; David A. Freedman, From Association to Causation: Some Remarks on the History of Statistics, 14
Stat. Sci. 243 (1999). Some studies find that herpes is a “cofactor,” which increases risk among women
who are also exposed to HPV. Only certain strains of HPV are carcinogenic.
219
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
2. Randomized controlled experiments
In randomized controlled experiments, investigators assign subjects to treatment
or control groups at random. The groups are therefore likely to be comparable,
except for the treatment. This minimizes the role of confounding. Minor imbalances will remain, due to the play of random chance; the likely effect on study
results can be assessed by statistical techniques.19 The bottom line is that causal
inferences based on well-executed randomized experiments are generally more
secure than inferences based on well-executed observational studies.
The following example should help bring the discussion together. Today, we
know that taking aspirin helps prevent heart attacks. But initially, there was some
controversy. People who take aspirin rarely have heart attacks. This is anecdotal
evidence for a protective effect, but it proves almost nothing. After all, few people
have frequent heart attacks, whether or not they take aspirin regularly. A good
study compares heart attack rates for two groups: people who take aspirin (the
treatment group) and people who do not (the controls). An observational study
would be easy to do, but in such a study the aspirin-takers are likely to be different from the controls. Indeed, they are likely to be sicker—that is why they
are taking aspirin. The study would be biased against finding a protective effect.
Randomized experiments are harder to do, but they provide better evidence. It
is the experiments that demonstrate a protective effect.20
In summary, data from a treatment group without a control group generally
reveal very little and can be misleading. Comparisons are essential. If subjects are
assigned to treatment and control groups at random, a difference in the outcomes
between the two groups can usually be accepted, within the limits of statistical
error (infra Section IV), as a good measure of the treatment effect. However, if
the groups are created in any other way, differences that existed before treatment
may contribute to differences in the outcomes or mask differences that otherwise
would become manifest. Observational studies succeed to the extent that the treatment and control groups are comparable—apart from the treatment.
3. Observational studies
The bulk of the statistical studies seen in court are observational, not experimental. Take the question of whether capital punishment deters murder. To
conduct a randomized controlled experiment, people would need to be assigned
randomly to a treatment group or a control group. People in the treatment
group would know they were subject to the death penalty for murder; the
19. Randomization of subjects to treatment or control groups puts statistical tests of significance
on a secure footing. Freedman et al., supra note 12, at 503–22, 545–63; see infra Section IV.
20. In other instances, experiments have banished strongly held beliefs. E.g., Scott M. Lippman
et al., Effect of Selenium and Vitamin E on Risk of Prostate Cancer and Other Cancers: The Selenium
and Vitamin E Cancer Prevention Trial (SELECT), 301 JAMA 39 (2009).
220
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
controls would know that they were exempt. Conducting such an experiment
is not possible.
Many studies of the deterrent effect of the death penalty have been conducted,
all observational, and some have attracted judicial attention. Researchers have catalogued differences in the incidence of murder in states with and without the death
penalty and have analyzed changes in homicide rates and execution rates over the
years. When reporting on such observational studies, investigators may speak of
“control groups” (e.g., the states without capital punishment) or claim they are “controlling for” confounding variables by statistical methods.21 However, association is
not causation. The causal inferences that can be drawn from analysis of observational
data—no matter how complex the statistical technique—usually rest on a foundation
that is less secure than that provided by randomized controlled experiments.
That said, observational studies can be very useful. For example, there is strong
observational evidence that smoking causes lung cancer (supra Section II.A.1). Generally, observational studies provide good evidence in the following circumstances:
• Theassociationisseeninstudieswithdifferentdesigns,ondifferentkindsof
subjects, and done by different research groups.22 That reduces the chance
that the association is due to a defect in one type of study, a peculiarity in
one group of subjects, or the idiosyncrasies of one research group.
• Theassociationholdswheneffectsofconfoundingvariablesaretakeninto
account by appropriate methods, for example, comparing smaller groups
that are relatively homogeneous with respect to the confounders.23
• Thereisaplausibleexplanationfortheeffectoftheindependentvariable;
alternative explanations in terms of confounding should be less plausible
than the proposed causal link.24
21. A procedure often used to control for confounding in observational studies is regression
analysis. The underlying logic is described infra Section V.D and in Daniel L. Rubinfeld, Reference
Guide on Multiple Regression, Section II, in this manual. But see Richard A. Berk, Regression
Analysis: A Constructive Critique (2004); Rethinking Social Inquiry: Diverse Tools, Shared Standards
(Henry E. Brady & David Collier eds., 2004); David A. Freedman, Statistical Models: Theory and
Practice (2005); David A. Freedman, Oasis or Mirage, Chance, Spring 2008, at 59.
22. For example, case-control studies are designed one way and cohort studies another, with
many variations. See, e.g., Leon Gordis, Epidemiology (4th ed. 2008); supra note 16.
23. The idea is to control for the influence of a confounder by stratification—making comparisons separately within groups for which the confounding variable is nearly constant and therefore has
little influence over the variables of primary interest. For example, smokers are more likely to get lung
cancer than nonsmokers. Age, gender, social class, and region of residence are all confounders, but
controlling for such variables does not materially change the relationship between smoking and cancer
rates. Furthermore, many different studies—of different types and on different populations—confirm
the causal link. That is why most experts believe that smoking causes lung cancer and many other
diseases. For a review of the literature, see International Agency for Research on Cancer, supra note 16.
24. A. Bradford Hill, The Environment and Disease: Association or Causation?, 58 Proc. Royal
Soc’y Med. 295 (1965); Alfred S. Evans, Causation and Disease: A Chronological Journey 187 (1993).
Plausibility, however, is a function of time and circumstances.
221
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Thus, evidence for the causal link does not depend on observed associations alone.
Observational studies can produce legitimate disagreement among experts,
and there is no mechanical procedure for resolving such differences of opinion.
In the end, deciding whether associations are causal typically is not a matter of
statistics alone, but also rests on scientific judgment. There are, however, some
basic questions to ask when appraising causal inferences based on empirical studies:
• Wastherea controlgroup?Unlesscomparisonscanbe made,thestudy
has little to say about causation.
• If there was a control group, how were subjects assigned to treatment
or control: through a process under the control of the investigator (a
controlled experiment) or through a process outside the control of the
investigator (an observational study)?
• Ifthestudywasacontrolledexperiment,wastheassignmentmadeusing
a chance mechanism (randomization), or did it depend on the judgment
of the investigator?
If the data came from an observational study or a nonrandomized controlled
experiment,
• Howdidthesubjectscometobeintreatmentorincontrolgroups?
• Arethetreatmentandcontrolgroupscomparable?
• Ifnot,whatadjustmentsweremadetoaddressconfounding?
• Weretheadjustmentssensibleandsufficient?25
4. Can the results be generalized?
Internal validity is about the specifics of a particular study: Threats to internal validity include confounding and chance differences between treatment and control
groups. External validity is about using a particular study or set of studies to reach
more general conclusions. A careful randomized controlled experiment on a large
but unrepresentative group of subjects will have high internal validity but low
external validity.
Any study must be conducted on certain subjects, at certain times and places,
and using certain treatments. To extrapolate from the conditions of a study to
more general conditions raises questions of external validity. For example, studies
suggest that definitions of insanity given to jurors influence decisions in cases
of incest. Would the definitions have a similar effect in cases of murder? Other
studies indicate that recidivism rates for ex-convicts are not affected by provid25. Many courts have noted the importance of confounding variables. E.g., People Who Care v.
Rockford Bd. of Educ., 111 F.3d 528, 537–38 (7th Cir. 1997) (educational achievement); Hollander
v. Sandoz Pharms. Corp., 289 F.3d 1193, 1213 (10th Cir. 2002) (stroke); In re Proportionality Review
Project (II), 757 A.2d 168 (N.J. 2000) (capital sentences).
222
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
ing them with temporary financial support after release. Would similar results be
obtained if conditions in the labor market were different?
Confidence in the appropriateness of an extrapolation cannot come from
the experiment itself. It comes from knowledge about outside factors that would
or would not affect the outcome.26 Sometimes, several studies, each having different limitations, all point in the same direction. This is the case, for example,
with studies indicating that jurors who approve of the death penalty are more
likely to convict in a capital case.27 Convergent results support the validity of
generalizations.
B. Descriptive Surveys and Censuses
We now turn to a second topic—choosing units for study. A census tries to measure
some characteristic of every unit in a population. This is often impractical. Then
investigators use sample surveys, which measure characteristics for only part of a
population. The accuracy of the information collected in a census or survey depends
on how the units are selected for study and how the measurements are made.28
1. What method is used to select the units?
By definition, a census seeks to measure some characteristic of every unit in
a whole population. It may fall short of this goal, in which case one must ask
26. Such judgments are easiest in the physical and life sciences, but even here, there are problems. For example, it may be difficult to infer human responses to substances that affect animals. First,
there are often inconsistencies across test species: A chemical may be carcinogenic in mice but not
in rats. Extrapolation from rodents to humans is even more problematic. Second, to get measurable
effects in animal experiments, chemicals are administered at very high doses. Results are extrapolated—
using mathematical models—to the very low doses of concern in humans. However, there are many
dose–response models to use and few grounds for choosing among them. Generally, different models
produce radically different estimates of the “virtually safe dose” in humans. David A. Freedman &
Hans Zeisel, From Mouse to Man: The Quantitative Assessment of Cancer Risks, 3 Stat. Sci. 3 (1988).
For these reasons, many experts—and some courts in toxic tort cases—have concluded that evidence
from animal experiments is generally insufficient by itself to establish causation. See, e.g., Bruce N.
Ames et al., The Causes and Prevention of Cancer, 92 Proc. Nat’l Acad. Sci. USA 5258 (1995); National
Research Council, Science and Judgment in Risk Assessment 59 (1994) (“There are reasons based
on both biologic principles and empirical observations to support the hypothesis that many forms of
biologic responses, including toxic responses, can be extrapolated across mammalian species, including
Homo sapiens, but the scientific basis of such extrapolation is not established with sufficient rigor to
allow broad and definitive generalizations to be made.”).
27. Phoebe C. Ellsworth, Some Steps Between Attitudes and Verdicts, in Inside the Juror 42, 46
(Reid Hastie ed., 1993). Nonetheless, in Lockhart v. McCree, 476 U.S. 162 (1986), the Supreme Court
held that the exclusion of opponents of the death penalty in the guilt phase of a capital trial does not
violate the constitutional requirement of an impartial jury.
28. See Shari Seidman Diamond, Reference Guide on Survey Research, Sections III, IV, in
this manual.
223
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
whether the missing data are likely to differ in some systematic way from the data
that are collected.29 The methodological framework of a scientific survey is different. With probability methods, a sampling frame (i.e., an explicit list of units in
the population) must be created. Individual units then are selected by an objective,
well-defined chance procedure, and measurements are made on the sampled units.
To illustrate the idea of a sampling frame, suppose that a defendant in a
criminal case seeks a change of venue: According to him, popular opinion is so
adverse that it would be difficult to impanel an unbiased jury. To prove the state
of popular opinion, the defendant commissions a survey. The relevant population consists of all persons in the jurisdiction who might be called for jury duty.
The sampling frame is the list of all potential jurors, which is maintained by court
officials and is made available to the defendant. In this hypothetical case, the fit
between the sampling frame and the population would be excellent.
In other situations, the sampling frame is more problematic. In an obscenity
case, for example, the defendant can offer a survey of community standards.30
The population comprises all adults in the legally relevant district, but obtaining a full list of such people may not be possible. Suppose the survey is done by
telephone, but cell phones are excluded from the sampling frame. (This is usual
practice.) Suppose too that cell phone users, as a group, hold different opinions
from landline users. In this second hypothetical, the poll is unlikely to reflect the
opinions of the cell phone users, no matter how many individuals are sampled and
no matter how carefully the interviewing is done.
Many surveys do not use probability methods. In commercial disputes involving trademarks or advertising, the population of all potential purchasers of a product is hard to identify. Pollsters may resort to an easily accessible subgroup of the
population, for example, shoppers in a mall.31 Such convenience samples may be
biased by the interviewer’s discretion in deciding whom to approach—a form of
29. The U.S. Decennial Census generally does not count everyone that it should, and it counts
some people who should not be counted. There is evidence that net undercount is greater in some
demographic groups than others. Supplemental studies may enable statisticians to adjust for errors and
omissions, but the adjustments rest on uncertain assumptions. See Lawrence D. Brown et al., Statistical
Controversies in Census 2000, 39 Jurimetrics J. 347 (2007); David A. Freedman & Kenneth W. Wachter,
Methods for Census 2000 and Statistical Adjustments, in Social Science Methodology 232 (Steven Turner
& William Outhwaite eds., 2007) (reviewing technical issues and litigation surrounding census adjustment in 1990 and 2000); 9 Stat. Sci. 458 (1994) (symposium presenting arguments for and against
adjusting the 1990 census).
30. On the admissibility of such polls, see State v. Midwest Pride IV, Inc., 721 N.E.2d 458 (Ohio
Ct. App. 1998) (holding one such poll to have been properly excluded and collecting cases from
other jurisdictions).
31. E.g., Smith v. Wal-Mart Stores, Inc., 537 F. Supp. 2d 1302, 1333 (N.D. Ga. 2008) (treating a small mall-intercept survey as entitled to much less weight than a survey based on a probability
sample); R.J. Reynolds Tobacco Co. v. Loew’s Theatres, Inc., 511 F. Supp. 867, 876 (S.D.N.Y. 1980)
(questioning the propriety of basing a “nationally projectable statistical percentage” on a suburban
mall intercept study).
224
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
selection bias—and the refusal of some of those approached to participate—nonresponse bias (infra Section II.B.2). Selection bias is acute when constituents write
their representatives, listeners call into radio talk shows, interest groups collect
information from their members, or attorneys choose cases for trial.32
There are procedures that attempt to correct for selection bias. In quota sampling, for example, the interviewer is instructed to interview so many women, so
many older people, so many ethnic minorities, and the like. But quotas still leave
discretion to the interviewers in selecting members of each demographic group
and therefore do not solve the problem of selection bias.33
Probability methods are designed to avoid selection bias. Once the population
is reduced to a sampling frame, the units to be measured are selected by a lottery
that gives each unit in the sampling frame a known, nonzero probability of being
chosen. Random numbers leave no room for selection bias.34 Such procedures
are used to select individuals for jury duty. They also have been used to choose
“bellwether” cases for representative trials to resolve issues in a large group of
similar cases.35
32. E.g., Pittsburgh Press Club v. United States, 579 F.2d 751, 759 (3d Cir. 1978) (tax-exempt
club’s mail survey of its members to show little sponsorship of income-producing uses of facilities was
held to be inadmissible hearsay because it “was neither objective, scientific, nor impartial”), rev’d on
other grounds, 615 F.2d 600 (3d Cir. 1980). Cf. In re Chevron U.S.A., Inc., 109 F.3d 1016 (5th Cir.
1997). In that case, the district court decided to try 30 cases to resolve common issues or to ascertain
damages in 3000 claims arising from Chevron’s allegedly improper disposal of hazardous substances.
The court asked the opposing parties to select 15 cases each. Selecting 30 extreme cases, however,
is quite different from drawing a random sample of 30 cases. Thus, the court of appeals wrote that
although random sampling would have been acceptable, the trial court could not use the results in
the 30 extreme cases to resolve issues of fact or ascertain damages in the untried cases. Id. at 1020.
Those cases, it warned, were “not cases calculated to represent the group of 3000 claimants.” Id. See
infra note 35.
A well-known example of selection bias is the 1936 Literary Digest poll. After successfully predicting the winner of every U.S. presidential election since 1916, the Digest used the replies from 2.4
million respondents to predict that Alf Landon would win the popular vote, 57% to 43%. In fact,
Franklin Roosevelt won by a landslide vote of 62% to 38%. See Freedman et al., supra note 12, at
334–35. The Digest was so far off, in part, because it chose names from telephone books, rosters of
clubs and associations, city directories, lists of registered voters, and mail order listings. Id. at 335, A-20
n.6. In 1936, when only one household in four had a telephone, the people whose names appeared on
such lists tended to be more affluent. Lists that overrepresented the affluent had worked well in earlier
elections, when rich and poor voted along similar lines, but the bias in the sampling frame proved fatal
when the Great Depression made economics a salient consideration for voters.
33. See Freedman et al., supra note 12, at 337–39.
34. In simple random sampling, units are drawn at random without replacement. In particular,
each unit has the same probability of being chosen for the sample. Id. at 339–41. More complicated
methods, such as stratified sampling and cluster sampling, have advantages in certain applications. In
systematic sampling, every fifth, tenth, or hundredth (in mathematical jargon, every nth) unit in the
sampling frame is selected. If the units are not in any special order, then systematic sampling is often
comparable to simple random sampling.
35. E.g., In re Simon II Litig., 211 F.R.D. 86 (E.D.N.Y. 2002), vacated, 407 F.3d 125 (2d Cir.
2005), dismissed, 233 F.R.D. 123 (E.D.N.Y. 2006); In re Estate of Marcus Human Rights Litig., 910
225
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
2. Of the units selected, which are measured?
Probability sampling ensures that within the limits of chance (infra Section IV), the
sample will be representative of the sampling frame. The question remains regarding which units actually get measured. When documents are sampled for audit,
all the selected ones can be examined, at least in principle. Human beings are less
easily managed, and some will refuse to cooperate. Surveys should therefore report
nonresponse rates. A large nonresponse rate warns of bias, although supplemental
studies may establish that nonrespondents are similar to respondents with respect
to characteristics of interest.36
In short, a good survey defines an appropriate population, uses a probability
method for selecting the sample, has a high response rate, and gathers accurate
information on the sample units. When these goals are met, the sample tends to
be representative of the population. Data from the sample can be extrapolated
F. Supp. 1460 (D. Haw. 1995), aff’d sub nom. Hilao v. Estate of Marcos, 103 F.3d 767 (9th Cir. 1996);
Cimino v. Raymark Indus., Inc., 751 F. Supp. 649 (E.D. Tex. 1990), rev’d, 151 F.3d 297 (5th Cir.
1998); cf. In re Chevron U.S.A., Inc., 109 F.3d 1016 (5th Cir. 1997) (discussed supra note 32). Although
trials in a suitable random sample of cases can produce reasonable estimates of average damages, the
propriety of precluding individual trials raises questions of due process and the right to trial by jury. See
Thomas E. Willging, Mass Torts Problems and Proposals: A Report to the Mass Torts Working Group
(Fed. Judicial Ctr. 1999); cf. Wal-Mart Stores, Inc. v. Dukes, 131 S. Ct. 2541, 2560–61 (2011). The
cases and the views of commentators are described more fully in David H. Kaye & David A. Freedman, Statistical Proof, in 1 Modern Scientific Evidence: The Law and Science of Expert Testimony §
6:16 (David L. Faigman et al. eds., 2009–2010).
36. For discussions of nonresponse rates and admissibility of surveys conducted for litigation,
see Johnson v. Big Lots Stores, Inc., 561 F. Supp. 2d 567 (E.D. La. 2008) (fair labor standards); United
States v. Dentsply Int’l, Inc., 277 F. Supp. 2d 387, 437 (D. Del. 2003), rev’d on other grounds, 399 F.3d
181 (3d Cir. 2005) (antitrust).
The 1936 Literary Digest election poll (supra note 32) illustrates the dangers in nonresponse. Only
24% of the 10 million people who received questionnaires returned them. Most of the respondents
probably had strong views on the candidates and objected to President Roosevelt’s economic program.
This self-selection is likely to have biased the poll. Maurice C. Bryson, The Literary Digest Poll: Making
of a Statistical Myth, 30 Am. Statistician 184 (1976); Freedman et al., supra note 12, at 335–36. Even
when demographic characteristics of the sample match those of the population, caution is indicated. See
David Streitfeld, Shere Hite and the Trouble with Numbers, 1 Chance 26 (1988); Chamont Wang, Sense
and Nonsense of Statistical Inference: Controversy, Misuse, and Subtlety 174–76 (1993).
In United States v. Gometz, 730 F.2d 475, 478 (7th Cir. 1984) (en banc), the Seventh Circuit
recognized that “a low rate of response to juror questionnaires could lead to the underrepresentation of
a group that is entitled to be represented on the qualified jury wheel.” Nonetheless, the court held that
under the Jury Selection and Service Act of 1968, 28 U.S.C. §§ 1861–1878 (1988), the clerk did not
abuse his discretion by failing to take steps to increase a response rate of 30%. According to the court,
“Congress wanted to make it possible for all qualified persons to serve on juries, which is different
from forcing all qualified persons to be available for jury service.” Gometz, 730 F.2d at 480. Although
it might “be a good thing to follow up on persons who do not respond to a jury questionnaire,” the
court concluded that Congress “was not concerned with anything so esoteric as nonresponse bias.” Id.
at 479, 482; cf. In re United States, 426 F.3d 1 (1st Cir. 2005) (reaching the same result with respect to
underrepresentation of African Americans resulting in part from nonresponse bias).
226
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
to describe the characteristics of the population. Of course, surveys may be useful
even if they fail to meet these criteria. But then, additional arguments are needed
to justify the inferences.
C. Individual Measurements
1. Is the measurement process reliable?
Reliability and validity are two aspects of accuracy in measurement. In statistics,
reliability refers to reproducibility of results.37 A reliable measuring instrument
returns consistent measurements. A scale, for example, is perfectly reliable if
it reports the same weight for the same object time and again. It may not be
accurate—it may always report a weight that is too high or one that is too low—
but the perfectly reliable scale always reports the same weight for the same object.
Its errors, if any, are systematic: They always point in the same direction.
Reliability can be ascertained by measuring the same quantity several times;
the measurements must be made independently to avoid bias. Given independence, the correlation coefficient (infra Section V.B) between repeated measurements can be used as a measure of reliability. This is sometimes called a test-retest
correlation or a reliability coefficient.
A courtroom example is DNA identification. An early method of identification required laboratories to determine the lengths of fragments of DNA. By
making independent replicate measurements of the fragments, laboratories determined the likelihood that two measurements differed by specified amounts.38 Such
results were needed to decide whether a discrepancy between a crime sample and
a suspect sample was sufficient to exclude the suspect.39
Coding provides another example. In many studies, descriptive information
is obtained on the subjects. For statistical purposes, the information usually has to
be reduced to numbers. The process of reducing information to numbers is called
“coding,” and the reliability of the process should be evaluated. For example, in
a study of death sentencing in Georgia, legally trained evaluators examined short
summaries of cases and ranked them according to the defendant’s culpability.40
37. Courts often use “reliable” to mean “that which can be relied on” for some purpose, such
as establishing probable cause or crediting a hearsay statement when the declarant is not produced
for confrontation. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 590 n.9 (1993), for example,
distinguishes “evidentiary reliability” from reliability in the technical sense of giving consistent results.
We use “reliability” to denote the latter.
38. See National Research Council, The Evaluation of Forensic DNA Evidence 139–41 (1996).
39. Id.; National Research Council, DNA Technology in Forensic Science 61–62 (1992).
Current methods are discussed in David H. Kaye & George Sensabaugh, Reference Guide on DNA
Identification Evidence, Section II, in this manual.
40. David C. Baldus et al., Equal Justice and the Death Penalty: A Legal and Empirical Analysis
49–50 (1990).
227
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Two different aspects of reliability should be considered. First, the “withinobserver variability” of judgments should be small—the same evaluator should
rate essentially identical cases in similar ways. Second, the “between-observer
variability” should be small—different evaluators should rate the same cases in
essentially the same way.
2. Is the measurement process valid?
Reliability is necessary but not sufficient to ensure accuracy. In addition to reliability, validity is needed. A valid measuring instrument measures what it is supposed to. Thus, a polygraph measures certain physiological responses to stimuli,
for example, in pulse rate or blood pressure. The measurements may be reliable.
Nonetheless, the polygraph is not valid as a lie detector unless the measurements
it makes are well correlated with lying.41
When there is an established way of measuring a variable, a new measurement
process can be validated by comparison with the established one. Breathalyzer
readings can be validated against alcohol levels found in blood samples. LSAT
scores used for law school admissions can be validated against grades earned in law
school. A common measure of validity is the correlation coefficient between the
predictor and the criterion (e.g., test scores and later performance).42
Employment discrimination cases illustrate some of the difficulties. Thus,
plaintiffs suing under Title VII of the Civil Rights Act may challenge an employment test that has a disparate impact on a protected group, and defendants may
try to justify the use of a test as valid, reliable, and a business necessity.43 For
validation, the most appropriate criterion variable is clear enough: job performance. However, plaintiffs may then turn around and challenge the validity
of performance ratings. For reliability, administering the test twice to the same
group of people may be impractical. Even if repeated testing is practical, it may be
statistically inadvisable, because subjects may learn something from the first round
of testing that affects their scores on the second round. Such “practice effects” are
likely to compromise the independence of the two measurements, and independence is needed to estimate reliability. Statisticians therefore use internal evidence
41. See United States v. Henderson, 409 F.3d 1293, 1303 (11th Cir. 2005) (“while the physical
responses recorded by a polygraph machine may be tested, ‘there is no available data to prove that
those specific responses are attributable to lying.’”); National Research Council, The Polygraph and
Lie Detection (2003) (reviewing the scientific literature).
42. As the discussion of the correlation coefficient indicates, infra Section V.B, the closer the
coefficient is to 1, the greater the validity. For a review of data on test reliability and validity, see Paul
R. Sackett et al., High-Stakes Testing in Higher Education and Employment: Appraising the Evidence for
Validity and Fairness, 63 Am. Psychologist 215 (2008).
43. See, e.g., Washington v. Davis, 426 U.S. 229, 252 (1976); Albemarle Paper Co. v. Moody,
422 U.S. 405, 430–32 (1975); Griggs v. Duke Power Co., 401 U.S. 424 (1971); Lanning v. S.E. Penn.
Transp. Auth., 308 F.3d 286 (3d Cir. 2002).
228
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
from the test itself. For example, if scores on the first half of the test correlate well
with scores from the second half, then that is evidence of reliability.
A further problem is that test-takers are likely to be a select group. The ones
who get the jobs are even more highly selected. Generally, selection attenuates
(weakens) the correlations. There are methods for using internal measures of reliability to estimate test-retest correlations; there are other methods that correct for
attenuation. However, such methods depend on assumptions about the nature of
the test and the procedures used to select the test-takers and are therefore open
to challenge.44
3. Are the measurements recorded correctly?
Judging the adequacy of data collection involves an examination of the process
by which measurements are taken. Are responses to interviews coded correctly?
Do mistakes distort the results? How much data are missing? What was done to
compensate for gaps in the data? These days, data are stored in computer files.
Cross-checking the files against the original sources (e.g., paper records), at least
on a sample basis, can be informative.
Data quality is a pervasive issue in litigation and in applied statistics more generally. A programmer moves a file from one computer to another, and half the data
disappear. The definitions of crucial variables are lost in the sands of time. Values
get corrupted: Social security numbers come to have eight digits instead of nine,
and vehicle identification numbers fail the most elementary consistency checks.
Everybody in the company, from the CEO to the rawest mailroom trainee, turns
out to have been hired on the same day. Many of the residential customers have
last names that indicate commercial activity (“Happy Valley Farriers”). These
problems seem humdrum by comparison with those of reliability and validity,
but—unless caught in time—they can be fatal to statistical arguments.45
44. See Thad Dunning & David A. Freedman, Modeling Selection Effects, in Social Science Methodology 225 (Steven Turner & William Outhwaite eds., 2007); Howard Wainer & David Thissen,
True Score Theory: The Traditional Method, in Test Scoring 23 (David Thissen & Howard Wainer eds.,
2001).
45. See, e.g., Malletier v. Dooney & Bourke, Inc., 525 F. Supp. 2d 558, 630 (S.D.N.Y. 2007)
(coding errors contributed “to the cumulative effect of the methodological errors” that warranted
exclusion of a consumer confusion survey); EEOC v. Sears, Roebuck & Co., 628 F. Supp. 1264, 1304,
1305 (N.D. Ill. 1986) (“[E]rrors in EEOC’s mechanical coding of information from applications in its
hired and nonhired samples also make EEOC’s statistical analysis based on this data less reliable.” The
EEOC “consistently coded prior experience in such a way that less experienced women are considered
to have the same experience as more experienced men” and “has made so many general coding errors
that its data base does not fairly reflect the characteristics of applicants for commission sales positions
at Sears.”), aff’d, 839 F.2d 302 (7th Cir. 1988). But see Dalley v. Mich. Blue Cross-Blue Shield, Inc.,
612 F. Supp. 1444, 1456 (E.D. Mich. 1985) (“although plaintiffs show that there were some mistakes
in coding, plaintiffs still fail to demonstrate that these errors were so generalized and so pervasive that
the entire study is invalid.”).
229
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
D. What Is Random?
In the law, a selection process sometimes is called “random,” provided that it does
not exclude identifiable segments of the population. Statisticians use the term
in a far more technical sense. For example, if we were to choose one person at
random from a population, in the strict statistical sense, we would have to ensure
that everybody in the population is chosen with exactly the same probability.
With a randomized controlled experiment, subjects are assigned to treatment or
control at random in the strict sense—by tossing coins, throwing dice, looking
at tables of random numbers, or more commonly these days, by using a random
number generator on a computer. The same rigorous definition applies to random sampling. It is randomness in the technical sense that provides assurance of
unbiased estimates from a randomized controlled experiment or a probability
sample. Randomness in the technical sense also justifies calculations of standard
errors, confidence intervals, and p-values (infra Sections IV–V). Looser definitions
of randomness are inadequate for statistical purposes.
III. How Have the Data Been Presented?
After data have been collected, they should be presented in a way that makes
them intelligible. Data can be summarized with a few numbers or with graphical displays. However, the wrong summary can mislead.46 Section III.A discusses
rates or percentages and provides some cautionary examples of misleading summaries, indicating the kinds of questions that might be considered when summaries are presented in court. Percentages are often used to demonstrate statistical
association, which is the topic of Section III.B. Section III.C considers graphical
summaries of data, while Sections III.D and III.E discuss some of the basic descriptive statistics that are likely to be encountered in litigation, including the mean,
median, and standard deviation.
A. Are Rates or Percentages Properly Interpreted?
1. Have appropriate benchmarks been provided?
The selective presentation of numerical information is like quoting someone out
of context. Is a fact that “over the past three years,” a particular index fund of
large-cap stocks “gained a paltry 1.9% a year” indicative of poor management?
Considering that “the average large-cap value fund has returned just 1.3% a year,”
46. See generally Freedman et al., supra note 12; Huff, supra note 12; Moore & Notz, supra note
12; Zeisel, supra note 12.
230
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
a growth rate of 1.9% is hardly an indictment.47 In this example and many others,
it is helpful to find a benchmark that puts the figures into perspective.
2. Have the data collection procedures changed?
Changes in the process of collecting data can create problems of interpretation. Statistics on crime provide many examples. The number of petty larcenies
reported in Chicago more than doubled one year—not because of an abrupt crime
wave, but because a new police commissioner introduced an improved reporting
system.48 For a time, police officials in Washington, D.C., “demonstrated” the
success of a law-and-order campaign by valuing stolen goods at $49, just below
the $50 threshold then used for inclusion in the Federal Bureau of Investigation’s
Uniform Crime Reports.49 Allegations of manipulation in the reporting of crime
from one time period to another are legion.50
Changes in data collection procedures are by no means limited to crime
statistics. Indeed, almost all series of numbers that cover many years are affected
by changes in definitions and collection methods. When a study includes such
time-series data, it is useful to inquire about changes and to look for any sudden
jumps, which may signal such changes.
3. Are the categories appropriate?
Misleading summaries also can be produced by the choice of categories to be used
for comparison. In Philip Morris, Inc. v. Loew’s Theatres, Inc.,51 and R.J. Reynolds
Tobacco Co. v. Loew’s Theatres, Inc.,52 Philip Morris and R.J. Reynolds sought
an injunction to stop the maker of Triumph low-tar cigarettes from running
advertisements claiming that participants in a national taste test preferred Triumph to other brands. Plaintiffs alleged that claims that Triumph was a “national
taste test winner” or Triumph “beats” other brands were false and misleading.
An exhibit introduced by the defendant contained the data shown in Table 1.53
Only 14% + 22% = 36% of the sample preferred Triumph to Merit, whereas
47. Paul J. Lim, In a Downturn, Buy and Hold or Quit and Fold?, N.Y. Times, July 27, 2008.
48. James P. Levine et al., Criminal Justice in America: Law in Action 99 (1986) (referring to
a change from 1959 to 1960).
49. D. Seidman & M. Couzens, Getting the Crime Rate Down: Political Pressure and Crime Reporting, 8 Law & Soc’y Rev. 457 (1974).
50. Michael D. Maltz, Missing UCR Data and Divergence of the NCVS and UCR Trends, in
Understanding Crime Statistics: Revisiting the Divergence of the NCVS and UCR 269, 280 (James
P. Lynch & Lynn A. Addington eds., 2007) (citing newspaper reports in Boca Raton, Atlanta, New
York, Philadelphia, Broward County (Florida), and Saint Louis); Michael Vasquez, Miami Police: FBI:
Crime Stats Accurate, Miami Herald, May 1, 2008.
51. 511 F. Supp. 855 (S.D.N.Y. 1980).
52. 511 F. Supp. 867 (S.D.N.Y. 1980).
53. Philip Morris, 511 F. Supp. at 866.
231
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
29% + 11% = 40% preferred Merit to Triumph. By selectively combining categories, however, the defendant attempted to create a different impression. Because
24% found the brands to be about the same, and 36% preferred Triumph, the
defendant claimed that a clear majority (36% + 24% = 60%) found Triumph “as
good [as] or better than Merit.”54 The court resisted this chicanery, finding that
defendant’s test results did not support the advertising claims.55
Table 1. Data Used by a Defendant to Refute Plaintiffs’ False Advertising Claim
Number
Percentage
Triumph
Much
Better
Than Merit
Triumph
Somewhat
Better
Than Merit
Triumph
About the
Same
as Merit
Triumph
Somewhat
Worse
Than Merit
Triumph
Much
Worse
Than Merit
45
14
73
22
77
24
93
29
36
11
There was a similar distortion in claims for the accuracy of a home pregnancy
test. The manufacturer advertised the test as 99.5% accurate under laboratory conditions. The data underlying this claim are summarized in Table 2.
Table 2. Home Pregnancy Test Results
Test says pregnant
Test says not pregnant
Total
Actually Pregnant
Actually not Pregnant
197
1
198
0
2
2
Table 2 does indicate that only one error occurred in 200 assessments, or
99.5% overall accuracy, but the table also shows that the test can make two types
of errors: It can tell a pregnant woman that she is not pregnant (a false negative),
and it can tell a woman who is not pregnant that she is (a false positive). The
reported 99.5% accuracy rate conceals a crucial fact—the company had virtually
no data with which to measure the rate of false positives.56
54. Id.
55. Id. at 856–57.
56. Only two women in the sample were not pregnant; the test gave correct results for both of
them. Although a false-positive rate of 0 is ideal, an estimate based on a sample of only two women
is not. These data are reported in Arnold Barnett, How Numbers Can Trick You, Tech. Rev., Oct.
1994, at 38, 44–45.
232
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
4. How big is the base of a percentage?
Rates and percentages often provide effective summaries of data, but these statistics can be misinterpreted. A percentage makes a comparison between two
numbers: One number is the base, and the other number is compared to that base.
Putting them on the same base (100) makes it easy to compare them.
When the base is small, however, a small change in absolute terms can generate a large percentage gain or loss. This could lead to newspaper headlines such
as “Increase in Thefts Alarming,” even when the total number of thefts is small.57
Conversely, a large base will make for small percentage increases. In these situations, actual numbers may be more revealing than percentages.
5. What comparisons are made?
Finally, there is the issue of which numbers to compare. Researchers sometimes
choose among alternative comparisons. It may be worthwhile to ask why they
chose the one they did. Would another comparison give a different view? A
government agency, for example, may want to compare the amount of service
now being given with that of earlier years—but what earlier year should be the
baseline? If the first year of operation is used, a large percentage increase should
be expected because of startup problems. If last year is used as the base, was it
also part of the trend, or was it an unusually poor year? If the base year is not
representative of other years, the percentage may not portray the trend fairly. No
single question can be formulated to detect such distortions, but it may help to
ask for the numbers from which the percentages were obtained; asking about the
base can also be helpful.58
B. Is an Appropriate Measure of Association Used?
Many cases involve statistical association. Does a test for employee promotion
have an exclusionary effect that depends on race or gender? Does the incidence
of murder vary with the rate of executions for convicted murderers? Do consumer
purchases of a product depend on the presence or absence of a product warning?
This section discusses tables and percentage-based statistics that are frequently
presented to answer such questions.59
Percentages often are used to describe the association between two variables.
Suppose that a university alleged to discriminate against women in admitting
57. Lyda Longa, Increase in Thefts Alarming, Daytona News-J. June 8, 2008 (reporting a 35%
increase in armed robberies in Daytona Beach, Florida, in a 5-month period, but not indicating
whether the number had gone up by 6 (from 17 to 23), by 300 (from 850 to 1150), or by some other
amount).
58. For assistance in coping with percentages, see Zeisel, supra note 12, at 1–24.
59. Correlation and regression are discussed infra Section V.
233
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
students consists of only two colleges—engineering and business. The university
admits 350 out of 800 male applicants; by comparison, it admits only 200 out of
600 female applicants. Such data commonly are displayed as in Table 3.60
As Table 3 indicates, 350/800 = 44% of the males are admitted, compared
with only 200/600 = 33% of the females. One way to express the disparity is
to subtract the two percentages: 44% – 33% = 11 percentage points. Although
such subtraction is commonly seen in jury discrimination cases,61 the difference is
inevitably small when the two percentages are both close to zero. If the selection
rate for males is 5% and that for females is 1%, the difference is only 4 percentage
points. Yet, females have only one-fifth the chance of males of being admitted,
and that may be of real concern.
Table 3. Admissions by Gender
Decision
Male
Female
Total
Admit
Deny
Total
350
450
800
200
400
600
550
850
1400
For Table 3, the selection ratio (used by the Equal Employment Opportunity Commission in its “80% rule”) is 33/44 = 75%, meaning that, on average,
women have 75% the chance of admission that men have.62 However, the selection ratio has its own problems. In the last example, if the selection rates are 5%
and 1%, then the exclusion rates are 95% and 99%. The ratio is 99/95 = 104%,
meaning that females have, on average, 104% the risk of males of being rejected.
The underlying facts are the same, of course, but this formulation sounds much
less disturbing.
60. A table of this sort is called a “cross-tab” or a “contingency table.” Table 3 is “two-by-two”
because it has two rows and two columns, not counting rows or columns containing totals.
61. See, e.g., State v. Gibbs, 758 A.2d 327, 337 (Conn. 2000); Primeaux v. Dooley, 747 N.W.2d
137, 141 (S.D. 2008); D.H. Kaye, Statistical Evidence of Discrimination in Jury Selection, in Statistical
Methods in Discrimination Litigation 13 (David H. Kaye & Mikel Aickin eds., 1986).
62. A procedure that selects candidates from the least successful group at a rate less than 80% of
the rate for the most successful group “will generally be regarded by the Federal enforcement agencies
as evidence of adverse impact.” EEOC Uniform Guidelines on Employee Selection Procedures, 29
C.F.R. § 1607.4(D) (2008). The rule is designed to help spot instances of substantially discriminatory
practices, and the commission usually asks employers to justify any procedures that produce selection
ratios of 80% or less.
The analogous statistic used in epidemiology is called the relative risk. See Green et al., supra
note 13, Section III.A. Relative risks are usually quoted as decimals; for example, a selection ratio of
75% corresponds to a relative risk of 0.75.
234
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
The odds ratio is more symmetric. If 5% of male applicants are admitted,
the odds on a man being admitted are 5/95 = 1/19; the odds on a woman being
admitted are 1/99. The odds ratio is (1/99)/(1/19) = 19/99. The odds ratio for
rejection instead of acceptance is the same, except that the order is reversed.63
Although the odds ratio has desirable mathematical properties, its meaning may
be less clear than that of the selection ratio or the simple difference.
Data showing disparate impact are generally obtained by aggregating—putting
together—statistics from a variety of sources. Unless the source material is fairly
homogeneous, aggregation can distort patterns in the data. We illustrate the problem with the hypothetical admission data in Table 3. Applicants can be classified
not only by gender and admission but also by the college to which they applied,
as in Table 4.
Table 4. Admissions by Gender and College
Engineering
Business
Decision
Male
Female
Male
Female
Admit
Deny
300
300
100
100
50
150
100
300
The entries in Table 4 add up to the entries in Table 3. Expressed in a more
technical manner, Table 3 is obtained by aggregating the data in Table 4. Yet
there is no association between gender and admission in either college; men and
women are admitted at identical rates. Combining two colleges with no association produces a university in which gender is associated strongly with admission.
The explanation for this paradox is that the business college, to which most of the
women applied, admits relatively few applicants. It is easier to be accepted at the
engineering college, the college to which most of the men applied. This example
illustrates a common issue: Association can result from combining heterogeneous
statistical material.64
63. For women, the odds on rejection are 99 to 1; for men, 19 to 1. The ratio of these odds is
99/19. Likewise, the odds ratio for an admitted applicant being a man as opposed to a denied applicant
being a man is also 99/19.
64. Tables 3 and 4 are hypothetical, but closely patterned on a real example. See P.J. Bickel
et al., Sex Bias in Graduate Admissions: Data from Berkeley, 187 Science 398 (1975). The tables are an
instance of Simpson’s Paradox.
235
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
C. Does a Graph Portray Data Fairly?
Graphs are useful for revealing key characteristics of a batch of numbers, trends
over time, and the relationships among variables.
1. How are trends displayed?
Graphs that plot values over time are useful for seeing trends. However, the scales
on the axes matter. In Figure 1, the rate of all crimes of domestic violence in
Florida (per 100,000 people) appears to decline rapidly over the 10 years from
1998 through 2007; in Figure 2, the same rate appears to drop slowly.65 The
moral is simple: Pay attention to the markings on the axes to determine whether
the scale is appropriate.
Figure 1
Figure 2
2. How are distributions displayed?
A graph commonly used to display the distribution of data is the histogram. One
axis denotes the numbers, and the other indicates how often those fall within
65. Florida Statistical Analysis Center, Florida Department of Law Enforcement, Florida’s Crime
Rate at a Glance, available at http://www.fdle.state.fl.us/FSAC/Crime_Trends/domestic_violence/
index.asp. The data are from the Florida Uniform Crime Report statistics on crimes ranging from
simple stalking and forcible fondling to murder and arson. The Web page with the numbers graphed
in Figures 1 and 2 is no longer posted, but similar data for all violent crime is available at http://www.
fdle.state.fl.us/FSAC/Crime_Trends/Violent-Crime.aspx.
236
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
specified intervals (called “bins” or “class intervals”). For example, we flipped a
quarter 10 times in a row and counted the number of heads in this “batch” of 10
tosses. With 50 batches, we obtained the following counts:66
77568
44253
42365
54244
43474
57235
68474 74543
4 6 4 9 10 5 5 6 6 4
The histogram is shown in Figure 3.67 A histogram shows how the data are
distributed over the range of possible values. The spread can be made to appear
larger or smaller, however, by changing the scale of the horizontal axis. Likewise,
the shape can be altered somewhat by changing the size of the bins.68 It may be
worth inquiring how the analyst chose the bin widths.
Figure 3. Histogram showing how frequently various numbers of heads
appeared in 50 batches of 10 tosses of a quarter.
66. The coin landed heads 7 times in the first 10 tosses; by coincidence, there were also 7 heads
in the next 10 tosses; there were 5 heads in the third batch of 10 tosses; and so forth.
67. In Figure 3, the bin width is 1. There were no 0s or 1s in the data, so the bars over 0 and 1
disappear. There is a bin from 1.5 to 2.5; the four 2s in the data fall into this bin, so the bar over the
interval from 1.5 to 2.5 has height 4. There is another bin from 2.5 to 3.5, which catches five 3s;
the height of the corresponding bar is 5. And so forth.
All the bins in Figure 3 have the same width, so this histogram is just like a bar graph. However,
data are often published in tables with unequal intervals. The resulting histograms will have unequal
bin widths; bar heights should be calculated so that the areas (height × width) are proportional to the
frequencies. In general, a histogram differs from a bar graph in that it represents frequencies by area,
not height. See Freedman et al., supra note 12, at 31–41.
68. As the width of the bins decreases, the graph becomes more detailed, but the appearance
becomes more ragged until finally the graph is effectively a plot of each datum. The optimal bin width
depends on the subject matter and the goal of the analysis.
237
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
D. Is an Appropriate Measure Used for the Center of a
Distribution?
Perhaps the most familiar descriptive statistic is the mean (or “arithmetic mean”).
The mean can be found by adding all the numbers and dividing the total by how
many numbers were added. By comparison, the median cuts the numbers into
halves: half the numbers are larger than the median and half are smaller.69 Yet
a third statistic is the mode, which is the most common number in the dataset.
These statistics are different, although they are not always clearly distinguished.70
The mean takes account of all the data—it involves the total of all the numbers;
however, particularly with small datasets, a few unusually large or small observations may have too much influence on the mean. The median is resistant to such
outliers.
Thus, studies of damage awards in tort cases find that the mean is larger than
the median.71 This is because the mean takes into account (indeed, is heavily
influenced by) the magnitudes of the relatively few very large awards, whereas
the median merely counts their number. If one is seeking a single, representative
number for the awards, the median may be more useful than the mean.72 Still, if
the issue is whether insurers were experiencing more costs from jury verdicts, the
mean is the more appropriate statistic: The total of the awards is directly related
to the mean, not to the median.73
69. Technically, at least half the numbers are at the median or larger; at least half are at the
median or smaller. When the distribution is symmetric, the mean equals the median. The values
diverge, however, when the distribution is asymmetric, or skewed.
70. In ordinary language, the arithmetic mean, the median, and the mode seem to be referred to
interchangeably as “the average.” In statistical parlance, however, the average is the arithmetic mean.
The mode is rarely used by statisticians, because it is unstable: Small changes to the data often result
in large changes to the mode.
71. In a study using a probability sample of cases, the median compensatory award in wrongful
death cases was $961,000, whereas the mean award was around $3.75 million for the 162 cases in
which the plaintiff prevailed. Thomas H. Cohen & Steven K. Smith, U.S. Dep’t of Justice, Bureau
of Justice Statistics Bulletin NCJ 202803, Civil Trial Cases and Verdicts in Large Counties 2001, 10
(2004). In TXO Production Corp. v. Alliance Resources Corp., 509 U.S. 443 (1993), briefs portraying the
punitive damage system as out of control pointed to mean punitive awards. These were some 10 times
larger than the median awards described in briefs defending the system of punitive damages. Michael
Rustad & Thomas Koenig, The Supreme Court and Junk Social Science: Selective Distortion in Amicus Briefs,
72 N.C. L. Rev. 91, 145–47 (1993).
72. In passing on proposed settlements in class-action lawsuits, courts have been advised to look
to the magnitude of the settlements negotiated by the parties. But the mean settlement will be large
if a higher number of meritorious, high-cost cases are resolved early in the life cycle of the litigation.
This possibility led the court in In re Educational Testing Service Praxis Principles of Learning and Teaching,
Grades 7-12 Litig., 447 F. Supp. 2d 612, 625 (E.D. La. 2006), to regard the smaller median settlement
as “more representative of the value of a typical claim than the mean value” and to use this median
in extrapolating to the entire class of pending claims.
73. To get the total award, just multiply the mean by the number of awards; by contrast, the
total cannot be computed from the median. (The more pertinent figure for the insurance industry is
238
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Research also has shown that there is considerable stability in the ratio of
punitive to compensatory damage awards, and the Supreme Court has placed
great weight on this ratio in deciding whether punitive damages are excessive
in a particular case. In Exxon Shipping Co. v. Baker,74 Exxon contended that an
award of $2.5 billion in punitive damages for a catastrophic oil spill in Alaska was
unreasonable under federal maritime law. The Court looked to a “comprehensive study of punitive damages awarded by juries in state civil trials [that] found
a median ratio of punitive to compensatory awards of just 0.62:1, but a mean
ratio of 2.90:1.”75 The higher mean could reflect a relatively small but disturbing
proportion of unjustifiably large punitive awards.76 Looking to the median ratio as
“the line near which cases like this one largely should be grouped,” the majority
concluded that “a 1:1 ratio, which is above the median award, is a fair upper limit
in such maritime cases [of reckless conduct].”77
E. Is an Appropriate Measure of Variability Used?
The location of the center of a batch of numbers reveals nothing about the variations exhibited by these numbers.78 Statistical measures of variability include the
range, the interquartile range, and the standard deviation. The range is the difference between the largest number in the batch and the smallest. The range seems
natural, and it indicates the maximum spread in the numbers, but the range is
unstable because it depends entirely on the most extreme values.79 The interquartile range is the difference between the 25th and 75th percentiles.80 The interquartile range contains 50% of the numbers and is resistant to changes in extreme
values. The standard deviation is a sort of mean deviation from the mean.81
not the total of jury awards, but actual claims experience including settlements; of course, even the
risk of large punitive damage awards may have considerable impact.)
74. 128 S. Ct. 2605 (2008).
75. Id. at 2625.
76. According to the Court, “the outlier cases subject defendants to punitive damages that
dwarf the corresponding compensatories,” and the “stark unpredictability” of these rare awards is the
“real problem.” Id. This perceived unpredictability has been the subject of various statistical studies
and much debate. See Anthony J. Sebok, Punitive Damages: From Myth to Theory, 92 Iowa L. Rev.
957 (2007).
77. 128 S. Ct. at 2633.
78. The numbers 1, 2, 5, 8, 9 have 5 as their mean and median. So do the numbers 5, 5, 5,
5, 5. In the first batch, the numbers vary considerably about their mean; in the second, the numbers
do not vary at all.
79. Moreover, the range typically depends on the number of units in the sample.
80. By definition, 25% of the data fall below the 25th percentile, 90% fall below the 90th percentile, and so on. The median is the 50th percentile.
81. When the distribution follows the normal curve, about 68% of the data will be within 1
standard deviation of the mean, and about 95% will be within 2 standard deviations of the mean. For
other distributions, the proportions will be different.
239
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
There are no hard and fast rules about which statistic is the best. In general,
the bigger the measures of spread are, the more the numbers are dispersed.82
Particularly in small datasets, the standard deviation can be influenced heavily by
a few outlying values. To assess the extent of this influence, the mean and the
standard deviation can be recomputed with the outliers discarded. Beyond this,
any of the statistics can (and often should) be supplemented with a figure that
displays much of the data.
IV. What Inferences Can Be Drawn from
the Data?
The inferences that may be drawn from a study depend on the design of the study
and the quality of the data (supra Section II). The data might not address the issue
of interest, might be systematically in error, or might be difficult to interpret
because of confounding. Statisticians would group these concerns together under
the rubric of “bias.” In this context, bias means systematic error, with no connotation of prejudice. We turn now to another concern, namely, the impact of
random chance on study results (“random error”).83
If a pattern in the data is the result of chance, it is likely to wash out when
more data are collected. By applying the laws of probability, a statistician can assess
the likelihood that random error will create spurious patterns of certain kinds.
Such assessments are often viewed as essential when making inferences from data.
Technically, the standard deviation is the square root of the variance; the variance is the mean
square deviation from the mean. For example, if the mean is 100, then 120 deviates from the mean
by 20, and the square of 20 is 202 = 400. If the variance (i.e., the mean of the squared deviations) is
900, then the standard deviation is the square root of 900, that is, 900 = 30. Taking the square root
gets back to the original scale of the measurements. For example, if the measurements are of length in
inches, the variance is in square inches; taking the square root changes back to inches.
82. In Exxon Shipping Co. v. Baker, 554 U.S. 471 (2008), along with the mean and median ratios
of punitive to compensatory awards of 0.62 and 2.90, the Court referred to a standard deviation of
13.81. Id. at 498. These numbers led the Court to remark that “[e]ven to those of us unsophisticated
in statistics, the thrust of these figures is clear: the spread is great, and the outlier cases subject defendants to punitive damages that dwarf the corresponding compensatories.” Id. at 499-500. The size of
the standard deviation compared to the mean supports the observation that ratios in the cases of jury
award studies are dispersed. A graph of each pair of punitive and compensatory damages offers more
insight into how scattered these figures are. See Theodore Eisenberg et al., The Predictability of Punitive
Damages, 26 J. Legal Stud. 623 (1997); infra Section V.A (explaining scatter diagrams).
83. Random error is also called sampling error, chance error, or statistical error. Econometricians
use the parallel concept of random disturbance terms. See Rubinfeld, supra note 21. Randomness and
cognate terms have precise technical meanings; it is randomness in the technical sense that justifies the
probability calculations behind standard errors, confidence intervals, and p-values (supra Section II.D,
infra Sections IV.A–B). For a discussion of samples and populations, see supra Section II.B.
240
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Thus, statistical inference typically involves tasks such as the following, which will
be discussed in the rest of this guide.
• Estimation. A statistician draws a sample from a population (supra Section II.B) and estimates a parameter—that is, a numerical characteristic of
the population. (The average value of a large group of claims is a parameter
of perennial interest.) Random error will throw the estimate off the mark.
The question is, by how much? The precision of an estimate is usually
reported in terms of the standard error and a confidence interval.
• Significance testing. A “null hypothesis” is formulated—for example, that
a parameter takes a particular value. Because of random error, an estimated value for the parameter is likely to differ from the value specified
by the null—even if the null is right. (“Null hypothesis” is often shortened to “null.”) How likely is it to get a difference as large as, or larger
than, the one observed in the data? This chance is known as a p-value.
Small p-values argue against the null hypothesis. Statistical significance is
determined by reference to the p-value; significance testing (also called
hypothesis testing) is the technique for computing p-values and determining statistical significance.
• Developing a statistical model. Statistical inferences often depend on the validity of statistical models for the data. If the data are collected on the basis of
a probability sample or a randomized experiment, there will be statistical
models that suit the occasion, and inferences based on these models will be
secure. Otherwise, calculations are generally based on analogy: This group of
people is like a random sample; that observational study is like a randomized
experiment. The fit between the statistical model and the data collection
process may then require examination—how good is the analogy? If the
model breaks down, that will bias the analysis.
• Computing posterior probabilities. Given the sample data, what is the probability of the null hypothesis? The question might be of direct interest to
the courts, especially when translated into English; for example, the null
hypothesis might be the innocence of the defendant in a criminal case.
Posterior probabilities can be computed using a formula called Bayes’ rule.
However, the computation often depends on prior beliefs about the statistical model and its parameters; such prior beliefs almost necessarily require
subjective judgment. According to the frequentist theory of statistics,84
84. The frequentist theory is also called objectivist, by contrast with the subjectivist version of
Bayesian theory. In brief, frequentist methods treat probabilities as objective properties of the system
being studied. Subjectivist Bayesians view probabilities as measuring subjective degrees of belief. See
infra Section IV.D and Appendix, Section A, for discussion of the two positions. The Bayesian position
is named after the Reverend Thomas Bayes (England, c. 1701–1761). His essay on the subject was
published after his death: An Essay Toward Solving a Problem in the Doctrine of Chances, 53 Phil. Trans.
Royal Soc’y London 370 (1763–1764). For discussion of the foundations and varieties of Bayesian and
241
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
prior probabilities rarely have meaning and neither do posterior
probabilities.85
Key ideas of estimation and testing will be illustrated by courtroom examples, with some complications omitted for ease of presentation and some details
postponed (see infra Section V.D on statistical models, and the Appendix on the
calculations).
The first example, on estimation, concerns the Nixon papers. Under the Presidential Recordings and Materials Preservation Act of 1974, Congress impounded
Nixon’s presidential papers after he resigned. Nixon sued, seeking compensation
on the theory that the materials belonged to him personally. Courts ruled in his
favor: Nixon was entitled to the fair market value of the papers, with the amount
to be proved at trial.86
The Nixon papers were stored in 20,000 boxes at the National Archives in
Alexandria, Virginia. It was plainly impossible to value this entire population of
material. Appraisers for the plaintiff therefore took a random sample of 500 boxes.
(From this point on, details are simplified; thus, the example becomes somewhat
hypothetical.) The appraisers determined the fair market value of each sample
box. The average of the 500 sample values turned out to be $2000. The standard
deviation (supra Section III.E) of the 500 sample values was $2200. Many boxes
had low appraised values whereas some boxes were considered to be extremely
valuable; this spread explains the large standard deviation.
A. Estimation
1. What estimator should be used?
With the Nixon papers, it is natural to use the average value of the 500 sample
boxes to estimate the average value of all 20,000 boxes comprising the population.
other forms of statistical inference, see, e.g., Richard M. Royall, Statistical Inference: A Likelihood
Paradigm (1997); James Berger, The Case for Objective Bayesian Analysis, 1 Bayesian Analysis 385 (2006),
available at http://ba.stat.cmu.edu/journal/2006/vol01/issue03/berger.pdf; Stephen E. Fienberg, Does
It Make Sense to be an “Objective Bayesian”? (Comment on Articles by Berger and by Goldstein), 1 Bayesian
Analysis 429 (2006); David Freedman, Some Issues in the Foundation of Statistics, 1 Found. Sci. 19
(1995), reprinted in Topics in the Foundation of Statistics 19 (Bas C. van Fraasen ed., 1997); see also
D.H. Kaye, What Is Bayesianism? in Probability and Inference in the Law of Evidence: The Uses and
Limits of Bayesianism (Peter Tillers & Eric Green eds., 1988), reprinted in 28 Jurimetrics J. 161 (1988)
(distinguishing between “Bayesian probability,” “Bayesian statistical inference,” “Bayesian inference
writ large,” and “Bayesian decision theory”).
85. Prior probabilities of repeatable events (but not hypotheses) can be defined within the frequentist framework. See infra note 122. When this happens, prior and posterior probabilities for these
events are meaningful according to both schools of thought.
86. Nixon v. United States, 978 F.2d 1269 (D.C. Cir. 1992); Griffin v. United States, 935 F.
Supp. 1 (D.D.C. 1995).
242
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
With the average value for each box having been estimated as $2000, the plaintiff
demanded compensation in the amount of
20,000 × $2,000 = $40,000,000.
In more complex problems, statisticians may have to choose among several
estimators. Generally, estimators that tend to make smaller errors are preferred;
however, “error” might be quantified in more than one way. Moreover, the
advantage of one estimator over another may depend on features of the population
that are largely unknown, at least before the data are collected and analyzed. For
complicated problems, professional skill and judgment may therefore be required
when choosing a sample design and an estimator. In such cases, the choices and
the rationale for them should be documented.
2. What is the standard error? The confidence interval?
An estimate based on a sample is likely to be off the mark, at least by a small
amount, because of random error. The standard error gives the likely magnitude
of this random error, with smaller standard errors indicating better estimates.87
In our example of the Nixon papers, the standard error for the sample average can be computed from (1) the size of the sample—500 boxes—and (2) the
standard deviation of the sample values; see infra Appendix. Bigger samples give
estimates that are more precise. Accordingly, the standard error should go down
as the sample size grows, although the rate of improvement slows as the sample
gets bigger. (“Sample size” and “the size of the sample” just mean the number
of items in the sample; the “sample average” is the average value of the items in
the sample.) The standard deviation of the sample comes into play by measuring
heterogeneity. The less heterogeneity in the values, the smaller the standard error.
For example, if all the values were about the same, a tiny sample would give an
accurate estimate. Conversely, if the values are quite different from one another,
a larger sample would be needed.
With a random sample of 500 boxes and a standard deviation of $2200, the
standard error for the sample average is about $100. The plaintiff’s total demand
was figured as the number of boxes (20,000) times the sample average ($2000).
Therefore, the standard error for the total demand can be computed as 20,000
times the standard error for the sample average88:
87. We distinguish between (1) the standard deviation of the sample, which measures the spread
in the sample data and (2) the standard error of the sample average, which measures the likely size of
the random error in the sample average. The standard error is often called the standard deviation, and
courts generally use the latter term. See, e.g., Castaneda v. Partida, 430 U.S. 482 (1977).
88. We are assuming a simple random sample. Generally, the formula for the standard error must
take into account the method used to draw the sample and the nature of the estimator. In fact, the
Nixon appraisers used more elaborate statistical procedures. Moreover, they valued the material as of
243
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
20,000 × $100 = $2,000,000.
How is the standard error to be interpreted? Just by the luck of the draw, a
few too many high-value boxes may have come into the sample, in which case
the estimate of $40,000,000 is too high. Or, a few too many low-value boxes may
have been drawn, in which case the estimate is too low. This is random error.
The net effect of random error is unknown, because data are available only on
the sample, not on the full population. However, the net effect is likely to be
something close to the standard error of $2,000,000. Random error throws the
estimate off, one way or the other, by something close to the standard error. The
role of the standard error is to gauge the likely size of the random error.
The plaintiff’s argument may be open to a variety of objections, particularly
regarding appraisal methods. However, the sampling plan is sound, as is the
extrapolation from the sample to the population. And there is no need for a larger
sample: The standard error is quite small relative to the total claim.
Random errors larger in magnitude than the standard error are commonplace. Random errors larger in magnitude than two or three times the standard
error are unusual. Confidence intervals make these ideas more precise. Usually,
a confidence interval for the population average is centered at the sample average; the desired confidence level is obtained by adding and subtracting a suitable
multiple of the standard error. Statisticians who say that the population average
falls within 1 standard error of the sample average will be correct about 68% of
the time. Those who say “within 2 standard errors” will be correct about 95%
of the time, and those who say “within 3 standard errors” will be correct about
99.7% of the time, and so forth. (We are assuming a large sample; the confidence
levels correspond to areas under the normal curve and are approximations; the
“population average” just means the average value of all the items in the population.89) In summary,
• Togeta68%confidenceinterval,startatthesampleaverage,thenaddand
subtract 1 standard error.
• Togeta95%confidenceinterval,startatthesampleaverage,thenaddand
subtract twice the standard error.
1995, extrapolated backward to the time of taking (1974), and then added interest. The text ignores
these complications.
89. See infra Appendix. The area under the normal curve between –1 and +1 is close to 68.3%.
Likewise, the area between –2 and +2 is close to 95.4%. Many academic statisticians would use
±1.96 SE for a 95% confidence interval. However, the normal curve only gives an approximation to
the relevant chances, and the error in that approximation will often be larger than a few tenths of a
percent. For simplicity, we use ±1 SE for the 68% confidence level, and ±2 SE for 95% confidence.
The normal curve gives good approximations when the sample size is reasonably large; for small
samples, other techniques should be used. See infra notes 106–07.
244
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
• Togeta99.7%confidenceinterval,startatthesampleaverage,thenadd
and subtract three times the standard error.
With the Nixon papers, the 68% confidence interval for plaintiff’s total
demand runs
from $40,000,000 − $2,000,000 = $38,000,000.
to $40,000,000 + $2,000,000 = $42,000,000.
The 95% confidence interval runs
from $40,000,000 − (2 × $2,000,000) = $36,000,000.
to $40,000,000 + (2 × $2,000,000) = $44,000,000.
The 99.7% confidence interval runs
from $40,000,000 − (3 × $2,000,000) = $34,000,000.
to $40,000,000 + (3 × $2,000,000) = $46,000,000.
To write this more compactly, we abbreviate standard error as SE. Thus, 1
SE is one standard error, 2 SE is twice the standard error, and so forth. With a
large sample and an estimate like the sample average, a 68% confidence interval
is the range
estimate – 1 SE to estimate + 1 SE.
A 95% confidence interval is the range
estimate – 2 SE to estimate + 2 SE.
The 99.7% confidence interval is the range
estimate – 3 SE to estimate + 3 SE.
For a given sample size, increased confidence can be attained only by widening the interval. The 95% confidence level is the most popular, but some authors
use 99%, and 90% is seen on occasion. (The corresponding multipliers on the SE
are about 2, 2.6, and 1.6, respectively; see infra Appendix.) The phrase “margin of
error” generally means twice the standard error. In medical journals, “confidence
interval” is often abbreviated as “CI.”
245
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The main point is that an estimate based on a sample will differ from the exact
population value, because of random error. The standard error gives the likely
size of the random error. If the standard error is small, random error probably has
little effect. If the standard error is large, the estimate may be seriously wrong.
Confidence intervals are a technical refinement, and bias is a separate issue to
consider (infra Section IV.A.4).
3. How big should the sample be?
There is no easy answer to this sensible question. Much depends on the level of
error that is tolerable and the nature of the material being sampled. Generally,
increasing the size of the sample will reduce the level of random error (“sampling
error”). Bias (“nonsampling error”) cannot be reduced that way. Indeed, beyond
some point, large samples are harder to manage and more vulnerable to nonsampling error. To reduce bias, the researcher must improve the design of the
study or use a statistical model more tightly linked to the data collection process.
If the material being sampled is heterogeneous, random error will be large;
a larger sample will be needed to offset the heterogeneity (supra Section IV.A.1).
A pilot sample may be useful to estimate heterogeneity and determine the final
sample size. Probability samples require some effort in the design phase, and it
will rarely be sensible to draw a sample with fewer than, say, two or three dozen
items. Moreover, with such small samples, methods based on the normal curve
(supra Section IV.A.2) will not apply.
Population size (i.e., the number of items in the population) usually has little
bearing on the precision of estimates for the population average. This is surprising. On the other hand, population size has a direct bearing on estimated totals.
Both points are illustrated by the Nixon papers (see supra Section IV.A.2 and infra
Appendix). To be sure, drawing a probability sample from a large population may
246
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
involve a lot of work. Samples presented in the courtroom have ranged from 5
(tiny) to 1.7 million (huge).90
4. What are the technical difficulties?
To begin with, “confidence” is a term of art. The confidence level indicates the
percentage of the time that intervals from repeated samples would cover the true
value. The confidence level does not express the chance that repeated estimates
would fall into the confidence interval.91 With the Nixon papers, the 95% confidence interval should not be interpreted as saying that 95% of all random samples
will produce estimates in the range from $36 million to $44 million. Moreover,
the confidence level does not give the probability that the unknown parameter lies
within the confidence interval.92 For example, the 95% confidence level should
not be translated to a 95% probability that the total value of the papers is in the
range from $36 million to $44 million. According to the frequentist theory of
statistics, probability statements cannot be made about population characteristics:
Probability statements apply to the behavior of samples. That is why the different
term “confidence” is used.
The next point to make is that for a given confidence level, a narrower
interval indicates a more precise estimate, whereas a broader interval indicates less
90. See Lebrilla v. Farmers Group, Inc., No. 00-CC-017185 (Cal. Super. Ct., Orange County,
Dec. 5, 2006) (preliminary approval of settlement), a class action lawsuit on behalf of plaintiffs who
were insured by Farmers and had automobile accidents. Plaintiffs alleged that replacement parts recommended by Farmers did not meet specifications: Small samples were used to evaluate these allegations. At the other extreme, it was proposed to adjust Census 2000 for undercount and overcount by
reviewing a sample of 1.7 million persons. See Brown et al., supra note 29, at 353.
91. Opinions reflecting this misinterpretation include In re Silicone Gel Breast Implants Prods.
Liab. Litig, 318 F. Supp. 2d 879, 897 (C.D. Cal. 2004) (“a margin of error between 0.5 and 8.0 at
the 95% confidence level . . . means that 95 times out of 100 a study of that type would yield a relative risk value somewhere between 0.5 and 8.0.”); United States ex rel. Free v. Peters, 806 F. Supp.
705, 713 n.6 (N.D. Ill. 1992) (“A 99% confidence interval, for instance, is an indication that if we
repeated our measurement 100 times under identical conditions, 99 times out of 100 the point estimate
derived from the repeated experimentation will fall within the initial interval estimate. . . .”), rev’d
in part, 12 F.3d 700 (7th Cir. 1993). The more technically correct statement in the Silicone Gel case,
for example, would be that “the confidence interval of 0.5 to 8.0 means that the relative risk in the
population could fall within this wide range and that in roughly 95 times out of 100, random samples
from the same population, the confidence intervals (however wide they might be) would include the
population value (whatever it is).”
92. See, e.g., Freedman et al., supra note 12, at 383–86; infra Section IV.B.1. Consequently, it is
misleading to suggest that “[a] 95% confidence interval means that there is a 95% probability that the
‘true’ relative risk falls within the interval” or that “the probability that the true value was . . . within
two standard deviations of the mean . . . would be 95 percent.” DeLuca v. Merrell Dow Pharms.,
Inc., 791 F. Supp. 1042, 1046 (D.N.J. 1992), aff’d, 6 F.3d 778 (3d Cir. 1993); SmithKline Beecham
Corp. v. Apotex Corp., 247 F. Supp. 2d 1011, 1037 (N.D. Ill. 2003), aff’d on other grounds, 403 F.3d
1331 (Fed. Cir. 2005).
247
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
precision.93 A high confidence level with a broad interval means very little, but a
high confidence level for a small interval is impressive, indicating that the random
error in the sample estimate is low. For example, take a 95% confidence interval
for a damage claim. An interval that runs from $34 million to $44 million is one
thing, but –$10 million to $90 million is something else entirely. Statements about
confidence without mention of an interval are practically meaningless.94
Standard errors and confidence intervals are often derived from statistical
models for the process that generated the data. The model usually has parameters—
numerical constants describing the population from which samples were drawn.
When the values of the parameters are not known, the statistician must work
backward, using the sample data to make estimates. That was the case here.95
Generally, the chances needed for statistical inference are computed from a model
and estimated parameter values.
If the data come from a probability sample or a randomized controlled experiment (supra Sections II.A–B), the statistical model may be connected tightly to
the actual data collection process. In other situations, using the model may be
tantamount to assuming that a sample of convenience is like a random sample,
or that an observational study is like a randomized experiment. With the Nixon
papers, the appraisers drew a random sample, and that justified the statistical
93. In Cimino v. Raymark Industries, Inc., 751 F. Supp. 649 (E.D. Tex. 1990), rev’d, 151 F.3d 297
(5th Cir. 1998), the district court drew certain random samples from more than 6000 pending asbestos
cases, tried these cases, and used the results to estimate the total award to be given to all plaintiffs
in the pending cases. The court then held a hearing to determine whether the samples were large
enough to provide accurate estimates. The court’s expert, an educational psychologist, testified that
the estimates were accurate because the samples matched the population on such characteristics as race
and the percentage of plaintiffs still alive. Id. at 664. However, the matches occurred only in the sense
that population characteristics fell within 99% confidence intervals computed from the samples. The
court thought that matches within the 99% confidence intervals proved more than matches within 95%
intervals. Id. This is backward. To be correct in a few instances with a 99% confidence interval is not
very impressive—by definition, such intervals are broad enough to ensure coverage 99% of the time.
94. In Hilao v. Estate of Marcos, 103 F.3d 767 (9th Cir. 1996), for example, “an expert on statistics . . . testified that . . . a random sample of 137 claims would achieve ‘a 95% statistical probability
that the same percentage determined to be valid among the examined claims would be applicable to
the totality of [9541 facially valid] claims filed.’” Id. at 782. There is no 95% “statistical probability”
that a percentage computed from a sample will be “applicable” to a population. One can compute
a confidence interval from a random sample and be 95% confident that the interval covers some
parameter. The computation can be done for a sample of virtually any size, with larger samples giving smaller intervals. What is missing from the opinion is a discussion of the widths of the relevant
intervals. For the same reason, it is meaningless to testify, as an expert did in Ayyad v. Sprint Spectrum,
L.P., No. RG03-121510 (Cal. Super. Ct., Alameda County) (transcript, May 28, 2008, at 730), that
a simple regression equation is trustworthy because the coefficient of the explanatory variable has “an
extremely high indication of reliability to more than 99% confidence level.”
95. With the Nixon papers, one parameter is the average value of all 20,000 boxes, and another
parameter is the standard deviation of the 20,000 values. These parameters can be used to approximate
the distribution of the sample average. See infra Appendix. Regression models and their parameters are
discussed infra Section V and in Rubinfeld, supra note 21.
248
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
calculations—if not the appraised values themselves. In many contexts, the choice
of an appropriate statistical model is less than obvious. When a model does not
fit the data collection process, estimates and standard errors will not be probative.
Standard errors and confidence intervals generally ignore systematic errors
such as selection bias or nonresponse bias (supra Sections II.B.1–2). For example,
after reviewing studies to see whether a particular drug caused birth defects, a
court observed that mothers of children with birth defects may be more likely to
remember taking a drug during pregnancy than mothers with normal children.
This selective recall would bias comparisons between samples from the two groups
of women. The standard error for the estimated difference in drug usage between
the groups would ignore this bias, as would the confidence interval.96
B. Significance Levels and Hypothesis Tests
1. What Is the p-value?
In 1969, Dr. Benjamin Spock came to trial in the U.S. District Court for Massachusetts. The charge was conspiracy to violate the Military Service Act. The jury
was drawn from a panel of 350 persons selected by the clerk of the court. The
panel included only 102 women—substantially less than 50%—although a majority of the eligible jurors in the community were female. The shortfall in women
was especially poignant in this case: “Of all defendants, Dr. Spock, who had given
wise and welcome advice on child-rearing to millions of mothers, would have
liked women on his jury.”97
Can the shortfall in women be explained by the mere play of random chance?
To approach the problem, a statistician would formulate and test a null hypothesis.
Here, the null hypothesis says that the panel is like 350 persons drawn at random
from a large population that is 50% female. The expected number of women drawn
would then be 50% of 350, which is 175. The observed number of women is 102.
The shortfall is 175 − 102 = 73. How likely is it to find a disparity this large or
larger, between observed and expected values? The probability is called p, or the
p-value.
96. Brock v. Merrell Dow Pharms., Inc., 874 F.2d 307, 311–12 (5th Cir.), modified, 884 F.2d
166 (5th Cir. 1989). In Brock, the court stated that the confidence interval took account of bias (in
the form of selective recall) as well as random error. 874 F.2d at 311–12. This is wrong. Even if the
sampling error were nonexistent—which would be the case if one could interview every woman who
had a child during the period that the drug was available—selective recall would produce a difference
in the percentages of reported drug exposure between mothers of children with birth defects and those
with normal children. In this hypothetical situation, the standard error would vanish. Therefore, the
standard error could disclose nothing about the impact of selective recall.
97. Hans Zeisel, Dr. Spock and the Case of the Vanishing Women Jurors, 37 U. Chi. L. Rev. 1
(1969). Zeisel’s reasoning was different from that presented in this text. The conviction was reversed
on appeal without reaching the issue of jury selection. United States v. Spock, 416 F.2d 165 (1st Cir.
1965).
249
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The p-value is the probability of getting data as extreme as, or more extreme
than, the actual data—given that the null hypothesis is true. In the example, p
turns out to be essentially zero. The discrepancy between the observed and the
expected is far too large to explain by random chance. Indeed, even if the panel
had included 155 women, the p-value would only be around 0.02, or 2%.98 (If
the population is more than 50% female, p will be even smaller.) In short, the jury
panel was nothing like a random sample from the community.
Large p-values indicate that a disparity can easily be explained by the play
of chance: The data fall within the range likely to be produced by chance variation. On the other hand, if p is very small, something other than chance must
be involved: The data are far away from the values expected under the null
hypothesis. Significance testing often seems to involve multiple negatives. This is
because a statistical test is an argument by contradiction.
With the Dr. Spock example, the null hypothesis asserts that the jury panel is
like a random sample from a population that is 50% female. The data contradict
this null hypothesis because the disparity between what is observed and what is
expected (according to the null) is too large to be explained as the product of random chance. In a typical jury discrimination case, small p-values help a defendant
appealing a conviction by showing that the jury panel is not like a random sample
from the relevant population; large p-values hurt. In the usual employment context, small p-values help plaintiffs who complain of discrimination—for example,
by showing that a disparity in promotion rates is too large to be explained by
chance; conversely, large p-values would be consistent with the defense argument
that the disparity is just due to chance.
Because p is calculated by assuming that the null hypothesis is correct, p does
not give the chance that the null is true. The p-value merely gives the chance
of getting evidence against the null hypothesis as strong as or stronger than the
evidence at hand. Chance affects the data, not the hypothesis. According to the
frequency theory of statistics, there is no meaningful way to assign a numerical
probability to the null hypothesis. The correct interpretation of the p-value can
therefore be summarized in two lines:
p is the probability of extreme data given the null hypothesis.
p is not the probability of the null hypothesis given extreme data.99
98. With 102 women out of 350, the p-value is about 2/1015, where 1015 is 1 followed by
15 zeros, that is, a quadrillion. See infra Appendix for the calculations.
99. Some opinions present a contrary view. E.g., Vasquez v. Hillery, 474 U.S. 254, 259 n.3
(1986) (“the District Court . . . ultimately accepted . . . a probability of 2 in 1000 that the phenomenon
was attributable to chance”); Nat’l Abortion Fed. v. Ashcroft, 330 F. Supp. 2d 436 (S.D.N.Y. 2004),
aff’d in part, 437 F.3d 278 (2d Cir. 2006), vacated, 224 Fed. App’x. 88 (2d Cir. 2007) (“According to Dr.
Howell, . . . a ‘P value’ of 0.30 . . . indicates that there is a thirty percent probability that the results
of the . . . [s]tudy were merely due to chance alone.”). Such statements confuse the probability of the
250
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
To recapitulate the logic of significance testing: If p is small, the observed
data are far from what is expected under the null hypothesis—too far to be readily
explained by the operations of chance. That discredits the null hypothesis.
Computing p-values requires statistical expertise. Many methods are available,
but only some will fit the occasion. Sometimes standard errors will be part of the
analysis; other times they will not be. Sometimes a difference of two standard
errors will imply a p-value of about 5%; other times it will not. In general, the
p-value depends on the model, the size of the sample, and the sample statistics.
2. Is a difference statistically significant?
If an observed difference is in the middle of the distribution that would be
expected under the null hypothesis, there is no surprise. The sample data are of the
type that often would be seen when the null hypothesis is true. The difference is
not significant, as statisticians say, and the null hypothesis cannot be rejected. On
the other hand, if the sample difference is far from the expected value—according
to the null hypothesis—then the sample is unusual. The difference is significant,
and the null hypothesis is rejected. Statistical significance is determined by comparing p to a preset value, called the significance level.100 The null hypothesis is
rejected when p falls below this level.
In practice, statistical analysts typically use levels of 5% and 1%.101 The
5% level is the most common in social science, and an analyst who speaks of significant results without specifying the threshold probably is using this figure. An
unexplained reference to highly significant results probably means that p is less
kind of outcome observed, which is computed under some model of chance, with the probability that
chance is the explanation for the outcome—the “transposition fallacy.”
Instances of the transposition fallacy in criminal cases are collected in David H. Kaye et al., The
New Wigmore: A Treatise on Evidence: Expert Evidence §§ 12.8.2(b) & 14.1.2 (2d ed. 2011). In
McDaniel v. Brown, 130 S. Ct. 665 (2010), for example, a DNA analyst suggested that a random match
probability of 1/3,000,000 implied a .000033 probability that the DNA was not the source of the
DNA found on the victim’s clothing. See David H. Kaye, “False But Highly Persuasive”: How Wrong
Were the Probability Estimates in McDaniel v. Brown? 108 Mich. L. Rev. First Impressions 1 (2009).
100. Statisticians use the Greek letter alpha (α) to denote the significance level; α gives the
chance of getting a significant result, assuming that the null hypothesis is true. Thus, α represents the
chance of a false rejection of the null hypothesis (also called a false positive, a false alarm, or a Type I
error). For example, suppose α = 5%. If investigators do many studies, and the null hypothesis happens to be true in each case, then about 5% of the time they would obtain significant results—and
falsely reject the null hypothesis.
101. The Supreme Court implicitly referred to this practice in Castaneda v. Partida, 430 U.S.
482, 496 n.17 (1977), and Hazelwood School District v. United States, 433 U.S. 299, 311 n.17 (1977).
In these footnotes, the Court described the null hypothesis as “suspect to a social scientist” when a
statistic from “large samples” falls more than “two or three standard deviations” from its expected value
under the null hypothesis. Although the Court did not say so, these differences produce p-values of
about 5% and 0.3% when the statistic is normally distributed. The Court’s standard deviation is our
standard error.
251
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
than 1%. These levels of 5% and 1% have become icons of science and the legal
process. In truth, however, such levels are at best useful conventions.
Because the term “significant” is merely a label for a certain kind of p-value,
significance is subject to the same limitations as the underlying p-value. Thus,
significant differences may be evidence that something besides random error is at
work. They are not evidence that this something is legally or practically important. Statisticians distinguish between statistical and practical significance to make
the point. When practical significance is lacking—when the size of a disparity is
negligible—there is no reason to worry about statistical significance.102
It is easy to mistake the p-value for the probability of the null hypothesis given
the data (supra Section IV.B.1). Likewise, if results are significant at the 5% level,
it is tempting to conclude that the null hypothesis has only a 5% chance of being
correct.103 This temptation should be resisted. From the frequentist perspective,
statistical hypotheses are either true or false. Probabilities govern the samples, not
the models and hypotheses. The significance level tells us what is likely to happen
when the null hypothesis is correct; it does not tell us the probability that the
hypothesis is true. Significance comes no closer to expressing the probability that
the null hypothesis is true than does the underlying p-value.
3. Tests or interval estimates?
How can a highly significant difference be practically insignificant? The reason
is simple: p depends not only on the magnitude of the effect, but also on the
sample size (among other things). With a huge sample, even a tiny effect will be
102. E.g., Waisome v. Port Auth., 948 F.2d 1370, 1376 (2d Cir. 1991) (“though the disparity
was found to be statistically significant, it was of limited magnitude.”); United States v. Henderson,
409 F.3d 1293, 1306 (11th Cir. 2005) (regardless of statistical significance, excluding law enforcement
officers from jury service does not have a large enough impact on the composition of grand juries
to violate the Jury Selection and Service Act); cf. Thornburg v. Gingles, 478 U.S. 30, 53–54 (1986)
(repeating the district court’s explanation of why “the correlation between the race of the voter and
the voter’s choice of certain candidates was [not only] statistically significant,” but also “so marked
as to be substantively significant, in the sense that the results of the individual election would have
been different depending upon whether it had been held among only the white voters or only the
black voters.”).
103. E.g., Waisome, 948 F.2d at 1376 (“Social scientists consider a finding of two standard
deviations significant, meaning there is about one chance in 20 that the explanation for a deviation
could be random . . . .”); Adams v. Ameritech Serv., Inc., 231 F.3d 414, 424 (7th Cir. 2000) (“Two
standard deviations is normally enough to show that it is extremely unlikely (. . . less than a 5%
probability) that the disparity is due to chance”); Magistrini v. One Hour Martinizing Dry Cleaning,
180 F. Supp. 2d 584, 605 n.26 (D.N.J. 2002) (a “statistically significant . . . study shows that there
is only 5% probability that an observed association is due to chance.”); cf. Giles v. Wyeth, Inc., 500
F. Supp. 2d 1048, 1056 (S.D. Ill. 2007) (“While [plaintiff] admits that a p-value of .15 is three times
higher than what scientists generally consider statistically significant—that is, a p-value of .05 or
lower—she maintains that this “represents 85% certainty, which meets any conceivable concept of
preponderance of the evidence.”).
252
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
highly significant.104 For example, suppose that a company hires 52% of male job
applicants and 49% of female applicants. With a large enough sample, a statistician could compute an impressively small p-value. This p-value would confirm
that the difference does not result from chance, but it would not convert a trivial
difference (52% versus 49% ) into a substantial one.105 In short, the p-value does
not measure the strength or importance of an association.
A “significant” effect can be small. Conversely, an effect that is “not significant” can be large. By inquiring into the magnitude of an effect, courts can avoid
being misled by p-values. To focus attention on more substantive concerns—the
size of the effect and the precision of the statistical analysis—interval estimates
(e.g., confidence intervals) may be more valuable than tests. Seeing a plausible
range of values for the quantity of interest helps describe the statistical uncertainty
in the estimate.
4. Is the sample statistically significant?
Many a sample has been praised for its statistical significance or blamed for its lack
thereof. Technically, this makes little sense. Statistical significance is about the
difference between observations and expectations. Significance therefore applies
to statistics computed from the sample, but not to the sample itself, and certainly
not to the size of the sample. Findings can be statistically significant. Differences
can be statistically significant (supra Section IV.B.2). Estimates can be statistically
significant (infra Section V.D.2). By contrast, samples can be representative or
unrepresentative. They can be chosen well or badly (supra Section II.B.1). They
can be large enough to give reliable results or too small to bother with (supra
Section IV.A.3). But samples cannot be “statistically significant,” if this technical
phrase is to be used as statisticians use it.
C. Evaluating Hypothesis Tests
1. What is the power of the test?
When a p-value is high, findings are not significant, and the null hypothesis is not
rejected. This could happen for at least two reasons:
104. See supra Section IV.B.2. Although some opinions seem to equate small p-values with
“gross” or “substantial” disparities, most courts recognize the need to decide whether the underlying
sample statistics reveal that a disparity is large. E.g., Washington v. People, 186 P.3d 594 (Colo. 2008)
(jury selection).
105. Cf. Frazier v. Garrison Indep. Sch. Dist., 980 F.2d 1514, 1526 (5th Cir. 1993) (rejecting
claims of intentional discrimination in the use of a teacher competency examination that resulted in
retention rates exceeding 95% for all groups); Washington, 186 P.2d 594 (although a jury selection
practice that reduced the representation of “African-Americans [from] 7.7 percent of the population
[to] 7.4 percent of the county’s jury panels produced a highly statistically significant disparity, the small
degree of exclusion was not constitutionally significant.”).
253
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
1. The null hypothesis is true.
2. The null is false—but, by chance, the data happened to be of the kind
expected under the null.
If the power of a statistical study is low, the second explanation may be plausible. Power is the chance that a statistical test will declare an effect when there
is an effect to be declared.106 This chance depends on the size of the effect and
the size of the sample. Discerning subtle differences requires large samples; small
samples may fail to detect substantial differences.
When a study with low power fails to show a significant effect, the results
may therefore be more fairly described as inconclusive than negative. The proof
is weak because power is low. On the other hand, when studies have a good
chance of detecting a meaningful association, failure to obtain significance can be
persuasive evidence that there is nothing much to be found.107
2. What about small samples?
For simplicity, the examples of statistical inference discussed here (supra Sections IV.A–B) were based on large samples. Small samples also can provide useful
106. More precisely, power is the probability of rejecting the null hypothesis when the alternative hypothesis (infra Section IV.C.5) is right. Typically, this probability will depend on the values of
unknown parameters, as well as the preset significance level α. The power can be computed for any
value of α and any choice of parameters satisfying the alternative hypothesis. See infra Appendix for
an example. Frequentist hypothesis testing keeps the risk of a false positive to a specified level (such
as α = 5%) and then tries to maximize power.
Statisticians usually denote power by the Greek letter beta (β). However, some authors use β to
denote the probability of accepting the null hypothesis when the alternative hypothesis is true; this usage
is fairly standard in epidemiology. Accepting the null hypothesis when the alternative holds true is a
false negative (also called a Type II error, a missed signal, or a false acceptance of the null hypothesis).
The chance of a false negative may be computed from the power. Some commentators have
claimed that the cutoff for significance should be chosen to equalize the chance of a false positive and
a false negative, on the ground that this criterion corresponds to the more-probable-than-not burden
of proof. The argument is fallacious, because α and β do not give the probabilities of the null and
alternative hypotheses; see supra Sections IV.B.1–2; supra note 34. See also D.H. Kaye, Hypothesis Testing
in the Courtroom, in Contributions to the Theory and Application of Statistics: A Volume in Honor of
Herbert Solomon 331, 341–43 (Alan E. Gelfand ed., 1987).
107. Some formal procedures (meta-analysis) are available to aggregate results across studies.
See, e.g., In re Bextra and Celebrex Marketing Sales Practices and Prod. Liab. Litig., 524 F. Supp.
2d 1166, 1174, 1184 (N.D. Cal. 2007) (holding that “[a] meta-analysis of all available published and
unpublished randomized clinical trials” of certain pain-relief medicine was admissible). In principle,
the power of the collective results will be greater than the power of each study. However, these
procedures have their own weakness. See, e.g., Richard A. Berk & David A. Freedman, Statistical
Assumptions as Empirical Commitments, in Punishment and Social Control: Essays in Honor of Sheldon
Messinger 235, 244–48 (T.G. Blomberg & S. Cohen eds., 2d ed. 2003); Michael Oakes, Statistical
Inference: A Commentary for the Social and Behavioral Sciences (1986); Diana B. Petitti, MetaAnalysis, Decision Analysis, and Cost-Effectiveness Analysis Methods for Quantitative Synthesis in
Medicine (2d ed. 2000).
254
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
information. Indeed, when confidence intervals and p-values can be computed,
the interpretation is the same with small samples as with large ones.108 The concern with small samples is not that they are beyond the ken of statistical theory,
but that
1 The underlying assumptions are hard to validate.
2. Because approximations based on the normal curve generally cannot be
used, confidence intervals may be difficult to compute for parameters of
interest. Likewise, p-values may be difficult to compute for hypotheses
of interest.109
3. Small samples may be unreliable, with large standard errors, broad confidence intervals, and tests having low power.
3. One tail or two?
In many cases, a statistical test can be done either one-tailed or two-tailed; the
second method often produces a p-value twice as big as the first method. The
methods are easily explained with a hypothetical example. Suppose we toss a coin
1000 times and get 532 heads. The null hypothesis to be tested asserts that the
coin is fair. If the null is correct, the chance of getting 532 or more heads is 2.3%.
That is a one-tailed test, whose p-value is 2.3%. To make a two-tailed test, the
statistician computes the chance of getting 532 or more heads—or 500 − 32 = 468
heads or fewer. This is 4.6%. In other words, the two-tailed p-value is 4.6%.
Because small p-values are evidence against the null hypothesis, the one-tailed test
seems to produce stronger evidence than its two-tailed counterpart. However,
the advantage is largely illusory, as the example suggests. (The two-tailed test may
seem artificial, but it offers some protection against possible artifacts resulting from
multiple testing—the topic of the next section.)
Some courts and commentators have argued for one or the other type of test,
but a rigid rule is not required if significance levels are used as guidelines rather
than as mechanical rules for statistical proof.110 One-tailed tests often make it
108. Advocates sometimes contend that samples are “too small to allow for meaningful statistical
analysis,” United States v. New York City Bd. of Educ., 487 F. Supp. 2d 220, 229 (E.D.N.Y. 2007),
and courts often look to the size of samples from earlier cases to determine whether the sample data
before them are admissible or convincing. Id. at 230; Timmerman v. U.S. Bank, 483 F.3d 1106, 1116
n.4 (10th Cir. 2007). However, a meaningful statistical analysis yielding a significant result can be based
on a small sample, and reliability does not depend on sample size alone (see supra Section IV.A.3, infra
Section V.C.1). Well-known small-sample techniques include the sign test and Fisher’s exact test.
E.g., Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers 154–56, 339–41 (2d ed. 2001); see
generally E.L. Lehmann & H.J.M. d’Abrera, Nonparametrics (2d ed. 2006).
109. With large samples, approximate inferences (e.g., based on the central limit theorem, see
infra Appendix) may be quite adequate. These approximations will not be satisfactory for small samples.
110. See, e.g., United States v. State of Delaware, 93 Fair Empl. Prac. Cas. (BNA) 1248, 2004
WL 609331, *10 n.4 (D. Del. 2004). According to formal statistical theory, the choice between one
255
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
easier to reach a threshold such as 5%, at least in terms of appearance. However,
if we recognize that 5% is not a magic line, then the choice between one tail
and two is less important—as long as the choice and its effect on the p-value are
made explicit.
4. How many tests have been done?
Repeated testing complicates the interpretation of significance levels. If enough
comparisons are made, random error almost guarantees that some will yield “significant” findings, even when there is no real effect. To illustrate the point, consider the problem of deciding whether a coin is biased. The probability that a fair
coin will produce 10 heads when tossed 10 times is (1/2)10 = 1/1024. Observing
10 heads in the first 10 tosses, therefore, would be strong evidence that the coin
is biased. Nonetheless, if a fair coin is tossed a few thousand times, it is likely that
at least one string of ten consecutive heads will appear. Ten heads in the first ten
tosses means one thing; a run of ten heads somewhere along the way to a few
thousand tosses of a coin means quite another. A test—looking for a run of ten
heads—can be repeated too often.
Artifacts from multiple testing are commonplace. Because research that fails to
uncover significance often is not published, reviews of the literature may produce
an unduly large number of studies finding statistical significance.111 Even a single
researcher may examine so many different relationships that a few will achieve
statistical significance by mere happenstance. Almost any large dataset—even pages
from a table of random digits—will contain some unusual pattern that can be
uncovered by diligent search. Having detected the pattern, the analyst can perform
a statistical test for it, blandly ignoring the search effort. Statistical significance is
bound to follow.
There are statistical methods for dealing with multiple looks at the data,
which permit the calculation of meaningful p-values in certain cases.112 However,
no general solution is available, and the existing methods would be of little help
in the typical case where analysts have tested and rejected a variety of models
before arriving at the one considered the most satisfactory (see infra Section V on
regression models). In these situations, courts should not be overly impressed with
tail or two can sometimes be made by considering the exact form of the alternative hypothesis (infra
Section IV.C.5). But see Freedman et al., supra note 12, at 547–50. One-tailed tests at the 5% level
are viewed as weak evidence—no weaker standard is commonly used in the technical literature.
One-tailed tests are also called one-sided (with no pejorative intent); two-tailed tests are two-sided.
111. E.g., Philippa J. Easterbrook et al., Publication Bias in Clinical Research, 337 Lancet 867
(1991); John P.A. Ioannidis, Effect of the Statistical Significance of Results on the Time to Completion and
Publication of Randomized Efficacy Trials, 279 JAMA 281 (1998); Stuart J. Pocock et al., Statistical Problems
in the Reporting of Clinical Trials: A Survey of Three Medical Journals, 317 New Eng. J. Med. 426 (1987).
112. See, e.g., Sandrine Dudoit & Mark J. van der Laan, Multiple Testing Procedures with
Applications to Genomics (2008).
256
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
claims that estimates are significant. Instead, they should be asking how analysts
developed their models.113
5. What are the rival hypotheses?
The p-value of a statistical test is computed on the basis of a model for the data:
the null hypothesis. Usually, the test is made in order to argue for the alternative
hypothesis: another model. However, on closer examination, both models may
prove to be unreasonable. A small p-value means something is going on besides
random error. The alternative hypothesis should be viewed as one possible explanation, out of many, for the data.
In Mapes Casino, Inc. v. Maryland Casualty Co.,114 the court recognized the
importance of explanations that the proponent of the statistical evidence had failed
to consider. In this action to collect on an insurance policy, Mapes sought to quantify its loss from theft. It argued that employees were using an intermediary to cash
in chips at other casinos. The casino established that over an 18-month period,
the win percentage at its craps tables was 6%, compared to an expected value of
20%. The statistics proved that something was wrong at the craps tables—the discrepancy was too big to explain as the product of random chance. But the court
was not convinced by plaintiff’s alternative hypothesis. The court pointed to other
possible explanations (Runyonesque activities such as skimming, scamming, and
crossroading) that might have accounted for the discrepancy without implicating
the suspect employees.115 In short, rejection of the null hypothesis does not leave
the proffered alternative hypothesis as the only viable explanation for the data.116
113. Intuition may suggest that the more variables included in the model, the better. However,
this idea often turns out to be wrong. Complex models may reflect only accidental features of the data.
Standard statistical tests offer little protection against this possibility when the analyst has tried a variety
of models before settling on the final specification. See authorities cited, supra note 21.
114. 290 F. Supp. 186 (D. Nev. 1968).
115. Id. at 193. Skimming consists of “taking off the top before counting the drop,” scamming
is “cheating by collusion between dealer and player,” and crossroading involves “professional cheaters
among the players.” Id. In plainer language, the court seems to have ruled that the casino itself might
be cheating, or there could have been cheaters other than the particular employees identified in the
case. At the least, plaintiff’s statistical evidence did not rule out such possibilities. Compare EEOC v.
Sears, Roebuck & Co., 839 F.2d 302, 312 & n.9, 313 (7th Cir. 1988) (EEOC’s regression studies
showing significant differences did not establish liability because surveys and testimony supported the
rival hypothesis that women generally had less interest in commission sales positions), with EEOC v.
General Tel. Co., 885 F.2d 575 (9th Cir. 1989) (unsubstantiated rival hypothesis of “lack of interest”
in “nontraditional” jobs insufficient to rebut prima facie case of gender discrimination); cf. supra Section II.A (problem of confounding).
116. E.g., Coleman v. Quaker Oats Co., 232 F.3d 1271, 1283 (9th Cir. 2000) (a disparity with
a p-value of “3 in 100 billion” did not demonstrate age discrimination because “Quaker never contends that the disparity occurred by chance, just that it did not occur for discriminatory reasons. When
other pertinent variables were factored in, the statistical disparity diminished and finally disappeared.”).
257
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
D. Posterior Probabilities
Standard errors, p-values, and significance tests are common techniques for assessing random error. These procedures rely on sample data and are justified in terms
of the operating characteristics of statistical procedures.117 However, frequentist
statisticians generally will not compute the probability that a particular hypothesis
is correct, given the data.118 For example, a frequentist may postulate that a coin is
fair: There is a 50-50 chance of landing heads, and successive tosses are independent. This is viewed as an empirical statement—potentially falsifiable—about the
coin. It is easy to calculate the chance that a fair coin will turn up heads in the next
10 tosses: The answer (see supra Section IV.C.4) is 1/1024. Therefore, observing
10 heads in a row brings into serious doubt the initial hypothesis of fairness.
But what of the converse probability: If the coin does land heads 10 times,
what is the chance that it is fair?119 To compute such converse probabilities, it is
necessary to postulate initial probabilities that the coin is fair, as well as probabilities of unfairness to various degrees. In the frequentist theory of inference, such
postulates are untenable: Probabilities are objective features of the situation that
specify the chances of events or effects, not hypotheses or causes.
By contrast, in the Bayesian approach, probabilities represent subjective
degrees of belief about hypotheses or causes rather than objective facts about
observations. The observer must quantify beliefs about the chance that the coin
is unfair to various degrees—in advance of seeing the data.120 These subjective
probabilities, like the probabilities governing the tosses of the coin, are set up to
obey the axioms of probability theory. The probabilities for the various hypotheses
about the coin, specified before data collection, are called prior probabilities.
117. Operating characteristics include the expected value and standard error of estimators, probabilities of error for statistical tests, and the like.
118. In speaking of “frequentist statisticians” or “Bayesian statisticians,” we do not mean to suggest that all statisticians fall on one side of the philosophical divide or the other. These are archetypes.
Many practicing statisticians are pragmatists, using whatever procedure they think is appropriate for
the occasion, and not concerning themselves greatly with what the numbers they obtain really mean.
119. We call this a converse probability because it is of the form P(H0|data) rather than
P(data|H0); an equivalent phrase, “inverse probability,” also is used. Treating P(data|H0) as if it were
the converse probability P(H0|data) is the transposition fallacy. For example, most U.S. senators are
men, but few men are senators. Consequently, there is a high probability that an individual who is a
senator is a man, but the probability that an individual who is a man is a senator is practically zero.
For examples of the transposition fallacy in court opinions, see cases cited supra notes 98, 102. The
frequentist p-value, P(data|H0), is generally not a good approximation to the Bayesian P(H0|data); the
latter includes considerations of power and base rates.
120. For example, let p be the unknown probability that the coin lands heads. What is the
chance that p exceeds 0.1? 0.6? The Bayesian statistician must be prepared to answer such questions.
Bayesian procedures are sometimes defended on the ground that the beliefs of any rational observer
must conform to the Bayesian rules. However, the definition of “rational” is purely formal. See Peter
C. Fishburn, The Axioms of Subjective Probability, 1 Stat. Sci. 335 (1986); Freedman, supra note 84;
David Kaye, The Laws of Probability and the Law of the Land, 47 U. Chi. L. Rev. 34 (1979).
258
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Prior probabilities can be updated, using Bayes’ rule, given data on how the
coin actually falls. (The Appendix explains the rule.) In short, a Bayesian statistician can compute posterior probabilities for various hypotheses about the coin,
given the data. These posterior probabilities quantify the statistician’s confidence
in the hypothesis that a coin is fair.121 Although such posterior probabilities relate
directly to hypotheses of legal interest, they are necessarily subjective, for they
reflect not just the data but also the subjective prior probabilities—that is, degrees
of belief about hypotheses formulated prior to obtaining data.
Such analyses have rarely been used in court, and the question of their
forensic value has been aired primarily in the academic literature. Some statisticians favor Bayesian methods, and some commentators have proposed using these
methods in some kinds of cases.122 The frequentist view of statistics is more conventional; subjective Bayesians are a well-established minority.123
121. Here, confidence has the meaning ordinarily ascribed to it, rather than the technical interpretation applicable to a frequentist confidence interval. Consequently, it can be related to the burden
of persuasion. See D.H. Kaye, Apples and Oranges: Confidence Coefficients and the Burden of Persuasion,
73 Cornell L. Rev. 54 (1987).
122. See David H. Kaye et al., The New Wigmore: A Treatise on Evidence: Expert Evidence
§§ 12.8.5, 14.3.2 (2d ed. 2010); David H. Kaye, Rounding Up the Usual Suspects: A Legal and Logical
Analysis of DNA Database Trawls, 87 N.C. L. Rev. 425 (2009). In addition, as indicated in the Appendix, Bayes’ rule is crucial in solving certain problems involving conditional probabilities of related
events. For example, if the proportion of women with breast cancer in a region is known, along with
the probability that a mammogram of an affected woman will be positive for cancer and that the
mammogram of an unaffected woman will be negative, then one can compute the numbers of falsepositive and false-negative mammography results that would be expected to arise in a population-wide
screening program. Using Bayes’ rule to diagnose a specific patient, however, is more problematic,
because the prior probability that the patient has breast cancer may not equal the population proportion. Nevertheless, to overcome the tendency to focus on a test result without considering the “base
rate” at which a condition occurs, a diagnostician can apply Bayes’ rule to plausible base rates before
making a diagnosis. Finally, Bayes’ rule also is valuable as a device to explicate the meaning of concepts
such as error rates, probative value, and transposition. See, e.g., David H. Kaye, The Double Helix
and the Law of Evidence (2010); Wigmore, supra, § 7.3.2; David H. Kaye & Jonathan J. Koehler, The
Misquantification of Probative Value, 27 Law & Hum. Behav. 645 (2003).
123. “Objective Bayesians” use Bayes’ rule without eliciting prior probabilities from subjective
beliefs. One strategy is to use preliminary data to estimate the prior probabilities and then apply Bayes’
rule to that empirical distribution. This “empirical Bayes” procedure avoids the charge of subjectivism at the cost of departing from a fully Bayesian framework. With ample data, however, it can be
effective and the estimates or inferences can be understood in frequentist terms. Another “objective”
approach is to use “noninformative” priors that are supposed to be independent of all data and prior
beliefs. However, the choice of such priors can be questioned, and the approach has been attacked by
frequentists and subjective Bayesians. E.g., Joseph B. Kadane, Is “Objective Bayesian Analysis” Objective,
Bayesian, or Wise?, 1 Bayesian Analysis 433 (2006), available at http://ba.stat.cmu.edu/journal/2006/
vol01/issue03/kadane.pdf; Jon Williamson, Philosophies of Probability, in Philosophy of Mathematics
493 (Andrew Irvine ed., 2009) (discussing the challenges to objective Bayesianism).
259
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
V. Correlation and Regression
Regression models are used by many social scientists to infer causation from
association. Such models have been offered in court to prove disparate impact in
discrimination cases, to estimate damages in antitrust actions, and for many other
purposes. Sections V.A, V.B, and V.C cover some preliminary material, showing
how scatter diagrams, correlation coefficients, and regression lines can be used to
summarize relationships between variables.124 Section V.D explains the ideas and
some of the pitfalls.
A. Scatter Diagrams
The relationship between two variables can be graphed in a scatter diagram (also
called a scatterplot or scattergram). We begin with data on income and education
for a sample of 178 men, ages 25 to 34, residing in Kansas.125 Each person in
the sample corresponds to one dot in the diagram. As indicated in Figure 5, the
horizontal axis shows education, and the vertical axis shows income. Person A
completed 12 years of schooling (high school) and had an income of $20,000.
Person B completed 16 years of schooling (college) and had an income of $40,000.
Figure 5. Plotting a scatter diagram. The horizontal axis shows educational level
and the vertical axis shows income.
124. The focus is on simple linear regression. See also Rubinfeld, supra note 21, and the Appendix, infra, and Section II, supra, for further discussion of these ideas with an emphasis on econometrics.
125. These data are from a public-use CD, Bureau of the Census, U.S. Department of Commerce, for the March 2005 Current Population Survey. Income and education are self-reported.
Income is censored at $100,000. For additional details, see Freedman et al., supra note 12, at A-11.
Both variables in a scatter diagram have to be quantitative (with numerical values) rather than qualitative (nonnumerical).
260
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Figure 6 is the scatter diagram for the Kansas data. The diagram confirms an
obvious point. There is a positive association between income and education. In
general, persons with a higher educational level have higher incomes. However,
there are many exceptions to this rule, and the association is not as strong as one
might expect.
Figure 6. Scatter diagram for income and education: men ages 25 to 34 in Kansas.
B. Correlation Coefficients
Two variables are positively correlated when their values tend to go up or down
together, such as income and education in Figure 5. The correlation coefficient
(usually denoted by the letter r) is a single number that reflects the sign of an association and its strength. Figure 7 shows r for three scatter diagrams: In the first,
there is no association; in the second, the association is positive and moderate; in
the third, the association is positive and strong.
A correlation coefficient of 0 indicates no linear association between the
variables. The maximum value for the coefficient is +1, indicating a perfect linear
relationship: The dots in the scatter diagram fall on a straight line that slopes up.
Sometimes, there is a negative association between two variables: Large values
of one tend to go with small values of the other. The age of a car and its fuel
economy in miles per gallon illustrate the idea. Negative association is indicated by
negative values for r. The extreme case is an r of –1, indicating that all the points
in the scatter diagram lie on a straight line that slopes down.
Weak associations are the rule in the social sciences. In Figure 5, the correlation between income and education is about 0.4. The correlation between college
grades and first-year law school grades is under 0.3 at most law schools, while the
261
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 7. The correlation coefficient measures the sign of a linear association
and its strength.
correlation between LSAT scores and first-year grades is generally about 0.4.126
The correlation between heights of fraternal twins is about 0.5. By contrast, the
correlation between heights of identical twins is about 0.95.
1. Is the association linear?
The correlation coefficient has a number of limitations, to be considered in turn.
The correlation coefficient is designed to measure linear association. Figure 8
shows a strong nonlinear pattern with a correlation close to zero. The correlation
coefficient is of limited use with nonlinear data.
2. Do outliers influence the correlation coefficient?
The correlation coefficient can be distorted by outliers—a few points that are far
removed from the bulk of the data. The left-hand panel in Figure 9 shows that
one outlier (lower right-hand corner) can reduce a perfect correlation to nearly
nothing. Conversely, the right-hand panel shows that one outlier (upper righthand corner) can raise a correlation of zero to nearly one. If there are extreme
outliers in the data, the correlation coefficient is unlikely to be meaningful.
3. Does a confounding variable influence the coefficient?
The correlation coefficient measures the association between two variables.
Researchers—and the courts—are usually more interested in causation. Causation is not the same as association. The association between two variables may
be driven by a lurking variable that has been omitted from the analysis (supra
126. Lisa Anthony Stilwell et al., Predictive Validity of the LSAT: A National Summary of the
2001–2002 Correlation Studies 5, 8 (2003).
262
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Figure 8. The scatter diagram shows a strong nonlinear association with a correlation coefficient close to zero. The correlation coefficient only
measures the degree of linear association.
Figure 9. The correlation coefficient can be distorted by outliers.
Section II.A). For an easy example, there is an association between shoe size and
vocabulary among schoolchildren. However, learning more words does not cause
the feet to get bigger, and swollen feet do not make children more articulate. In
this case, the lurking variable is easy to spot—age. In more realistic examples, the
lurking variable is harder to identify.127
127. Green et al., supra note 13, Section IV.C, provides one such example.
263
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In statistics, lurking variables are called confounders or confounding variables.
Association often does reflect causation, but a large correlation coefficient is not
enough to warrant causal inference. A large value of r only means that the dependent variable marches in step with the independent one: Possible reasons include
causation, confounding, and coincidence. Multiple regression is one method that
attempts to deal with confounders (infra Section V.D).128
C. Regression Lines
The regression line can be used to describe a linear trend in the data. The regression line for income on education in the Kansas sample is shown in Figure 10.
The height of the line estimates the average income for a given educational level.
For example, the average income for people with 8 years of education is estimated
at $21,100, indicated by the height of the line at 8 years. The average income for
people with 16 years of education is estimated at $34,700.
Figure 10. The regression line for income on education and its estimates.
Figure 11 combines the data in Figures 5 and 10: it shows the scatter diagram
for income and education, with the regression line superimposed. The line shows
the average trend of income as education increases. Thus, the regression line
indicates the extent to which a change in one variable (income) is associated with
a change in another variable (education).
128. See also Rubinfeld, supra note 21. The difference between experiments and observational
studies is discussed supra Section II.B.
264
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Figure 11. Scatter diagram for income and education, with the regression line
indicating the trend.
1. What are the slope and intercept?
The regression line can be described in terms of its intercept and slope. Often, the
slope is the more interesting statistic. In Figure 11, the slope is $1700 per year. On
average, each additional year of education is associated with an additional $1700
of income. Next, the intercept is $7500. This is an estimate of the average income
for (hypothetical) persons with zero years of education.129 Figure 10 suggests this
estimate may not be especially good. In general, estimates based on the regression
line become less trustworthy as we move away from the bulk of the data.
The slope of the regression line has the same limitations as the correlation
coefficient: (1) The slope may be misleading if the relationship is strongly nonlinear and (2) the slope may be affected by confounders. With respect to (1), the
slope of $1700 per year in Figure 10 presents each additional year of education
as having the same value, but some years of schooling surely are worth more and
129. The regression line, like any straight line, has an equation of the form y = a + bx. Here,
a is the intercept (the value of y when x = 0), and b is the slope (the change in y per unit change in
x). In Figure 9, the intercept of the regression line is $7500 and the slope is $1700 per year. The line
estimates an average income of $34,700 for people with 16 years of education. This may be computed
from the intercept and slope as follows:
$7500 + ($1700 per year) × 16 years = $7500 + $22,200 = $34,700.
The slope b is the same anywhere along the line. Mathematically, that is what distinguishes straight
lines from other curves. If the association is negative, the slope will be negative too. The slope is
like the grade of a road, and it is negative if the road goes downhill. The intercept is like the starting
elevation of a road, and it is computed from the data so that the line goes through the center of the
scatter diagram, rather than being generally too high or too low.
265
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
others less. With respect to (2), the association between education and income is
no doubt causal, but there are other factors to consider, including family background. Compared to individuals who did not graduate from high school, people
with college degrees usually come from richer and better educated families. Thus,
college graduates have advantages besides education. As statisticians might say,
the effects of family background are confounded with the effects of education.
Statisticians often use the guarded phrases “on average” and “associated with”
when talking about the slope of the regression line. This is because the slope has
limited utility when it comes to making causal inferences.
2. What is the unit of analysis?
If association between characteristics of individuals is of interest, these characteristics should be measured on individuals. Sometimes individual-level data are not
to be had, but rates or averages for groups are available. “Ecological” correlations
are computed from such rates or averages. These correlations generally overstate
the strength of an association. For example, average income and average education
can be determined for men living in each state and in Washington, D.C. The correlation coefficient for these 51 pairs of averages turns out to be 0.70. However,
states do not go to school and do not earn incomes. People do. The correlation for
income and education for men in the United States is only 0.42. The correlation
for state averages overstates the correlation for individuals—a common tendency
for ecological correlations.130
Ecological analysis is often seen in cases claiming dilution in voting strength
of minorities. In this type of voting rights case, plaintiffs must prove three things:
(1) the minority group constitutes a majority in at least one district of a proposed
plan; (2) the minority group is politically cohesive, that is, votes fairly solidly for
its preferred candidate; and (3) the majority group votes sufficiently as a bloc to
defeat the minority-preferred candidate.131 The first requirement is compactness;
the second and third define polarized voting.
130. Correlations are computed from the March 2005 Current Population Survey for men
ages 25–64. Freedman et al., supra note 12, at 149. The ecological correlation uses only the average
figures, but within each state there is a lot of spread about the average. The ecological correlation
smoothes away this individual variation. Cf. Green et al., supra note 13, Section II.B.4 (suggesting
that ecological studies of exposure and disease are “far from conclusive” because of the lack of data on
confounding variables (a much more general problem) as well as the possible aggregation bias described
here); David A. Freedman, Ecological Inference and the Ecological Fallacy, in 6 Int’l Encyclopedia of the
Social and Behavioral Sciences 4027 (Neil J. Smelser & Paul B. Baltes eds., 2001).
131. See Thornburg v. Gingles, 478 U.S. 30, 50–51 (1986) (“First, the minority group must be
able to demonstrate that it is sufficiently large and geographically compact to constitute a majority in
a single-member district. . . . Second, the minority group must be able to show that it is politically
cohesive. . . . Third, the minority must be able to demonstrate that the white majority votes sufficiently
as a bloc to enable it . . . usually to defeat the minority’s preferred candidate.”). In subsequent cases,
the Court has emphasized that these factors are not sufficient to make out a violation of section 2 of
266
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
The secrecy of the ballot box means that polarized voting cannot be directly
observed. Instead, plaintiffs in voting rights cases rely on ecological regression,
with scatter diagrams, correlations, and regression lines to estimate voting behavior
by groups and demonstrate polarization. The unit of analysis typically is the precinct. For each precinct, public records can be used to determine the percentage of
registrants in each demographic group of interest, as well as the percentage of the
total vote for each candidate—by voters from all demographic groups combined.
Plaintiffs’ burden is to determine the vote by each demographic group separately.
Figure 12 shows how the argument unfolds. Each point in the scatter diagram
represents data for one precinct in the 1982 Democratic primary election for auditor in Lee County, South Carolina. The horizontal axis shows the percentage of
registrants who are white. The vertical axis shows the turnout rate for the white
candidate. The regression line is plotted too. The slope would be interpreted as
the difference between the white turnout rate and the black turnout rate for the
white candidate. Furthermore, the intercept would be interpreted as the black
turnout rate for the white candidate.132 The validity of such estimates is contested
in the statistical literature.133
the Voting Rights Act. E.g., Johnson v. De Grandy, 512 U.S. 997, 1011 (1994) (“Gingles . . . clearly
declined to hold [these factors] sufficient in combination, either in the sense that a court’s examination
of relevant circumstances was complete once the three factors were found to exist, or in the sense that
the three in combination necessarily and in all circumstances demonstrated dilution.”).
132. By definition, the turnout rate equals the number of votes for the candidate, divided by the
number of registrants; the rate is computed separately for each precinct. The intercept of the line in
Figure 11 is 4%, and the slope is 0.52. Plaintiffs would conclude that only 4% of the black registrants
voted for the white candidate, while 4% + 52% = 56% of the white registrants voted for the white
candidate, which demonstrates polarization.
133. For further discussion of ecological regression in this context, see D. James Greiner, Ecological Inference in Voting Rights Act Disputes: Where Are We Now, and Where Do We Want to Be?, 47
Jurimetrics J. 115 (2007); Bernard Grofman & Chandler Davidson, Controversies in Minority Voting: The Voting Rights Act in Perspective (1992); Stephen P. Klein & David A. Freedman, Ecological Regression in Voting Rights Cases, 6 Chance 38 (Summer 1993). The use of ecological regression
increased considerably after the Supreme Court noted in Thornburg v. Gingles, 478 U.S. 30, 53 n.20
(1986), that “[t]he District Court found both methods [extreme case analysis and bivariate ecological
regression analysis] standard in the literature for the analysis of racially polarized voting.” See, e.g.,
Cottier v. City of Martin, 445 F.3d 1113, 1118 (8th Cir. 2006) (ecological regression is one of the
“proven approaches to evaluating elections”); Bruce M. Clarke & Robert Timothy Reagan, Fed.
Judicial Ctr., Redistricting Litigation: An Overview of Legal, Statistical, and Case-Management Issues
(2002); Greiner, supra, at 117, 121. Nevertheless, courts have cautioned against “overreliance on
bivariate ecological regression” in light of the inherent limitations of the technique. Lewis v. Alamance
County, 99 F.3d 600, 604 n.3 (4th Cir. 1996); Johnson v. Hamrick, 296 F.3d 1065, 1080 n.4 (11th
Cir. 2002) (“as a general rule, homogenous precinct analysis may be more reliable than ecological
regression.”). However, there are problems with both methods. See, e.g., Greiner, supra, at 123–39
(arguing that homogeneous precinct analysis is fundamentally flawed and that courts need to be more
discerning in dealing with ecological regression).
Redistricting plans based predominantly on racial considerations are unconstitutional unless
narrowly tailored to meet a compelling state interest. Shaw v. Reno, 509 U.S. 630 (1993). Whether
compliance with the Voting Rights Act can be considered a compelling interest is an open ques-
267
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 12. Turnout rate for the white candidate plotted against the percentage
of registrants who are white. Precinct-level data, 1982 Democratic
Primary for Auditor, Lee County, South Carolina.
Source: Data from James W. Loewen & Bernard Grofman, Recent Developments in Methods Used in Vote
Dilution Litigation, 21 Urb. Law. 589, 591 tbl.1 (1989).
D. Statistical Models
Statistical models are widely used in the social sciences and in litigation. For
example, the census suffers an undercount, more severe in certain places than
others. If some statistical models are to be believed, the undercount can be
corrected—moving seats in Congress and millions of dollars a year in tax funds.134
Other models purport to lift the veil of secrecy from the ballot box, enabling the
experts to determine how minority groups have voted—a crucial step in voting
rights litigation (supra Section V.C). This section discusses the statistical logic of
regression models.
A regression model attempts to combine the values of certain variables (the
independent variables) to get expected values for another variable (the dependent
variable). The model can be expressed in the form of a regression equation. A
simple regression equation has only one independent variable; a multiple regression equation has several independent variables. Coefficients in the equation will
be interpreted as showing the effects of changing the corresponding variables. This
is justified in some situations, as the next example demonstrates.
tion, but efforts to sustain racially motivated redistricting on this basis have not fared well before the
Supreme Court. See Abrams v. Johnson, 521 U.S. 74 (1997); Shaw v. Hunt, 517 U.S. 899 (1996);
Bush v. Vera, 517 U.S. 952 (1996).
134. See Brown et al., supra note 29; supra note 89.
268
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Hooke’s law (named after Robert Hooke, England, 1653–1703) describes
how a spring stretches in response to a load: Strain is proportional to stress. To
verify Hooke’s law experimentally, a physicist will make a number of observations
on a spring. For each observation, the physicist hangs a weight on the spring and
measures its length. A statistician could develop a regression model for these data:
length = a + b × weight + ε.
(1)
The error term, denoted by the Greek letter epsilon ε, is needed because measured
length will not be exactly equal to a + b × weight. If nothing else, measurement
error must be reckoned with. The model takes ε as “random error”—behaving
like draws made at random with replacement from a box of tickets. Each ticket
shows a potential error, which will be realized if that ticket is drawn. The average
of the potential errors in the box is assumed to be zero.
Equation (1) has two parameters, a and b. These constants of nature characterize the behavior of the spring: a is length under no load, and b is elasticity
(the increase in length per unit increase in weight). By way of numerical illustration, suppose a is 400 and b is 0.05. If the weight is 1, the length of the spring is
expected to be
400 + 0.05 = 400.05.
If the weight is 3, the expected length is
400 + 3 × 0.05 = 400 + 0.15 = 400.15.
In either case, the actual length will differ from expected, by a random error ε.
In standard statistical terminology, the ε’s for different observations on the
spring are assumed to be independent and identically distributed, with a mean of
zero. Take the ε’s for the first two observations. Independence means that the
chances for the second ε do not depend on outcomes for the first. If the errors are
like draws made at random with replacement from a box of tickets, as we assumed
earlier, that box will not change from one draw to the next—independence.
“Identically distributed” means that the chance behavior of the two ε’s is the
same: They are drawn at random from the same box. (See infra Appendix for
additional discussion.)
The parameters a and b in equation (1) are not directly observable, but they
can be estimated by the method of least squares.135 Statisticians often denote esti-
135. It might seem that a is observable; after all, we can measure the length of the spring with
no load. However, the measurement is subject to error, so we observe not a, but a + ε. See equation (1). The parameters a and b can be estimated, even estimated very well, but they cannot be
observed directly. The least squares estimates of a and b are the intercept and slope of the regression
269
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
mates by hats. Thus, â is the estimate for a, and bˆ is the estimate for b. The values
of â and bˆ are chosen to minimize the sum of the squared prediction errors. These
errors are also called residuals. They measure the difference between the actual
length of the spring and the predicted length, the latter being â + bˆ × weight:
actual length = â + bˆ × weight + residual.
(2)
Of course, no one really imagines there to be a box of tickets hidden in the
spring. However, the variability of physical measurements (under many but by
no means all circumstances) does seem to be remarkably like the variability in
draws from a box.136 In short, the statistical model corresponds rather closely to
the empirical phenomenon.
Equation (1) is a statistical model for the data, with unknown parameters a
and b. The error term ε is not observable. The model is a theory—and a good
one—about how the data are generated. By contrast, equation (2) is a regression
ˆ and the residual can
equation that is fitted to the data: The intercept â, the slope b,
all be computed from the data. The results are useful because â is a good estimate
for a, and bˆ is a good estimate for b. (Similarly, the residual is a good approximation to ε.) Without the theory, these estimates would be less useful. Is there a
theoretical model behind the data processing? Is the model justifiable? These questions can be critical when it comes to making statistical inferences from the data.
In social science applications, statistical models often are invoked without an
independent theoretical basis. We give an example involving salary discrimination
in the Appendix.137 The main ideas of such regression modeling can be captured
in a hypothetical exchange between a plaintiff seeking to prove salary discrimination and a company denying the allegation. Such a dialog might proceed as
follows:
1. Plaintiff argues that the defendant company pays male employees more
than females, which establishes a prima facie case of discrimination.
2. The company responds that the men are paid more because they are better
educated and have more experience.
3. Plaintiff refutes the company’s theory by fitting a regression equation that
includes a particular, presupposed relationship between salary (the dependent variable) and some measures of education and experience. Plaintiff’s
expert reports that even after adjusting for differences in education and
line. See supra Section V.C.1; Freedman et al., supra note 12, at 208–10. The method of least squares
was developed by Adrien-Marie Legendre (France, 1752–1833) and Carl Friedrich Gauss (Germany,
1777–1855) to fit astronomical orbits.
136. This is the Gauss model for measurement error. See Freedman et al., supra note 12, at
450–52.
137. The Reference Guide to Multiple Regression in this manual describes a comparable
example.
270
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
experience in this specific manner, men earn more than women. This
remaining difference in pay shows discrimination.
4. The company argues that the difference could be the result of chance, not
discrimination.
5. Plaintiff replies that because the coefficient for gender in the model is
statistically significant, chance is not a good explanation for the data.138
In step 3, the three explanatory variables are education (years of schooling
completed), experience (years with the firm), and a dummy variable for gender
(1 for men and 0 for women). These are supposed to predict salaries (dollars per
year). The equation is a formal analog of Hooke’s law (equation 1). According to
the model, an employee’s salary is determined as if by computing
a + (b × education) + (c × experience) + (d × gender),
(3)
and then adding an error ε drawn at random from a box of tickets.139 The
parameters a, b, c, and d, are estimated from the data by the method of least squares.
In step 5, the estimated coefficient d for the dummy variable turns out to be
positive and statistically significant and is offered as evidence of disparate impact.
Men earn more than women, even after adjusting for differences in background
factors that might affect productivity. This showing depends on many assumptions built into the model.140 Hooke’s law—equation (1)—is relatively easy to test
experimentally. For the salary discrimination model, validation would be difficult.
When expert testimony relies on statistical models, the court may well inquire,
what are the assumptions behind the model, and why do they apply to the case at
hand? It might then be important to distinguish between two situations:
• Thenatureoftherelationshipbetweenthevariablesisknownandregression is being used to make quantitative estimates of parameters in that
relationship, or
• Thenatureoftherelationshipislargelyunknownandregressionisbeing
used to determine the nature of the relationship—or indeed whether any
relationship exists at all.
138. In some cases, the p-value has been interpreted as the probability that defendants are innocent of discrimination. However, as noted earlier, such an interpretation is wrong: p merely represents
the probability of getting a large test statistic, given that the model is correct and the true coefficient
for gender is zero (see supra Section IV.B, infra Appendix, Section D.2). Therefore, even if we grant
the model, a p-value less than 50% does not demonstrate a preponderance of the evidence against the
null hypothesis.
139. Expression (3) is the expected value for salary, given the explanatory variables (education,
experience, gender). The error term is needed to account for deviations from expected: Salaries are not
going to be predicted very well by linear combinations of variables such as education and experience.
140. See infra Appendix.
271
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Regression was developed to handle situations of the first type, with Hooke’s law
being an example. The basis for the second type of application is analogical, and
the tightness of the analogy is an issue worth exploration.
In employment discrimination cases, and other contexts too, a wide variety
of models can be used. This is only to be expected, because the science does not
dictate specific equations. In a strongly contested case, each side will have its own
model, presented by its own expert. The experts will reach opposite conclusions
about discrimination. The dialog might continue with an exchange about which
model is better. Although statistical assumptions are challenged in court from time
to time, arguments more commonly revolve around the choice of variables. One
model may be questioned because it omits variables that should be included—for
example, skill levels or prior evaluations.141 Another model may be challenged
because it includes tainted variables reflecting past discriminatory behavior by
the firm.142 The court must decide which model—if either—fits the occasion.143
The frequency with which regression models are used is no guarantee that
they are the best choice for any particular problem. Indeed, from one perspective,
a regression or other statistical model may seem to be a marvel of mathematical
rigor. From another perspective, the model is a set of assumptions, supported only
by the say-so of the testifying expert. Intermediate judgments are also possible.144
141. E.g., Bazemore v. Friday, 478 U.S. 385 (1986); In re Linerboard Antitrust Litig., 497 F.
Supp. 2d 666 (E.D. Pa. 2007).
142. E.g., McLaurin v. Nat’l R.R. Passenger Corp., 311 F. Supp. 2d 61, 65–66 (D.D.C. 2004)
(holding that the inclusion of two allegedly tainted variables was reasonable in light of an earlier
consent decree).
143. E.g., Chang v. Univ. of R.I., 606 F. Supp. 1161, 1207 (D.R.I. 1985) (“it is plain to the
court that [defendant’s] model comprises a better, more useful, more reliable tool than [plaintiff’s]
counterpart.”); Presseisen v. Swarthmore College, 442 F. Supp. 593, 619 (E.D. Pa. 1977) (“[E]ach
side has done a superior job in challenging the other’s regression analysis, but only a mediocre job in
supporting their own . . . and the Court is . . . left with nothing.”), aff’d, 582 F.2d 1275 (3d Cir. 1978).
144. See, e.g., David W. Peterson, Reference Guide on Multiple Regression, 36 Jurimetrics J. 213,
214–15 (1996) (review essay); see supra note 21 for references to a range of academic opinion. More
recently, some investigators have turned to graphical models. However, these models have serious
weaknesses of their own. See, e.g., David A. Freedman, On Specifying Graphical Models for Causation,
and the Identification Problem, 26 Evaluation Rev. 267 (2004).
272
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Appendix
A. Frequentists and Bayesians
The mathematical theory of probability consists of theorems derived from axioms
and definitions. Mathematical reasoning is seldom controversial, but there may be
disagreement as to how the theory should be applied. For example, statisticians
may differ on the interpretation of data in specific applications. Moreover, there
are two main schools of thought about the foundations of statistics: frequentist
and Bayesian (also called objectivist and subjectivist).145
Frequentists see probabilities as empirical facts. When a fair coin is tossed,
the probability of heads is 1/2; if the experiment is repeated a large number of
times, the coin will land heads about one-half the time. If a fair die is rolled, the
probability of getting an ace (one spot) is 1/6. If the die is rolled many times, an
ace will turn up about one-sixth of the time.146 Generally, if a chance experiment
can be repeated, the relative frequency of an event approaches (in the long run)
its probability. By contrast, a Bayesian considers probabilities as representing not
facts but degrees of belief: In whole or in part, probabilities are subjective.
Statisticians of both schools use conditional probability—that is, the probability of one event given that another has occurred. For example, suppose a coin
is tossed twice. One event is that the coin will land HH. Another event is that at
least one H will be seen. Before the coin is tossed, there are four possible, equally
likely, outcomes: HH, HT, TH, TT. So the probability of HH is 1/4. However, if
we know that at least one head has been obtained, then we can rule out two tails
TT. In other words, given that at least one H has been obtained, the conditional
probability of TT is 0, and the first three outcomes have conditional probability
1/3 each. In particular, the conditional probability of HH is 1/3. This is usually
written as P(HH|at least one H) = 1/3. More generally, the probability of an event
C is denoted P(C); the conditional probability of D given C is written as P(D|C).
Two events C and D are independent if the conditional probability of D
given that C occurs is equal to the conditional probability of D given that C does
not occur. Statisticians use “~C” to denote the event that C does not occur. Thus
C and D are independent if P(D|C) = P(D|~C). If C and D are independent,
then the probability that both occur is equal to the product of the probabilities:
P(C and D) = P(C ) × P(D).
(A1)
145. But see supra note 123 (on “objective Bayesianism”).
146. Probabilities may be estimated from relative frequencies, but probability itself is a subtler
idea. For example, suppose a computer prints out a sequence of 10 letters H and T (for heads and
tails), which alternate between the two possibilities H and T as follows: H T H T H T H T H T.
The relative frequency of heads is 5/10 or 50%, but it is not at all obvious that the chance of an H
at the next position is 50%. There are difficulties in both the subjectivist and objectivist positions. See
Freedman, supra note 84.
273
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
This is the multiplication rule (or product rule) for independent events. If events
are dependent, then conditional probabilities must be used:
P(C and D) = P(C) × P(D|C).
(A2)
This is the multiplication rule for dependent events.
Bayesian statisticians assign probabilities to hypotheses as well as to events;
indeed, for them, the distinction between hypotheses and events may not be
a sharp one. We turn now to Bayes’ rule. If H0 and H1 are two hypotheses147
that govern the probability of an event A, a Bayesian can use the multiplication
rule (A2) to find that
P(A and H0) = P(A|H0)P(H0)
(A3)
P(A and H1) = P(A|H1)P(H1).
(A4)
P(A) = P(A and H0) + P(A and H1).
(A5)
and
Moreover,
The multiplication rule (A2) also shows that
(
)
P H1|A =
(
P A and H1
( )
P A
).
(A6)
We use (A4) to evaluate P(A and H1) in the numerator of (A6), and (A3), (A4),
and (A5) to evaluate P(A) in the denominator:
(
P ( A|H P ( H1 )
.
) P( A|H )P(H ) 1+)P( A|H
0
0
1 ) P ( H1 )
P H1|A =
(A7)
This is a special case of Bayes’ rule. It yields the conditional probability of hypothesis H0 given that event A has occurred.
For a stylized example in a criminal case, H0 is the hypothesis that blood
found at the scene of a crime came from a person other than the defendant; H1 is
the hypothesis that the blood came from the defendant; A is the event that blood
from the crime scene and blood from the defendant are both type A. Then P(H0)
is the prior probability of H0, based on subjective judgment, while P(H0|A) is the
posterior probability—updated from the prior using the data.
147. H0 is read “H-sub-zero,” while H1 is “H-sub-one.”
274
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Type A blood occurs in 42% of the population. So P(A|H0) = 0.42.148
Because the defendant has type A blood, P(A|H1) = 1. Suppose the prior probabilities are P(H0) = P(H1) = 0.5. According to (A7), the posterior probability
that the blood is from the defendant is
(
)
P H1|A =
1 × 05
.
= 070
. .
042
. × 05
. + 1 × 05
.
(A8)
Thus, the data increase the likelihood that the blood is the defendant’s. The probability went up from the prior value of P(H1) = 0.50 to the posterior value of
P(H1|A) = 0.70.
More generally, H0 and H1 refer to parameters in a statistical model. For a stylized example in an employment discrimination case, H0 asserts equal selection rates
in a population of male and female applicants; H1 asserts that the selection rates are
not equal; A is the event that a test statistic exceeds 2 in absolute value. In such situations, the Bayesian proceeds much as before. However, the frequentist computes
P(A|H0), and rejects H0 if this probability falls below 5%. Frequentists have to stop
there, because they view P(H0|A) as poorly defined at best. In their setup, P(H0)
and P(H1) rarely make sense, and these prior probabilities are needed to compute
P(H1|A): See supra equation (A7).
Assessing probabilities, conditional probabilities, and independence is not
entirely straightforward, either for frequentists or Bayesians. Inquiry into the basis
for expert judgment may be useful, and casual assumptions about independence
should be questioned.149
B. The Spock Jury: Technical Details
The rest of this Appendix provides some technical backup for the examples in Sections IV and V, supra. We begin with the Spock jury case. On the null hypothesis,
a sample of 350 people was drawn at random from a large population that was
50% male and 50% female. The number of women in the sample follows the
binomial distribution. For example, the chance of getting exactly 102 women in
the sample is given by the binomial formula150
n!
f
j! × n − j !
(
j
)
(1− f )n− j .
(A9)
148. Not all statisticians would accept the identification of a population frequency with P(A|H0).
Indeed, H0 has been translated into a hypothesis that the true donor has been selected from the population at random (i.e., in a manner that is uncorrelated with blood type). This step needs justification.
See supra note 123.
149. For problematic assumptions of independence in litigation, see, e.g., Wilson v. State, 803
A.2d 1034 (Md. 2002) (error to admit multiplied probabilities in a case involving two deaths of infants
in same family); 1 McCormick, supra note 2, § 210; see also supra note 29 (on census litigation).
150. The binomial formula is discussed in, e.g., Freedman et al., supra note 12, at 255–61.
275
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In the formula, n stands for the sample size, and so n = 350; and j = 102. The
f is the fraction of women in the population; thus, f = 0.50. The exclamation
point denotes factorials: 1! = 1, 2! = 2 × 1 = 2, 3! = 3 × 2 × 1 = 6, and so forth.
The chance of 102 women works out to 10–15. In the same way, we can compute the chance of getting 101 women, or 100, or any other particular number.
The chance of getting 102 women or fewer is then computed by addition. The
chance is p = 2 × 10–15, as reported supra note 98. This is very bad news for the
null hypothesis.
With the binomial distribution given by (9), the expected the number of
women in the sample is
n f = 350 × 05
. = 175.
(A10)
The standard error is
n×
(
)
f × 1 − f = 350 × 05
. × 05
. = 935
. .
(A11)
The observed value of 102 is nearly 8 SEs below the expected value, which is a
lot of SEs.
Figure 13 shows the probability histogram for the number of women in the
sample.151 The graph is drawn so that the area between two values is proportional
to the chance that the number of women will fall in that range. For example, take
the rectangle over 175; its base covers the interval from 174.5 to 175.5. The area
of this rectangle is 4.26% of the total area. So the chance of getting exactly 175
women is 4.26%. Next, take the range from 165 to 185 (inclusive): 73.84% of the
area falls into this range. This means there is a 73.84% chance that the number of
women in the sample will be in the range from 165 to 185 (inclusive).
According to a fundamental theorem in statistics (the central limit theorem),
the histogram follows the normal curve.152 Figure 13 shows the curve for comparison: The normal curve is almost indistinguishable from the top of the histogram. For a numerical example, suppose the jury panel had included 155 women.
On the null hypothesis, there is about a 1.85% chance of getting 155 women or
fewer. The normal curve gives 1.86%. The error is nil. Ordinarily, we would just
report p = 2%, as in the text (supra Section IV.B.1).
Finally, we consider power. Suppose we reject the null hypothesis when the
number of women in the sample is 155 or less. Let us assume a particular alternative hypothesis that quantifies the degree of discrimination against women: The
jury panel is selected at random from a population that is 40% female, rather than
50%. Figure 14 shows the probability histogram for the number of women, but
now the histogram is computed according to the alternative hypothesis. Again,
151. Probability histograms are discussed in, e.g., id. at 310–13.
152. The central limit theorem is discussed in, e.g., id. at 315–27.
276
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Figure 13. Probability histogram for the number of women in a random sample
of 350 people drawn from a large population that is 50% female and
50% male. The normal curve is shown for comparison. About 2% of
the area under the histogram is to the left of 155 (marked by a heavy
vertical line).
50
25
0
150
155
160
165
170
175
180
185
190
195
200
Number of Women
Note: The vertical line is placed at 155.5, and so the area to the left of it includes the rectangles over
155, 154, . . . ; the area represents the chance of getting 155 women or fewer. Cf. Freedman et al.,
supra note 12, at 317. The units on the vertical axis are “percent per standard unit”; cf. id. at 80, 315.
Figure 14. Probability histogram for the number of women in a random sample
of 350 people drawn from a large population that is 40% female and
60% male. The normal curve is shown for comparison. The area to
the left of 155 (marked by a heavy vertical line) is about 95%.
277
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the histogram follows the normal curve. About 95% of the area is to the left of
155, and so power is about 95%. The area can be computed exactly by using the
binomial distribution, or to an excellent approximation using the normal curve.
Figures 13 and 14 have the same shape: The central limit theorem is at work.
However, the histograms are centered differently. Figure 13 is centered at 175,
according to requirements of the null hypothesis. Figure 14 is centered at 140,
because the alternative hypothesis is used to determine the center, not the null
hypothesis. Thus, 155 is well to the left of center in Figure 13, and well to the
right in Figure 14: The figures have different centers. The main point of Figures 13
and 14 is that chances can often be approximated by areas under the normal curve,
justifying the large-sample theory presented supra Sections IV.A–B.
C. The Nixon Papers: Technical Details
With the Nixon papers, the population consists of 20,000 boxes. A random sample
of 500 boxes is drawn and each sample box is appraised. Statistical theory enables
us to make some precise statements about the behavior of the sample average.
• The expected value of the sample average equals the population average. Even more tersely, the sample average is an unbiased estimate of the
population average.
• Thestandarderrorforthesampleaverageequals
N −n σ
.
×
N −1
n
(A12)
In (A12), the N stands for the size of the population, which is 20,000; and n stands
for the size of the sample, which is 500. The first factor in (A12), with the square
root, is the finite sample correction factor. Here, as in many other such examples,
the correction factor is so close to 1 that it can safely be ignored. (This is why the
size of population usually has no bearing on the precision of the sample average as
an estimator for the population average.) Next, σ is the population standard deviation. This is unknown, but it can be estimated by the sample standard deviation,
which is $2200. The SE for the sample mean is therefore estimated from the data as
$2200/ 500 , which is nearly $100. Plaintiff’s total claim is 20,000 times the sample average. The SE for the total claim is therefore 20,000 × $100 = $2,000,000.
(Here, the size of the population comes into the formula.)
With a large sample, the probability histogram for the sample average follows
the normal curve quite closely. That is a consequence of the central limit theorem.
The center of the histogram is the population average. The SE is given by (A12),
and is about $100.
278
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
• What is the chance that the sample average differs from the population average by 1 SE or less? This chance is equal to the area under the
probability histogram within 1 SE of average, which by the central limit
theorem is almost equal to the area under the standard normal curve
between –1 and 1; that normal area is about 68%.
• What is the chance that the sample average differs from the population
average by 2 SE or less? By the same reasoning, this chance is about equal
to the area under the standard normal curve between –2 and 2, which is
about 95%.
• What is the chance that the sample average differs from the population
average by 3 SE or less? This chance is about equal to the area under the
standard normal curve between –3 and 3, which is about 99.7%.
To sum up, the probability histogram for the sample average is centered at
the population average. The spread is given by the standard error. The histogram
follows the normal curve. That is why confidence levels can be based on the standard error, with confidence levels read off the normal curve—for estimators that
are essentially unbiased, and obey the central limit theorem (supra Section IV.A.2,
Appendix Section B).153 These large-sample methods generally work for sums,
averages, and rates, although much depends on the design of the sample.
More technically, the normal curve is the density of a normal distribution.
The standard normal density has mean equal to 0 and standard error equal to 1.
Its equation is
2
y = e − x /2/ 2π
where e = 2.71828. . . and π = 3.14159. . . . This density can be rescaled to have
any desired mean and standard error. The resulting densities are the famous
“normal curves” or “bell-shaped curves” of statistical theory. In Figure 12, the
density is scaled to match the probability histogram in terms of the mean and
standard error; likewise in Figure 13.
D. A Social Science Example of Regression: Gender Discrimination
in Salaries
1. The regression model
To illustrate social science applications of the kind that might be seen in litigation,
Section V referred to a stylized example on salary discrimination. A particular
153. See, e.g., id. at 409–24. On the standard deviation, see supra Section III.E; Freedman et al.,
supra note 12, at 67–72. The finite sample correction factor is discussed in id. at 367–70.
279
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
regression model was used to predict salaries (dollars per year) of employees in a
firm. It had three explanatory variables: education (years of schooling completed),
experience (years with the firm), and a dummy variable for gender (1 for men and
0 for women). The regression equation is
salary = a + b × education + c × experience + d × gender + ε.
(A13)
Equation (A13) is a statistical model for the data, with unknown parameters a, b, c,
and d. Here, a is the intercept and the other parameters are regression coefficients.
The ε at the end of the equation is an unobservable error term. In the right-hand
side of (A3) and similar expressions, by convention, the multiplications are done
before the additions.
As noted in Section V, the equation is a formal analog of Hooke’s law (1).
According to the model, an employee’s salary is determined as if by computing
a + b × education + c × experience + d × gender
(A14)
and then adding an error ε drawn at random from a box of tickets. Expression (A14) is the expected value for salary, given the explanatory variables (education, experience, gender). The error term is needed to account for deviations from
expected: Salaries are not going to be predicted very well by linear combinations
of variables such as education and experience.
The parameters are estimated from the data using least squares. If the estimated coefficient for the dummy variable turns out to be positive and statistically
significant, that would be evidence of disparate impact. Men earn more than
women, even after adjusting for differences in background factors that might affect
productivity. Suppose the estimated equation turns out as follows:
predicted salary = $7100 + $1300 × education + $2200
× experience + $700 × gender.
(A15)
According to (A15), the estimated value for the intercept a in (A14) is $7100; the
estimated value for the coefficient b is $1300, and so forth. According to equation
(A15), every extra year of education is worth $1300. Similarly, every extra year
of experience is worth $2200. And, most important, the company gives men a
salary premium of $700 over women with the same education and experience.
A male employee with 12 years of education (high school) and 10 years of
experience, for example, would have a predicted salary of
$7100 + $1300 × 12 + $2200 × 10 + $700 × 1
= $7100 + $15,600 + $22,000 + $700 = $45,400.
A similarly situated female employee has a predicted salary of only
280
Copyright © National Academy of Sciences. All rights reserved.
(A16)
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
$7100 + $1300 × 12 + $2200 × 10 + $700 × 0
= $7100 + $15,600 + $22,000 + $0 = $44,700.
(A17)
Notice the impact of the gender variable in the model: $700 is added to equation
(A16), but not to equation (A17).
A major step in proving discrimination is showing that the estimated
coefficient of the gender variable—$700 in the numerical illustration—is statistically significant. This showing depends on the assumptions built into the model.
Thus, each extra year of education is assumed to be worth the same across all levels
of experience. Similarly, each extra year of experience is worth the same across all
levels of education. Furthermore, the premium paid to men does not depend systematically on education or experience. Omitted variables such as ability, quality
of education, or quality of experience do not make any systematic difference to
the predictions of the model.154 These are all assumptions made going into the
analysis, rather than conclusions coming out of the data.
Assumptions are also made about the error term—the mysterious ε at the end
of (A13). The errors are assumed to be independent and identically distributed
from person to person in the dataset. Such assumptions are critical when computing p-values and demonstrating statistical significance. Regression modeling that
does not produce statistically significant coefficients will not be good evidence
of discrimination, and statistical significance cannot be established unless stylized
assumptions are made about unobservable error terms.
The typical regression model, like the one sketched above, therefore involves a
host of assumptions. As noted in Section V, Hooke’s law—equation (1)—is relatively
easy to test experimentally. For the salary discrimination model—equation (A13)—
validation would be difficult. That is why we suggested that when expert testimony
relies on statistical models, the court may well inquire about the assumptions behind
the model and why they apply to the case at hand.
2. Standard errors, t-statistics, and statistical significance
Statistical proof of discrimination depends on the significance of the estimated
coefficient for the gender variable. Significance is determined by the t-test, using
the standard error. The standard error measures the likely difference between
the estimated value for the coefficient and its true value. The estimated value is
$700—the coefficient of the gender variable in equation (A5); the true value d
in (A13), remains unknown. According to the model, the difference between
the estimated value and the true value is due to the action of the error term ε in
(A3). Without ε, observed values would line up perfectly with expected values,
154. Technically, these omitted variables are assumed to be independent of the error term in
the equation.
281
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
and estimated values for parameters would be exactly equal to true values. This
does not happen.
The t-statistic is the estimated value divided by its standard error. For example, in (A15), the estimate for d is $700. If the standard error is $325, then t is
$700/$325 = 2.15. This is significant—that is, hard to explain as the product
of random error. Under the null hypothesis that d is zero, there is only about a
5% chance that the absolute value of t is greater than 2. (We are assuming the
sample is large.) Thus, statistical significance is achieved (supra Section IV.B.2).
Significance would be taken as evidence that d—the true parameter in the model
(A13)—does not vanish. According to a viewpoint often presented in the social
science journals and the courtroom, here is statistical proof that gender matters
in determining salaries. On the other hand, if the standard error is $1400, then t
is $700/$1400 = 0.5. The difference between the estimated value of d and zero
could easily result from chance. So the true value of d could well be zero, in which
case gender does not affect salaries.
Of course, the parameter d is only a construct in a model. If the model is
wrong, the standard error, t-statistic, and significance level are rather difficult to
interpret. Even if the model is granted, there is a further issue. The 5% is the
chance that the absolute value of t exceeds 2, given the model and given the null
hypothesis that d is zero. However, the 5% is often taken to be the chance of the
null hypothesis given the data. This misinterpretation is commonplace in the social
science literature, and it appears in some opinions describing expert testimony.155
For a frequentist statistician, the chance that d is zero given the data makes no
sense: Parameters do not exhibit chance variation. For a Bayesian statistician, the
chance that d is zero given the data makes good sense, but the computation via
the t-test could be seriously in error, because the prior probability that d is zero
has not been taken into account.156
The mathematical terminology in the previous paragraph may need to be
deciphered: The “absolute value” of t is the magnitude, ignoring sign. Thus, the
absolute value of both +3 and −3 is 3.
155. See supra Section IV.B & notes 102 & 116.
156. See supra Section IV & supra Appendix.
282
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
Glossary of Terms
The following definitions are adapted from a variety of sources, including Michael
O. Finkelstein & Bruce Levin, Statistics for Lawyers (2d ed. 2001), and David A.
Freedman et al., Statistics (4th ed. 2007).
absolute value. Size, neglecting sign. The absolute value of +2.7 is 2.7; so is the
absolute value of −2.7.
adjust for. See control for.
alpha (α). A symbol often used to denote the probability of a Type I error. See
Type I error; size. Compare beta.
alternative hypothesis. A statistical hypothesis that is contrasted with the null
hypothesis in a significance test. See statistical hypothesis; significance test.
area sample. A probability sample in which the sampling frame is a list of geographical areas. That is, the researchers make a list of areas, choose some at
random, and interview people in the selected areas. This is a cost-effective
way to draw a sample of people. See probability sample; sampling frame.
arithmetic mean. See mean.
average. See mean.
Bayes’ rule. In its simplest form, an equation involving conditional probabilities
that relates a “prior probability” known or estimated before collecting certain data to a “posterior probability” that reflects the impact of the data on
the prior probability. In Bayesian statistical inference, “the prior” expresses
degrees of belief about various hypotheses. Data are collected according to
some statistical model; at least, the model represents the investigator’s beliefs.
Bayes’ rule combines the prior with the data to yield the posterior probability,
which expresses the investigator’s beliefs about the parameters, given the data.
See Appendix A. Compare frequentist.
beta (β). A symbol sometimes used to denote power, and sometimes to denote
the probability of a Type II error. See Type II error; power. Compare alpha.
between-observer variability. Differences that occur when two observers
measure the same thing. Compare within-observer variability.
bias. Also called systematic error. A systematic tendency for an estimate to be
too high or too low. An estimate is unbiased if the bias is zero. (Bias does not
mean prejudice, partiality, or discriminatory intent.) See nonsampling error.
Compare sampling error.
bin. A class interval in a histogram. See class interval; histogram.
binary variable. A variable that has only two possible values (e.g., gender).
Called a dummy variable when the two possible values are 0 and 1.
binomial distribution. A distribution for the number of occurrences in repeated,
independent “trials” where the probabilities are fixed. For example, the num283
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ber of heads in 100 tosses of a coin follows a binomial distribution. When
the probability is not too close to 0 or 1 and the number of trials is large, the
binomial distribution has about the same shape as the normal distribution. See
normal distribution; Poisson distribution.
blind. See double-blind experiment.
bootstrap. Also called resampling; Monte Carlo method. A procedure for estimating sampling error by constructing a simulated population on the basis of
the sample, then repeatedly drawing samples from the simulated population.
categorical data; categorical variable. See qualitative variable. Compare quantitative variable.
central limit theorem. Shows that under suitable conditions, the probability
histogram for a sum (or average or rate) will follow the normal curve. See
histogram; normal curve.
chance error. See random error; sampling error.
chi-squared (χ2). The chi-squared statistic measures the distance between the
data and expected values computed from a statistical model. If the chi-squared
statistic is too large to explain by chance, the data contradict the model. The
definition of “large” depends on the context. See statistical hypothesis; significance test.
class interval. Also, bin. The base of a rectangle in a histogram; the area of
the rectangle shows the percentage of observations in the class interval. See
histogram.
cluster sample. A type of random sample. For example, investigators might take
households at random, then interview all people in the selected households.
This is a cluster sample of people: A cluster consists of all the people in a
selected household. Generally, clustering reduces the cost of interviewing.
See multistage cluster sample.
coefficient of determination. A statistic (more commonly known as R-squared)
that describes how well a regression equation fits the data. See R-squared.
coefficient of variation. A statistic that measures spread relative to the mean:
SD/mean, or SE/expected value. See expected value; mean; standard deviation; standard error.
collinearity. See multicollinearity.
conditional probability. The probability that one event will occur given that
another has occurred.
confidence coefficient. See confidence interval.
confidence interval. An estimate, expressed as a range, for a parameter. For
estimates such as averages or rates computed from large samples, a 95% confidence interval is the range from about two standard errors below to two
standard errors above the estimate. Intervals obtained this way cover the true
284
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
value about 95% of the time, and 95% is the confidence level or the confidence coefficient. See central limit theorem; standard error.
confidence level. See confidence interval.
confounding variable; confounder. A confounder is correlated with the independent variable and the dependent variable. An association between the
dependent and independent variables in an observational study may not be
causal, but may instead be due to confounding. See controlled experiment;
observational study.
consistent estimator. An estimator that tends to become more and more accurate as the sample size grows. Inconsistent estimators, which do not become
more accurate as the sample gets larger, are frowned upon by statisticians.
content validity. The extent to which a skills test is appropriate to its intended
purpose, as evidenced by a set of questions that adequately reflect the domain
being tested. See validity. Compare reliability.
continuous variable. A variable that has arbitrarily fine gradations, such as a
person’s height. Compare discrete variable.
control for. Statisticians may control for the effects of confounding variables in
nonexperimental data by making comparisons for smaller and more homogeneous groups of subjects, or by entering the confounders as explanatory
variables in a regression model. To “adjust for” is perhaps a better phrase
in the regression context, because in an observational study the confounding factors are not under experimental control; statistical adjustments are an
imperfect substitute. See regression model.
control group. See controlled experiment.
controlled experiment. An experiment in which the investigators determine
which subjects are put into the treatment group and which are put into the
control group. Subjects in the treatment group are exposed by the investigators to some influence—the treatment; those in the control group are not so
exposed. For example, in an experiment to evaluate a new drug, subjects in
the treatment group are given the drug, and subjects in the control group are
given some other therapy; the outcomes in the two groups are compared to
see whether the new drug works.
Randomization—that is, randomly assigning subjects to each group—is
usually the best way to ensure that any observed difference between the two
groups comes from the treatment rather than from preexisting differences. Of
course, in many situations, a randomized controlled experiment is impractical,
and investigators must then rely on observational studies. Compare observational study.
convenience sample. A nonrandom sample of units, also called a grab sample.
Such samples are easy to take but may suffer from serious bias. Typically, mall
samples are convenience samples.
285
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
correlation coefficient. A number between –1 and 1 that indicates the extent of
the linear association between two variables. Often, the correlation coefficient
is abbreviated as r.
covariance. A quantity that describes the statistical interrelationship of two variables. Compare correlation coefficient; standard error; variance.
covariate. A variable that is related to other variables of primary interest in a
study; a measured confounder; a statistical control in a regression equation.
criterion. The variable against which an examination or other selection procedure is validated. See validity.
data. Observations or measurements, usually of units in a sample taken from a
larger population.
degrees of freedom. See t-test.
dependence. Two events are dependent when the probability of one is affected
by the occurrence or non-occurrence of the other. Compare independence;
dependent variable.
dependent variable. Also called outcome variable. Compare independent variable.
descriptive statistics. Like the mean or standard deviation, used to summarize
data.
differential validity. Differences in validity across different groups of subjects.
See validity.
discrete variable. A variable that has only a small number of possible values,
such as the number of automobiles owned by a household. Compare continuous variable.
distribution. See frequency distribution; probability distribution; sampling
distribution.
disturbance term. A synonym for error term.
double-blind experiment. An experiment with human subjects in which
neither the diagnosticians nor the subjects know who is in the treatment
group or the control group. This is accomplished by giving a placebo treatment to patients in the control group. In a single-blind experiment, the
patients do not know whether they are in treatment or control; the diagnosticians have this information.
dummy variable. Generally, a dummy variable takes only the values 0 or 1,
and distinguishes one group of interest from another. See binary variable;
regression model.
econometrics. Statistical study of economic issues.
epidemiology. Statistical study of disease or injury in human populations.
286
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
error term. The part of a statistical model that describes random error, i.e., the
impact of chance factors unrelated to variables in the model. In econometrics,
the error term is called a disturbance term.
estimator. A sample statistic used to estimate the value of a population parameter.
For example, the sample average commonly is used to estimate the population
average. The term “estimator” connotes a statistical procedure, whereas an
“estimate” connotes a particular numerical result.
expected value. See random variable.
experiment. See controlled experiment; randomized controlled experiment.
Compare observational study.
explanatory variable. See independent variable; regression model.
external validity. See validity.
factors. See independent variable.
Fisher’s exact test. A statistical test for comparing two sample proportions. For
example, take the proportions of white and black employees getting a promotion. An investigator may wish to test the null hypothesis that promotion does
not depend on race. Fisher’s exact test is one way to arrive at a p-value. The
calculation is based on the hypergeometric distribution. For details, see Michael
O. Finkelstein and Bruce Levin, Statistics for Lawyers 154–56 (2d ed. 2001).
See hypergeometric distribution; p-value; significance test; statistical hypothesis.
fitted value. See residual.
fixed significance level. Also alpha; size. A preset level, such as 5% or 1%; if
the p-value of a test falls below this level, the result is deemed statistically significant. See significance test. Compare observed significance level; p-value.
frequency; relative frequency. Frequency is the number of times that something occurs; relative frequency is the number of occurrences, relative to a
total. For example, if a coin is tossed 1000 times and lands heads 517 times,
the frequency of heads is 517; the relative frequency is 0.517, or 51.7%.
frequency distribution. Shows how often specified values occur in a dataset.
frequentist. Also called objectivist. Describes statisticians who view probabilities
as objective properties of a system that can be measured or estimated. Compare Bayesian. See Appendix.
Gaussian distribution. A synonym for the normal distribution. See normal
distribution.
general linear model. Expresses the dependent variable as a linear combination
of the independent variables plus an error term whose components may be
dependent and have differing variances. See error term; linear combination;
variance. Compare regression model.
grab sample. See convenience sample.
287
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
heteroscedastic. See scatter diagram.
highly significant. See p-value; practical significance; significance test.
histogram. A plot showing how observed values fall within specified intervals,
called bins or class intervals. Generally, matters are arranged so that the area
under the histogram, but over a class interval, gives the frequency or relative frequency of data in that interval. With a probability histogram, the area
gives the chance of observing a value that falls in the corresponding interval.
homoscedastic. See scatter diagram.
hypergeometric distribution. Suppose a sample is drawn at random, without
replacement, from a finite population. How many times will items of a certain
type come into the sample? The hypergeometric distribution gives the probabilities. For more details, see 1 William Feller, An Introduction to Probability
Theory and Its Applications 41–42 (2d ed. 1957). Compare Fisher’s exact test.
hypothesis. See alternative hypothesis; null hypothesis; one-sided hypothesis;
significance test; statistical hypothesis; two-sided hypothesis.
hypothesis test. See significance test.
identically distributed. Random variables are identically distributed when they
have the same probability distribution. For example, consider a box of numbered tickets. Draw tickets at random with replacement from the box. The
draws will be independent and identically distributed.
independence. Also, statistical independence. Events are independent when
the probability of one is unaffected by the occurrence or non-occurrence
of the other. Compare conditional probability; dependence; independent
variable; dependent variable.
independent variable. Independent variables (also called explanatory variables,
predictors, or risk factors) represent the causes and potential confounders in
a statistical study of causation; the dependent variable represents the effect.
In an observational study, independent variables may be used to divide the
population up into smaller and more homogenous groups (“stratification”).
In a regression model, the independent variables are used to predict the
dependent variable. For example, the unemployment rate has been used
as the independent variable in a model for predicting the crime rate; the
unemployment rate is the independent variable in this model, and the crime
rate is the dependent variable. The distinction between independent and
dependent variables is unrelated to statistical independence. See regression
model. Compare dependent variable; dependence; independence.
indicator variable. See dummy variable.
internal validity. See validity.
interquartile range. Difference between 25th and 75th percentile. See percentile.
288
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
interval estimate. A confidence interval, or an estimate coupled with a standard
error. See confidence interval; standard error. Compare point estimate.
least squares. See least squares estimator; regression model.
least squares estimator. An estimator that is computed by minimizing the sum
of the squared residuals. See residual.
level. The level of a significance test is denoted alpha (α). See alpha; fixed significance level; observed significance level; p-value; significance test.
linear combination. To obtain a linear combination of two variables, multiply
the first variable by some constant, multiply the second variable by another
constant, and add the two products. For example, 2u + 3v is a linear combination of u and v.
list sample. See systematic sample.
loss function. Statisticians may evaluate estimators according to a mathematical
formula involving the errors—that is, differences between actual values and
estimated values. The “loss” may be the total of the squared errors, or the
total of the absolute errors, etc. Loss functions seldom quantify real losses, but
may be useful summary statistics and may prompt the construction of useful
statistical procedures. Compare risk.
lurking variable. See confounding variable.
mean. Also, the average; the expected value of a random variable. The mean
gives a way to find the center of a batch of numbers: Add the numbers and
divide by how many there are. Weights may be employed, as in “weighted
mean” or “weighted average.” See random variable. Compare median; mode.
measurement validity. See validity. Compare reliability.
median. The median, like the mean, is a way to find the center of a batch of
numbers. The median is the 50th percentile. Half the numbers are larger,
and half are smaller. (To be very precise: at least half the numbers are greater
than or equal to the median; At least half the numbers are less than or equal
to the median; for small datasets, the median may not be uniquely defined.)
Compare mean; mode; percentile.
meta-analysis. Attempts to combine information from all studies on a certain
topic. For example, in the epidemiological context, a meta-analysis may
attempt to provide a summary odds ratio and confidence interval for the effect
of a certain exposure on a certain disease.
mode. The most common value. Compare mean; median.
model. See probability model; regression model; statistical model.
multicollinearity. Also, collinearity. The existence of correlations among the
independent variables in a regression model. See independent variable; regression model.
289
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
multiple comparison. Making several statistical tests on the same dataset.
Multiple comparisons complicate the interpretation of a p-value. For example,
if 20 divisions of a company are examined, and one division is found to have
a disparity significant at the 5% level, the result is not surprising; indeed, it
would be expected under the null hypothesis. Compare p-value; significance
test; statistical hypothesis.
multiple correlation coefficient. A number that indicates the extent to which
one variable can be predicted as a linear combination of other variables.
Its magnitude is the square root of R-squared. See linear combination;
R-squared; regression model. Compare correlation coefficient.
multiple regression. A regression equation that includes two or more independent variables. See regression model. Compare simple regression.
multistage cluster sample. A probability sample drawn in stages, usually after
stratification; the last stage will involve drawing a cluster. See cluster sample;
probability sample; stratified random sample.
multivariate methods. Methods for fitting models with multiple variables; in
statistics, multiple response variables; in other fields, multiple explanatory
variables. See regression model.
natural experiment. An observational study in which treatment and control
groups have been formed by some natural development; the assignment of
subjects to groups is akin to randomization. See observational study. Compare
controlled experiment.
nonresponse bias. Systematic error created by differences between respondents
and nonrespondents. If the nonresponse rate is high, this bias may be severe.
nonsampling error. A catch-all term for sources of error in a survey, other
than sampling error. Nonsampling errors cause bias. One example is selection
bias: The sample is drawn in a way that tends to exclude certain subgroups in
the population. A second example is nonresponse bias: People who do not
respond to a survey are usually different from respondents. A final example:
Response bias arises, for example, if the interviewer uses a loaded question.
normal distribution. Also, Gaussian distribution. When the normal distribution
has mean equal to 0 and standard error equal to 1, it is said to be “standard
normal.” The equation for the density is then
2
y = e − x /2/ 2π
where e = 2.71828. . . and π = 3.14159. . . . The density can be rescaled to
have any desired mean and standard error, resulting in the famous “bellshaped curves” of statistical theory. Terminology notwithstanding, there need
be nothing wrong with a distribution that differs from normal.
290
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
null hypothesis. For example, a hypothesis that there is no difference between
two groups from which samples are drawn. See significance test; statistical
hypothesis. Compare alternative hypothesis.
objectivist. See frequentist.
observational study. A study in which subjects select themselves into groups;
investigators then compare the outcomes for the different groups. For example, studies of smoking are generally observational. Subjects decide whether
or not to smoke; the investigators compare the death rate for smokers to the
death rate for nonsmokers. In an observational study, the groups may differ
in important ways that the investigators do not notice; controlled experiments minimize this problem. The critical distinction is that in a controlled
experiment, the investigators intervene to manipulate the circumstances of
the subjects; in an observational study, the investigators are passive observers.
(Of course, running a good observational study is hard work, and may be
quite useful.) Compare confounding variable; controlled experiment.
observed significance level. A synonym for p-value. See significance test.
Compare fixed significance level.
odds. The probability that an event will occur divided by the probability that it
will not. For example, if the chance of rain tomorrow is 2/3, then the odds
on rain are (2/3)/(1/3) = 2/1, or 2 to 1; the odds against rain are 1 to 2.
odds ratio. A measure of association, often used in epidemiology. For example, if
10% of all people exposed to a chemical develop a disease, compared with 5%
of people who are not exposed, then the odds of the disease in the exposed
group are 10/90 = 1/9, compared with 5/95 = 1/19 in the unexposed group.
The odds ratio is (1/9)/(1/19) = 19/9 = 2.1. An odds ratio of 1 indicates no
association. Compare relative risk.
one-sided hypothesis; one-tailed hypothesis. Excludes the possibility that
a parameter could be, for example, less than the value asserted in the null
hypothesis. A one-sided hypothesis leads to a one-sided (or one-tailed) test.
See significance test; statistical hypothesis; compare two-sided hypothesis.
one-sided test; one-tailed test. See one-sided hypothesis.
outcome variable. See dependent variable.
outlier. An observation that is far removed from the bulk of the data. Outliers
may indicate faulty measurements and they may exert undue influence on
summary statistics, such as the mean or the correlation coefficient.
p-value. Result from a statistical test. The probability of getting, just by chance,
a test statistic as large as or larger than the observed value. Large p-values
are consistent with the null hypothesis; small p-values undermine the null
hypothesis. However, p does not give the probability that the null hypothesis
is true. If p is smaller than 5%, the result is statistically significant. If p is smaller
291
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
than 1%, the result is highly significant. The p-value is also called the observed
significance level. See significance test; statistical hypothesis.
parameter. A numerical characteristic of a population or a model. See probability model.
percentile. To get the percentiles of a dataset, array the data from the smallest
value to the largest. Take the 90th percentile by way of example: 90% of the
values fall below the 90th percentile, and 10% are above. (To be very precise:
At least 90% of the data are at the 90th percentile or below; at least 10% of the
data are at the 90th percentile or above.) The 50th percentile is the median:
50% of the values fall below the median, and 50% are above. On the LSAT,
a score of 152 places a test taker at the 50th percentile; a score of 164 is at
the 90th percentile; a score of 172 is at the 99th percentile. Compare mean;
median; quartile.
placebo. See double-blind experiment.
point estimate. An estimate of the value of a quantity expressed as a single number. See estimator. Compare confidence interval; interval estimate.
Poisson distribution. A limiting case of the binomial distribution, when the
number of trials is large and the common probability is small. The parameter
of the approximating Poisson distribution is the number of trials times the
common probability, which is the expected number of events. When this
number is large, the Poisson distribution may be approximated by a normal
distribution.
population. Also, universe. All the units of interest to the researcher. Compare
sample; sampling frame.
population size. Also, size of population. Number of units in the population.
posterior probability. See Bayes’ rule.
power. The probability that a statistical test will reject the null hypothesis. To
compute power, one has to fix the size of the test and specify parameter values
outside the range given by the null hypothesis. A powerful test has a good
chance of detecting an effect when there is an effect to be detected. See beta;
significance test. Compare alpha; size; p-value.
practical significance. Substantive importance. Statistical significance does not
necessarily establish practical significance. With large samples, small differences can be statistically significant. See significance test.
practice effects. Changes in test scores that result from taking the same test
twice in succession, or taking two similar tests one after the other.
predicted value. See residual.
predictive validity. A skills test has predictive validity to the extent that test
scores are well correlated with later performance, or more generally with
outcomes that the test is intended to predict. See validity. Compare reliability.
292
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
predictor. See independent variable.
prior probability. See Bayes’ rule.
probability. Chance, on a scale from 0 to 1. Impossibility is represented by 0,
certainty by 1. Equivalently, chances may be quoted in percent; 100% corresponds to 1, 5% corresponds to .05, and so forth.
probability density. Describes the probability distribution of a random variable.
The chance that the random variable falls in an interval equals the area below
the density and above the interval. (However, not all random variables have
densities.) See probability distribution; random variable.
probability distribution. Gives probabilities for possible values or ranges of
values of a random variable. Often, the distribution is described in terms of a
density. See probability density.
probability histogram. See histogram.
probability model. Relates probabilities of outcomes to parameters; also, statistical model. The latter connotes unknown parameters.
probability sample. A sample drawn from a sampling frame by some objective
chance mechanism; each unit has a known probability of being sampled. Such
samples minimize selection bias, but can be expensive to draw.
psychometrics. The study of psychological measurement and testing.
qualitative variable; quantitative variable. Describes qualitative features of
subjects in a study (e.g., marital status—never-married, married, widowed,
divorced, separated). A quantitative variable describes numerical features
of the subjects (e.g., height, weight, income). This is not a hard-and-fast
distinction, because qualitative features may be given numerical codes, as
with a dummy variable. Quantitative variables may be classified as discrete
or continuous. Concepts such as the mean and the standard deviation apply
only to quantitative variables. Compare continuous variable; discrete variable;
dummy variable. See variable.
quartile. The 25th or 75th percentile. See percentile. Compare median.
R-squared (R2). Measures how well a regression equation fits the data. R-squared
varies between 0 (no fit) and 1 (perfect fit). R-squared does not measure the
extent to which underlying assumptions are justified. See regression model.
Compare multiple correlation coefficient; standard error of regression.
random error. Sources of error that are random in their effect, like draws made
at random from a box. These are reflected in the error term of a statistical
model. Some authors refer to random error as chance error or sampling error.
See regression model.
random variable. A variable whose possible values occur according to some
probability mechanism. For example, if a pair of dice are thrown, the total
number of spots is a random variable. The chance of two spots is 1/36, the
293
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
chance of three spots is 2/36, and so forth; the most likely number is 7, with
chance 6/36.
The expected value of a random variable is the weighted average of
the possible values; the weights are the probabilities. In our example, the
expected value is
1
2
3
5
6
× 2+ × 3+ ×4 + ×6+ ×7
36
36
36
36
36
2
1
5
4
3
+ × 8 + × 9 + × 10 + × 11 + × 12
36
36
36
36
36
In many problems, the weighted average is computed with respect to the
density; then sums must be replaced by integrals. The expected value need
not be a possible value for the random variable.
Generally, a random variable will be somewhere around its expected value,
but will be off (in either direction) by something like a standard error (SE)
or so. If the random variable has a more or less normal distribution, there is
about a 68% chance for it to fall in the range expected value – SE to expected
value + SE. See normal curve; standard error.
randomization. See controlled experiment; randomized controlled experiment.
randomized controlled experiment. A controlled experiment in which subjects are placed into the treatment and control groups at random—as if by a
lottery. See controlled experiment. Compare observational study.
range. The difference between the biggest and the smallest values in a batch of
numbers.
rate. In an epidemiological study, the number of events, divided by the size of
the population; often cross-classified by age and gender. For example, the
death rate from heart disease among American men ages 55–64 in 2004 was
about three per thousand. Among men ages 65–74, the rate was about seven
per thousand. Among women, the rate was about half that for men. Rates
adjust for differences in sizes of populations or subpopulations. Often, rates
are computed per unit of time, e.g., per thousand persons per year. Data
source: Statistical Abstract of the United States tbl. 115 (2008).
regression coefficient. The coefficient of a variable in a regression equation.
See regression model.
regression diagnostics. Procedures intended to check whether the assumptions
of a regression model are appropriate.
regression equation. See regression model.
regression line. The graph of a (simple) regression equation.
regression model. A regression model attempts to combine the values of certain
variables (the independent or explanatory variables) in order to get expected
values for another variable (the dependent variable). Sometimes, the phrase
294
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
“regression model” refers to a probability model for the data; if no qualifications are made, the model will generally be linear, and errors will be assumed
independent across observations, with common variance, The coefficients in
the linear combination are called regression coefficients; these are parameters.
At times, “regression model” refers to an equation (“the regression equation”)
estimated from data, typically by least squares.
For example, in a regression study of salary differences between men and
women in a firm, the analyst may include a dummy variable for gender,
as well as statistical controls such as education and experience to adjust for
productivity differences between men and women. The dummy variable
would be defined as 1 for the men and 0 for the women. Salary would be
the dependent variable; education, experience, and the dummy would be the
independent variables. See least squares; multiple regression; random error;
variance. Compare general linear model.
relative frequency. See frequency.
relative risk. A measure of association used in epidemiology. For example, if
10% of all people exposed to a chemical develop a disease, compared to 5%
of people who are not exposed, then the disease occurs twice as frequently
among the exposed people: The relative risk is 10%/5% = 2. A relative risk of
1 indicates no association. For more details, see Leon Gordis, Epidemiology
(4th ed. 2008). Compare odds ratio.
reliability. The extent to which a measurement process gives the same results on
repeated measurement of the same thing. Compare validity.
representative sample. Not a well-defined technical term. A sample judged to
fairly represent the population, or a sample drawn by a process likely to give
samples that fairly represent the population, for example, a large probability
sample.
resampling. See bootstrap.
residual. The difference between an actual and a predicted value. The predicted
value comes typically from a regression equation, and is better called the fitted value, because there is no real prediction going on. See regression model;
independent variable.
response variable. See independent variable.
risk. Expected loss. “Expected” means on average, over the various datasets that
could be generated by the statistical model under examination. Usually, risk
cannot be computed exactly but has to be estimated, because the parameters
in the statistical model are unknown and must be estimated. See loss function; random variable.
risk factor. See independent variable.
robust. A statistic or procedure that does not change much when data or assumptions are modified slightly.
295
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
sample. A set of units collected for study. Compare population.
sample size. Also, size of sample. The number of units in a sample.
sample weights. See stratified random sample.
sampling distribution. The distribution of the values of a statistic, over all possible samples from a population. For example, suppose a random sample is
drawn. Some values of the sample mean are more likely; others are less likely.
The sampling distribution specifies the chance that the sample mean will fall
in one interval rather than another.
sampling error. A sample is part of a population. When a sample is used to
estimate a numerical characteristic of the population, the estimate is likely to
differ from the population value because the sample is not a perfect microcosm of the whole. If the estimate is unbiased, the difference between the
estimate and the exact value is sampling error. More generally,
estimate = true value + bias + sampling error
Sampling error is also called chance error or random error. See standard error.
Compare bias; nonsampling error.
sampling frame. A list of units designed to represent the entire population as
completely as possible. The sample is drawn from the frame.
sampling interval. See systematic sample.
scatter diagram. Also, scatterplot; scattergram. A graph showing the relationship between two variables in a study. Each dot represents one subject. One
variable is plotted along the horizontal axis, the other variable is plotted along
the vertical axis. A scatter diagram is homoscedastic when the spread is more
or less the same inside any vertical strip. If the spread changes from one strip
to another, the diagram is heteroscedastic.
selection bias. Systematic error due to nonrandom selection of subjects for
study.
sensitivity. In clinical medicine, the probability that a test for a disease will give
a positive result given that the patient has the disease. Sensitivity is analogous
to the power of a statistical test. Compare specificity.
sensitivity analysis. Analyzing data in different ways to see how results depend
on methods or assumptions.
sign test. A statistical test based on counting and the binomial distribution. For
example, a Finnish study of twins found 22 monozygotic twin pairs where
1 twin smoked, 1 did not, and at least 1 of the twins had died. That sets up
a race to death. In 17 cases, the smoker died first; in 5 cases, the nonsmoker
died first. The null hypothesis is that smoking does not affect time to death,
so the chances are 50-50 for the smoker to die first. On the null hypothesis,
the chance that the smoker will win the race 17 or more times out of 22 is
296
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
8/1000. That is the p-value. The p-value can be computed from the binomial
distribution. For additional detail, see Michael O. Finkelstein & Bruce Levin,
Statistics for Lawyers 339–41 (2d ed. 2001); David A. Freedman et al.,
Statistics 262–63 (4th ed. 2007).
significance level. See fixed significance level; p-value.
significance test. Also, statistical test; hypothesis test; test of significance. A significance test involves formulating a statistical hypothesis and a test statistic, computing a p-value, and comparing p to some preestablished value (α) to decide
if the test statistic is significant. The idea is to see whether the data conform
to the predictions of the null hypothesis. Generally, a large test statistic goes
with a small p-value; and small p-values would undermine the null hypothesis.
For example, suppose that a random sample of male and female employees
were given a skills test and the mean scores of the men and women were
different—in the sample. To judge whether the difference is due to sampling
error, a statistician might consider the implications of competing hypotheses
about the difference in the population. The null hypothesis would say that
on average, in the population, men and women have the same scores: The
difference observed in the data is then just due to sampling error. A one-sided
alternative hypothesis would be that on average, in the population, men score
higher than women. The one-sided test would reject the null hypothesis if
the sample men score substantially higher than the women—so much so that
the difference is hard to explain on the basis of sampling error.
In contrast, the null hypothesis could be tested against the two-sided
alternative that on average, in the population, men score differently than
women—higher or lower. The corresponding two-sided test would reject the
null hypothesis if the sample men score substantially higher or substantially
lower than the women.
The one-sided and two-sided tests would both be based on the same
data, and use the same t-statistic. However, if the men in the sample score
higher than the women, the one-sided test would give a p-value only half as
large as the two-sided test; that is, the one-sided test would appear to give
stronger evidence against the null hypothesis. (“One-sided” and “one-tailed”
are synonymous; so are “two-sided and “two-tailed.”) See p-value; statistical
hypothesis; t-statistic.
significant. See p-value; practical significance; significance test.
simple random sample. A random sample in which each unit in the sampling
frame has the same chance of being sampled. The investigators take a unit at
random (as if by lottery), set it aside, take another at random from what is
left, and so forth.
simple regression. A regression equation that includes only one independent
variable. Compare multiple regression.
size. A synonym for alpha (α).
297
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
skip factor. See systematic sample.
specificity. In clinical medicine, the probability that a test for a disease will give
a negative result given that the patient does not have the disease. Specificity
is analogous to 1 – α, where α is the significance level of a statistical test.
Compare sensitivity.
spurious correlation. When two variables are correlated, one is not necessarily
the cause of the other. The vocabulary and shoe size of children in elementary
school, for example, are correlated—but learning more words will not make
the feet grow. Such noncausal correlations are said to be spurious. (Originally,
the term seems to have been applied to the correlation between two rates with
the same denominator: Even if the numerators are unrelated, the common
denominator will create some association.) Compare confounding variable.
standard deviation (SD). Indicates how far a typical element deviates from the
average. For example, in round numbers, the average height of women age
18 and over in the United States is 5 feet 4 inches. However, few women
are exactly average; most will deviate from average, at least by a little. The
SD is sort of an average deviation from average. For the height distribution,
the SD is 3 inches. The height of a typical woman is around 5 feet 4 inches,
but is off that average value by something like 3 inches.
For distributions that follow the normal curve, about 68% of the elements
are in the range from 1 SD below the average to 1 SD above the average.
Thus, about 68% of women have heights in the range 5 feet 1 inch to 5 feet
7 inches. Deviations from the average that exceed 3 or 4 SDs are extremely
unusual. Many authors use standard deviation to also mean standard error.
See standard error.
standard error (SE). Indicates the likely size of the sampling error in an estimate. Many authors use the term standard deviation instead of standard error.
Compare expected value; standard deviation.
standard error of regression. Indicates how actual values differ (in some average sense) from the fitted values in a regression model. See regression model;
residual. Compare R-squared.
standard normal. See normal distribution.
standardization. See standardized variable.
standardized variable. Transformed to have mean zero and variance one. This
involves two steps: (1) subtract the mean; (2) divide by the standard deviation.
statistic. A number that summarizes data. A statistic refers to a sample; a parameter
or a true value refers to a population or a probability model.
statistical controls. Procedures that try to filter out the effects of confounding
variables on non-experimental data, for example, by adjusting through statistical procedures such as multiple regression. Variables in a multiple regression
298
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
equation. See multiple regression; confounding variable; observational study.
Compare controlled experiment.
statistical dependence. See dependence.
statistical hypothesis. Generally, a statement about parameters in a probability
model for the data. The null hypothesis may assert that certain parameters have
specified values or fall in specified ranges; the alternative hypothesis would
specify other values or ranges. The null hypothesis is tested against the data with
a test statistic; the null hypothesis may be rejected if there is a statistically significant difference between the data and the predictions of the null hypothesis.
Typically, the investigator seeks to demonstrate the alternative hypothesis;
the null hypothesis would explain the findings as a result of mere chance,
and the investigator uses a significance test to rule out that possibility. See
significance test.
statistical independence. See independence.
statistical model. See probability model.
statistical test. See significance test.
statistical significance. See p-value.
stratified random sample. A type of probability sample. The researcher divides
the population into relatively homogeneous groups called “strata,” and draws
a random sample separately from each stratum. Dividing the population into
strata is called “stratification.” Often the sampling fraction will vary from
stratum to stratum. Then sampling weights should be used to extrapolate
from the sample to the population. For example, if 1 unit in 10 is sampled
from stratum A while 1 unit in 100 is sampled from stratum B, then each unit
drawn from A counts as 10, and each unit drawn from B counts as 100. The
first kind of unit has weight 10; the second has weight 100. See Freedman et
al., Statistics 401 (4th ed. 2007).
stratification. See independent variable; stratified random sample.
study validity. See validity.
subjectivist. See Bayesian.
systematic error. See bias.
systematic sample. Also, list sample. The elements of the population are numbered consecutively as 1, 2, 3, . . . . The investigators choose a starting point
and a “sampling interval” or “skip factor” k. Then, every kth element is
selected into the sample. If the starting point is 1 and k = 10, for example, the
sample would consist of items 1, 11, 21, . . . . Sometimes the starting point
is chosen at random from 1 to k: this is a random-start systematic sample.
t-statistic. A test statistic, used to make the t-test. The t-statistic indicates how
far away an estimate is from its expected value, relative to the standard error.
The expected value is computed using the null hypothesis that is being tested.
299
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Some authors refer to the t-statistic, others to the z-statistic, especially when
the sample is large. With a large sample, a t-statistic larger than 2 or 3 in absolute value makes the null hypothesis rather implausible—the estimate is too
many standard errors away from its expected value. See statistical hypothesis;
significance test; t-test.
t-test. A statistical test based on the t-statistic. Large t-statistics are beyond the
usual range of sampling error. For example, if t is bigger than 2, or smaller
than –2, then the estimate is statistically significant at the 5% level; such values
of t are hard to explain on the basis of sampling error. The scale for t-statistics
is tied to areas under the normal curve. For example, a t-statistic of 1.5 is not
very striking, because 13% = 13/100 of the area under the normal curve is
outside the range from –1.5 to 1.5. On the other hand, t = 3 is remarkable:
Only 3/1000 of the area lies outside the range from –3 to 3. This discussion is
predicated on having a reasonably large sample; in that context, many authors
refer to the z-test rather than the t-test.
Consider testing the null hypothesis that the average of a population equals
a given value; the population is known to be normal. For small samples, the
t-statistic follows Student’s t-distribution (when the null hypothesis holds)
rather than the normal curve; larger values of t are required to achieve significance. The relevant t-distribution depends on the number of degrees of
freedom, which in this context equals the sample size minus one. A t-test is
not appropriate for small samples drawn from a population that is not normal.
See p-value; significance test; statistical hypothesis.
test statistic. A statistic used to judge whether data conform to the null hypothesis. The parameters of a probability model determine expected values for the
data; differences between expected values and observed values are measured
by a test statistic. Such test statistics include the chi-squared statistic (χ2) and
the t-statistic. Generally, small values of the test statistic are consistent with
the null hypothesis; large values lead to rejection. See p-value; statistical
hypothesis; t-statistic.
time series. A series of data collected over time, for example, the Gross National
Product of the United States from 1945 to 2005.
treatment group. See controlled experiment.
two-sided hypothesis; two-tailed hypothesis. An alternative hypothesis
asserting that the values of a parameter are different from—either greater than
or less than—the value asserted in the null hypothesis. A two-sided alternative hypothesis suggests a two-sided (or two-tailed) test. See significance test;
statistical hypothesis. Compare one-sided hypothesis.
two-sided test; two-tailed test. See two-sided hypothesis.
Type I error. A statistical test makes a Type I error when (1) the null hypothesis
is true and (2) the test rejects the null hypothesis, i.e., there is a false posi300
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Statistics
tive. For example, a study of two groups may show some difference between
samples from each group, even when there is no difference in the population.
When a statistical test deems the difference to be significant in this situation,
it makes a Type I error. See significance test; statistical hypothesis. Compare
alpha; Type II error.
Type II error. A statistical test makes a Type II error when (1) the null hypothesis is false and (2) the test fails to reject the null hypothesis, i.e., there is a
false negative. For example, there may not be a significant difference between
samples from two groups when, in fact, the groups are different. See significance test; statistical hypothesis. Compare beta; Type I error.
unbiased estimator. An estimator that is correct on average, over the possible datasets. The estimates have no systematic tendency to be high or low.
Compare bias.
uniform distribution. For example, a whole number picked at random from 1
to 100 has the uniform distribution: All values are equally likely. Similarly, a
uniform distribution is obtained by picking a real number at random between
0.75 and 3.25: The chance of landing in an interval is proportional to the
length of the interval.
validity. Measurement validity is the extent to which an instrument measures
what it is supposed to, rather than something else. The validity of a standardized test is often indicated by the correlation coefficient between the test
scores and some outcome measure (the criterion variable). See content validity; differential validity; predictive validity. Compare reliability.
Study validity is the extent to which results from a study can be relied
upon. Study validity has two aspects, internal and external. A study has high
internal validity when its conclusions hold under the particular circumstances
of the study. A study has high external validity when its results are generalizable. For example, a well-executed randomized controlled double-blind
experiment performed on an unusual study population will have high internal
validity because the design is good; but its external validity will be debatable
because the study population is unusual.
Validity is used also in its ordinary sense: assumptions are valid when they
hold true for the situation at hand.
variable. A property of units in a study, which varies from one unit to another,
for example, in a study of households, household income; in a study of
people, employment status (employed, unemployed, not in labor force).
variance. The square of the standard deviation. Compare standard error; covariance.
weights. See stratified random sample.
within-observer variability. Differences that occur when an observer measures
the same thing twice, or measures two things that are virtually the same.
Compare between-observer variability.
301
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
z-statistic. See t-statistic.
z-test. See t-test.
References on Statistics
General Surveys
David Freedman et al., Statistics (4th ed. 2007).
Darrell Huff, How to Lie with Statistics (1993).
Gregory A. Kimble, How to Use (and Misuse) Statistics (1978).
David S. Moore & William I. Notz, Statistics: Concepts and Controversies (2005).
Michael Oakes, Statistical Inference: A Commentary for the Social and Behavioral
Sciences (1986).
Statistics: A Guide to the Unknown (Roxy Peck et al. eds., 4th ed. 2005).
Hans Zeisel, Say It with Figures (6th ed. 1985).
Reference Works for Lawyers and Judges
David C. Baldus & James W.L. Cole, Statistical Proof of Discrimination (1980
& Supp. 1987) (continued as Ramona L. Paetzold & Steven L. Willborn,
The Statistics of Discrimination: Using Statistical Evidence in Discrimination
Cases (1994) (updated annually).
David W. Barnes & John M. Conley, Statistical Evidence in Litigation: Methodology, Procedure, and Practice (1986 & Supp. 1989).
James Brooks, A Lawyer’s Guide to Probability and Statistics (1990).
Michael O. Finkelstein & Bruce Levin, Statistics for Lawyers (2d ed. 2001).
Modern Scientific Evidence: The Law and Science of Expert Testimony (David
L. Faigman et al. eds., Volumes 1 and 2, 2d ed. 2002) (updated annually).
David H. Kaye et al., The New Wigmore: A Treatise on Evidence: Expert Evidence § 12 (2d ed. 2011) (updated annually).
National Research Council, The Evolving Role of Statistical Assessments as Evidence in the Courts (Stephen E. Fienberg ed., 1989).
Statistical Methods in Discrimination Litigation (David H. Kaye & Mikel Aickin
eds., 1986).
Hans Zeisel & David Kaye, Prove It with Figures: Empirical Methods in Law and
Litigation (1997).
General Reference
Encyclopedia of Statistical Sciences (Samuel Kotz et al. eds., 2d ed. 2005).
302
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
Multiple Regression
DANIEL L. RUBINFELD
Daniel L. Rubinfeld, Ph.D., is Robert L. Bridges Professor of Law and Professor of Economics
Emeritus, University of California, Berkeley, and Visiting Professor of Law at New York
University Law School.
CONTENTS
I. Introduction and Overview, 305
II. Research Design: Model Specification, 311
A. What Is the Specific Question That Is Under Investigation by the
Expert? 311
B. What Model Should Be Used to Evaluate the Question at Issue? 311
1. Choosing the dependent variable, 312
2. Choosing the explanatory variable that is relevant to the
question at issue, 313
3. Choosing the additional explanatory variables, 313
4. Choosing the functional form of the multiple regression
model, 316
5. Choosing multiple regression as a method of analysis, 317
III. Interpreting Multiple Regression Results, 318
A. What Is the Practical, as Opposed to the Statistical, Significance of
Regression Results? 318
1. When should statistical tests be used? 319
2. What is the appropriate level of statistical significance? 320
3. Should statistical tests be one-tailed or two-tailed? 321
B. Are the Regression Results Robust? 322
1. What evidence exists that the explanatory variable causes
changes in the dependent variable? 322
2. To what extent are the explanatory variables correlated with
each other? 324
3. To what extent are individual errors in the regression model
independent? 325
4. To what extent are the regression results sensitive to individual
data points? 326
5. To what extent are the data subject to measurement error? 327
303
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
IV. The Expert, 328
A. Who Should Be Qualified as an Expert? 328
B. Should the Court Appoint a Neutral Expert? 329
V. Presentation of Statistical Evidence, 330
A. What Disagreements Exist Regarding Data on Which the Analysis Is
Based? 330
B. Which Database Information and Analytical Procedures Will Aid in
Resolving Disputes over Statistical Studies? 331
Appendix: The Basics of Multiple Regression, 333
A. Introduction, 333
B. Linear Regression Model, 336
1. Specifying the regression model, 337
2. Regression line, 337
C. Interpreting Regression Results, 339
D . Determining the Precision of the Regression Results, 340
1. Standard errors of the coefficients and t-statistics, 340
2. Goodness-of-fit, 344
3. Sensitivity of least squares regression results, 345
E. Reading Multiple Regression Computer Output, 346
F . Forecasting, 348
G. A Hypothetical Example, 350
Glossary of Terms, 352
References on Multiple Regression, 357
304
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
I. Introduction and Overview
Multiple regression analysis is a statistical tool used to understand the relationship
between or among two or more variables.1 Multiple regression involves a variable
to be explained—called the dependent variable—and additional explanatory variables that are thought to produce or be associated with changes in the dependent
variable.2 For example, a multiple regression analysis might estimate the effect of
the number of years of work on salary. Salary would be the dependent variable to
be explained; the years of experience would be the explanatory variable.
Multiple regression analysis is sometimes well suited to the analysis of data
about competing theories for which there are several possible explanations for the
relationships among a number of explanatory variables.3 Multiple regression typically uses a single dependent variable and several explanatory variables to assess the
statistical data pertinent to these theories. In a case alleging sex discrimination in
salaries, for example, a multiple regression analysis would examine not only sex,
but also other explanatory variables of interest, such as education and experience.4
The employer-defendant might use multiple regression to argue that salary is a
function of the employee’s education and experience, and the employee-plaintiff
might argue that salary is also a function of the individual’s sex. Alternatively,
in an antitrust cartel damages case, the plaintiff’s expert might utilize multiple
regression to evaluate the extent to which the price of a product increased during the period in which the cartel was effective, after accounting for costs and
other variables unrelated to the cartel. The defendant’s expert might use multiple
1. A variable is anything that can take on two or more values (e.g., the daily temperature in
Chicago or the salaries of workers at a factory).
2. Explanatory variables in the context of a statistical study are sometimes called independent
variables. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section II.A.1,
in this manual. The guide also offers a brief discussion of multiple regression analysis. Id., Section V.
3. Multiple regression is one type of statistical analysis involving several variables. Other types
include matching analysis, stratification, analysis of variance, probit analysis, logit analysis, discriminant
analysis, and factor analysis.
4. Thus, in Ottaviani v. State University of New York, 875 F.2d 365, 367 (2d Cir. 1989) (citations
omitted), cert. denied, 493 U.S. 1021 (1990), the court stated:
In disparate treatment cases involving claims of gender discrimination, plaintiffs typically use multiple
regression analysis to isolate the influence of gender on employment decisions relating to a particular
job or job benefit, such as salary.
The first step in such a regression analysis is to specify all of the possible “legitimate” (i.e., nondiscriminatory) factors that are likely to significantly affect the dependent variable and which could
account for disparities in the treatment of male and female employees. By identifying those legitimate
criteria that affect the decisionmaking process, individual plaintiffs can make predictions about what job
or job benefits similarly situated employees should ideally receive, and then can measure the difference
between the predicted treatment and the actual treatment of those employees. If there is a disparity
between the predicted and actual outcomes for female employees, plaintiffs in a disparate treatment
case can argue that the net “residual” difference represents the unlawful effect of discriminatory animus
on the allocation of jobs or job benefits.
305
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
regression to suggest that the plaintiff’s expert had omitted a number of pricedetermining variables.
More generally, multiple regression may be useful (1) in determining whether
a particular effect is present; (2) in measuring the magnitude of a particular effect;
and (3) in forecasting what a particular effect would be, but for an intervening
event. In a patent infringement case, for example, a multiple regression analysis
could be used to determine (1) whether the behavior of the alleged infringer
affected the price of the patented product, (2) the size of the effect, and (3) what
the price of the product would have been had the alleged infringement not
occurred.
Over the past several decades, the use of multiple regression analysis in court
has grown widely. Regression analysis has been used most frequently in cases of
sex and race discrimination5 antitrust violations,6 and cases involving class cer-
5. Discrimination cases using multiple regression analysis are legion. See, e.g., Bazemore v.
Friday, 478 U.S. 385 (1986), on remand, 848 F.2d 476 (4th Cir. 1988); Csicseri v. Bowsher, 862 F.
Supp. 547 (D.D.C. 1994) (age discrimination), aff’d, 67 F.3d 972 (D.C. Cir. 1995); EEOC v. General
Tel. Co., 885 F.2d 575 (9th Cir. 1989), cert. denied, 498 U.S. 950 (1990); Bridgeport Guardians, Inc.
v. City of Bridgeport, 735 F. Supp. 1126 (D. Conn. 1990), aff’d, 933 F.2d 1140 (2d Cir.), cert. denied,
502 U.S. 924 (1991); Bickerstaff v. Vassar College, 196 F.3d 435, 448–49 (2d Cir. 1999) (sex discrimination); McReynolds v. Sodexho Marriott, 349 F. Supp. 2d 1 (D.C. Cir. 2004) (race discrimination); Hnot v. Willis Group Holdings Ltd., 228 F.R.D. 476 (S.D.N.Y. 2005) (gender discrimination);
Carpenter v. Boeing Co., 456 F.3d 1183 (10th Cir. 2006) (sex discrimination); Coward v. ADT
Security Systems, Inc., 140 F.3d 271, 274–75 (D.C. Cir. 1998); Smith v. Virginia Commonwealth
Univ., 84 F.3d 672 (4th Cir. 1996) (en banc); Hemmings v. Tidyman’s Inc., 285 F.3d 1174, 1184–86
(9th Cir. 2000); Mehus v. Emporia State University, 222 F.R.D. 455 (D. Kan. 2004) (sex discrimination); Guiterrez v. Johnson & Johnson, 2006 WL 3246605 (D.N.J. Nov. 6, 2006 (race discrimination);
Morgan v. United Parcel Service, 380 F.3d 459 (8th Cir. 2004) (racial discrimination). See also Keith
N. Hylton & Vincent D. Rougeau, Lending Discrimination: Economic Theory, Econometric Evidence, and
the Community Reinvestment Act, 85 Geo. L.J. 237, 238 (1996) (“regression analysis is probably the best
empirical tool for uncovering discrimination”).
6. E.g., United States v. Brown Univ., 805 F. Supp. 288 (E.D. Pa. 1992) (price fixing of college
scholarships), rev’d, 5 F.3d 658 (3d Cir. 1993); Petruzzi’s IGA Supermarkets, Inc. v. Darling-Delaware
Co., 998 F.2d 1224 (3d Cir.), cert. denied, 510 U.S. 994 (1993); Ohio v. Louis Trauth Dairy, Inc.,
925 F. Supp. 1247 (S.D. Ohio 1996); In re Chicken Antitrust Litig., 560 F. Supp. 963, 993 (N.D. Ga.
1980); New York v. Kraft Gen. Foods, Inc., 926 F. Supp. 321 (S.D.N.Y. 1995); Freeland v. AT&T,
238 F.R.D. 130 (S.D.N.Y. 2006); In re Pressure Sensitive Labelstock Antitrust Litig., 2007 U.S. Dist.
LEXIS 85466 (M.D. Pa. Nov. 19, 2007); In re Linerboard Antitrust Litig., 497 F. Supp. 2d 666 (E.D.
Pa. 2007) (price fixing by manufacturers of corrugated boards and boxes); In re Polypropylene Carpet
Antitrust Litig., 93 F. Supp. 2d 1348 (N.D. Ga. 2000); In re OSB Antitrust Litig., 2007 WL 2253418
(E.D. Pa. Aug. 3, 2007) (price fixing of Oriented Strand Board, also known as “waferboard”); In re
TFT-LCD (Flat Panel) Antitrust Litig., 267 F.R.D. 583 (N.D. Cal. 2010).
For a broad overview of the use of regression methods in antitrust, see ABA Antitrust Section,
Econometrics: Legal, Practical and Technical Issues (John Harkrider & Daniel Rubinfeld, eds. 2005).
See also Jerry Hausman et al., Competitive Analysis with Differenciated Products, 34 Annales D’Économie
et de Statistique 159 (1994); Gregory J. Werden, Simulating the Effects of Differentiated Products Mergers:
A Practical Alternative to Structural Merger Policy, 5 Geo. Mason L. Rev. 363 (1997).
306
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
tification (under Rule 23).7 However, there are a range of other applications,
including census undercounts,8 voting rights,9 the study of the deterrent effect of
the death penalty,10 rate regulation,11 and intellectual property.12
7. In antitrust, the circuits are currently split as to the extent to which plaintiffs must prove
that common elements predominate over individual elements. E.g., compare In Re Hydrogen Peroxide
Litig., 522 F.2d 305 (3d Cir. 2008) with In Re Cardizem CD Antitrust Litig., 391 F.3d 812 (6th Cir.
2004). For a discussion of use of multiple regression in evaluating class certification, see Bret M. Dickey
& Daniel L. Rubinfeld, Antitrust Class Certification: Towards an Economic Framework, 66 N.Y.U. Ann.
Surv. Am. L. 459 (2010) and John H. Johnson & Gregory K. Leonard, Economics and the Rigorous
Analysis of Class Certification in Antitrust Cases, 3 J. Competition L. & Econ. 341 (2007).
8. See, e.g., City of New York v. U.S. Dep’t of Commerce, 822 F. Supp. 906 (E.D.N.Y. 1993)
(decision of Secretary of Commerce not to adjust the 1990 census was not arbitrary and capricious),
vacated, 34 F.3d 1114 (2d Cir. 1994) (applying heightened scrutiny), rev’d sub nom. Wisconsin v. City of
New York, 517 U.S. 565 (1996); Carey v. Klutznick, 508 F. Supp. 420, 432–33 (S.D.N.Y. 1980) (use
of reasonable and scientifically valid statistical survey or sampling procedures to adjust census figures
for the differential undercount is constitutionally permissible), stay granted, 449 U.S. 1068 (1980), rev’d
on other grounds, 653 F.2d 732 (2d Cir. 1981), cert. denied, 455 U.S. 999 (1982); Young v. Klutznick,
497 F. Supp. 1318, 1331 (E.D. Mich. 1980), rev’d on other grounds, 652 F.2d 617 (6th Cir. 1981), cert.
denied, 455 U.S. 939 (1982).
9. Multiple regression analysis was used in suits charging that at-large areawide voting was
instituted to neutralize black voting strength, in violation of section 2 of the Voting Rights Act, 42
U.S.C. § 1973 (1988). Multiple regression demonstrated that the race of the candidates and that of
the electorate were determinants of voting. See Williams v. Brown, 446 U.S. 236 (1980); Rodriguez
v. Pataki, 308 F. Supp. 2d 346, 414 (S.D.N.Y. 2004); United States v. Vill. of Port Chester, 2008
U.S. Dist. LEXIS 4914 (S.D.N.Y. Jan. 17, 2008); Meza v. Galvin, 322 F. Supp. 2d 52 (D. Mass.
2004) (violation of VRA with regard to Hispanic voters in Boston); Bone Shirt v. Hazeltine, 336
F. Supp. 2d 976 (D.S.D. 2004) (violations of VRA with regard to Native American voters in South
Dakota); Georgia v. Ashcroft, 195 F. Supp. 2d 25 (D.D.C. 2002) (redistricting of Georgia’s state and
federal legislative districts); Benavidez v. City of Irving, 638 F. Supp. 2d 709 (N.D. Tex. 2009) (challenge of city’s at-large voting scheme). For commentary on statistical issues in voting rights cases, see,
e.g., Statistical and Demographic Issues Underlying Voting Rights Cases, 15 Evaluation Rev. 659 (1991);
Stephen P. Klein et al., Ecological Regression Versus the Secret Ballot, 31 Jurimetrics J. 393 (1991); James
W. Loewen & Bernard Grofman, Recent Developments in Methods Used in Vote Dilution Litigation, 21
Urb. Law. 589 (1989); Arthur Lupia & Kenneth McCue, Why the 1980s Measures of Racially Polarized
Voting Are Inadequate for the 1990s, 12 Law & Pol’y 353 (1990).
10. See, e.g., Gregg v. Georgia, 428 U.S. 153, 184–86 (1976). For critiques of the validity of
the deterrence analysis, see National Research Council, Deterrence and Incapacitation: Estimating
the Effects of Criminal Sanctions on Crime Rates (Alfred Blumstein et al. eds., 1978); Richard O.
Lempert, Desert and Deterrence: An Assessment of the Moral Bases of the Case for Capital Punishment, 79
Mich. L. Rev. 1177 (1981); Hans Zeisel, The Deterrent Effect of the Death Penalty: Facts v. Faith, 1976
Sup. Ct. Rev. 317; and John Donohue & Justin Wolfers, Uses and Abuses of Statistical Evidence in the
Death Penalty Debate, 58 Stan. L. Rev. 787 (2005).
11. See, e.g., Time Warner Entertainment Co. v. FCC, 56 F.3d 151 (D.C. Cir. 1995) (challenge to FCC’s application of multiple regression analysis to set cable rates), cert. denied, 516 U.S.
1112 (1996); Appalachian Power Co. v. EPA, 135 F.3d 791 (D.C. Cir. 1998) (challenging the EPA’s
application of regression analysis to set nitrous oxide emission limits); Consumers Util. Rate Advocacy
Div. v. Ark. PSC, 99 Ark. App. 228 (Ark. Ct. App. 2007) (challenging an increase in nongas rates).
12. See Polaroid Corp. v. Eastman Kodak Co., No. 76-1634-MA, 1990 WL 324105, at *29,
*62–63 (D. Mass. Oct. 12, 1990) (damages awarded because of patent infringement), amended by No.
307
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Multiple regression analysis can be a source of valuable scientific testimony
in litigation. However, when inappropriately used, regression analysis can confuse
important issues while having little, if any, probative value. In EEOC v. Sears,
Roebuck & Co.,13 in which Sears was charged with discrimination against women
in hiring practices, the Seventh Circuit acknowledged that “[m]ultiple regression
analyses, designed to determine the effect of several independent variables on a
dependent variable, which in this case is hiring, are an accepted and common
method of proving disparate treatment claims.”14 However, the court affirmed
the district court’s findings that the “E.E.O.C.’s regression analyses did not ‘accurately reflect Sears’ complex, nondiscriminatory decision-making processes’” and
that the “‘E.E.O.C.’s statistical analyses [were] so flawed that they lack[ed] any
persuasive value.’”15 Serious questions also have been raised about the use of multiple regression analysis in census undercount cases and in death penalty cases.16
The Supreme Court’s rulings in Daubert and Kumho Tire have encouraged
parties to raise questions about the admissibility of multiple regression analyses.17
Because multiple regression is a well-accepted scientific methodology, courts have
frequently admitted testimony based on multiple regression studies, in some cases
over the strong objection of one of the parties.18 However, on some occasions
courts have excluded expert testimony because of a failure to utilize a multiple
regression methodology.19 On other occasions, courts have rejected regression
76-1634-MA, 1991 WL 4087 (D. Mass. Jan. 11, 1991); Estate of Vane v. The Fair, Inc., 849 F.2d
186, 188 (5th Cir. 1988) (lost profits were the result of copyright infringement), cert. denied, 488 U.S.
1008 (1989); Louis Vuitton Malletier v. Dooney & Bourke, Inc., 525 F. Supp. 2d 576, 664 (S.D.N.Y.
2007) (trademark infringement and unfair competition suit). The use of multiple regression analysis to
estimate damages has been contemplated in a wide variety of contexts. See, e.g., David Baldus et al.,
Improving Judicial Oversight of Jury Damages Assessments: A Proposal for the Comparative Additur/Remittitur
Review of Awards for Nonpecuniary Harms and Punitive Damages, 80 Iowa L. Rev. 1109 (1995); Talcott
J. Franklin, Calculating Damages for Loss of Parental Nurture Through Multiple Regression Analysis, 52
Wash. & Lee L. Rev. 271 (1997); Roger D. Blair & Amanda Kay Esquibel, Yardstick Damages in Lost
Profit Cases: An Econometric Approach, 72 Denv. U. L. Rev. 113 (1994). Daniel Rubinfeld, Quantitative
Methods in Antitrust, in 1 Issues in Competition Law and Policy 723 (2008).
13. 839 F.2d 302 (7th Cir. 1988).
14. Id. at 324 n.22.
15. Id. at 348, 351 (quoting EEOC v. Sears, Roebuck & Co., 628 F. Supp. 1264, 1342, 1352
(N.D. Ill. 1986)). The district court commented specifically on the “severe limits of regression analysis
in evaluating complex decision-making processes.” 628 F. Supp. at 1350.
16. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, Sections II.A.3,
B.1, in this manual.
17. Daubert v. Merrill Dow Pharms., Inc. 509 U.S. 579 (1993); Kumho Tire Co. v. Carmichael,
526 U.S. 137, 147 (1999) (expanding the Daubert’s application to nonscientific expert testimony).
18. See Newport Ltd. v. Sears, Roebuck & Co., 1995 U.S. Dist. LEXIS 7652 (E.D. La. May
26, 1995). See also Petruzzi’s IGA Supermarkets, supra note 6, 998 F.2d at 1240, 1247 (finding that
the district court abused its discretion in excluding multiple regression-based testimony and reversing
the grant of summary judgment to two defendants).
19. See, e.g., In re Executive Telecard Ltd. Sec. Litig., 979 F. Supp. 1021 (S.D.N.Y. 1997).
308
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
studies that did not have an adequate foundation or research design with respect
to the issues at hand.20
In interpreting the results of a multiple regression analysis, it is important to
distinguish between correlation and causality. Two variables are correlated—that
is, associated with each other—when the events associated with the variables
occur more frequently together than one would expect by chance. For example,
if higher salaries are associated with a greater number of years of work experience,
and lower salaries are associated with fewer years of experience, there is a positive
correlation between salary and number of years of work experience. However, if
higher salaries are associated with less experience, and lower salaries are associated
with more experience, there is a negative correlation between the two variables.
A correlation between two variables does not imply that one event causes the
second. Therefore, in making causal inferences, it is important to avoid spurious
correlation.21 Spurious correlation arises when two variables are closely related but
bear no causal relationship because they are both caused by a third, unexamined
variable. For example, there might be a negative correlation between the age of
certain skilled employees of a computer company and their salaries. One should
not conclude from this correlation that the employer has necessarily discriminated
against the employees on the basis of their age. A third, unexamined variable, such
as the level of the employees’ technological skills, could explain differences in productivity and, consequently, differences in salary.22 Or, consider a patent infringement case in which increased sales of an allegedly infringing product are associated
with a lower price of the patented product.23 This correlation would be spurious
if the two products have their own noncompetitive market niches and the lower
price is the result of a decline in the production costs of the patented product.
Pointing to the possibility of a spurious correlation will typically not be
enough to dispose of a statistical argument. It may be appropriate to give little
weight to such an argument absent a showing that the correlation is relevant.
For example, a statistical showing of a relationship between technological skills
20. See City of Tuscaloosa v. Harcros Chemicals, Inc., 158 F.2d 548 (11th Cir. 1998), in which
the court ruled plaintiffs’ regression-based expert testimony inadmissible and granted summary judgment to the defendants. See also American Booksellers Ass’n v. Barnes & Noble, Inc., 135 F. Supp.
2d 1031, 1041 (N.D. Cal. 2001), in which a model was said to contain “too many assumptions and
simplifications that are not supported by real-world evidence,” and Obrey v. Johnson, 400 F.3d 691
(9th Cir. 2005).
21. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section V.B.3,
in this manual.
22. See, e.g., Sheehan v. Daily Racing Form Inc., 104 F.3d 940, 942 (7th Cir.) (rejecting plaintiff’s age discrimination claim because statistical study showing correlation between age and retention
ignored the “more than remote possibility that age was correlated with a legitimate job-related qualification”), cert. denied, 521 U.S. 1104 (1997).
23. In some particular cases, there are statistical tests that allow one to reject claims of causality.
For a brief description of these tests, which were developed by Jerry Hausman, see Robert S. Pindyck
& Daniel L. Rubinfeld, Econometric Models and Economic Forecasts § 7.5 (4th ed. 1997).
309
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
and worker productivity might be required in the age discrimination example,
above.24
Causality cannot be inferred by data analysis alone; rather, one must infer that
a causal relationship exists on the basis of an underlying causal theory that explains
the relationship between the two variables. Even when an appropriate theory has
been identified, causality can never be inferred directly. One must also look for
empirical evidence that there is a causal relationship. Conversely, the fact that two
variables are correlated does not guarantee the existence of a relationship; it could
be that the model—a characterization of the underlying causal theory—does not
reflect the correct interplay among the explanatory variables. In fact, the absence
of correlation does not guarantee that a causal relationship does not exist. Lack of
correlation could occur if (1) there are insufficient data, (2) the data are measured
inaccurately, (3) the data do not allow multiple causal relationships to be sorted
out, or (4) the model is specified wrongly because of the omission of a variable
or variables that are related to the variable of interest.
There is a tension between any attempt to reach conclusions with near
certainty and the inherently uncertain nature of multiple regression analysis. In
general, the statistical analysis associated with multiple regression allows for the
expression of uncertainty in terms of probabilities. The reality that statistical analysis generates probabilities concerning relationships rather than certainty should not
be seen in itself as an argument against the use of statistical evidence, or worse, as
a reason to not admit that there is uncertainty at all. The only alternative might
be to use less reliable anecdotal evidence.
This reference guide addresses a number of procedural and methodological issues that are relevant in considering the admissibility of, and weight to be
accorded to, the findings of multiple regression analyses. It also suggests some
standards of reporting and analysis that an expert presenting multiple regression
analyses might be expected to meet. Section II discusses research design—how the
multiple regression framework can be used to sort out alternative theories about a
case. The guide discusses the importance of choosing the appropriate specification
of the multiple regression model and raises the issue of whether multiple regression
is appropriate for the case at issue. Section III accepts the regression framework
and concentrates on the interpretation of the multiple regression results from both
a statistical and a practical point of view. It emphasizes the distinction between
regression results that are statistically significant and results that are meaningful
to the trier of fact. It also points to the importance of evaluating the robustness
24. See, e.g., Allen v. Seidman, 881 F.2d 375 (7th Cir. 1989) (judicial skepticism was raised when
the defendant did not submit a logistic regression incorporating an omitted variable—the possession of
a higher degree or special education; defendant’s attack on statistical comparisons must also include an
analysis that demonstrates that comparisons are flawed). The appropriate requirements for the defendant’s showing of spurious correlation could, in general, depend on the discovery process. See, e.g.,
Boykin v. Georgia Pac. Co., 706 F.2d 1384 (1983) (criticism of a plaintiff’s analysis for not including
omitted factors, when plaintiff considered all information on an application form, was inadequate).
310
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
of regression analyses, i.e., seeing the extent to which the results are sensitive to
changes in the underlying assumptions of the regression model. Section IV briefly
discusses the qualifications of experts and suggests a potentially useful role for
court-appointed neutral experts. Section V emphasizes procedural aspects associated with use of the data underlying regression analyses. It encourages greater
pretrial efforts by the parties to attempt to resolve disputes over statistical studies.
Throughout the main body of this guide, hypothetical examples are used as
illustrations. Moreover, the basic “mathematics” of multiple regression has been
kept to a bare minimum. To achieve that goal, the more formal description of the
multiple regression framework has been placed in the Appendix. The Appendix is
self-contained and can be read before or after the text. The Appendix also includes
further details with respect to the examples used in the body of this guide.
II. Research Design: Model Specification
Multiple regression allows the testifying economist or other expert to choose
among alternative theories or hypotheses and assists the expert in distinguishing
correlations between variables that are plainly spurious from those that may reflect
valid relationships.
A. What Is the Specific Question That Is Under Investigation
by the Expert?
Research begins with a clear formulation of a research question. The data to be
collected and analyzed must relate directly to this question; otherwise, appropriate inferences cannot be drawn from the statistical analysis. For example, if the
question at issue in a patent infringement case is what price the plaintiff’s product
would have been but for the sale of the defendant’s infringing product, sufficient
data must be available to allow the expert to account statistically for the important
factors that determine the price of the product.
B. What Model Should Be Used to Evaluate the Question at
Issue?
Model specification involves several steps, each of which is fundamental to the success of the research effort. Ideally, a multiple regression analysis builds on a theory
that describes the variables to be included in the study. A typical regression model
will include one or more dependent variables, each of which is believed to be causally related to a series of explanatory variables. Because we cannot be certain that
the explanatory variables are themselves unaffected or independent of the influence
of the dependent variable (at least at the point of initial study), the explanatory
311
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
variables are often termed covariates. Covariates are known to have an association
with the dependent or outcome variable, but causality remains an open question.
For example, the theory of labor markets might lead one to expect salaries in
an industry to be related to workers’ experience and the productivity of workers’
jobs. A belief that there is job discrimination would lead one to create a model
in which the dependent variable was a measure of workers’ salaries and the list of
covariates included a variable reflecting discrimination in addition to measures
of job training and experience.
In a perfect world, the analysis of the job discrimination (or any other) issue
might be accomplished through a controlled “natural experiment,” in which
employees would be randomly assigned to a variety of employers in an industry
under study and asked to fill positions requiring identical experience and skills. In
this observational study, where the only difference in salaries could be a result of
discrimination, it would be possible to draw clear and direct inferences from an
analysis of salary data. Unfortunately, the opportunity to conduct observational
studies of this kind is rarely available to experts in the context of legal proceedings.
In the real world, experts must do their best to interpret the results of real-world
“quasi-experiments,” in which it is impossible to control all factors that might affect
worker salaries or other variables of interest.25
Models are often characterized in terms of parameters—numerical characteristics of the model. In the labor market discrimination example, one parameter
might reflect the increase in salary associated with each additional year of prior
job experience. Another parameter might reflect the reduction in salary associated
with a lack of current on-the-job experience. Multiple regression uses a sample,
or a selection of data, from the population (all the units of interest) to obtain estimates of the values of the parameters of the model. An estimate associated with a
particular explanatory variable is an estimated regression coefficient.
Failure to develop the proper theory, failure to choose the appropriate variables, or failure to choose the correct form of the model can substantially bias the
statistical results—that is, create a systematic tendency for an estimate of a model
parameter to be too high or too low.
1. Choosing the dependent variable
The variable to be explained, the dependent variable, should be the appropriate
variable for analyzing the question at issue.26 Suppose, for example, that pay dis-
25. In the literature on natural and quasi-experiments, the explanatory variables are characterized
as “treatments” and the dependent variable as the “outcome.” For a review of natural experiments
in the criminal justice arena, see David P. Farrington, A Short History of Randomized Experiments in
Criminology, 27 Evaluation Rev. 218–27 (2003).
26. In multiple regression analysis, the dependent variable is usually a continuous variable that
takes on a range of numerical values. When the dependent variable is categorical, taking on only two
or three values, modified forms of multiple regression, such as probit analysis or logit analysis, are
312
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
crimination among hourly workers is a concern. One choice for the dependent
variable is the hourly wage rate of the employees; another choice is the annual
salary. The distinction is important, because annual salary differences may in part
result from differences in hours worked. If the number of hours worked is the
product of worker preferences and not discrimination, the hourly wage is a good
choice. If the number of hours worked is related to the alleged discrimination,
annual salary is the more appropriate dependent variable to choose.27
2. Choosing the explanatory variable that is relevant to the question at issue
The explanatory variable that allows the evaluation of alternative hypotheses must
be chosen appropriately. Thus, in a discrimination case, the variable of interest
may be the race or sex of the individual. In an antitrust case, it may be a variable
that takes on the value 1 to reflect the presence of the alleged anticompetitive
behavior and the value 0 otherwise.28
3. Choosing the additional explanatory variables
An attempt should be made to identify additional known or hypothesized explanatory variables, some of which are measurable and may support alternative substantive hypotheses that can be accounted for by the regression analysis. Thus, in a
discrimination case, a measure of the skills of the workers may provide an alternative explanation—lower salaries may have been the result of inadequate skills.29
appropriate. For an example of the use of the latter, see EEOC v. Sears, Roebuck & Co., 839 F.2d 302,
325 (7th Cir. 1988) (EEOC used logit analysis to measure the impact of variables, such as age, education, job-type experience, and product-line experience, on the female percentage of commission hires).
27. In job systems in which annual salaries are tied to grade or step levels, the annual salary corresponding to the job position could be more appropriate.
28. Explanatory variables may vary by type, which will affect the interpretation of the regression
results. Thus, some variables may be continuous and others may be categorical.
29. In James v. Stockham Valves, 559 F. 2d 310 (5th Cir. 1977), the Court of Appeals rejected
the employer’s claim that skill level rather than race determined assignment and wage levels, noting
the circularity of defendant’s argument. In Ottaviani v. State University of New York, 679 F. Supp. 288,
306–08 (S.D.N.Y. 1988), aff’d, 875 F.2d 365 (2d Cir. 1989), cert. denied, 493 U.S. 1021 (1990), the
court ruled (in the liability phase of the trial) that the university showed that there was no discrimination in either placement into initial rank or promotions between ranks, and so rank was a proper
variable in multiple regression analysis to determine whether women faculty members were treated
differently than men.
However, in Trout v. Garrett, 780 F. Supp. 1396, 1414 (D.D.C. 1991), the court ruled (in the
damage phase of the trial) that the extent of civilian employees’ prehire work experience was not
an appropriate variable in a regression analysis to compute back pay in employment discrimination.
According to the court, including the prehire level would have resulted in a finding of no sex discrimination, despite a contrary conclusion in the liability phase of the action. Id. See also Stuart v. Roache,
951 F.2d 446 (1st Cir. 1991) (allowing only 3 years of seniority to be considered as the result of prior
313
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Not all possible variables that might influence the dependent variable can be
included if the analysis is to be successful; some cannot be measured, and others
may make little difference.30 If a preliminary analysis shows the unexplained
portion of the multiple regression to be unacceptably high, the expert may seek
to discover whether some previously undetected variable is missing from the
analysis.31
Failure to include a major explanatory variable that is correlated with the
variable of interest in a regression model may cause an included variable to be
credited with an effect that actually is caused by the excluded variable.32 In general, omitted variables that are correlated with the dependent variable reduce the
probative value of the regression analysis. The importance of omitting a relevant
variable depends on the strength of the relationship between the omitted variable
and the dependent variable and the strength of the correlation between the omitted variable and the explanatory variables of interest. Other things being equal,
the greater the correlation between the omitted variable and the variable of interest, the greater the bias caused by the omission. As a result, the omission of an
important variable may lead to inferences made from regression analyses that do
not assist the trier of fact.33
discrimination), cert. denied, 504 U.S. 913 (1992). Whether a particular variable reflects “legitimate”
considerations or itself reflects or incorporates illegitimate biases is a recurring theme in discrimination
cases. See, e.g., Smith v. Virginia Commonwealth Univ., 84 F.3d 672, 677 (4th Cir. 1996) (en banc)
(suggesting that whether “performance factors” should have been included in a regression analysis was
a question of material fact); id. at 681–82 (Luttig, J., concurring in part) (suggesting that the failure of
the regression analysis to include “performance factors” rendered it so incomplete as to be inadmissible); id. at 690–91 (Michael, J., dissenting) (suggesting that the regression analysis properly excluded
“performance factors”); see also Diehl v. Xerox Corp., 933 F. Supp. 1157, 1168 (W.D.N.Y. 1996).
30. The summary effect of the excluded variables shows up as a random error term in the regression model, as does any modeling error. See Appendix, infra, for details. But see David W. Peterson,
Reference Guide on Multiple Regression, 36 Jurimetrics J. 213, 214 n.2 (1996) (review essay) (asserting
that “the presumption that the combined effect of the explanatory variables omitted from the model
are uncorrelated with the included explanatory variables” is “a knife-edge condition . . . not likely
to occur”).
31. A very low R-squared (R2) is one indication of an unexplained portion of the multiple
regression model that is unacceptably high. However, the inference that one makes from a particular
value of R2 will depend, of necessity, on the context of the particular issues under study and the
particular dataset that is being analyzed. For reasons discussed in the Appendix, a low R2 does not
necessarily imply a poor model (and vice versa).
32. Technically, the omission of explanatory variables that are correlated with the variable of
interest can cause biased estimates of regression parameters.
33. See Bazemore v. Friday, 751 F.2d 662, 671–72 (4th Cir. 1984) (upholding the district court’s
refusal to accept a multiple regression analysis as proof of discrimination by a preponderance of the
evidence, the court of appeals stated that, although the regression used four variable factors (race,
education, tenure, and job title), the failure to use other factors, including pay increases that varied by
county, precluded their introduction into evidence), aff’d in part, vacated in part, 478 U.S. 385 (1986).
Note, however, that in Sobel v. Yeshiva University, 839 F.2d 18, 33, 34 (2d Cir. 1988), cert. denied,
490 U.S. 1105 (1989), the court made clear that “a [Title VII] defendant challenging the validity of
314
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
Omitting variables that are not correlated with the variable of interest is, in
general, less of a concern, because the parameter that measures the effect of the
variable of interest on the dependent variable is estimated without bias. Suppose,
for example, that the effect of a policy introduced by the courts to encourage
husbands to pay child support has been tested by randomly choosing some cases
to be handled according to current court policies and other cases to be handled
according to a new, more stringent policy. The effect of the new policy might be
measured by a multiple regression using payment success as the dependent variable
and a 0 or 1 explanatory variable (1 if the new program was applied; 0 if it was
not). Failure to include an explanatory variable that reflected the age of the husbands involved in the program would not affect the court’s evaluation of the new
policy, because men of any given age are as likely to be affected by the old policy
as they are the new policy. Randomly choosing the court’s policy to be applied
to each case has ensured that the omitted age variable is not correlated with the
policy variable.
Bias caused by the omission of an important variable that is related to the
included variables of interest can be a serious problem.34 Nonetheless, it is possible for the expert to account for bias qualitatively if the expert has knowledge
(even if not quantifiable) about the relationship between the omitted variable
and the explanatory variable. Suppose, for example, that the plaintiff’s expert
in a sex discrimination pay case is unable to obtain quantifiable data that reflect
the skills necessary for a job, and that, on average, women are more skillful than
men. Suppose also that a regression analysis of the wage rate of employees (the
dependent variable) on years of experience and a variable reflecting the sex of
each employee (the explanatory variable) suggests that men are paid substantially
more than women with the same experience. Because differences in skill levels
have not been taken into account, the expert may conclude reasonably that the
a multiple regression analysis [has] to make a showing that the factors it contends ought to have been
included would weaken the showing of salary disparity made by the analysis,” by making a specific
attack and “a showing of relevance for each particular variable it contends . . . ought to [be] includ[ed]”
in the analysis, rather than by simply attacking the results of the plaintiffs’ proof as inadequate for lack
of a given variable. See also Smith v. Virginia Commonwealth Univ., 84 F.3d 672 (4th Cir. 1996) (en
banc) (finding that whether certain variables should have been included in a regression analysis is a
question of fact that precludes summary judgment); Freeland v. AT&T, 238 F.R.D. 130, 145 (S.D.N.Y.
2006) (“Ordinarily, the failure to include a variable in a regression analysis will affect the probative
value of the analysis and not its admissibility”).
Also, in Bazemore v. Friday, the Court, declaring that the Fourth Circuit’s view of the evidentiary
value of the regression analyses was plainly incorrect, stated that “[n]ormally, failure to include variables
will affect the analysis’ probativeness, not its admissibility. Importantly, it is clear that a regression
analysis that includes less than ‘all measurable variables’ may serve to prove a plaintiff’s case.” 478 U.S.
385, 400 (1986) (footnote omitted).
34. See also David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section V.B.3,
in this manual.
315
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
wage difference measured by the regression is a conservative estimate of the true
discriminatory wage difference.
The precision of the measure of the effect of a variable of interest on
the dependent variable is also important.35 In general, the more complete the
explained relationship between the included explanatory variables and the dependent variable, the more precise the results. Note, however, that the inclusion of
explanatory variables that are irrelevant (i.e., not correlated with the dependent
variable) reduces the precision of the regression results. This can be a source of
concern when the sample size is small, but it is not likely to be of great consequence when the sample size is large.
4. Choosing the functional form of the multiple regression model
Choosing the proper set of variables to be included in the multiple regression
model does not complete the modeling exercise. The expert must also choose the
proper form of the regression model. The most frequently selected form is
the linear regression model (described in the Appendix). In this model, the magnitude of the change in the dependent variable associated with the change in any
of the explanatory variables is the same no matter what the level of the explanatory variables. For example, one additional year of experience might add $5000
to salary, regardless of the previous experience of the employee.
In some instances, however, there may be reason to believe that changes in
explanatory variables will have differential effects on the dependent variable as the
values of the explanatory variables change. In these instances, the expert should
consider the use of a nonlinear model. Failure to account for nonlinearities can
lead to either overstatement or understatement of the effect of a change in the
value of an explanatory variable on the dependent variable.
One particular type of nonlinearity involves the interaction among several
variables. An interaction variable is the product of two other variables that are
included in the multiple regression model. The interaction variable allows the
expert to take into account the possibility that the effect of a change in one variable on the dependent variable may change as the level of another explanatory
variable changes. For example, in a salary discrimination case, the inclusion of a
term that interacts a variable measuring experience with a variable representing
the sex of the employee (1 if a female employee; 0 if a male employee) allows
the expert to test whether the sex differential varies with the level of experience.
A significant negative estimate of the parameter associated with the sex variable
suggests that inexperienced women are discriminated against, whereas a significant
35. A more precise estimate of a parameter is an estimate with a smaller standard error. See
Appendix, infra, for details.
316
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
negative estimate of the interaction parameter suggests that the extent of discrimination increases with experience.36
Note that insignificant coefficients in a model with interactions may suggest a
lack of discrimination, whereas a model without interactions may suggest the contrary. It is especially important to account for interaction terms that could affect
the determination of discrimination; failure to do so may lead to false conclusions
concerning discrimination.
5. Choosing multiple regression as a method of analysis
There are many multivariate statistical techniques other than multiple regression that are useful in legal proceedings. Some statistical methods are appropriate
when nonlinearities are important;37 others apply to models in which the dependent variable is discrete, rather than continuous.38 Still others have been applied
predominantly to respond to methodological concerns arising in the context of
discrimination litigation.39
It is essential that a valid statistical method be applied to assist with the analysis in each legal proceeding. Therefore, the expert should be prepared to explain
why any chosen method, including multiple regression, was more suitable than
the alternatives.
36. For further details concerning interactions, see the Appendix, infra. Note that in Ottaviani v.
State University of New York, 875 F.2d 365, 367 (2d Cir. 1989), cert. denied, 493 U.S. 1021 (1990), the
defendant relied on a regression model in which a dummy variable reflecting gender appeared as an
explanatory variable. The female plaintiff, however, used an alternative approach in which a regression
model was developed for men only (the alleged protected group). The salaries of women predicted by
this equation were then compared with the actual salaries; a positive difference would, according to
the plaintiff, provide evidence of discrimination. For an evaluation of the methodological advantages
and disadvantages of this approach, see Joseph L. Gastwirth, A Clarification of Some Statistical Issues in
Watson v. Fort Worth Bank and Trust, 29 Jurimetrics J. 267 (1989).
37. These techniques include, but are not limited to, piecewise linear regression, polynomial
regression, maximum likelihood estimation of models with nonlinear functional relationships, and
autoregressive and moving-average time-series models. See, e.g., Pindyck & Rubinfeld, supra note 23,
at 117–21, 136–37, 273–84, 463–601.
38. For a discussion of probit analysis and logit analysis, techniques that are useful in the analysis
of qualitative choice, see id. at 248–81.
39. The correct model for use in salary discrimination suits is a subject of debate among labor
economists. As a result, some have begun to evaluate alternative approaches, including urn models
(Bruce Levin & Herbert Robbins, Urn Models for Regression Analysis, with Applications to Employment
Discrimination Studies, Law & Contemp. Probs., Autumn 1983, at 247) and, as a means of correcting for measurement errors, reverse regression (Delores A. Conway & Harry V. Roberts, Reverse
Regression, Fairness, and Employment Discrimination, 1 J. Bus. & Econ. Stat. 75 (1983)). But see Arthur
S. Goldberger, Redirecting Reverse Regressions, 2 J. Bus. & Econ. Stat. 114 (1984); Arlene S. Ash, The
Perverse Logic of Reverse Regression, in Statistical Methods in Discrimination Litigation 85 (D.H. Kaye
& Mikel Aickin eds., 1986).
317
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
III. Interpreting Multiple Regression
Results
Multiple regression results can be interpreted in purely statistical terms, through
the use of significance tests, or they can be interpreted in a more practical, nonstatistical manner. Although an evaluation of the practical significance of regression
results is almost always relevant in the courtroom, tests of statistical significance
are appropriate only in particular circumstances.
A. What Is the Practical, as Opposed to the Statistical,
Significance of Regression Results?
Practical significance means that the magnitude of the effect being studied is
not de minimis—it is sufficiently important substantively for the court to be
concerned. For example, if the average wage rate is $10.00 per hour, a wage
differential between men and women of $0.10 per hour is likely to be deemed
practically insignificant because the differential represents only 1% ($0.10/$10.00)
of the average wage rate.40 That same difference could be statistically significant,
however, if a sufficiently large sample of men and women was studied.41 The
reason is that statistical significance is determined, in part, by the number of
observations in the dataset.
As a general rule, the statistical significance of the magnitude of a regression
coefficient increases as the sample size increases. Thus, a $1.00 per hour wage
differential between men and women that was determined to be insignificantly
different from zero with a sample of 20 men and women could be highly significant if the sample size were increased to 200.
Often, results that are practically significant are also statistically significant.42
However, it is possible with a large dataset to find statistically significant coeffi40. There is no specific percentage threshold above which a result is practically significant. Practical significance must be evaluated in the context of a particular legal issue. See also David H. Kaye &
David A. Freedman, Reference Guide on Statistics, Section IV.B.2, in this manual.
41. Practical significance also can apply to the overall credibility of the regression results. Thus,
in McCleskey v. Kemp, 481 U.S. 279 (1987), coefficients on race variables were statistically significant,
but the Court declined to find them legally or constitutionally significant.
42. In Melani v. Board of Higher Education, 561 F. Supp. 769, 774 (S.D.N.Y. 1983), a Title VII
suit was brought against the City University of New York (CUNY) for allegedly discriminating against
female instructional staff in the payment of salaries. One approach of the plaintiff’s expert was to use
multiple regression analysis. The coefficient on the variable that reflected the sex of the employee
was approximately $1800 when all years of data were included. Practically (in terms of average wages
at the time) and statistically (in terms of a 5% significance test), this result was significant. Thus, the
court stated that “[p]laintiffs have produced statistically significant evidence that women hired as CUNY
instructional staff since 1972 received substantially lower salaries than similarly qualified men.” Id. at
781 (emphasis added). For a related analysis involving multiple comparison, see Csicseri v. Bowsher,
318
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
cients that are practically insignificant. Similarly, it is also possible (especially when
the sample size is small) to obtain results that are practically significant but fail to
achieve statistical significance. Suppose, for example, that an expert undertakes a
damages study in a patent infringement case and predicts “but-for sales”—what
sales would have been had the infringement not occurred—using data that predate
the period of alleged infringement. If data limitations are such that only 3 or 4
years of preinfringement sales are known, the difference between but-for sales and
actual sales during the period of alleged infringement could be practically significant but statistically insignificant. Alternatively, with only 3 or 4 data points, the
expert would be unable to detect an effect, even if one existed.
1. When should statistical tests be used?
A test of a specific contention, a hypothesis test, often assists the court in determining whether a violation of the law has occurred in areas in which direct evidence
is inaccessible or inconclusive. For example, an expert might use hypothesis tests
in race and sex discrimination cases to determine the presence of a discriminatory
effect.
Statistical evidence alone never can prove with absolute certainty the worth
of any substantive theory. However, by providing evidence contrary to the view
that a particular form of discrimination has not occurred, for example, the multiple regression approach can aid the trier of fact in assessing the likelihood that
discrimination has occurred.43
Tests of hypotheses are appropriate in a cross-sectional analysis, in which the
data underlying the regression study have been chosen as a sample of a population
at a particular point in time, and in a time-series analysis, in which the data being
evaluated cover a number of time periods. In either analysis, the expert may want
to evaluate a specific hypothesis, usually relating to a question of liability or to the
determination of whether there is measurable impact of an alleged violation. Thus,
in a sex discrimination case, an expert may want to evaluate a null hypothesis of
no discrimination against the alternative hypothesis that discrimination takes a par-
862 F. Supp. 547, 572 (D.D.C. 1994) (noting that plaintiff’s expert found “statistically significant
instances of discrimination” in 2 of 37 statistical comparisons, but suggesting that “2 of 37 amounts to
roughly 5% and is hardly indicative of a pattern of discrimination”), aff’d, 67 F.3d 972 (D.C. Cir. 1995).
43. See International Brotherhood. of Teamsters v. United States, 431 U.S. 324 (1977) (the
Court inferred discrimination from overwhelming statistical evidence by a preponderance of the evidence); Ryther v. KARE 11, 108 F.3d 832, 844 (8th Cir. 1997) (“The plaintiff produced overwhelming evidence as to the elements of a prima facie case, and strong evidence of pretext, which, when
considered with indications of age-based animus in [plaintiff’s] work environment, clearly provide
sufficient evidence as a matter of law to allow the trier of fact to find intentional discrimination.”);
Paige v. California, 291 F.3d 1141 (9th Cir. 2002) (allowing plaintiffs to rely on aggregated data to
show employment discrimination).
319
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ticular form.44 Alternatively, in an antitrust damages proceeding, the expert may
want to test a null hypothesis of no legal impact against the alternative hypothesis
that there was an impact. In either type of case, it is important to realize that
rejection of the null hypothesis does not in itself prove legal liability. It is possible
to reject the null hypothesis and believe that an alternative explanation other than
one involving legal liability accounts for the results.45
Often, the null hypothesis is stated in terms of a particular regression coefficient being equal to 0. For example, in a wage discrimination case, the null
hypothesis would be that there is no wage difference between sexes. If a negative
difference is observed (meaning that women are found to earn less than men, after
the expert has controlled statistically for legitimate alternative explanations), the
difference is evaluated as to its statistical significance using the t-test.46 The t-test
uses the t-statistic to evaluate the hypothesis that a model parameter takes on a
particular value, usually 0.
2. What is the appropriate level of statistical significance?
In most scientific work, the level of statistical significance required to reject the
null hypothesis (i.e., to obtain a statistically significant result) is set conventionally at 0.05, or 5%.47 The significance level measures the probability that the
null hypothesis will be rejected incorrectly. In general, the lower the percentage required for statistical significance, the more difficult it is to reject the null
hypothesis; therefore, the lower the probability that one will err in doing so.
Although the 5% criterion is typical, reporting of more stringent 1% significance
tests or less stringent 10% tests can also provide useful information.
In doing a statistical test, it is useful to compute an observed significance
level, or p-value. The p-value associated with the null hypothesis that a regression
coefficient is 0 is the probability that a coefficient of this magnitude or larger could
have occurred by chance if the null hypothesis were true. If the p-value were less
than or equal to 5%, the expert would reject the null hypothesis in favor of the
44. Tests are also appropriate when comparing the outcomes of a set of employer decisions with
those that would have been obtained had the employer chosen differently from among the available
options.
45. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section IV.C.5,
in this manual.
46. The t-test is strictly valid only if a number of important assumptions hold. However, for
many regression models, the test is approximately valid if the sample size is sufficiently large. See
Appendix, infra, for a more complete discussion of the assumptions underlying multiple regression..
47. See, e.g., Palmer v. Shultz, 815 F.2d 84, 92 (D.C. Cir. 1987) (“‘the .05 level of significance
. . . [is] certainly sufficient to support an inference of discrimination’” (quoting Segar v. Smith, 738
F.2d 1249, 1283 (D.C. Cir. 1984), cert. denied, 471 U.S. 1115 (1985))); United States v. Delaware,
2004 U.S. Dist. LEXIS 4560 (D. Del. Mar. 22, 2004) (stating that .05 is the normal standard chosen).
320
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
alternative hypothesis; if the p-value were greater than 5%, the expert would fail
to reject the null hypothesis.48
3. Should statistical tests be one-tailed or two-tailed?
When the expert evaluates the null hypothesis that a variable of interest has no
linear association with a dependent variable against the alternative hypothesis that
there is an association, a two-tailed test, which allows for the effect to be either
positive or negative, is usually appropriate. A one-tailed test would usually be
applied when the expert believes, perhaps on the basis of other direct evidence
presented at trial, that the alternative hypothesis is either positive or negative, but
not both. For example, an expert might use a one-tailed test in a patent infringement case if he or she strongly believes that the effect of the alleged infringement
on the price of the infringed product was either zero or negative. (The sales of
the infringing product competed with the sales of the infringed product, thereby
lowering the price.) By using a one-tailed test, the expert is in effect stating that
prior to looking at the data it would be very surprising if the data pointed in the
direct opposite to the one posited by the expert.
Because using a one-tailed test produces p-values that are one-half the size of
p-values using a two-tailed test, the choice of a one-tailed test makes it easier for
the expert to reject a null hypothesis. Correspondingly, the choice of a two-tailed
test makes null hypothesis rejection less likely. Because there is some arbitrariness
involved in the choice of an alternative hypothesis, courts should avoid relying
solely on sharply defined statistical tests.49 Reporting the p-value or a confidence
interval should be encouraged because it conveys useful information to the court,
whether or not a null hypothesis is rejected.
48. The use of 1%, 5%, and, sometimes, 10% levels for determining statistical significance
remains a subject of debate. One might argue, for example, that when regression analysis is used in
a price-fixing antitrust case to test a relatively specific alternative to the null hypothesis (e.g., price
fixing), a somewhat lower level of confidence (a higher level of significance, such as 10% ) might be
appropriate. Otherwise, when the alternative to the null hypothesis is less specific, such as the rather
vague alternative of “effect” (e.g., the price increase is caused by the increased cost of production,
increased demand, a sharp increase in advertising, or price fixing), a high level of confidence (associated
with a low significance level, such as 1%) may be appropriate. See, e.g., Vuyanich v. Republic Nat’l
Bank, 505 F. Supp. 224, 272 (N.D. Tex. 1980) (noting the “arbitrary nature of the adoption of the
5% level of [statistical] significance” to be required in a legal context); Cook v. Rockwell Int’l Corp.,
2006 U.S. Dist. LEXIS 89121 (D. Colo. Dec. 7, 2006).
49. Courts have shown a preference for two-tailed tests. See, e.g., Palmer v. Shultz, 815 F.2d
84, 95–96 (D.C. Cir. 1987) (rejecting the use of one-tailed tests, the court found that because some
appellants were claiming overselection for certain jobs, a two-tailed test was more appropriate in Title
VII cases); Moore v. Summers, 113 F. Supp. 2d 5, 20 (D.D.C. 2000) (reiterating the preference for a
two-tailed test). See also David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section IV.C.2, in this manual; Csicseri v. Bowsher, 862 F. Supp. 547, 565 (D.D.C. 1994) (finding that
although a one-tailed test is “not without merit,” a two-tailed test is preferable).
321
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. Are the Regression Results Robust?
The issue of robustness—whether regression results are sensitive to slight modifications in assumptions (e.g., that the data are measured accurately)—is of vital
importance. If the assumptions of the regression model are valid, standard statistical
tests can be applied. However, when the assumptions of the model are violated,
standard tests can overstate or understate the significance of the results.
The violation of an assumption does not necessarily invalidate a regression analysis, however. In some instances in which the assumptions of multiple
regression analysis fail, there are other statistical methods that are appropriate.
Consequently, experts should be encouraged to provide additional information
that relates to the issue of whether regression assumptions are valid, and if they
are not valid, the extent to which the regression results are robust. The following
questions highlight some of the more important assumptions of regression analysis.
1. What evidence exists that the explanatory variable causes changes in
the dependent variable?
In the multiple regression framework, the expert often assumes that changes in
explanatory variables affect the dependent variable, but changes in the dependent
variable do not affect the explanatory variables—that is, there is no feedback.50
In making this assumption, the expert draws the conclusion that a correlation
between a covariate and the dependent outcome variable results from the effect of
the former on the latter and not vice versa. Were it the case that the causality was
reversed so that the outcome variable affected the covariate, and not vice versa,
spurious correlation is likely to cause the expert and the trier of fact to reach the
wrong conclusion. Finally, it is possible in some cases that both the outcome variable and the covariate each affect the other; if the expert does not take this more
complex relationship into account, the regression coefficient on the variable of
interest could be either too high or too low.51
Figure 1 illustrates this point. In Figure 1(a), the dependent variable, price, is
explained through a multiple regression framework by three covariate explanatory
variables—demand, cost, and advertising—with no feedback. Each of the three
covariates is assumed to affect price causally, while price is assumed to have no
effect on the three covariates. However, in Figure 1(b), there is feedback, because
price affects demand, and demand, cost, and advertising affect price. Cost and
advertising, however, are not affected by price. In this case both price and demand
are jointly determined; each has a causal effect on the other.
50. The assumption of no feedback is especially important in litigation, because it is possible for
the defendant (if responsible, for example, for price fixing or discrimination) to affect the values of
the explanatory variables and thus to bias the usual statistical tests that are used in multiple regression.
51. When both effects occur at the same time, this is described as “simultaneity.”
322
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
Figure 1. Feedback.
1(a). No Feedback
Demand
Price
Cost
Advertising
1(b). Feedback
Demand
Price
Cost
Advertising
As a general rule, there are no basic direct statistical tests for determining the
direction of causality; rather, the expert, when asked, should be prepared to defend
his or her assumption based on an understanding of the underlying behavior evidence relating to the businesses or individuals involved.52
Although there is no single approach that is entirely suitable for estimating
models when the dependent variable affects one or more explanatory variables,
one possibility is for the expert to drop the questionable variable from the regression to determine whether the variable’s exclusion makes a difference. If it does
not, the issue becomes moot. Another approach is for the expert to expand the
multiple regression model by adding one or more equations that explain the relationship between the explanatory variable in question and the dependent variable.
Suppose, for example, that in a salary-based sex discrimination suit the defendant’s expert considers employer-evaluated test scores to be an appropriate explanatory variable for the dependent variable, salary. If the plaintiff were to provide
information that the employer adjusted the test scores in a manner that penalized
women, the assumption that salaries were determined by test scores and not that
test scores were affected by salaries might be invalid. If it is clearly inappropriate,
52. There are statistical time-series tests for particular formulations of causality; see Pindyck &
Rubinfeld, supra note 23, § 9.2.
323
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the test-score variable should be removed from consideration. Alternatively, the
information about the employer’s use of the test scores could be translated into
a second equation in which a new dependent variable—test score—is related to
workers’ salary and sex. A test of the hypothesis that salary and sex affect test scores
would provide a suitable test of the absence of feedback.
2. To what extent are the explanatory variables correlated with each other?
It is essential in multiple regression analysis that the explanatory variable of interest
not be correlated perfectly with one or more of the other explanatory variables.
If there were perfect correlation between two variables, the expert could not
separate out the effect of the variable of interest on the dependent variable from
the effect of the other variable. In essence, there are two explanations for the
same pattern in the data. Suppose, for example, that in a sex discrimination suit, a
particular form of job experience is determined to be a valid source of high wages.
If all men had the requisite job experience and all women did not, it would be
impossible to tell whether wage differentials between men and women were the
result of sex discrimination or differences in experience.
When two or more explanatory variables are correlated perfectly—that is,
when there is perfect collinearity—one cannot estimate the regression parameters.
The existing dataset does not allow one to distinguish between alternative competing explanations of the movement in the dependent variable. However, when
two or more variables are highly, but not perfectly, correlated—that is, when there
is multicollinearity—the regression can be estimated, but some concerns remain.
The greater the multicollinearity between two variables, the less precise are the
estimates of individual regression parameters, and an expert is less able to distinguish among competing explanations for the movement in the outcome variable
(even though there is no problem in estimating the joint influence of the two
variables and all other regression parameters).53
Fortunately, the reported regression statistics take into account any multicollinearity that might be present.54 It is important to note as a corollary, however, that a failure to find a strong relationship between a variable of interest and
53. See Griggs v. Duke Power Co., 401 U.S. 424 (1971) (The court argued that an education
requirement was one rationalization of the data, but racial discrimination was another. If you had put
both race and education in the regression, it would have been asking too much of the data to tell
which variable was doing the real work, because education and race were so highly correlated in the
market at that time.).
54. See Denny v. Westfield State College, 669 F. Supp. 1146, 1149 (D. Mass. 1987) (The court
accepted the testimony of one expert that “the presence of multicollinearity would merely tend to
overestimate the amount of error associated with the estimate. . . . In other words, p-values will be
artificially higher than they would be if there were no multicollinearity present.”) (emphasis added);
In re High Fructose Corn Syrup Antitrust Litig., 295 F.3d 651, 659 (7th Cir. Ill. 2002) (refusing to
second-guess district court’s admission of regression analyses that addressed multicollinearity in different ways).
324
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
a dependent variable need not imply that there is no relationship.55 A relatively
small sample, or even a large sample with substantial multicollinearity, may not
provide sufficient information for the expert to determine whether there is a
relationship.
3. To what extent are individual errors in the regression model
independent?
If the expert calculated the parameters of a multiple regression model using as data
the entire population, the estimates might still measure the model’s population
parameters with error. Errors can arise for a number of reasons, including (1) the
failure of the model to include the appropriate explanatory variables, (2) the failure
of the model to reflect any nonlinearities that might be present, and (3) the inclusion of inappropriate variables in the model. (Of course, further sources of error
will arise if a sample, or subset, of the population is used to estimate the regression
parameters.)
It is useful to view the cumulative effect of all of these sources of modeling
error as being represented by an additional variable, the error term, in the multiple regression model. An important assumption in multiple regression analysis is
that the error term and each of the explanatory variables are independent of each
other. (If the error term and an explanatory variable are independent, they are not
correlated with each other.) To the extent this is true, the expert can estimate the
parameters of the model without bias; the magnitude of the error term will affect
the precision with which a model parameter is estimated, but will not cause that
estimate to be consistently too high or too low.
The assumption of independence may be inappropriate in a number of circumstances. In some instances, failure of the assumption makes multiple regression analysis an unsuitable statistical technique; in other instances, modifications
or adjustments within the regression framework can be made to accommodate
the failure.
The independence assumption may fail, for example, in a study of individual
behavior over time, in which an unusually high error value in one time period is
likely to lead to an unusually high value in the next time period. For example, if
an economic forecaster underpredicted this year’s Gross Domestic Product, he or
she is likely to underpredict next year’s as well; the factor that caused the prediction error (e.g., an incorrect assumption about Federal Reserve policy) is likely
to be a source of error in the future.
55. If an explanatory variable of concern and another explanatory variable are highly correlated,
dropping the second variable from the regression can be instructive. If the coefficient on the explanatory variable of concern becomes significant, a relationship between the dependent variable and the
explanatory variable of concern is suggested.
325
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Alternatively, the assumption of independence may fail in a study of a group
of firms at a particular point in time, in which error terms for large firms are systematically higher than error terms for small firms. For example, an analysis of the
profitability of firms may not accurately account for the importance of advertising
as a source of increased sales and profits. To the extent that large firms advertise
more than small firms, the regression errors would be large for the large firms and
small for the small firms. A third possibility is that the dependent variable varies
at the individual level, but the explanatory variable of interest varies only at the
level of a group. For example, an expert might be viewing the price of a product
in an antitrust case as a function of a variable or variables that measure the marketing channel through which the product is sold (e.g., wholesale or retail). In this
case, errors within each of the marketing groups are likely not to be independent.
Failure to account for this could cause the expert to overstate the statistical significance of the regression parameters.
In some instances, there are statistical tests that are appropriate for evaluating
the independence assumption.56 If the assumption has failed, the expert should
ask first whether the source of the lack of independence is the omission of an
important explanatory variable from the regression. If so, that variable should be
included when possible, or the potential effect of its omission should be estimated
when inclusion is not possible. If there is no important missing explanatory variable, the expert should apply one or more procedures that modify the standard
multiple regression technique to allow for more accurate estimates of the regression parameters.57
4. To what extent are the regression results sensitive to individual data
points?
Estimated regression coefficients can be highly sensitive to particular data points.
Suppose, for example, that one data point deviates greatly from its expected value,
as indicated by the regression equation, while the remaining data points show
56. In a time-series analysis, the correlation of error values over time, the “serial correlation,”
can be tested (in most instances) using a number of tests, including the Durbin-Watson test. The
possibility that some error terms are consistently high in magnitude and others are systematically low,
heteroscedasticity can also be tested in a number of ways. See, e.g., Pindyck & Rubinfeld, supra note
23, at 146–59. When serial correlation and/or heteroscedasticity are present, the standard errors associated with the estimated coefficients must be modified. For a discussion of the use of such “robust”
standard errors, see Jeffrey M. Wooldridge, Introductory Econometrics: A Modern Approach, ch. 8
(4th ed. 2009).
57. When serial correlation is present, a number of closely related statistical methods are appropriate, including generalized differencing (a type of generalized least squares) and maximum likelihood
estimation. When heteroscedasticity is the problem, weighted least squares and maximum likelihood estimation are appropriate. See, e.g., id. All these techniques are readily available in a number of statistical
computer packages. They also allow one to perform the appropriate statistical tests of the significance of
the regression coefficients.
326
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
little deviation. It would not be unusual in this situation for the coefficients in
a multiple regression to change substantially if the data point in question were
removed from the sample.
Evaluating the robustness of multiple regression results is a complex endeavor.
Consequently, there is no agreed set of tests for robustness that analysts should
apply. In general, it is important to explore the reasons for unusual data points. If
the source is an error in recording data, the appropriate corrections can be made.
If all the unusual data points have certain characteristics in common (e.g., they
all are associated with a supervisor who consistently gives high ratings in an equal
pay case), the regression model should be modified appropriately.
One generally useful diagnostic technique is to determine to what extent
the estimated parameter changes as each data point in the regression analysis is
dropped from the sample. An influential data point—a point that causes the estimated parameter to change substantially—should be studied further to determine
whether mistakes were made in the use of the data or whether important explanatory variables were omitted.58
5. To what extent are the data subject to measurement error?
In multiple regression analysis it is assumed that variables are measured accurately.59 If there are measurement errors in the dependent variable, estimates of
regression parameters will be less accurate, although they will not necessarily be
biased. However, if one or more independent variables are measured with error,
the corresponding parameter estimates are likely to be biased, typically toward
zero (and other coefficient estimates are likely to be biased as well).
To understand why, suppose that the dependent variable, salary, is measured
without error, and the explanatory variable, experience, is subject to measurement
error. (Seniority or years of experience should be accurate, but the type of experience is subject to error, because applicants may overstate previous job responsibilities.) As the measurement error increases, the estimated parameter associated with
the experience variable will tend toward zero, that is, eventually, there will be no
relationship between salary and experience.
It is important for any source of measurement error to be carefully evaluated.
In some circumstances, little can be done to correct the measurement-error prob-
58. A more complete and formal treatment of the robustness issue appears in David A. Belsley et
al., Regression Diagnostics: Identifying Influential Data and Sources of Collinearity 229–44 (1980). For
a useful discussion of the detection of outliers and the evaluation of influential data points, see R.D.
Cook & S. Weisberg, Residuals and Influence in Regression (Monographs on Statistics and Applied
Probability No. 18, 1982). For a broad discussion of robust regression methods, see Peer J. Rouseeuw
& Annick M. Leroy, Robust Regression and Outlier Detection (2004).
59. Inaccuracy can occur not only in the precision with which a particular variable is measured,
but also in the precision with which the variable to be measured corresponds to the appropriate theoretical construct specified by the regression model.
327
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
lem; the regression results must be interpreted in that light. In other circumstances,
however, the expert can correct measurement error by finding a new, more reliable data source. Finally, alternative estimation techniques (using related variables
that are measured without error) can be applied to remedy the measurement-error
problem in some situations.60
IV. The Expert
Multiple regression analysis is taught to students in extremely diverse fields,
including statistics, economics, political science, sociology, psychology, anthropology, public health, and history. Nonetheless, the methodology is difficult to
master, necessitating a combination of technical skills (the science) and experience
(the art). This naturally raises two questions:
1. Who should be qualified as an expert?
2. When and how should the court appoint an expert to assist in the evaluation of statistical issues, including those relating to multiple regression?
A. Who Should Be Qualified as an Expert?
Any individual with substantial training in and experience with multiple regression
and other statistical methods may be qualified as an expert.61 A doctoral degree in
a discipline that teaches theoretical or applied statistics, such as economics, history,
and psychology, usually signifies to other scientists that the proposed expert meets
this preliminary test of the qualification process.
The decision to qualify an expert in regression analysis rests with the court.
Clearly, the proposed expert should be able to demonstrate an understanding of
the discipline. Publications relating to regression analysis in peer-reviewed journals, active memberships in related professional organizations, courses taught on
regression methods, and practical experience with regression analysis can indicate
a professional’s expertise. However, the expert’s background and experience with
the specific issues and tools that are applicable to a particular case should also be
considered during the qualification process. Thus, if the regression methods are
being utilized to evaluate damages in an antitrust case, the qualified expert should
have sufficient qualifications in economic analysis as well as statistics. An individual
whose expertise lies solely with statistics will be limited in his or her ability to
evaluate the usefulness of alternative economic models. Similarly, if a case involves
60. See, e.g., Pindyck & Rubinfeld, supra note 23, at 178–98 (discussion of instrumental variables
estimation).
61. A proposed expert whose only statistical tool is regression analysis may not be able to judge
when a statistical analysis should be based on an approach other than regression analysis.
328
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
eyewitness identification, a background in psychology as well as statistics may
provide essential qualifying elements.
B. Should the Court Appoint a Neutral Expert?
There are conflicting views on the issue of whether court-appointed experts
should be used. In complex cases in which two experts are presenting conflicting
statistical evidence, the use of a “neutral” court-appointed expert can be advantageous. There are those who believe, however, that there is no such thing as a
truly “neutral” expert. In any event, if an expert is chosen, that individual should
have substantial expertise and experience—ideally, someone who is respected by
both plaintiffs and defendants.62
The appointment of such an expert is likely to influence the presentation of
the statistical evidence by the experts for the parties in the litigation. The neutral
expert will have an incentive to present a balanced position that relies on broad
principles for which there is consensus in the community of experts. As a result,
the parties’ experts can be expected to present testimony that confronts core issues
that are likely to be of concern to the court and that is sufficiently balanced to be
persuasive to the court-appointed expert.63
Rule 706 of the Federal Rules of Evidence governs the selection and instruction of court-appointed experts. In particular:
1. The expert should be notified of his or her duties through a written court
order or at a conference with the parties.
2. The expert should inform the parties of his or her findings orally or in
writing.
3. If deemed appropriate by the court, the expert should be available to testify
and may be deposed or cross-examined by any party.
4. The court must determine the expert’s compensation.64
5. The parties should be free to utilize their own experts.
Although not required by Rule 706, it will usually be advantageous for the
court to opt for the appointment of a neutral expert as early in the litigation process as possible. It will also be advantageous to minimize any ex parte contact with
62. Judge Posner notes in In re High Fructose Corn Syrup Antitrust Litig., 295 F.2d 651, 665 (7th
Cir., 2002), “the judge and jury can repose a degree of confidence in his testimony that it could not
repose in that of a party’s witness. The judge and the jury may not understand the neutral expert
perfectly but at least they will know that he has no axe to grind, and so, to a degree anyway, they will
be able to take his testimony on faith.”
63. For a discussion of the presentation of expert evidence generally, including the use of courtappointed experts, see Samuel R. Gross, Expert Evidence, 1991 Wis. L. Rev. 1113 (1991).
64. Although Rule 706 states that the compensation must come from public funds, complex
litigation may be sufficiently costly as to require that the parties share the costs of the neutral expert.
329
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the neutral expert; this will diminish the possibility that one or both parties will
come to the view that the court’s ultimate opinion was unreasonably influenced
by the neutral expert.
Rule 706 does not offer specifics as to the process of appointment of a courtappointed expert. One possibility is to have the parties offer a short list of possible
appointees. If there was no common choice, the court could select from the combined list, perhaps after allowing each party to exercise one or more peremptory
challenges. Another possibility is to obtain a list of recommended experts from a
selection of individuals known to be experts in the field.
V. Presentation of Statistical Evidence
The costs of evaluating statistical evidence can be reduced and the precision of
that evidence increased if the discovery process is used effectively. In evaluating
the admissibility of statistical evidence, courts should consider the following issues:
1. Has the expert provided sufficient information to replicate the multiple
regression analysis?
2. Are the expert’s methodological choices reasonable, or are they arbitrary
and unjustified?
A. What Disagreements Exist Regarding Data on Which the
Analysis Is Based?
In general, a clear and comprehensive statement of the underlying research
methodology is a requisite part of the discovery process. The expert should be
encouraged to reveal both the nature of the experimentation carried out and the
sensitivity of the results to the data and to the methodology.
The following suggestions are useful requirements that can substantially
improve the discovery process:
1. To the extent possible, the parties should be encouraged to agree to use
a common database. Even if disagreement about the significance of the
data remains, early agreement on a common database can help focus the
discovery process on the important issues in the case.
2. A party that offers data to be used in statistical work, including multiple
regression analysis, should be encouraged to provide the following to the
other parties: (a) a hard copy of the data when available and manageable
in size, along with the underlying sources; (b) computer disks or tapes on
which the data are recorded; (c) complete documentation of the disks or
tapes; (d) computer programs that were used to generate the data (in hard
330
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
3.
4.
5.
6.
copy if necessary, but preferably on a computer disk or tape, or both);
and (e) documentation of such computer programs. The documentation
should be sufficiently complete and clear so that the opposing expert can
reproduce all of the statistical work.
A party offering data should make available the personnel involved in the
compilation of such data to answer the other parties’ technical questions
concerning the data and the methods of collection or compilation.
A party proposing to offer an expert’s regression analysis at trial should
ask the expert to fully disclose (a) the database and its sources,65 (b) the
method of collecting the data, and (c) the methods of analysis. When possible, this disclosure should be made sufficiently in advance of trial so that
the opposing party can consult its experts and prepare cross-examination.
The court must decide on a case-by-case basis where to draw the disclosure line.
An opposing party should be given the opportunity to object to a database
or to a proposed method of analysis of the database to be offered at trial.
Objections may be to simple clerical errors or to more complex issues
relating to the selection of data, the construction of variables, and, on
occasion, the particular form of statistical analysis to be used. Whenever
possible, these objections should be resolved before trial.
The parties should be encouraged to resolve differences as to the appropriateness and precision of the data to the extent possible by informal
conference. The court should make an effort to resolve differences before
trial.
These suggestions are motivated by the objective of improving the discovery
process to make it more informative. The fact that these questions may raise some
doubts or concerns about a particular regression model should not be taken to
mean that the model does not provide useful information. It does, however, take
considerable skill for an expert to determine the extent to which information is
useful when the model being utilized has some shortcomings.
B. Which Database Information and Analytical Procedures
Will Aid in Resolving Disputes over Statistical Studies?66
To help resolve disputes over statistical studies, experts should follow the guidelines below when presenting database information and analytical procedures:
65. These sources would include all variables used in the statistical analyses conducted by the
expert, not simply those variables used in a final analysis on which the expert expects to rely.
66. For a more complete discussion of these requirements, see The Evolving Role of Statistical
Assessments as Evidence in the Courts, app. F at 256 (Stephen E. Fienberg ed., 1989) (Recommended
331
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
1. The expert should state clearly the objectives of the study, as well as the time
frame to which it applies and the statistical population to which the results
are being projected.
2. The expert should report the units of observation (e.g., consumers, businesses, or employees).
3. The expert should clearly define each variable.
4. The expert should clearly identify the sample for which data are being
studied,67 as well as the method by which the sample was obtained.
5. The expert should reveal if there are missing data, whether caused by a
lack of availability (e.g., in business data) or nonresponse (e.g., in survey
data), and the method used to handle the missing data (e.g., deletion of
observations).
6. The expert should report investigations into errors associated with the
choice of variables and assumptions underlying the regression model.
7. If samples were chosen randomly from a population (i.e., probability sampling procedures were used),68 the expert should make a good-faith effort
to provide an estimate of a sampling error, the measure of the difference
between the sample estimate of a parameter (such as the mean of a dependent variable under study), and the (unknown) population parameter (the
population mean of the variable).69
8. If probability sampling procedures were not used, the expert should report
the set of procedures that was used to minimize sampling errors.
Standards on Disclosure of Procedures Used for Statistical Studies to Collect Data Submitted in Evidence in Legal Cases).
67. The sample information is important because it allows the expert to make inferences about
the underlying population.
68. In probability sampling, each representative of the population has a known probability of
being in the sample. Probability sampling is ideal because it is highly structured, and in principle, it
can be replicated by others. Nonprobability sampling is less desirable because it is often subjective,
relying to a large extent on the judgment of the expert.
69. Sampling error is often reported in terms of standard errors or confidence intervals. See
Appendix, infra, for details.
332
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
Appendix: The Basics of Multiple Regression
A. Introduction
This appendix illustrates, through examples, the basics of multiple regression
analysis in legal proceedings. Often, visual displays are used to describe the relationship between variables that are used in multiple regression analysis. Figure 2 is
a scatterplot that relates scores on a job aptitude test (shown on the x-axis) and job
performance ratings (shown on the y-axis). Each point on the scatterplot shows
where a particular individual scored on the job aptitude test and how his or her
job performance was rated. For example, the individual represented by Point A in
Figure 2 scored 49 on the job aptitude test and had a job performance rating of 62.
Figure 2. Scatterplot of scores on a job aptitude test relative to job performance
rating.
Job Performance Rating
100
75
A
50
25
0
0
25
50
75
100
Job Aptitude Test Score
The relationship between two variables can be summarized by a correlation
coefficient, which ranges in value from –1 (a perfect negative relationship) to
+1 (a perfect positive relationship). Figure 3 depicts three possible relationships
between the job aptitude variable and the job performance variable. In Figure 3(a),
there is a positive correlation: In general, higher job performance ratings are
associated with higher aptitude test scores, and lower job performance ratings
are associated with lower aptitude test scores. In Figure 3(b), the correlation is
333
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
negative: Higher job performance ratings are associated with lower aptitude test
scores, and lower job performance ratings are associated with higher aptitude
test scores. Positive and negative correlations can be relatively strong or relatively
weak. If the relationship is sufficiently weak, there is effectively no correlation, as
is illustrated in Figure 3(c).
Figure 3. Correlation between the job aptitude variable and the job performance
variable: (a) positive correlation, (b) negative correlation, (c) weak relationship with no correlation.
Job Performance Rating
3(b). Negative Correlation
Job Performance Rating
3(a). Positive Correlation
Job Aptitude Test Score
Job Aptitude Test Score
Job Performance Rating
3(c). No Correlation
Job Aptitude Test Score
Multiple regression analysis goes beyond the calculation of correlations; it is a
method in which a regression line is used to relate the average of one variable—the
dependent variable—to the values of other explanatory variables. As a result, regression analysis can be used to predict the values of one variable using the values of
others. For example, if average job performance ratings depend on aptitude test scores,
regression analysis can use information about test scores to predict job performance.
334
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
A regression line is the best-fitting straight line through a set of points in a
scatterplot. If there is only one explanatory variable, the straight line is defined
by the equation
Y = a + bX.
(1)
In equation (1), a is the intercept of the line with the y-axis when X equals 0,
and b is the slope—the change in the dependent variable associated with a 1-unit
change in the explanatory variable. In Figure 4, for example, when the aptitude test
score is 0, the predicted (average) value of the job performance rating is the intercept, 18.4. Also, for each additional point on the test score, the job performance
rating increases .73 units, which is given by the slope .73. Thus, the estimated
regression line is
(2)
Y = 184
. + .73X .
The regression line typically is estimated using the standard method of least
squares, where the values of a and b are calculated so that the sum of the squared
deviations of the points from the line are minimized. In this way, positive deviations and negative deviations of equal size are counted equally, and large deviations
are counted more than small deviations. In Figure 4 the deviation lines are verti-
Figure 4. Regression line.
Job Performance Rating
100
75
50
25
0
0
25
50
75
100
Job Aptitude Test Score
335
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
cal because the equation is predicting job performance ratings from aptitude test
scores, not aptitude test scores from job performance ratings.
The important variables that systematically might influence the dependent variable, and for which data can be obtained, typically should be included
explicitly in a statistical model. All remaining influences, which should be small
individually, but can be substantial in the aggregate, are included in an additional
random error term.70 Multiple regression is a procedure that separates the systematic effects (associated with the explanatory variables) from the random effects
(associated with the error term) and also offers a method of assessing the success
of the process.
B. Linear Regression Model
When there are an arbitrary number of explanatory variables, the linear regression
model takes the following form:
Y = β0 + β1X 1 + β2X 2 + . . . + β kX k + ε
(3)
where Y represents the dependent variable, such as the salary of an employee,
and X1 . . . Xk represent the explanatory variables (e.g., the experience of each
employee and his or her sex, coded as a 1 or 0, respectively). The error term,
ε, represents the collective unobservable influence of any omitted variables. In a
linear regression, each of the terms being added involves unknown parameters,
β0, β1, . . . βk,71 which are estimated by “fitting” the equation to the data using
least squares.
Each estimated coefficient βk measures how the dependent variable Y
responds, on average, to a change in the corresponding covariate Xk, after “controlling for” all the other covariates. The informal phrase “controlling for” has
a specific statistical meaning. Consider the following three-step procedure. First,
we calculate the residuals from a regression of Y on all covariates other than Xk.
Second, we calculate the residuals of a regression of Xk on all the other covariates.
Third, and finally, we regress the first residual variable on the second residual
variable. The resulting coefficient will be identically equal to βk. Thus, the coeffi70. It is clearly advantageous for the random component of the regression relationship to be
small relative to the variation in the dependent variable.
71. The variables themselves can appear in many different forms. For example, Y might represent the logarithm of an employee’s salary, and X1 might represent the logarithm of the employee’s
years of experience. The logarithmic representation is appropriate when Y increases exponentially as
X increases—for each unit increase in X, the corresponding increase in Y becomes larger and larger.
For example, if an expert were to graph the growth of the U.S. population (Y ) over time (t), the
following equation might be appropriate:
log(Y) = β0 + β1log(t).
336
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
cient in a multiple regression represents the slope of the line “Y, adjusted for all
covariates other than Xk versus Xk adjusted for all the other covariates.”72
Most statisticians use the least squares regression technique because of its simplicity and its desirable statistical properties. As a result, it also is used frequently
in legal proceedings.
1. Specifying the regression model
Suppose an expert wants to analyze the salaries of women and men at a large publishing house to discover whether a difference in salaries between employees with
similar years of work experience provides evidence of discrimination.73 To begin
with the simplest case, Y, the salary in dollars per year, represents the dependent
variable to be explained, and X1 represents the explanatory variable—the number
of years of experience of the employee. The regression model would be written
Y = β0 + β1X1 + ε.
(4)
In equation (4), β0 and β1 are the parameters to be estimated from the data,
and ε is the random error term. The parameter β0 is the average salary of all
employees with no experience. The parameter β1 measures the average effect of
an additional year of experience on the average salary of employees.
2. Regression line
Once the parameters in a regression equation, such as equation (3), have been estimated, the fitted values for the dependent variable can be calculated. If we denote
the estimated regression parameters, or regression coefficients, for the model in
equation (3) by β0, β1, . . . βk, the fitted values for Y, denoted Ŷ, are given by
Ŷ = β0 + β1X1 + β2X2 + . . . βkXk.
(5)
Figure 5 illustrates this for the example involving a single explanatory variable.
The data are shown as a scatter of points; salary is on the vertical axis, and years
of experience is on the horizontal axis. The estimated regression line is drawn
through the data points. It is given by
Ŷ = $15,000 + $2000X1.
(6)
72. In econometrics, this is known as the Frisch–Waugh–Lovell theorem.
73. The regression results used in this example are based on data for 1715 men and women,
which were used by the defense in a sex discrimination case against the New York Times that was
settled in 1978. Professor Orley Ashenfelter, Department of Economics, Princeton University, provided the data.
337
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 5. Goodness of fit.
ˆ i)
Residual (Yi – Y
Salary (Thousands of Dollars) (Y)
30
A
Si
b1 = $2,000
B
One Year
Si 21
b0
1
2
3
4
5
6
Years of Experience (X1)
Thus, the fitted value for the salary associated with an individual’s years of experience X1i is given by
Ŷ i = β0 + β1X1i (at Point B).
(7)
The intercept of the straight line is the average value of the dependent variable when
the explanatory variable or variables are equal to 0; the intercept β0 is shown on
the vertical axis in Figure 5. Similarly, the slope of the line measures the (average)
change in the dependent variable associated with a unit increase in an explanatory
variable; the slope β1 also is shown. In equation (6), the intercept $15,000 indicates
that employees with no experience earn $15,000 per year. The slope parameter
implies that each year of experience adds $2000 to an “average” employee’s salary.
Now, suppose that the salary variable is related simply to the sex of the employee.
The relevant indicator variable, often called a dummy variable, is X2, which is
equal to 1 if the employee is male, and 0 if the employee is female. Suppose the
regression of salary Y on X2 yields the following result: Y = $30,449 + $10,979X2.
The coefficient $10,979 measures the difference between the average salary of
men and the average salary of women.74
74. To understand why, note that when X2 equals 0, the average salary for women is
$30,449 + $10,979*0 = $30,449. Correspondingly, when X2 = 1, the average salary for men
is $30,449 + $10,979*1 = $41,428. The difference, $41,428 – $30,449, is $10,979.
338
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
a. Regression residuals
For each data point, the regression residual is the difference between the actual
values and fitted values of the dependent variable. Suppose, for example, that we
are studying an individual with 3 years of experience and a salary of $27,000.
According to the regression line in Figure 5, the average salary of an individual
with 3 years of experience is $21,000. Because the individual’s salary is $6000
higher than the average salary, the residual (the individual’s salary minus the average salary) is $6000. In general, the residual e associated with a data point, such as
Point A in Figure 5, is given by ei = Yi − Ŷ i . Each data point in the figure has a
residual, which is the error made by the least squares regression method for that
individual.
b. Nonlinearities
Nonlinear models account for the possibility that the effect of an explanatory
variable on the dependent variable may vary in magnitude as the level of the
explanatory variable changes. One useful nonlinear model uses interactions among
variables to produce this effect. For example, suppose that
S = β1 +β2SEX + β3EXP + β4(EXP)(SEX) + ε
(8)
where S is annual salary, SEX is equal to 1 for women and 0 for men, EXP represents years of job experience, and ε is a random error term. The coefficient β2
measures the difference in average salary (across all experience levels) between
men and women for employees with no experience. The coefficient β3 measures
the effect of experience on salary for men (when SEX = 0), and the coefficient
β4 measures the difference in the effect of experience on salary between men and
women. It follows, for example, that the effect of 1 year of experience on salary
for men is β3, whereas the comparable effect for women is β3 + β4.75
C. Interpreting Regression Results
To explain how regression results are interpreted, we can expand the earlier example associated with Figure 5 to consider the possibility of an additional explanatory
variable—the square of the number of years of experience, X3. The X3 variable is
designed to capture the fact that for most individuals, salaries increase with experience, but eventually salaries tend to level off. The estimated regression line using
the third additional explanatory variable, as well as the first explanatory variable
for years of experience (X1) and the dummy variable for sex (X2), is
75. Estimating a regression in which there are interaction terms for all explanatory variables,
as in equation (8), is essentially the same as estimating two separate regressions, one for men and one
for women.
339
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Ŷ = $14,085 + $2323X1 + $1675X2 − $36X3.
(9)
The importance of including relevant explanatory variables in a regression
model is illustrated by the change in the regression results after the X3 and X1
variables are added. The coefficient on the variable X2 measures the difference
in the salaries of men and women while controlling for the effect of experience.
The differential of $1675 is substantially lower than the previously measured differential of $10,979. Clearly, failure to control for job experience in this example
leads to an overstatement of the difference in salaries between men and women.
Now consider the interpretation of the explanatory variables for experience,
X1 and X3. The positive sign on the X1 coefficient shows that salary increases with
experience. The negative sign on the X3 coefficient indicates that the rate of salary increase decreases with experience. To determine the combined effect of the
variables X1 and X3, some simple calculations can be made. For example, consider
how the average salary of women (X2 = 0) changes with the level of experience.
As experience increases from 0 to 1 year, the average salary increases by $2251,
from $14,085 to $16,336. However, women with 2 years of experience earn only
$2179 more than women with 1 year of experience, and women with 1 year of
experience earn only $2127 more than women with 2 years. Furthermore, women
with 7 years of experience earn $28,582 per year, which is only $1855 more than
the $26,727 earned by women with 6 years of experience.76 Figure 6 illustrates
the results: The regression line shown is for women’s salaries; the corresponding
line for men’s salaries would be parallel and $1675 higher.
D. Determining the Precision of the Regression Results
Least squares regression provides not only parameter estimates that indicate the
direction and magnitude of the effect of a change in the explanatory variable on
the dependent variable, but also an estimate of the reliability of the parameter
estimates and a measure of the overall goodness of fit of the regression model.
Each of these factors is considered in turn.
1. Standard errors of the coefficients and t-statistics
Estimates of the true but unknown parameters of a regression model are numbers
that depend on the particular sample of observations under study. If a different
sample were used, a different estimate would be calculated.77 If the expert continued to collect more and more samples and generated additional estimates, as
might happen when new data became available over time, the estimates of each
76. These numbers can be calculated by substituting different values of X1 and X3 in equation (9).
77. The least squares formula that generates the estimates is called the least squares estimator,
and its values vary from sample to sample.
340
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
Figure 6. Regression slope for women’s salaries and men’s salaries.
Salary (Thousands of Dollars)
30
20
10
1
2
3
4
5
6
7
Years of Experience
parameter would follow a probability distribution (i.e., the expert could determine
the percentage or frequency of the time that each estimate occurs). This probability distribution can be summarized by a mean and a measure of dispersion around
the mean, a standard deviation, which usually is referred to as the standard error
of the coefficient, or the standard error (SE).78
Suppose, for example, that an expert is interested in estimating the average
price paid for a gallon of unleaded gasoline by consumers in a particular geographic area of the United States at a particular point in time. The mean price for
a sample of 10 gas stations might be $1.25, while the mean for another sample
might be $1.29, and the mean for a third, $1.21. On this basis, the expert also
could calculate the overall mean price of gasoline to be $1.25 and the standard
deviation to be $0.04.
Least squares regression generalizes this result, by calculating means whose
values depend on one or more explanatory variables. The standard error of a
regression coefficient tells the expert how much parameter estimates are likely
to vary from sample to sample. The greater the variation in parameter estimates
from sample to sample, the larger the standard error and consequently the less
reliable the regression results. Small standard errors imply results that are likely to
78. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section IV.A,
in this manual.
341
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
be similar from sample to sample, whereas results with large standard errors show
more variability.
Under appropriate assumptions, the least squares estimators provide “best”
determinations of the true underlying parameters.79 In fact, least squares has several desirable properties. First, least squares estimators are unbiased. Intuitively,
this means that if the regression were calculated repeatedly with different samples,
the average of the many estimates obtained for each coefficient would be the true
parameter. Second, least squares estimators are consistent; if the sample were very
large, the estimates obtained would come close to the true parameters. Third,
least squares is efficient, in that its estimators have the smallest variance among all
(linear) unbiased estimators.
If the further assumption is made that the probability distribution of each of
the error terms is known, statistical statements can be made about the precision
of the coefficient estimates. For relatively large samples (often, thirty or more
data points will be sufficient for regressions with a small number of explanatory
variables), the probability that the estimate of a parameter lies within an interval
of 2 standard errors around the true parameter is approximately .95, or 95%. A
frequent, although not always appropriate, assumption in statistical work is that the
error term follows a normal distribution, from which it follows that the estimated
parameters are normally distributed. The normal distribution has the property
that the area within 1.96 standard errors of the mean is equal to 95% of the total
area. Note that the normality assumption is not necessary for least squares to be
used, because most of the properties of least squares apply regardless of normality.
In general, for any parameter estimate b, the expert can construct an interval
around b such that there is a 95% probability that the interval covers the true
parameter. This 95% confidence interval80 is given by81
b ± 1.96 (SE of b).
(10)
The expert can test the hypothesis that a parameter is actually equal to 0 (often
stated as testing the null hypothesis) by looking at its t-statistic, which is defined as
t=
b
.
SE b
()
(11)
79. The necessary assumptions of the regression model include (a) the model is specified correctly, (b) errors associated with each observation are drawn randomly from the same probability
distribution and are independent of each other, (c) errors associated with each observation are independent of the corresponding observations for each of the explanatory variables in the model, and (d) no
explanatory variable is correlated perfectly with a combination of other variables.
80. Confidence intervals are used commonly in statistical analyses because the expert can never
be certain that a parameter estimate is equal to the true population parameter.
81. If the number of data points in the sample is small, the standard error must be multiplied
by a number larger than 1.96.
342
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
If the t-statistic is less than 1.96 in magnitude, the 95% confidence interval around
b must include 0.82 Because this means that the expert cannot reject the hypothesis
that β equals 0, the estimate, whatever it may be, is said to be not statistically
significant. Conversely, if the t-statistic is greater than 1.96 in absolute value, the
expert concludes that the true value of β is unlikely to be 0 (intuitively, b is “too
far” from 0 to be consistent with the true value of β being 0). In this case, the
expert rejects the hypothesis that β equals 0 and calls the estimate statistically significant. If the null hypothesis β equals 0 is true, using a 95% confidence level will
cause the expert to falsely reject the null hypothesis 5% of the time. Consequently,
results often are said to be significant at the 5% level.83
As an example, consider a more complete set of regression results associated
with the salary regression described in equation (9):
Ŷ = $14,085 + $2323X1 + $1675X2 − $36X3
(1577)
(140)
(1435)
(3.4)
t =
8.9
16.5
1.2
−10.8.
(12)
The standard error of each estimated parameter is given in parentheses directly
below the parameter, and the corresponding t-statistics appear below the standard
error values.
Consider the coefficient on the dummy variable X2. It indicates that $1675
is the best estimate of the mean salary difference between men and women.
However, the standard error of $1435 is large in relation to its coefficient $1675.
Because the standard error is relatively large, the range of possible values for
measuring the true salary difference, the true parameter, is great. In fact, a 95%
confidence interval is given by
$1675 ± $1435 ∙ 1.96 = $1675 ± $2813.
(13)
In other words, the expert can have 95% confidence that the true value of the
coefficient lies between –$1138 and $4488. Because this range includes 0, the
effect of sex on salary is said to be insignificantly different from 0 at the 5% level.
The t value of 1.2 is equal to $1675 divided by $1435. Because this t-statistic is
less than 1.96 in magnitude (a condition equivalent to the inclusion of a 0 in the
above confidence interval), the sex variable again is said to be an insignificant
determinant of salary at the 5% level of significance.
82. The t-statistic applies to any sample size. As the sample gets large, the underlying distribution,
which is the source of the t-statistic (Student’s t-distribution), approximates the normal distribution.
83. A t-statistic of 2.57 in magnitude or greater is associated with a 99% confidence level, or a
1% level of significance, that includes a band of 2.57 standard deviations on either side of the estimated
coefficient.
343
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Note also that experience is a highly significant determinant of salary, because
both the X1 and the X3 variables have t-statistics substantially greater than 1.96 in
magnitude. More experience has a significant positive effect on salary, but the size
of this effect diminishes significantly with experience.
2. Goodness of fit
Reported regression results usually contain not only the point estimates of the
parameters and their standard errors or t-statistics, but also other information that
tells how closely the regression line fits the data. One statistic, the standard error of
the regression (SER), is an estimate of the overall size of the regression residuals.84
An SER of 0 would occur only when all data points lie exactly on the regression
line—an extremely unlikely possibility. Other things being equal, the larger the
SER, the poorer the fit of the data to the model.
For a normally distributed error term, the expert would expect approximately
95% of the data points to lie within 2 SERs of the estimated regression line, as
shown in Figure 7 (in Figure 7, the SER is approximately $5000).
Figure 7. Standard error of the regression.
2 SERs
Regression line
Salary (Y)
2 SERs
Experience (X1)
84. More specifically, it is a measure of the standard deviation of the regression error ε. It sometimes is called the root mean squared error of the regression line.
344
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
R-squared (R2) is a statistic that measures the percentage of variation in the
dependent variable that is accounted for by all the explanatory variables.85 Thus,
R2 provides a measure of the overall goodness of fit of the multiple regression
equation. Its value ranges from 0 to 1. An R2 of 0 means that the explanatory
variables explain none of the variation of the dependent variable; an R2 of 1 means
that the explanatory variables explain all of the variation. The R2 associated with
equation (12) is .56. This implies that the three explanatory variables explain 56%
of the variation in salaries.
What level of R2, if any, should lead to a conclusion that the model is satisfactory? Unfortunately, there is no clear-cut answer to this question, because the
magnitude of R2 depends on the characteristics of the data being studied and, in
particular, whether the data vary over time or over individuals. Typically, an R2
is low in cross-sectional studies in which differences in individual behavior are
explained. It is likely that these individual differences are caused by many factors
that cannot be measured. As a result, the expert cannot hope to explain most of
the variation. In time-series studies, in contrast, the expert is explaining the movement of aggregates over time. Because most aggregate time series have substantial
growth, or trend, in common, it will not be difficult to “explain” one time series
using another time series, simply because both are moving together. It follows as
a corollary that a high R2 does not by itself mean that the variables included in
the model are the appropriate ones.
As a general rule, courts should be reluctant to rely solely on a statistic such as R2
to choose one model over another. Alternative procedures and tests are available.86
3. Sensitivity of least squares regression results
The least squares regression line can be sensitive to extreme data points. This
sensitivity can be seen most easily in Figure 8. Assume initially that there are only
three data points, A, B, and C, relating information about X1 to the variable Y.
The least squares line describing the best-fitting relationship between Points A, B,
and C is represented by Line 1. Point D is called an outlier because it lies far from
the regression line that fits the remaining points. When a new, best-fitting least
squares line is reestimated to include Point D, Line 2 is obtained. Figure 8 shows
that the outlier Point D is an influential data point, because it has a dominant effect
on the slope and intercept of the least squares line. Because least squares attempts
to minimize the sum of squared deviations, the sensitivity of the line to individual
points sometimes can be substantial.87
85. The variation is the square of the difference between each Y value and the average Y value,
summed over all the Y values.
86. These include F-tests and specification error tests. See Pindyck & Rubinfeld, supra note 23,
at 88–95, 128–36, 194–98.
87. This sensitivity is not always undesirable. In some instances it may be much more important
to predict Point D when a big change occurs than to measure the effects of small changes accurately.
345
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 8. Least squares regression.
Line 2
D
B
Line 1
C
A
What makes the influential data problem even more difficult is that the effect
of an outlier may not be seen readily if deviations are measured from the final
regression line. The reason is that the influence of Point D on Line 2 is so substantial that its deviation from the regression line is not necessarily larger than the
deviation of any of the remaining points from the regression line.88 Although they
are not as popular as least squares, alternative estimation techniques that are less
sensitive to outliers, such as robust estimation, are available.
E. Reading Multiple Regression Computer Output
Statistical computer packages that report multiple regression analyses vary to some
extent in the information they provide and the form that the information takes.
Table 1 contains a sample of the basic computer output that is associated with
equation (9).
88. The importance of an outlier also depends on its location in the dataset. Outliers associated
with relatively extreme values of explanatory variables are likely to be especially influential. See, e.g.,
Fisher v. Vassar College, 70 F.3d 1420, 1436 (2d Cir. 1995) (court required to include assessment of
“service in academic community,” because concept was too amorphous and not a significant factor in
tenure review), rev’d on other grounds, 114 F.3d 1332 (2d Cir. 1997) (en banc).
346
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
Table 1. Regression Output
Dependent variable: Y
SSE
62346266124 F-test
174.71
DFE
561
Prob > F
0.0001
MSE
111134164
R2
0.556
Variable
DF
Parameter
Estimate
Standard
Error
t-Statistic Prob >|t|
Intercept
1
14,084.89
1577.484
8.9287
.0001
X1
1
2323.17
140.70
16.5115
.0001
X2
1
1675.11
1435.422
1.1670
.2437
X3
1
−36.71
3.41
−10.7573 .0001
Note: SSE = sum of squared errors; DFE = degrees of freedom associated with the error term; MSE
= mean squared error; DF = degrees of freedom; Prob = probability.
In the lower portion of Table 1, note that the parameter estimates, the standard
errors, and the t-statistics match the values given in equation (12).89 The variable
“Intercept” refers to the constant term b0 in the regression. The column “DF”
represents degrees of freedom. The “1” signifies that when the computer calculates
the parameter estimates, each variable that is added to the linear regression adds
an additional constraint that must be satisfied. The column labeled “Prob > |t|”
lists the two-tailed p-values associated with each estimated parameter; the p-value
measures the observed significance level—the probability of getting a test statistic as
extreme or more extreme than the observed number if the model parameter is in
fact 0. The very low p-values on the variables X1 and X3 imply that each variable
is statistically significant at less than the 1% level—both highly significant results.
In contrast, the X2 coefficient is only significant at the 24% level, implying that
it is insignificant at the traditional 5% level. Thus, the expert cannot reject with
confidence the null hypothesis that salaries do not differ by sex after the expert has
accounted for the effect of experience.
The top portion of Table 1 provides data that relate to the goodness of fit
of the regression equation. The sum of squared errors (SSE) measures the sum
of the squares of the regression residuals—the sum that is minimized by the least
squares procedure. The degrees of freedom associated with the error term (DFE)
are given by the number of observations minus the number of parameters that
were estimated. The mean squared error (MSE) measures the variance of the
error term (the square of the standard error of the regression). MSE is equal to
SSE divided by DFE.
89. Computer programs give results to more decimal places than are meaningful. This added
detail should not be seen as evidence that the regression results are exact.
347
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The R2 of 0.556 indicates that 55.6% of the variation in salaries is explained
by the regression variables, X1, X2, and X3. Finally, the F-test is a test of the null
hypothesis that all regression coefficients (except the intercept) are jointly equal
to 0—that there is no linear association between the dependent variable and any of the
explanatory variables. This is equivalent to the null hypothesis that R2 is equal to 0. In
this case, the F-ratio of 174.71 is sufficiently high that the expert can reject the null
hypothesis with a very high degree of confidence (i.e., with a 1% level of significance).
F. Forecasting
In general, a forecast is a prediction made about the values of the dependent variable using information about the explanatory variables. Often, ex ante forecasts
are performed; in this situation, values of the dependent variable are predicted
beyond the sample (e.g., beyond the time period in which the model has been
estimated). However, ex post forecasts are frequently used in damage analyses.90
An ex post forecast has a forecast period such that all values of the dependent and
explanatory variables are known; ex post forecasts can be checked against existing
data and provide a direct means of evaluation.
For example, to calculate the forecast for the salary regression discussed above,
the expert uses the estimated salary equation
Ŷ = $14,085 + $2323X1 + $1675X2 − $36X3.
(14)
To predict the salary of a man with 2 years’ experience, the expert calculates
Ŷ ( 2 ) = $14,085 + ($2323 ∙ 2) + $1675 − ($36 ∙ 2) = $20,262.
(15)
The degree of accuracy of both ex ante and ex post forecasts can be calculated
provided that the model specification is correct and the errors are normally distributed and independent. The statistic is known as the standard error of forecast
(SEF). The SEF measures the standard deviation of the forecast error that is made
within a sample in which the explanatory variables are known with certainty.91 The
90. Frequently, in cases involving damages, the question arises, what the world would have been
like had a certain event not taken place. For example, in a price-fixing antitrust case, the expert can
ask what the price of a product would have been had a certain event associated with the price-fixing
agreement not occurred. If prices would have been lower, the evidence suggests impact. If the expert
can predict how much lower they would have been, the data can help the expert develop a numerical
estimate of the amount of damages.
91. There are actually two sources of error implicit in the SEF. The first source arises because
the estimated parameters of the regression model may not be exactly equal to the true regression
parameters. The second source is the error term itself; when forecasting, the expert typically sets the
error equal to 0 when a turn of events not taken into account in the regression model may make it
appropriate to make the error positive or negative.
348
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
SEF can be used to determine how accurate a given forecast is. In equation (15),
the SEF associated with the forecast of $20,262 is approximately $5000. If a large
sample size is used, the probability is roughly 95% that the predicted salary will be
within 1.96 standard errors of the forecasted value. In this case, the appropriate
95% interval for the prediction is $10,822 to $30,422. Because the estimated model
does not explain salaries effectively, the SEF is large, as is the 95% interval. A more
complete model with additional explanatory variables would result in a lower SEF
and a smaller 95% interval for the prediction.
A danger exists when using the SEF, which applies to the standard errors of
the estimated coefficients as well. The SEF is calculated on the assumption that the
model includes the correct set of explanatory variables and the correct functional
form. If the choice of variables or the functional form is wrong, the estimated forecast error may be misleading. In some instances, it may be smaller, perhaps substantially smaller, than the true SEF; in other instances, it may be larger, for example, if
the wrong variables happen to capture the effects of the correct variables.
The difference between the SEF and the SER is shown in Figure 9. The SER
measures deviations within the sample. The SEF is more general, because it calculates deviations within or without the sample period. In general, the difference
between the SEF and the SER increases as the values of the explanatory variables
increase in distance from the mean values. Figure 9 shows the 95% prediction
interval created by the measurement of two SEFs about the regression line.
Figure 9. Standard error of forecast.
2 SEFs
Salary (Y)
2 SERs
Experience (X1)
349
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
G. A Hypothetical Example
Jane Thompson filed suit in federal court alleging that officials in the police
department discriminated against her and a class of other female police officers in
violation of Title VII of the Civil Rights Act of 1964, as amended. On behalf of
the class, Ms. Thompson alleged that she was paid less than male police officers
with equivalent skills and experience. Both plaintiff and defendant used expert
economists with econometric expertise to present statistical evidence to the court
in support of their positions.
Plaintiff’s expert pointed out that the mean salary of the 40 female officers was
$30,604, whereas the mean salary of the 60 male officers was $43,077. To show
that this difference was statistically significant, the expert put forward a regression
of salary (SALARY) on a constant term and a dummy indicator variable (FEM)
equal to 1 for each female and 0 for each male. The results were as follows:
Standard Error
p-value
R2 = .22
SALARY = $43,077 −$12,373*FEM
($1528) ($2416)
<.01
<.01
The −$12,373 coefficient on the FEM variable measures the mean difference
between male and female salaries. Because the standard error is approximately onefifth of the value of the coefficient, this difference is statistically significant at the 5%
(and indeed at the 1%) level. If this is an appropriate regression model (in terms of its
implicit characterization of salary determination), one can conclude that it is highly
unlikely that the difference in salaries between men and women is due to chance.
The defendant’s expert testified that the regression model put forward was the
wrong model because it failed to account for the fact that males (on average) had
substantially more experience than females. The relatively low R2 was an indication that there was substantial unexplained variation in the salaries of male and
female officers. An examination of data relating to years spent on the job showed
that the average male experience was 8.2 years, whereas the average for females
was only 3.5 years. The defense expert then presented a regression analysis that
added an additional explanatory variable (i.e., a covariate), the years of experience
of each police officer (EXP). The new regression results were as follows:
SALARY = $28,049 – $3860*FEM + $1833*EXP
Standard Error
(2513) ($2347)
($265)
p-value
<.01
<.11
<.01
R2 = .47
Experience is itself a statistically significant explanatory variable, with a
p-value of less than .01. Moreover, the difference between male and female
350
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
salaries, holding experience constant, is only $3860, and this difference is not statistically significant at the 5% level. The defense expert was able to testify on this
basis that the court could not rule out alternative explanations for the difference
in salaries other than the plaintiff’s claim of discrimination.
The debate did not end here. On rebuttal, the plaintiff’s expert made three
distinct points. First, whether $3860 was statistically significant or not, it was practically significant, representing a salary difference of more than 10% of the mean
female officers’ salaries. Second, although the result was not statistically significant at
the 5% level, it was significant at the 11% level. If the regression model were valid,
there would be approximately an 11% probability that one would err by concluding
that the mean salary difference between men and women was a result of chance.
Third, and most importantly, the expert testified that the regression model
was not correctly specified. Further analysis by the expert showed that the value of
an additional year of experience was $2333 for males on average, but only $1521
for females. Based on supporting testimonial experience, the expert testified that
one could not rule out the possibility that the mechanism by which the police
department discriminated against females was by rewarding males more for their
experience than females. The expert made this point clear by running an additional regression in which a further covariate was added to the model. The new
variable was an interaction variable, INT, measured as the product of the FEM
and EXP variables. The regression results were as follows:
SALARY = $35,122 − $5250*FEM + $2333*EXP − $812*FEM*EXP
St. Error ($2825) ($347)
($265)
($185)
p-value
<.01
<.11
<.01
<.04
R2 = .65
The plaintiff’s expert noted that for all males in the sample, FEM = 0, in which
case the regression results are given by the equation
SALARY = $35,122 + $2333*EXP
However, for females, FEM = 1, in which the corresponding equation is
SALARY = $29,872 + $1521*EXP
It appears, therefore, that females are discriminated against not only when hired
(i.e., when EXP = 0), but also in the reward they get as they accumulate more
and more experience.
The debate between the experts continued, focusing less on the statistical interpretation of any one particular regression model, but more on the model choice
itself, and not simply on statistical significance, but also with regard to practical
significance.
351
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Glossary
The following terms and definitions are adapted from a variety of sources, including A Dictionary of Epidemiology (John M. Last et al., eds., 4th ed. 2000) and
Robert S. Pindyck & Daniel L. Rubinfeld, Econometric Models and Economic
Forecasts (4th ed. 1998).
alternative hypothesis. See hypothesis test.
association. The degree of statistical dependence between two or more events or
variables. Events are said to be associated when they occur more frequently
together than one would expect by chance.
bias. Any effect at any stage of investigation or inference tending to produce
results that depart systematically from the true values (i.e., the results are
either too high or too low). A biased estimator of a parameter differs on
average from the true parameter.
coefficient. An estimated regression parameter.
confidence interval. An interval that contains a true regression parameter with
a given degree of confidence.
consistent estimator. An estimator that tends to become more and more accurate as the sample size grows.
correlation. A statistical means of measuring the linear association between variables. Two variables are correlated positively if, on average, they move in the
same direction; two variables are correlated negatively if, on average, they
move in opposite directions.
covariate. A variable that is possibly predictive of an outcome under study; an
explanatory variable.
cross-sectional analysis. A type of multiple regression analysis in which each
data point is associated with a different unit of observation (e.g., an individual
or a firm) measured at a particular point in time.
degrees of freedom (DF). The number of observations in a sample minus the
number of estimated parameters in a regression model. A useful statistic in
hypothesis testing.
dependent variable. The variable to be explained or predicted in a multiple
regression model.
dummy variable. A variable that takes on only two values, usually 0 and 1, with
one value indicating the presence of a characteristic, attribute, or effect (1),
and the other value indicating its absence (0).
efficient estimator. An estimator of a parameter that produces the greatest precision possible.
error term. A variable in a multiple regression model that represents the cumulative effect of a number of sources of modeling error.
352
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
estimate. The calculated value of a parameter based on the use of a particular
sample.
estimator. The sample statistic that estimates the value of a population parameter
(e.g., a regression parameter); its values vary from sample to sample.
ex ante forecast. A prediction about the values of the dependent variable that go
beyond the sample; consequently, the forecast must be based on predictions
for the values of the explanatory variables in the regression model.
explanatory variable. A variable that is associated with changes in a dependent
variable.
ex post forecast. A prediction about the values of the dependent variable made
during a period in which all values of the explanatory and dependent variables
are known. Ex post forecasts provide a useful means of evaluating the fit of
a regression model.
F-test. A statistical test (based on an F-ratio) of the null hypothesis that a group of
explanatory variables are jointly equal to 0. When applied to all the explanatory variables in a multiple regression model, the F-test becomes a test of the
null hypothesis that R2 equals 0.
feedback. When changes in an explanatory variable affect the values of the
dependent variable, and changes in the dependent variable also affect the
explanatory variable. When both effects occur at the same time, the two
variables are described as being determined simultaneously.
fitted value. The estimated value for the dependent variable; in a linear regression, this value is calculated as the intercept plus a weighted average of the
values of the explanatory variables, with the estimated parameters used as
weights.
heteroscedasticity. When the error associated with a multiple regression model
has a nonconstant variance; that is, the error values associated with some
observations are typically high, while the values associated with other observations are typically low.
hypothesis test. A statement about the parameters in a multiple regression model.
The null hypothesis may assert that certain parameters have specified values
or ranges; the alternative hypothesis would specify other values or ranges.
independence. When two variables are not correlated with each other (in the
population).
independent variable. An explanatory variable that affects the dependent variable but that is not affected by the dependent variable.
influential data point. A data point whose deletion from a regression sample
causes one or more estimated regression parameters to change substantially.
interaction variable. The product of two explanatory variables in a regression
model. Used in a particular form of nonlinear model.
353
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
intercept. The value of the dependent variable when each of the explanatory
variables takes on the value of 0 in a regression equation.
least squares. A common method for estimating regression parameters. Least
squares minimizes the sum of the squared differences between the actual
values of the dependent variable and the values predicted by the regression
equation.
linear regression model. A regression model in which the effect of a change in
each of the explanatory variables on the dependent variable is the same, no
matter what the values of those explanatory variables.
mean (sample). An average of the outcomes associated with a probability distribution, where the outcomes are weighted by the probability that each will
occur.
mean squared error (MSE). The estimated variance of the regression error,
calculated as the average of the sum of the squares of the regression residuals.
model. A representation of an actual situation.
multicollinearity. When two or more variables are highly correlated in a multiple regression analysis. Substantial multicollinearity can cause regression
parameters to be estimated imprecisely, as reflected in relatively high standard
errors.
multiple regression analysis. A statistical tool for understanding the relationship
between two or more variables.
nonlinear regression model. A model having the property that changes in
explanatory variables will have differential effects on the dependent variable
as the values of the explanatory variables change.
normal distribution. A bell-shaped probability distribution having the property
that about 95% of the distribution lies within 2 standard deviations of the
mean.
null hypothesis. In regression analysis the null hypothesis states that the results
observed in a study with respect to a particular variable are no different from
what might have occurred by chance, independent of the effect of that variable. See hypothesis test.
one-tailed test. A hypothesis test in which the alternative to the null hypothesis
that a parameter is equal to 0 is for the parameter to be either positive or
negative, but not both.
outlier. A data point that is more than some appropriate distance from a regression line that is estimated using all the other data points in the sample.
p-value. The significance level in a statistical test; the probability of getting a test
statistic as extreme or more extreme than the observed value. The larger the
p-value, the more likely that the null hypothesis is valid.
parameter. A numerical characteristic of a population or a model.
354
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
perfect collinearity. When two or more explanatory variables are correlated
perfectly.
population. All the units of interest to the researcher; also, universe.
practical significance. Substantive importance. Statistical significance does not
ensure practical significance, because, with large samples, small differences
can be statistically significant.
probability distribution. The process that generates the values of a random variable. A probability distribution lists all possible outcomes and the probability
that each will occur.
probability sampling. A process by which a sample of a population is chosen
so that each unit of observation has a known probability of being selected.
quasi-experiment (or natural experiment). A naturally occurring instance
of observable phenomena that yield data that approximate a controlled
experiment.
R-squared (R2). A statistic that measures the percentage of the variation in the
dependent variable that is accounted for by all of the explanatory variables in
a regression model. R-squared is the most commonly used measure of goodness of fit of a regression model.
random error term. A term in a regression model that reflects random error
(sampling error) that is the result of chance. As a consequence, the result
obtained in the sample differs from the result that would be obtained if the
entire population were studied.
regression coefficient. Also, regression parameter. The estimate of a population
parameter obtained from a regression equation that is based on a particular
sample.
regression residual. The difference between the actual value of a dependent
variable and the value predicted by the regression equation.
robust estimation. An alternative to least squares estimation that is less sensitive
to outliers.
robustness. A statistic or procedure that does not change much when data or
assumptions are slightly modified is robust.
sample. A selection of data chosen for a study; a subset of a population.
sampling error. A measure of the difference between the sample estimate of a
parameter and the population parameter.
scatterplot. A graph showing the relationship between two variables in a study;
each dot represents one subject. One variable is plotted along the horizontal
axis; the other variable is plotted along the vertical axis.
serial correlation. The correlation of the values of regression errors over time.
355
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
slope. The change in the dependent variable associated with a one-unit change
in an explanatory variable.
spurious correlation. When two variables are correlated, but one is not the
cause of the other.
standard deviation. The square root of the variance of a random variable. The
variance is a measure of the spread of a probability distribution about its mean;
it is calculated as a weighted average of the squares of the deviations of the
outcomes of a random variable from its mean.
standard error of forecast (SEF). An estimate of the standard deviation of the
forecast error; it is based on forecasts made within a sample in which the values
of the explanatory variables are known with certainty.
standard error of the coefficient; standard error (SE). A measure of the
variation of a parameter estimate or coefficient about the true parameter. The
standard error is a standard deviation that is calculated from the probability
distribution of estimated parameters.
standard error of the regression (SER). An estimate of the standard deviation
of the regression error; it is calculated as the square root of the average of the
squares of the residuals associated with a particular multiple regression analysis.
statistical significance. A test used to evaluate the degree of association between
a dependent variable and one or more explanatory variables. If the calculated
p-value is smaller than 5%, the result is said to be statistically significant (at
the 5% level). If p is greater than 5%, the result is statistically insignificant
(at the 5% level).
t-statistic. A test statistic that describes how far an estimate of a parameter is from
its hypothesized value (i.e., given a null hypothesis). If a t-statistic is sufficiently large (in absolute magnitude), an expert can reject the null hypothesis.
t-test. A test of the null hypothesis that a regression parameter takes on a particular value, usually 0. The test is based on the t-statistic.
time-series analysis. A type of multiple regression analysis in which each data
point is associated with a particular unit of observation (e.g., an individual or
a firm) measured at different points in time.
two-tailed test. A hypothesis test in which the alternative to the null hypothesis
that a parameter is equal to 0 is for the parameter to be either positive or
negative, or both.
variable. Any attribute, phenomenon, condition, or event that can have two or
more values.
variable of interest. The explanatory variable that is the focal point of a particular study or legal issue.
356
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Multiple Regression
References on Multiple Regression
Jonathan A. Baker & Daniel L. Rubinfeld, Empirical Methods in Antitrust: Review
and Critique, 1 Am. L. & Econ. Rev. 386 (1999).
Gerald V. Barrett & Donna M. Sansonetti, Issues Concerning the Use of Regression
Analysis in Salary Discrimination Cases, 41 Personnel Psychol. 503 (2006).
Thomas J. Campbell, Regression Analysis in Title VII Cases: Minimum Standards,
Comparable Worth, and Other Issues Where Law and Statistics Meet, 36 Stan. L.
Rev. 1299 (1984).
Catherine Connolly, The Use of Multiple Regression Analysis in Employment Discrimination Cases, 10 Population Res. & Pol’y Rev. 117 (1991).
Arthur P. Dempster, Employment Discrimination and Statistical Science, 3 Stat. Sci.
149 (1988).
Michael O. Finkelstein, The Judicial Reception of Multiple Regression Studies in Race
and Sex Discrimination Cases, 80 Colum. L. Rev. 737 (1980).
Michael O. Finkelstein & Hans Levenbach, Regression Estimates of Damages in PriceFixing Cases, Law & Contemp. Probs., Autumn 1983, at 145.
Franklin M. Fisher, Multiple Regression in Legal Proceedings, 80 Colum. L. Rev.
702 (1980).
Franklin M. Fisher, Statisticians, Econometricians, and Adversary Proceedings, 81 J. Am.
Stat. Ass’n 277 (1986).
Joseph L. Gastwirth, Methods for Assessing the Sensitivity of Statistical Comparisons
Used in Title VII Cases to Omitted Variables, 33 Jurimetrics J. 19 (1992).
Note, Beyond the Prima Facie Case in Employment Discrimination Law: Statistical Proof
and Rebuttal, 89 Harv. L. Rev. 387 (1975).
Daniel L. Rubinfeld, Econometrics in the Courtroom, 85 Colum. L. Rev. 1048
(1985).
Daniel L. Rubinfeld & Peter O. Steiner, Quantitative Methods in Antitrust Litigation,
Law & Contemp. Probs., Autumn 1983, at 69.
Daniel L. Rubinfeld, Statistical and Demographic Issues Underlying Voting Rights
Cases, 15 Evaluation Rev. 659 (1991).
The Evolving Role of Statistical Assessments as Evidence in the Courts (Stephen
E. Fienberg ed., 1989).
357
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
Survey Research
SHARI SEIDMAN DIAMOND
Shari Seidman Diamond, J.D., Ph.D., is the Howard J. Trienens Professor of Law and
Professor of Psychology, Northwestern University, and a Research Professor, American Bar
Foundation, Chicago, Illinois.
CONTENTS
I. Introduction, 361
A. Use of Surveys in Court, 363
B. Surveys Used to Help Assess Expert Acceptance in the Wake of
Daubert, 367
C. Surveys Used to Help Assess Community Standards: Atkins v.
Virginia, 369
D. A Comparison of Survey Evidence and Individual Testimony, 372
II. Purpose and Design of the Survey, 373
A. Was the Survey Designed to Address Relevant Questions? 373
B. Was Participation in the Design, Administration, and Interpretation
of the Survey Appropriately Controlled to Ensure the Objectivity
of the Survey? 374
C. Are the Experts Who Designed, Conducted, or Analyzed the Survey
Appropriately Skilled and Experienced? 375
D. Are the Experts Who Will Testify About Surveys Conducted by
Others Appropriately Skilled and Experienced? 375
III. Population Definition and Sampling, 376
A. Was an Appropriate Universe or Population Identified? 376
B. Did the Sampling Frame Approximate the Population? 377
C. Does the Sample Approximate the Relevant Characteristics of the
Population? 380
D. What Is the Evidence That Nonresponse Did Not Bias the Results
of the Survey? 383
E. What Procedures Were Used to Reduce the Likelihood of a Biased
Sample? 385
F. What Precautions Were Taken to Ensure That Only Qualified
Respondents Were Included in the Survey? 386
359
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
IV. Survey Questions and Structure, 387
A. Were Questions on the Survey Framed to Be Clear, Precise, and
Unbiased? 387
B. Were Some Respondents Likely to Have No Opinion? If So, What
Steps Were Taken to Reduce Guessing? 389
C. Did the Survey Use Open-Ended or Closed-Ended Questions? How
Was the Choice in Each Instance Justified? 391
D. If Probes Were Used to Clarify Ambiguous or Incomplete Answers,
What Steps Were Taken to Ensure That the Probes Were Not
Leading and Were Administered in a Consistent Fashion? 394
E. What Approach Was Used to Avoid or Measure Potential Order or
Context Effects? 395
F. If the Survey Was Designed to Test a Causal Proposition, Did the
Survey Include an Appropriate Control Group or Question? 397
G. What Limitations Are Associated with the Mode of Data Collection
Used in the Survey? 401
1. In-person interviews, 402
2. Telephone interviews, 403
3. Mail questionnaires, 405
4. Internet surveys, 406
V. Surveys Involving Interviewers, 409
A. Were the Interviewers Appropriately Selected and Trained? 409
B. What Did the Interviewers Know About the Survey and Its
Sponsorship? 410
C. What Procedures Were Used to Ensure and Determine That the
Survey Was Administered to Minimize Error and Bias? 411
VI. Data Entry and Grouping of Responses, 412
A. What Was Done to Ensure That the Data Were Recorded
Accurately? 412
B. What Was Done to Ensure That the Grouped Data Were Classified
Consistently and Accurately? 413
VII. Disclosure and Reporting, 413
A. When Was Information About the Survey Methodology and Results
Disclosed? 413
B. Does the Survey Report Include Complete and Detailed
Information on All Relevant Characteristics? 415
C. In Surveys of Individuals, What Measures Were Taken to Protect the
Identities of Individual Respondents? 417
VIII. Acknowledgment, 418
Glossary of Terms, 419
References on Survey Research, 423
360
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
I.
Introduction
Sample surveys are used to describe or enumerate the beliefs, attitudes, or behavior
of persons or other social units.1 Surveys typically are offered in legal proceedings
to establish or refute claims about the characteristics of those individuals or social
units (e.g., whether consumers are likely to be misled by the claims contained
in an allegedly deceptive advertisement;2 which qualities purchasers focus on in
making decisions about buying new computer systems).3 In a broader sense, a
survey can describe or enumerate the attributes of any units, including animals and
objects.4 We focus here primarily on sample surveys, which must deal not only
with issues of population definition, sampling, and measurement common to all
surveys, but also with the specialized issues that arise in obtaining information
from human respondents.
In principle, surveys may count or measure every member of the relevant
population (e.g., all plaintiffs eligible to join in a suit, all employees currently
working for a corporation, all trees in a forest). In practice, surveys typically
count or measure only a portion of the individuals or other units that the survey
is intended to describe (e.g., a sample of jury-eligible citizens, a sample of potential
job applicants). In either case, the goal is to provide information on the relevant
population from which the sample was drawn. Sample surveys can be carried out
using probability or nonprobability sampling techniques. Although probability
sampling offers important advantages over nonprobability sampling,5 experts in
some fields (e.g., marketing) regularly rely on various forms of nonprobability
sampling when conducting surveys. Consistent with Federal Rule of Evidence
703, courts generally have accepted such evidence.6 Thus, in this reference guide,
both the probability sample and the nonprobability sample are discussed. The
strengths of probability sampling and the weaknesses of various types of nonprobability sampling are described.
1. Sample surveys conducted by social scientists “consist of (relatively) systematic, (mostly)
standardized approaches to collecting information on individuals, households, organizations, or larger
organized entities through questioning systematically identified samples.” James D. Wright & Peter V.
Marsden, Survey Research and Social Science: History, Current Practice, and Future Prospects, in Handbook
of Survey Research 1, 3 (James D. Wright & Peter V. Marsden eds., 2d ed. 2010).
2. See Sanderson Farms v. Tyson Foods, 547 F. Supp. 2d 491 (D. Md. 2008).
3. See SMS Sys. Maint. Servs. v. Digital Equip. Corp., 118 F.3d 11, 30 (1st Cir. 1999). For other
examples, see notes 19–32 and accompanying text.
4. In J.H. Miles & Co. v. Brown, 910 F. Supp. 1138 (E.D. Va. 1995), clam processors and fishing
vessel owners sued the Secretary of Commerce for failing to use the unexpectedly high results from 1994
survey data on the size of the clam population to determine clam fishing quotas for 1995. The estimate of
clam abundance is obtained from surveys of the amount of fishing time the research survey vessels require
to collect a specified yield of clams in major fishing areas over a period of several weeks. Id. at 1144–45.
5. See infra Section III.C.
6. Fed. R. Evid. 703 recognizes facts or data “of a type reasonably relied upon by experts in
the particular field. . . .”
361
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
As a method of data collection, surveys have several crucial potential advantages over less systematic approaches.7 When properly designed, executed, and
described, surveys (1) economically present the characteristics of a large group of
respondents or other units and (2) permit an assessment of the extent to which
the measured respondents or other units are likely to adequately represent a relevant group of individuals or other units.8 All questions asked of respondents and
all other measuring devices used (e.g., criteria for selecting eligible respondents)
can be examined by the court and the opposing party for objectivity, clarity, and
relevance, and all answers or other measures obtained can be analyzed for completeness and consistency. The survey questions should not be the only focus of
attention. To make it possible for the court and the opposing party to closely scrutinize the survey so that its relevance, objectivity, and representativeness can be
evaluated, the party proposing to offer the survey as evidence should also describe
in detail the design, execution, and analysis of the survey. This should include
(1) a description of the population from which the sample was selected, demonstrating that it was the relevant population for the question at hand; (2) a description of how the sample was drawn and an explanation for why that sample design
was appropriate; (3) a report on response rate and the ability of the sample to
represent the target population; and (4) an evaluation of any sources of potential
bias in respondents’ answers.
The questions listed in this reference guide are intended to assist judges in
identifying, narrowing, and addressing issues bearing on the adequacy of surveys
either offered as evidence or proposed as a method for developing information.9
These questions can be (1) raised from the bench during a pretrial proceeding to
determine the admissibility of the survey evidence; (2) presented to the contending experts before trial for their joint identification of disputed and undisputed
issues; (3) presented to counsel with the expectation that the issues will be
addressed during the examination of the experts at trial; or (4) raised in bench trials
when a motion for a preliminary injunction is made to help the judge evaluate
7. This does not mean that surveys can be relied on to address all questions. For example, if
survey respondents had been asked in the days before the attacks of 9/11 to predict whether they
would volunteer for military service if Washington, D.C., were to be bombed, their answers may
not have provided accurate predictions. Although respondents might have willingly answered the
question, their assessment of what they would actually do in response to an attack simply may have
been inaccurate. Even the option of a “do not know” choice would not have prevented an error in
prediction if they believed they could accurately predict what they would do. Thus, although such a
survey would have been suitable for assessing the predictions of respondents, it might have provided
a very inaccurate estimate of what an actual response to the attack would be.
8. The ability to quantitatively assess the limits of the likely margin of error is unique to probability sample surveys, but an expert testifying about any survey should provide enough information
to allow the judge to evaluate how potential error, including coverage, measurement, nonresponse,
and sampling error, may have affected the obtained pattern of responses.
9. See infra text accompanying note 31.
362
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
what weight, if any, the survey should be given.10 These questions are intended
to improve the utility of cross-examination by counsel, where appropriate, not
to replace it.
All sample surveys, whether they measure individuals or other units, should
address the issues concerning purpose and design (Section II), population definition and sampling (Section III), accuracy of data entry (Section VI), and disclosure and reporting (Section VII). Questionnaire and interview surveys, whether
conducted in-person, on the telephone, or online, raise methodological issues
involving survey questions and structure (Section IV) and confidentiality (Section VII.C). Interview surveys introduce additional issues (e.g., interviewer training and qualifications) (Section V), and online surveys raise some new issues and
questions that are currently under study (Section VI). The sections of this reference guide are labeled to direct the reader to those topics that are relevant to the
type of survey being considered. The scope of this reference guide is necessarily
limited, and additional issues might arise in particular cases.
A. Use of Surveys in Court
Fifty years ago the question of whether surveys constituted acceptable evidence still
was unsettled.11 Early doubts about the admissibility of surveys centered on their
use of sampling12 and their status as hearsay evidence.13 Federal Rule of Evidence
10. Lanham Act cases involving trademark infringement or deceptive advertising frequently
require expedited hearings that request injunctive relief, so judges may need to be more familiar with
survey methodology when considering the weight to accord a survey in these cases than when presiding over cases being submitted to a jury. Even in a case being decided by a jury, however, the court
must be prepared to evaluate the methodology of the survey evidence in order to rule on admissibility.
See Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 589 (1993).
11. Hans Zeisel, The Uniqueness of Survey Evidence, 45 Cornell L.Q. 322, 345 (1960).
12. In an early use of sampling, Sears, Roebuck & Co. claimed a tax refund based on sales made
to individuals living outside city limits. Sears randomly sampled 33 of the 826 working days in the
relevant working period, computed the proportion of sales to out-of-city individuals during those days,
and projected the sample result to the entire period. The court refused to accept the estimate based on
the sample. When a complete audit was made, the result was almost identical to that obtained from
the sample. Sears, Roebuck & Co. v. City of Inglewood, tried in Los Angeles Superior Court in 1955, is
described in R. Clay Sprowls, The Admissibility of Sample Data into a Court of Law: A Case History, 4
UCLA L. Rev. 222, 226–29 (1956–1957).
13. Judge Wilfred Feinberg’s thoughtful analysis in Zippo Manufacturing Co. v. Rogers Imports,
Inc., 216 F. Supp. 670, 682–83 (S.D.N.Y. 1963), provides two alternative grounds for admitting
opinion surveys: (1) Surveys are not hearsay because they are not offered in evidence to prove the
truth of the matter asserted; and (2) even if they are hearsay, they fall under one of the exceptions as
a “present sense impression.” In Schering Corp. v. Pfizer Inc., 189 F.3d 218 (2d Cir. 1999), the Second
Circuit distinguished between perception surveys designed to reflect the present sense impressions of
respondents and “memory” surveys designed to collect information about a past occurrence based on
the recollections of the survey respondents. The court in Schering suggested that if a survey is offered
to prove the existence of a specific idea in the public mind, then the survey does constitute hearsay
363
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
703 settled both matters for surveys by redirecting attention to the “validity of the
techniques employed.”14 The inquiry under Rule 703 focuses on whether facts or
data are “of a type reasonably relied upon by experts in the particular field in forming opinions or inferences upon the subject.”15 For a survey, the question becomes,
“Was the poll or survey conducted in accordance with generally accepted survey
principles, and were the results used in a statistically correct way?”16 This focus on
the adequacy of the methodology used in conducting and analyzing results from a
survey is also consistent with the Supreme Court’s discussion of admissible scientific
evidence in Daubert v. Merrell Dow Pharmaceuticals, Inc.17
Because the survey method provides an economical and systematic way to
gather information and draw inferences about a large number of individuals or
other units, surveys are used widely in business, government, and, increasingly,
evidence. As the court observed, Federal Rule of Evidence 803(3), creating “an exception to the
hearsay rule for such statements [i.e., state-of-mind expressions] rather than excluding the statements
from the definition of hearsay, makes sense only in this light.” Id. at 230 n.3. See also Playtex Prods.
v. Procter & Gamble Co., 2003 U.S. Dist. LEXIS 8913 (S.D.N.Y. May 28, 2003), aff’d, 126 Fed.
Appx. 32 (2d Cir. 2005). Note, however, that when survey respondents are shown a stimulus (e.g., a
commercial) and then respond to a series of questions about their impressions of what they viewed,
those impressions reflect both respondents’ initial perceptions and their memory for what they saw and
heard. Concerns about the impact of memory on the trustworthiness of survey responses appropriately
depend on the passage of time between exposure and testing and on the likelihood that distorting
events occurred during that interval.
Two additional exceptions to the hearsay exclusion can be applied to surveys. First, surveys may
constitute a hearsay exception if the survey data were collected in the normal course of a regularly
conducted business activity, unless “the source of information or the method or circumstances of
preparation indicate lack of trustworthiness.” Fed. R. Evid. 803(6); see also Ortho Pharm. Corp. v.
Cosprophar, Inc., 828 F. Supp. 1114, 1119–20 (S.D.N.Y. 1993) (marketing surveys prepared in the
course of business were properly excluded because they lacked foundation from a person who saw
the original data or knew what steps were taken in preparing the report), aff’d, 32 F.3d 690 (2d Cir.
1994). In addition, if a survey shows guarantees of trustworthiness equivalent to those in other hearsay
exceptions, it can be admitted if the court determines that the statement is offered as evidence of a
material fact, it is more probative on the point for which it is offered than any other evidence that the
proponent can procure through reasonable efforts, and admissibility serves the interests of justice. Fed.
R. Evid. 807; e.g., Schering, 189 F.3d at 232. Admissibility as an exception to the hearsay exclusion
thus depends on the trustworthiness of the survey. New Colt Holding v. RJG Holdings of Fla., 312
F. Supp. 2d 195, 223 (D. Conn. 2004).
14. Fed. R. Evid. 703 Advisory Committee Note.
15. Fed. R. Evid. 703.
16. Manual for Complex Litigation § 2.712 (1982). Survey research also is addressed in the
Manual for Complex Litigation, Second § 21.484 (1985) [hereinafter MCL 2d]; the Manual for Complex Litigation, Third § 21.493 (1995) [hereinafter MCL 3d]; and the Manual for Complex Litigation,
Fourth §11.493 (2004) [hereinafter MCL 4th]. Note, however, that experts who collect survey data,
along with the professions that rely on those surveys, may differ in some of their methodological
standards and principles. An assessment of the precision of sample estimates and an evaluation of the
sources and magnitude of likely bias are required to distinguish methods that are acceptable from
methods that are not.
17. 509 U.S. 579 (1993); see also General Elec. Co. v. Joiner, 522 U.S. 136, 147 (1997).
364
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
administrative settings and judicial proceedings.18 Both federal and state courts
have accepted survey evidence on a variety of issues. In a case involving allegations of discrimination in jury panel composition, the defense team surveyed
prospective jurors to obtain their age, race, education, ethnicity, and income
distribution.19 Surveys of employees or prospective employees are used to support
or refute claims of employment discrimination.20 Surveys provide information on
the nature and similarity of claims to support motions for or against class certification.21 In ruling on the admissibility of scientific claims, courts have examined surveys of scientific experts to assess the extent to which the theory or technique has
received widespread acceptance.22 Some courts have admitted surveys in obscenity
cases to provide evidence about community standards.23 Requests for a change of
venue on grounds of jury pool bias often are backed by evidence from a survey
of jury-eligible respondents in the area of the original venue.24 The plaintiff in
an antitrust suit conducted a survey to assess what characteristics, including price,
affected consumers’ preferences. The survey was offered as one way to estimate
damages.25 In a Title IX suit based on allegedly discriminatory scheduling of girls’
18. Some sample surveys are so well accepted that they even may not be recognized as surveys.
For example, some U.S. Census Bureau data are based on sample surveys. Similarly, the Standard Table
of Mortality, which is accepted as proof of the average life expectancy of an individual of a particular
age and gender, is based on survey data.
19. United States v. Green, 389 F. Supp. 2d 29 (D. Mass. 2005), rev’d on other grounds, 426
F.3d 1 (1st Cir. 2005) (evaluating minority underrepresentation in the jury pool by comparing racial
composition of the voting-age population in the district with the racial breakdown indicated in juror
questionnaires returned to court); see also People v. Harris, 36 Cal. 3d 36, 679 P.2d 433 (Cal. 1984).
20. John Johnson v. Big Lots Stores, Inc., No. 04-321, 2008 U.S. Dist. LEXIS 35316, at *20
(E.D. La. Apr. 29, 2008); Stender v. Lucky Stores, Inc., 803 F. Supp. 259, 326 (N.D. Cal. 1992);
EEOC v. Sears, Roebuck & Co., 628 F. Supp. 1264, 1308 (N.D. Ill. 1986), aff’d, 839 F.2d 302 (7th
Cir. 1988).
21. John Johnson v. Big Lots Stores, Inc., 561 F. Supp. 2d 567 (E.D. La. 2008); Marlo v. United
Parcel Service, Inc., 251 F.R.D. 476 (C.D. Cal. 2008).
22. United States v. Scheffer, 523 U.S. 303, 309 (1998); United States v. Bishop, 64 F. Supp.
2d 1149 (D. Utah 1999); United States v. Varoudakis, No. 97-10158, 1998 WL 151238 (D. Mass.
Mar. 27, 1998); State v. Shively, 268 Kan. 573 (2000), aff’d, 268 Kan. 589 (2000) (all cases in which
courts determined, based on the inconsistent reactions revealed in several surveys, that the polygraph
test has failed to achieve general acceptance in the scientific community). Contra, see Lee v. Martinez,
136 N.M. 166, 179–81, 96 P.3d 291, 304–06 (N.M. 2004). People v. Williams, 830 N.Y.S.2d 452
(2006) (expert permitted to testify regarding scientific studies of factors affecting the perceptual ability
and memory of eyewitnesses to make identifications based in part on general acceptance demonstrated
in survey of experts who study eyewitness identification).
23. E.g., People v. Page Books, Inc., 601 N.E.2d 273, 279–80 (Ill. App. Ct. 1992); State v.
Williams, 598 N.E.2d 1250, 1256–58 (Ohio Ct. App. 1991).
24. E.g., United States v. Eagle, 586 F.2d 1193, 1195 (8th Cir. 1978); United States v. Tokars,
839 F. Supp. 1578, 1583 (D. Ga. 1993), aff’d, 95 F.3d 1520 (11th Cir. 1996); State v. Baumruk, 85
S.W.3d 644 (Mo. 2002); People v. Boss, 701 N.Y.S.2d 342 (App. Div. 1999).
25. Dolphin Tours, Inc. v. Pacifico Creative Servs., Inc., 773 F.2d 1506, 1508 (9th Cir. 1985).
See also SMS Sys. Maint. Servs., Inc. v. Digital Equip. Corp., 188 F.3d 11 (1st Cir. 1999); Benjamin
F. King, Statistics in Antitrust Litigation, in Statistics and the Law 49 (Morris H. DeGroot et al. eds.,
365
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
sports, a survey was offered for the purpose of establishing how girls felt about the
scheduling of girls’ and boys’ sports.26 A routine use of surveys in federal courts
occurs in Lanham Act27 cases, when the plaintiff alleges trademark infringement28
or claims that false advertising29 has confused or deceived consumers. The pivotal
legal question in such cases virtually demands survey research because it centers
on consumer perception and memory (i.e., is the consumer likely to be confused
about the source of a product, or does the advertisement imply a false or misleading message?).30 In addition, survey methodology has been used creatively to
assist federal courts in managing mass torts litigation. Faced with the prospect of
conducting discovery concerning 10,000 plaintiffs, the plaintiffs and defendants
in Wilhoite v. Olin Corp.31 jointly drafted a discovery survey that was administered
1986). Surveys have long been used in antitrust litigation to help define relevant markets. In United
States v. E.I. du Pont de Nemours & Co., 118 F. Supp. 41, 60 (D. Del. 1953), aff’d, 351 U.S. 377
(1956), a survey was used to develop the “market setting” for the sale of cellophane. In Mukand, Ltd.
v. United States, 937 F. Supp. 910 (Ct. Int’l Trade 1996), a survey of purchasers of stainless steel wire
rods was conducted to support a determination of competition and fungibility between domestic and
Indian wire rod.
26. Alston v. Virginia High Sch. League, Inc., 144 F. Supp. 2d 526, 539–40 (W.D. Va. 1999).
27. Lanham Act § 43(a), 15 U.S.C. § 1125(a) (1946) (amended 2006).
28. E.g., Herman Miller v. Palazzetti Imports & Exports, 270 F.3d 298, 312 (6th Cir. 2001)
(“Because the determination of whether a mark has acquired secondary meaning is primarily an empirical inquiry, survey evidence is the most direct and persuasive evidence.”); Simon Property Group v.
MySimon, 104 F. Supp. 2d 1033, 1038 (S.D. Ind. 2000) (“Consumer surveys are generally accepted
by courts as one means of showing the likelihood of consumer confusion.”). See also Qualitex Co. v.
Jacobson Prods. Co., No. CIV-90-1183HLH, 1991 U.S. Dist. LEXIS 21172 (C.D. Cal. Sept. 3, 1991),
aff’d in part & rev’d in part on other grounds, 13 F.3d 1297 (9th Cir. 1994), rev’d on other grounds, 514 U.S.
159 (1995); Union Carbide Corp. v. Ever-Ready, Inc., 531 F.2d 366 (7th Cir.), cert. denied, 429 U.S.
830 (1976). According to Neal Miller, Facts, Expert Facts, and Statistics: Descriptive and Experimental
Research Methods in Litigation, 40 Rutgers L. Rev. 101, 137 (1987), trademark law has relied on the
institutionalized use of statistical evidence more than any other area of the law.
29. E.g., Southland Sod Farms v. Stover Seed Co., 108 F.3d 1134, 1142–43 (9th Cir. 1997);
American Home Prods. Corp. v. Johnson & Johnson, 577 F.2d 160 (2d Cir. 1978); Rexall Sundown,
Inc. v. Perrigo Co., 651 F. Supp. 2d 9 (E.D.N.Y. 2009); Mutual Pharm. Co. v. Ivax Pharms. Inc., 459
F. Supp. 2d 925 (C.D. Cal. 2006); Novartis Consumer Health v. Johnson & Johnson-Merck Consumer
Pharms., 129 F. Supp. 2d 351 (D.N.J. 2000).
30. Courts have observed that “the court’s reaction is at best not determinative and at worst
irrelevant. The question in such cases is, what does the person to whom the advertisement is addressed
find to be the message?” American Brands, Inc. v. R.J. Reynolds Tobacco Co., 413 F. Supp. 1352,
1357 (S.D.N.Y. 1976). The wide use of surveys in recent years was foreshadowed in Triangle Publications, Inc. v. Rohrlich, 167 F.2d 969, 974 (2d Cir. 1948) (Frank, J., dissenting). Called on to determine
whether a manufacturer of girdles labeled “Miss Seventeen” infringed the trademark of the magazine
Seventeen, Judge Frank suggested that, in the absence of a test of the reactions of “numerous girls and
women,” the trial court judge’s finding as to what was likely to confuse was “nothing but a surmise, a
conjecture, a guess,” noting that “neither the trial judge nor any member of this court is (or resembles)
a teen-age girl or the mother or sister of such a girl.” Id. at 976–77.
31. No. CV-83-C-5021-NE (N.D. Ala. filed Jan. 11, 1983). The case ultimately settled before
trial. See Francis E. McGovern & E. Allan Lind, The Discovery Survey, Law & Contemp. Probs.,
Autumn 1988, at 41.
366
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
in person by neutral third parties, thus replacing interrogatories and depositions.
It resulted in substantial savings in both time and cost.
B. Surveys Used to Help Assess Expert Acceptance in the
Wake of Daubert
Scientists who offer expert testimony at trial typically present their own opinions.
These opinions may or may not be representative of the opinions of the scientific
community at large. In deciding whether to admit such testimony, courts applying the Frye test must determine whether the science being offered is generally
accepted by the relevant scientific community. Under Daubert as well, a relevant
factor used to decide admissibility is the extent to which the theory or technique
has received widespread acceptance. Properly conducted surveys can provide a
useful way to gauge acceptance, and courts recently have been offered assistance
from surveys that allegedly gauge relevant scientific opinion. As with any scientific research, the usefulness of the information obtained from a survey depends
on the quality of research design. Several critical factors have emerged that have
limited the value of some of these surveys: problems in defining the relevant target
population and identifying an appropriate sampling frame, response rates that raise
questions about the representativeness of the results, and a failure to ask questions
that assess opinions on the relevant issue.
Courts deciding on the admissibility of polygraph tests have considered results
from several surveys of purported experts. Surveys offered as providing evidence
of relevant scientific opinion have tested respondents from several populations:
(1) professional polygraph examiners,32 (2) psychophysiologists (members of the
Society for Psychophysiological Research),33 and (3) distinguished psychologists
(Fellows of the Division of General Psychology of the American Psychological
Association).34 Respondents in the first group expressed substantial confidence in
the scientific accuracy of polygraph testing, and those in the third group expressed
substantial doubts about it. Respondents in the second group were asked the same
question across three surveys that differed in other aspects of their methodology
(e.g., when testing occurred and what the response rate was). Although over 60%
of those questioned in two of the three surveys characterized the polygraph as a
useful diagnostic tool, one of the surveys was conducted in 1982 and the more
recent survey, published in 1984, achieved only a 30% response rate. The third
32. See plaintiff’s survey described in Meyers v. Arcudi, 947 F. Supp. 581, 588 (D. Conn. 1996).
33. Susan L. Amato & Charles R. Honts, What Do Psychophysiologists Think About Polygraph
Tests? A Survey of the Membership of SPR, 31 Psychophysiology S22 [abstract]; Gallup Organization,
Survey of Members of the Society for Psychological Research Concerning Their Opinions of Polygraph Test
Interpretation, 13 Polygraph 153 (1984); William G. Iacono & David T. Lykken, The Validity of the Lie
Detector: Two Surveys of Scientific Opinion, 82 J. Applied Psychol. 426 (1997).
34. Iacono & Lykken, supra note 33.
367
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
survey, also conducted in 1984, achieved a response rate of 90% and found that
only 44% of respondents viewed the polygraph as a useful diagnostic tool. On the
basis of these inconsistent reactions from the several surveys, courts have determined that the polygraph has failed to achieve general acceptance in the scientific
community.35 In addition, however, courts have criticized the relevance of the
population surveyed by proponents of the polygraph. For example, in Meyers v.
Arcudi the court noted that the survey offered by proponents of the polygraph
was a survey of “practitioners who estimated the accuracy of the control question technique [of polygraph testing] to be between 86% and 100%.”36 The court
rejected the conclusions from this survey on the basis of a determination that the
population surveyed was not the relevant scientific community, noting that “many
of them . . . do not even possess advanced degrees and are not trained in the
scientific method.”37
The link between specialized expertise and self-interest poses a dilemma in
defining the relevant scientific population. As the court in United States v. Orians
recognized, “The acceptance in the scientific community depends in large part on
how the relevant scientific community is defined.”38 In rejecting the defendants’
urging that the court consider as relevant only psychophysiologists whose work is
dedicated in large part to polygraph research, the court noted that Daubert “does
not require the court to limit its inquiry to those individuals that base their livelihood on the acceptance of the relevant scientific theory. These individuals are
often too close to the science and have a stake in its acceptance; i.e., their livelihood depends in part on the acceptance of the method.”39
To be relevant to a Frye or Daubert inquiry on general acceptance, the questions asked in a survey of experts should assess opinions on the quality of the
scientific theory and methodology, rather than asking whether or not the instrument should be used in a legal setting. Thus, a survey in which 60% of respondents agreed that the polygraph is “a useful diagnostic tool when considered with
other available information,” 1% viewed it as sufficiently reliable to be the sole
determinant, and the remainder thought it entitled to little or no weight, failed
to assess the relevant issue. As the court in United States v. Cordoba noted, because
“useful” and “other available information” could have many meanings, “there is
little wonder why [the response chosen by the majority of respondents] was most
frequently selected.”40
35. United States v. Scheffer, 523 U.S. 303, 309 (1998); United States v. Bishop, 64 F. Supp.
2d 1149 (D. Utah 1999); Meyers v. Arcudi, 947 F. Supp. 581, 588 (D. Conn. 1996); United States v.
Varoudakis, 48 Fed. R. Evid. Serv. 1187 (D. Mass. 1998).
36. Meyers v. Arcudi, 947 F. Supp. at 588.
37. Id.
38. 9 F. Supp. 2d 1168, 1173 (D. Ariz. 1998).
39. Id.
40. 991 F. Supp. 1199 (C.D. Cal. 1998), aff’d, 194 F.3d 1053 (9th Cir. 1999).
368
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
A similar flaw occurred in a survey conducted by experts opposed to the use
of the polygraph in trial proceedings. Survey respondents were asked whether they
would advocate that courts admit into evidence the outcome of a polygraph test.41
That question calls for more than an assessment of the accuracy of the polygraph,
and thus does not appropriately limit expert opinion to issues within the expert’s
competence, that is, to the accuracy of the information provided by the test
results. The survey also asked whether respondents agreed that the control question technique, the most common form of polygraph test, is accurate at least 85%
of the time in real-life applications for guilty and innocent subjects.42 Although
polygraph proponents frequently claim an accuracy level of 85%, it is up to the
courts to decide what accuracy level would be required to justify admissibility.
A better approach would be to ask survey respondents to estimate the level of
accuracy they believe the test is likely to produce.43
Surveys of experts are no substitute for an evaluation of whether the testimony an expert witness is offering will assist the trier of fact. Nonetheless, courts
can use an assessment of opinion in the relevant scientific community to aid in
determining whether a particular expert is proposing to use methods that would
be rejected by a representative group of experts to arrive at the opinion the expert
will offer. Properly conducted surveys can provide an economical way to collect
and present information on scientific consensus and dissensus.
C. Surveys Used to Help Assess Community Standards:
Atkins v. Virginia
In Atkins v. Virginia,44 the U.S. Supreme Court determined that the Eighth
Amendment’s prohibition of “cruel and unusual punishment” forbids the execution of mentally retarded persons.45 Following the interpretation advanced in
Trop v. Dulles46 that “The Amendment must draw its meaning from the evolving
standards of decency that mark the progress of a maturing society,”47 the Court
examined a variety of sources, including legislative judgments and public opinion
polls, to find that a national consensus had developed barring such executions.48
41. See Iacono & Lykken, supra note 33, at 430, tbl. 2 (1997).
42. Id.
43. At least two assessments should be made: an estimate of the accuracy for guilty subjects and
an estimate of the accuracy for innocent subjects.
44. 536 U.S. 304, 322 (2002).
45. Although some groups have recently moved away from the term “mental retardation” in
response to concerns that the term may have pejorative connotations, mental retardation was the name
used for the condition at issue in Atkins and it continues to be employed in federal laws, in cases
determining eligibility for the death penalty, and as a diagnosis by the medical profession.
46. 356 U.S. 86 (1958).
47. Id. at 101.
48. Atkins, 536 U.S. at 313–16.
369
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In a vigorous dissent, Chief Justice Rehnquist objected to the use of the polls,
arguing that legislative judgments and jury decisions should be the sole indicators
of national opinion. He also objected to the particular polls cited in the majority
opinion, identifying what he viewed as serious methodological weaknesses.
The Court has struggled since Furman v. Georgia49 to develop an adequate
way to measure public standards regarding the application of the death penalty
to specific categories of cases. In relying primarily on surveys of state legislative
actions, the Court has ignored the forces that influence whether an issue emerges
on a legislative agenda, and the strong influence of powerful minorities on legislative actions.50 Moreover, the various members of the Court have disagreed about
whether states without any death penalty should be included in the count of states
that bar the execution of a particular category of defendant.
The Court has sometimes considered jury verdicts in assessing public standards. In Coker v. Georgia,51 the Court forbade the imposition of the death penalty
for rape. Citing Gregg v. Georgia52 for the proposition that “[t]he jury . . . is a
significant and reliable objective index of contemporary values because it is so
directly involved,” the Court noted that “in the vast majority of cases [of rape
in Georgia], at least 9 out of 10, juries have not imposed the death sentence.”53
In Atkins, Chief Justice Rehnquist complained about the absence of jury verdict
data.54 Had such data been available, however, they would have been irrelevant
because a “survey” of the jurors who have served in such cases would constitute a
biased sample of the public. A potential juror unwilling to impose the death penalty on a mentally retarded person would have been ineligible to serve in a capital
case involving a mentally retarded defendant because the juror would not have
been able to promise during voir dire that he or she would be willing to listen
to the evidence and impose the death penalty if the evidence warranted it. Thus,
the death-qualified jury in such a case would be composed only of representatives
from that subset of citizens willing to execute a mentally retarded defendant, an
unrepresentative and systematically biased sample.
Public opinion surveys can provide an important supplementary source of
information about contemporary values.55 The Court in Atkins was presented with
data from 27 different polls and surveys,56 8 of them national and 19 statewide.
49. 408 U.S. 238 (1972).
50. See Stanford v. Kentucky, 492 U.S. 361 (1989), abrogated by Roper v. Simmons, 543 U.S.
551 (2005).
51. 433 U.S. 584, 596 (1977).
52. 428 U.S. 153, 181 (1976).
53. Coker v. Georgia, 433 U.S. at 596.
54. See Atkins, 536 U.S. at 323 (Rehnquist, C.J., dissenting).
55. See id. at 316 n.21 (“[T]heir consistency with the legislative evidence lends further support
to our conclusion that there is a consensus”).
56. The quality of any poll or survey depends on the methodology used, which should be fully
visible to the court and the opposing party. See Section VII, infra.
370
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
The information on the polling data appeared in an amicus brief filed by the
American Association on Mental Retardation.57 Respondents were asked in various ways how they felt about imposing the death penalty on a mentally retarded
defendant. In each poll, a majority of respondents expressed opposition to executing the mentally retarded. Chief Justice Rehnquist noted two weaknesses reflected
in the data presented to the Court. First, almost no information was provided
about the target populations from which the samples were drawn or the methodology of sample selection and data collection. Although further information was
available on at least some of the surveys (e.g., the nationwide telephone survey
of 1000 voters conducted in 1993 by the Tarrance Group used a sample based
on voter turnout in the last three presidential elections), that information apparently was not part of the court record. This omission violates accepted reporting
standards in survey research, and the information is needed if the decisionmaker
is to intelligently evaluate the quality of the survey. Its absence in this instance
occurred because the survey information was obtained from secondary sources.
A second objection raised by Chief Justice Rehnquist was that the wording of some of the questions required respondents to say merely whether they
favored or were opposed to the use of the death penalty when the defendant
is mentally retarded. It is unclear how a respondent who favors execution of a
mentally retarded defendant only in a rare case would respond to that question.
Some of the questions, however, did ask whether the respondent felt that it was
never appropriate to execute the mentally retarded or whether it was appropriate in some circumstances.58 In responses to these questions as well, a majority
of respondents said that they found the execution of mentally retarded persons
unacceptable under any circumstances. The critical point is that despite variations in wording of questions, the year in which the poll was conducted, who
conducted it, where it was conducted, and how it was carried out, a majority of respondents (between 56% and 83%) expressed opposition to executing
mentally retarded defendants. The Court thus was presented with a consistent
set of findings, providing striking reinforcement for the Atkins majority’s legislative analysis. Opinion poll data and legislative decisions have different strengths
and weaknesses as indicators of contemporary values. The value of a multiplemeasure approach is that it avoids a potentially misleading reliance on a single
source or measure.
57. The data appear as an appendix to the Opinion of Chief Justice Rehnquist in Atkins.
58. Appendix to the Opinion of Chief Justice Rehnquist in Atkins. “Some people feel that there
is nothing wrong with imposing the death penalty on persons who are mentally retarded, depending
on the circumstances. Others feel that the death penalty should never be imposed on persons who are
mentally retarded under any circumstances. Which of these views comes closest to your own?” The
Tarrance Group, Death Penalty Poll, Q. 9 (Mar. 1993), citing Samuel R. Gross, Update: American Public
Opinion on the Death Penalty—It’s Getting Personal, 83 Cornell L. Rev. 1448, 1467 (1998).
371
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
D. A Comparison of Survey Evidence and Individual
Testimony
To illustrate the value of a survey, it is useful to compare the information that
can be obtained from a competently done survey with the information obtained
by other means. A survey is presented by a survey expert who testifies about the
responses of a substantial number of individuals who have been selected according
to an explicit sampling plan and asked the same set of questions by interviewers
who were not told who sponsored the survey or what answers were predicted
or preferred. Although parties presumably are not obliged to present a survey
conducted in anticipation of litigation by a nontestifying expert if it produced
unfavorable results,59 the court can and should scrutinize the method of respondent selection for any survey that is presented.
A party using a nonsurvey method generally identifies several witnesses who
testify about their own characteristics, experiences, or impressions. Although
the party has no obligation to select these witnesses in any particular way or to
report on how they were chosen, the party is not likely to select witnesses whose
attributes conflict with the party’s interests. The witnesses who testify are aware
of the parties involved in the case and have discussed the case before testifying.
Although surveys are not the only means of demonstrating particular facts,
presenting the results of a well-done survey through the testimony of an expert is
an efficient way to inform the trier of fact about a large and representative group
of potential witnesses. In some cases, courts have described surveys as the most
direct form of evidence that can be offered.60 Indeed, several courts have drawn
negative inferences from the absence of a survey, taking the position that failure
to undertake a survey may strongly suggest that a properly done survey would not
support the plaintiff’s position.61
59. In re FedEx Ground Package System, 2007 U.S. Dist. LEXIS 27086 (N.D. Ind. April 10,
2007); Loctite Corp. v. National Starch & Chem. Corp., 516 F. Supp. 190, 205 (S.D.N.Y. 1981)
(distinguishing between surveys conducted in anticipation of litigation and surveys conducted for nonlitigation purposes which cannot be reproduced because of the passage of time, concluding that parties
should not be compelled to introduce the former at trial, but may be required to provide the latter).
60. See, e.g., Morrison Entm’t Group v. Nintendo of Am., 56 Fed. App’x. 782, 785 (9th Cir.
Cal. 2003).
61. Ortho Pharm. Corp. v. Cosprophar, Inc., 32 F.3d 690, 695 (2d Cir. 1994); Henri’s Food
Prods. Co. v. Kraft, Inc., 717 F.2d 352, 357 (7th Cir. 1983); Medici Classics Productions LLC v.
Medici Group LLC, 590 F. Supp. 2d 548, 556 (S.D.N.Y. 2008); Citigroup v. City Holding Co.,
2003 U.S. Dist. LEXIS 1845 (S.D.N.Y. Feb. 10, 2003); Chum Ltd. v. Lisowski, 198 F. Supp. 2d 530
(S.D.N.Y. 2002).
372
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
II. Purpose and Design of the Survey
A. Was the Survey Designed to Address Relevant Questions?
The report describing the results of a survey should include a statement describing
the purpose or purposes of the survey. One indication that a survey offers probative evidence is that it was designed to collect information relevant to the legal
controversy (e.g., to estimate damages in an antitrust suit or to assess consumer
confusion in a trademark case). Surveys not conducted specifically in preparation
for, or in response to, litigation may provide important information,62 but they
frequently ask irrelevant questions63 or select inappropriate samples of respondents
for study.64 Nonetheless, surveys do not always achieve their stated goals. Thus,
the content and execution of a survey must be scrutinized whether or not the
survey was designed to provide relevant data on the issue before the court.65
Moreover, if a survey was not designed for purposes of litigation, one source of
bias is less likely: The party presenting the survey is less likely to have designed
and constructed the survey to provide evidence supporting its side of the issue in
controversy.
62. See, e.g., Wright v. Jeep Corp., 547 F. Supp. 871, 874 (E.D. Mich. 1982). Indeed, as courts
increasingly have been faced with scientific issues, parties have requested in a number of recent cases
that the courts compel production of research data and testimony by unretained experts. The circumstances under which an unretained expert can be compelled to testify or to disclose research data and
opinions, as well as the extent of disclosure that can be required when the research conducted by
the expert has a bearing on the issues in the case, are the subject of considerable current debate. See,
e.g., Joe S. Cecil, Judicially Compelled Disclosure of Research Data, 1 Cts. Health Sci. & L. 434 (1991);
Richard L. Marcus, Discovery Along the Litigation/Science Interface, 57 Brook. L. Rev. 381, 393–428
(1991); see also Court-Ordered Disclosure of Academic Research: A Clash of Values of Science and Law, Law
& Contemp. Probs., Summer 1996, at 1.
63. See Loctite Corp. v. National Starch & Chem. Corp., 516 F. Supp. 190, 206 (S.D.N.Y.
1981) (marketing surveys conducted before litigation were designed to test for brand awareness, while
the “single issue at hand . . . [was] whether consumers understood the term ‘Super Glue’ to designate
glue from a single source”).
64. In Craig v. Boren, 429 U.S. 190 (1976), the state unsuccessfully attempted to use its annual
roadside survey of the blood alcohol level, drinking habits, and preferences of drivers to justify prohibiting the sale of 3.2% beer to males under the age of 21 and to females under the age of 18. The
data were biased because it was likely that the male would be driving if both the male and female
occupants of the car had been drinking. As pointed out in 2 Joseph L. Gastwirth, Statistical Reasoning
in Law and Public Policy: Tort Law, Evidence, and Health 527 (1988), the roadside survey would
have provided more relevant data if all occupants of the cars had been included in the survey (and if
the type and amount of alcohol most recently consumed had been requested so that the consumption
of 3.2% beer could have been isolated).
65. See Merisant Co. v. McNeil Nutritionals, LLC, 242 F.R.D. 315 (E.D. Pa. 2007).
373
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. Was Participation in the Design, Administration, and
Interpretation of the Survey Appropriately Controlled to
Ensure the Objectivity of the Survey?
An early handbook for judges recommended that survey interviews be “conducted independently of the attorneys in the case.”66 Some courts interpreted this
to mean that any evidence of attorney participation is objectionable.67 A better
interpretation is that the attorney should have no part in carrying out the survey.68
However, some attorney involvement in the survey design is necessary to ensure
that relevant questions are directed to a relevant population.69 The 2009 amendments to Federal Rule of Civil Procedure 26(a)(2)70 no longer allow an inquiry
into the nature of communications between attorneys and experts, and so the role
of attorneys in constructing surveys may become less apparent. The key issues
for the trier of fact concerning the design of the survey are the objectivity and
relevance of the questions on the survey and the appropriateness of the definition
of the population used to guide sample selection. These aspects of the survey are
visible to the trier of fact and can be judged on their quality, irrespective of who
suggested them. In contrast, the interviews themselves are not directly visible, and
any potential bias is minimized by having interviewers and respondents blind to
the purpose and sponsorship of the survey and by excluding attorneys from any
part in conducting interviews and tabulating results.71
66. Judicial Conference of the United States, Handbook of Recommended Procedures for the
Trial of Protracted Cases 75 (1960).
67. See, e.g., Boehringer Ingelheim G.m.b.H. v. Pharmadyne Lab., 532 F. Supp. 1040, 1058
(D.N.J. 1980).
68. Upjohn Co. v. American Home Prods. Corp., No. 1-95-CV-237, 1996 U.S. Dist. LEXIS
8049, at *42 (W.D. Mich. Apr. 5, 1996) (objection that “counsel reviewed the design of the survey
carries little force with this Court because [opposing party] has not identified any flaw in the survey
that might be attributed to counsel’s assistance”). For cases in which attorney participation was linked
to significant flaws in the survey design, see Johnson v. Big Lots Stores, Inc., No. 04-321, 2008 U.S.
Dist. LEXIS 35316, at *20 (E.D. La. April 29, 2008); United States v. Southern Indiana Gas & Elec.
Co., 258 F. Supp. 2d 884, 894 (S.D. Ind. 2003); Gibson v. County of Riverside, 181 F. Supp. 2d
1057, 1069 (C.D. Cal. 2002).
69. See 6 J. Thomas McCarthy, McCarthy on Trademarks and Unfair Competition § 32:166
(4th ed. 2003).
70. www.uscourts.gov/News/TheThirdBranch/10-11-01/Rules_Recommendations_Take_
Effect_December_1_2010.aspx.
71. Gibson, 181 F. Supp. 2d at 1068.
374
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
C. Are the Experts Who Designed, Conducted, or Analyzed
the Survey Appropriately Skilled and Experienced?
Experts prepared to design, conduct, and analyze a survey generally should have
graduate training in psychology (especially social, cognitive, or consumer psychology), sociology, political science, marketing, communication sciences, statistics,
or a related discipline; that training should include courses in survey research
methods, sampling, measurement, interviewing, and statistics. In some cases,
professional experience in teaching or conducting and publishing survey research
may provide the requisite background. In all cases, the expert must demonstrate an
understanding of foundational, current, and best practices in survey methodology,
including sampling,72 instrument design (questionnaire and interview construction), and statistical analysis.73 Publication in peer-reviewed journals, authored
books, fellowship status in professional organizations, faculty appointments, consulting experience, research grants, and membership on scientific advisory panels
for government agencies or private foundations are indications of a professional’s
area and level of expertise. In addition, some surveys involving highly technical
subject matter (e.g., the particular preferences of electrical engineers for various
pieces of electrical equipment and the bases for those preferences) or special populations (e.g., developmentally disabled adults with limited cognitive skills) may
require experts to have some further specialized knowledge. Under these conditions, the survey expert also should be able to demonstrate sufficient familiarity
with the topic or population (or assistance from an individual on the research
team with suitable expertise) to design a survey instrument that will communicate
clearly with relevant respondents.
D. Are the Experts Who Will Testify About Surveys
Conducted by Others Appropriately Skilled and Experienced?
Parties often call on an expert to testify about a survey conducted by someone else.
The secondary expert’s role is to offer support for a survey commissioned by the
party who calls the expert, to critique a survey presented by the opposing party, or
to introduce findings or conclusions from a survey not conducted in preparation
for litigation or by any of the parties to the litigation. The trial court should take
into account the exact issue that the expert seeks to testify about and the nature
of the expert’s field of expertise.74 The secondary expert who gives an opinion
72. The one exception is that sampling expertise would be unnecessary if the survey were
administered to all members of the relevant population. See, e.g., McGovern & Lind, supra note 31.
73. If survey expertise is being provided by several experts, a single expert may have general
familiarity but not special expertise in all these areas.
74. See Margaret A. Berger, The Admissibility of Expert Testimony, Section III.A, in this
manual.
375
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
about the adequacy and interpretation of a survey not only should have general
skills and experience with surveys and be familiar with all of the issues addressed
in this reference guide, but also should demonstrate familiarity with the following
properties of the survey being discussed:
1. Purpose of the survey;
2. Survey methodology,75 including
a. the target population,
b. the sampling design used in conducting the survey,
c. the survey instrument (questionnaire or interview schedule), and
d. (for interview surveys) interviewer training and instruction;
3. Results, including rates and patterns of missing data; and
4. Statistical analyses used to interpret the results.
III. Population Definition and Sampling
A. Was an Appropriate Universe or Population Identified?
One of the first steps in designing a survey or in deciding whether an existing
survey is relevant is to identify the target population (or universe).76 The target
population consists of all elements (i.e., individuals or other units) whose characteristics or perceptions the survey is intended to represent. Thus, in trademark
litigation, the relevant population in some disputes may include all prospective
and past purchasers of the plaintiff’s goods or services and all prospective and past
purchasers of the defendant’s goods or services. Similarly, the population for a discovery survey may include all potential plaintiffs or all employees who worked for
Company A between two specific dates. In a community survey designed to provide evidence for a motion for a change of venue, the relevant population consists
of all jury-eligible citizens in the community in which the trial is to take place.77
75. See A & M Records, Inc. v. Napster, Inc., 2000 U.S. Dist. LEXIS 20668 (N.D. Cal. Aug. 10,
2000) (holding that expert could not attest credibly that the surveys upon which he relied conformed
to accepted survey principles because of his minimal role in overseeing the administration of the survey
and limited expert report).
76. Identification of the proper target population or universe is recognized uniformly as a key
element in the development of a survey. See, e.g., Judicial Conference of the U.S., supra note 66; MCL
4th, supra note 16, § 11.493; see also 3 McCarthy, supra note 69, § 32:166; Council of Am. Survey
Res. Orgs., Code of Standards and Ethics for Survey Research § III.A.3 (2010).
77. A second relevant population may consist of jury-eligible citizens in the community where
the party would like to see the trial moved. By questioning citizens in both communities, the survey
can test whether moving the trial is likely to reduce the level of animosity toward the party requesting
the change of venue. See United States v. Haldeman, 559 F.2d 31, 140, 151, app. A at 176–79 (D.C.
Cir. 1976) (court denied change of venue over the strong objection of Judge MacKinnon, who cited
survey evidence that Washington, D.C., residents were substantially more likely to conclude, before
376
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
The definition of the relevant population is crucial because there may be systematic differences in the responses of members of the population and nonmembers.
For example, consumers who are prospective purchasers may know more about
the product category than consumers who are not considering making a purchase.
The universe must be defined carefully. For example, a commercial for a toy
or breakfast cereal may be aimed at children, who in turn influence their parents’
purchases. If a survey assessing the commercial’s tendency to mislead were conducted based on a sample from the target population of prospective and actual
adult purchasers, it would exclude a crucial relevant population. The appropriate
population in this instance would include children as well as parents.78
B. Did the Sampling Frame Approximate the Population?
The target population consists of all the individuals or units that the researcher
would like to study. The sampling frame is the source (or sources) from which
the sample actually is drawn. The surveyor’s job generally is easier if a complete
list of every eligible member of the population is available (e.g., all plaintiffs in
a discovery survey), so that the sampling frame lists the identity of all members
of the target population. Frequently, however, the target population includes
members who are inaccessible or who cannot be identified in advance. As a
result, reasonable compromises are sometimes required in developing the sampling
frame. The survey report should contain (1) a description of the target population, (2) a description of the sampling frame from which the sample is to be
drawn, (3) a discussion of the difference between the target population and the
sampling frame, and, importantly, (4) an evaluation of the likely consequences of
that difference.
A survey that provides information about a wholly irrelevant population
is itself irrelevant.79 Courts are likely to exclude the survey or accord it little
trial, that the defendants were guilty); see also People v. Venegas, 31 Cal. Rptr. 2d 114, 117 (Cal. Ct.
App. 1994) (change of venue denied because defendant failed to show that the defendant would face
a less hostile jury in a different court).
78. See, e.g., Warner Bros., Inc. v. Gay Toys, Inc., 658 F.2d 76 (2d Cir. 1981) (surveying
children users of the product rather than parent purchasers). Children and some other populations
create special challenges for researchers. For example, very young children should not be asked about
sponsorship or licensing, concepts that are foreign to them. Concepts, as well as wording, should be
age appropriate.
79. A survey aimed at assessing how persons in the trade respond to an advertisement should
be conducted on a sample of persons in the trade and not on a sample of consumers. See Home Box
Office v. Showtime/The Movie Channel, 665 F. Supp. 1079, 1083 (S.D.N.Y.), aff’d in part and vacated
in part, 832 F.2d 1311 (2d Cir. 1987); J & J Snack Food Corp. v. Earthgrains Co., 220 F. Supp. 2d
358, 371–72 (N.J. 2002). But see Lon Tai Shing Co. v. Koch + Lowy, No. 90-C4464, 1990 U.S. Dist.
LEXIS 19123, at *50 (S.D.N.Y. Dec. 14, 1990), in which the judge was willing to find likelihood
of consumer confusion from a survey of lighting store salespersons questioned by a survey researcher
posing as a customer. The court was persuaded that the salespersons who were misstating the source
377
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
weight.80 Thus, when the plaintiff submitted the results of a survey to prove that
the green color of its fishing rod had acquired a secondary meaning, the court
gave the survey little weight in part because the survey solicited the views of fishing rod dealers rather than consumers.81 More commonly, however, the sampling
frame and the target population have some overlap, but the overlap is imperfect:
The sampling frame excludes part of the target population, that is, it is underinclusive, or the sampling frame includes individuals who are not members of
the target population, that is, it is overinclusive relative to the target population.
Coverage error is the term used to describe inconsistencies between a sampling
frame and a target population. If the coverage is underinclusive, the survey’s value
depends on the proportion of the target population that has been excluded from
the sampling frame and the extent to which the excluded population is likely to
respond differently from the included population. Thus, a survey of spectators
and participants at running events would be sampling a sophisticated subset of
those likely to purchase running shoes. Because this subset probably would consist
of the consumers most knowledgeable about the trade dress used by companies
that sell running shoes, a survey based on this sampling frame would be likely to
substantially overrepresent the strength of a particular design as a trademark, and
the extent of that overrepresentation would be unknown and not susceptible to
any reasonable estimation.82
Similarly, in a survey designed to project demand for cellular phones, the
assumption that businesses would be the primary users of cellular service led
surveyors to exclude potential nonbusiness users from the survey. The Federal
Communications Commission (FCC) found the assumption unwarranted and
concluded that the research was flawed, in part because of this underinclusive
coverage.83 With the growth in individual cell phone use over time, noncoverage
error would be an even greater problem for this survey today.
of the lamp, whether consciously or not, must have believed reasonably that the consuming public
would be likely to rely on the salespersons’ inaccurate statements about the name of the company that
manufactured the lamp they were selling.
80. See Wells Fargo & Co. v. WhenU.com, Inc., 293 F. Supp. 2d 734 (E.D. Mich. 2003).
81. See R.L. Winston Rod Co. v. Sage Mfg. Co., 838 F. Supp. 1396, 1401–02 (D. Mont. 1993).
82. See Brooks Shoe Mfg. Co. v. Suave Shoe Corp., 533 F. Supp. 75, 80 (S.D. Fla. 1981), aff’d,
716 F.2d 854 (11th Cir. 1983); see also Hodgdon Power Co. v. Alliant Techsystems, Inc., 512 F. Supp.
2d 1178 (D. Kan. 2007) (excluding survey on gunpowder brands distributed at plaintiff’s promotional
booth at a shooting tournament); Winning Ways, Inc. v. Holloway Sportswear, Inc., 913 F. Supp.
1454, 1467 (D. Kan. 1996) (survey flawed in failing to include sporting goods customers who constituted a major portion of customers). But see Thomas & Betts Corp. v. Panduit Corp., 138 F.3d 277,
294–95 (7th Cir. 1998) (survey of store personnel admissible because relevant market included both
distributors and ultimate purchasers).
83. See Gencom, Inc., 56 Rad. Reg. 2d (P&F) 1597, 1604 (1984). This position was affirmed on
appeal. See Gencom, Inc. v. FCC, 832 F.2d 171, 186 (D.C. Cir. 1987); see also Beacon Mut. Ins. Co.
v. Onebeacon Ins. Corp, 376 F. Supp. 2d 251, 261 (D.R.I. 2005) (sample included only defendant’s
insurance agents and lack of confusion among those agents was “nonstartling”).
378
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
In some cases, it is difficult to determine whether a sampling frame that
omits some members of the population distorts the results of the survey and, if
so, the extent and likely direction of the bias. For example, a trademark survey
was designed to test the likelihood of confusing an analgesic currently on the
market with a new product that was similar in appearance.84 The plaintiff’s survey
included only respondents who had used the plaintiff’s analgesic, and the court
found that the target population should have included users of other analgesics,
“so that the full range of potential customers for whom plaintiff and defendants
would compete could be studied.”85 In this instance, it is unclear whether users
of the plaintiff’s product would be more or less likely to be confused than users of
the defendants’ product or users of a third analgesic.86
An overinclusive sampling frame generally presents less of a problem for interpretation than does an underinclusive sampling frame.87 If the survey expert can
demonstrate that a sufficiently large (and representative) subset of respondents in
the survey was drawn from the appropriate sampling frame, the responses obtained
from that subset can be examined, and inferences about the relevant population
can be drawn based on that subset.88 If the relevant subset cannot be identified,
however, an overbroad sampling frame will reduce the value of the survey.89 If
the sampling frame does not include important groups in the target population,
there is generally no way to know how the unrepresented members of the target
population would have responded.90
84. See American Home Prods. Corp. v. Barr Lab., Inc., 656 F. Supp. 1058 (D.N.J.), aff’d, 834
F.2d 368 (3d Cir. 1987).
85. Id. at 1070.
86. See also Craig v. Boren, 429 U.S. 190 (1976).
87. See Schwab v. Philip Morris USA, Inc. 449 F. Supp. 2d 992, 1134–35 (E.D.N.Y. 2006)
(“Studies evaluating broadly the beliefs of low tar smokers generally are relevant to the beliefs of “light”
smokers more specifically.”).
88. See National Football League Props. Inc. v. Wichita Falls Sportswear, Inc. 532 F. Supp. 651,
657–58 (W.D. Wash. 1982).
89. See Leelanau Wine Cellars, Ltd. v. Black & Red, Inc., 502 F.3d 504, 518 (6th Cir. 2007)
(lower court was correct in giving little weight to survey with overbroad universe); Big Dog Motorcycles, L.L.C. v. Big Dog Holdings, Inc., 402 F. Supp. 2d 1312, 1334 (D. Kan. 2005) (universe composed of prospective purchasers of all t-shirts and caps overinclusive for evaluating reactions of buyers
likely to purchase merchandise at motorcycle dealerships). See also Schieffelin & Co. v. Jack Co. of
Boca, 850 F. Supp. 232, 246 (S.D.N.Y. 1994).
90. See, e.g., Amstar Corp. v. Domino’s Pizza, Inc., 615 F.2d 252, 263–64 (5th Cir. 1980) (court
found both plaintiff’s and defendant’s surveys substantially defective for a systematic failure to include
parts of the relevant population); Scott Fetzer Co. v. House of Vacuums, Inc., 381 F.3d 477 (5th Cir.
2004) (universe drawn from plaintiff’s customer list underinclusive and likely to differ in their familiarity with plaintiff’s marketing and distribution techniques).
379
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
C. Does the Sample Approximate the Relevant Characteristics
of the Population?
Identification of a survey population must be followed by selection of a sample
that accurately represents that population.91 The use of probability sampling techniques maximizes both the representativeness of the survey results and the ability
to assess the accuracy of estimates obtained from the survey.
Probability samples range from simple random samples to complex multistage
sampling designs that use stratification, clustering of population elements into
various groupings, or both. In all forms of probability sampling, each element
in the relevant population has a known, nonzero probability of being included in
the sample.92 In simple random sampling, the most basic type of probability sampling, every element in the population has a known, equal probability of being
included in the sample, and all possible samples of a given size are equally likely to
be selected.93 Other probability sampling techniques include (1) stratified random
sampling, in which the researcher subdivides the population into mutually exclusive and exhaustive subpopulations, or strata, and then randomly selects samples
from within these strata; and (2) cluster sampling, in which elements are sampled
in groups or clusters, rather than on an individual basis.94 Note that selection
probabilities do not need to be the same for all population elements; however, if
the probabilities are unequal, compensatory adjustments should be made in the
analysis.
Probability sampling offers two important advantages over other types of
sampling. First, the sample can provide an unbiased estimate that summarizes the
responses of all persons in the population from which the sample was drawn; that
is, the expected value of the sample estimate is the population value being estimated. Second, the researcher can calculate a confidence interval that describes
explicitly how reliable the sample estimate of the population is likely to be. If
the sample is unbiased, the difference between the estimate and the exact value
is called the sampling error.95 Thus, suppose a survey collected responses from a
simple random sample of 400 dentists selected from the population of all dentists
91. MCL 4th, supra note 16, § 11.493. See also David H. Kaye & David A. Freedman, Reference
Guide on Statistics, Section II.B, in this manual.
92. The exception is that population elements omitted from the sampling frame have a zero
probability of being sampled.
93. Systematic sampling, in which every nth unit in the population is sampled and the starting
point is selected randomly, fulfills the first of these conditions. It does not fulfill the second, because
no systematic sample can include elements adjacent to one another on the list of population members
from which the sample is drawn. Except in unusual situations when periodicities occur, systematic
samples and simple random samples generally produce the same results. Thomas Plazza, Fundamentals
of Applied Sampling, in Handbook of Survey Research, supra note 1, at 139, 145.
94. Id. at 139, 150–63.
95. See David H. Kaye & David A. Freedman, supra note 91, Glossary, for a definition of
sampling error.
380
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
licensed to practice in the United States and found that 80, or 20%, of them
mistakenly believed that a new toothpaste, Goldgate, was manufactured by the
makers of Colgate. A survey expert could properly compute a confidence interval
around the 20% estimate obtained from this sample. If the survey were repeated
a large number of times, and a 95% confidence interval was computed each time,
95% of the confidence intervals would include the actual percentage of dentists
in the entire population who would believe that Goldgate was manufactured by
the makers of Colgate.96 In this example, the margin of error is ±4%, and so the
confidence interval is the range between 16% and 24%, that is, the estimate (20%)
plus or minus 4%.
All sample surveys produce estimates of population values, not exact measures
of those values. Strictly speaking, the margin of error associated with the sample
estimate assumes probability sampling. Assuming a probability sample, a confidence interval describes how stable the mean response in the sample is likely to
be. The width of the confidence interval depends on three primary characteristics:
1. Size of the sample (the larger the sample, the narrower the interval);
2. Variability of the response being measured; and
3. Confidence level the researcher wants to have.97
Traditionally, scientists adopt the 95% level of confidence, which means that
if 100 samples of the same size were drawn, the confidence interval expected for at
least 95 of the samples would be expected to include the true population value.98
Stratified probability sampling can be used to obtain more precise response
estimates by using what is known about characteristics of the population that are
likely to be associated with the response being measured. Suppose, for example,
we anticipated that more-experienced and less-experienced dentists might respond
differently to Goldgate toothpaste, and we had information on the year in which
each dentist in the population began practicing. By dividing the population of
dentists into more- and less-experienced strata (e.g., in practice 15 years or more
versus in practice less than 15 years) and then randomly sampling within experience stratum, we would be able to ensure that the sample contained precisely
96. Actually, because survey interviewers would be unable to locate some dentists and some
dentists would be unwilling to participate in the survey, technically the population to which this sample
would be projectable would be all dentists with current addresses who would be willing to participate
in the survey if they were asked. The expert should be prepared to discuss possible sources of bias due
to, for example, an address list that is not current.
97. When the sample design does not use a simple random sample, the confidence interval will
be affected.
98. To increase the likelihood that the confidence interval contains the actual population value
(e.g., from 95% to 99%) without increasing the sample size, the width of the confidence interval can
be expanded. An increase in the confidence interval brings an increase in the confidence level. For
further discussion of confidence intervals, see David H. Kaye & David A. Freedman, Reference Guide
on Statistics, Section IV.A, in this manual.
381
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
proportionate representation from each stratum, in this case, more- and lessexperienced dentists. That is, if 60% of dentists were in practice 15 years or more,
we could select 60% of the sample from the more-experienced stratum and 40%
from the less-experienced stratum and be sure that the sample would have proportionate representation from each stratum, reducing the likely sampling error.99
In proportionate stratified probability sampling, as in simple random sampling,
each individual member of the population has an equal chance of being selected.
Stratified probability sampling can also disproportionately sample from different
strata, a procedure that will produce more precise estimates if some strata are more
heterogeneous than others on the measure of interest.100 Disproportionate sampling may also used to enable the survey to provide separate estimates for particular
subgroups. With disproportionate sampling, sampling weights must be used in
the analysis to accurately describe the characteristics of the population as a whole.
Although probability sample surveys often are conducted in organizational
settings and are the recommended sampling approach in academic and government publications on surveys, probability sample surveys can be expensive when
in-person interviews are required, the target population is dispersed widely, or
members of the target population are rare. A majority of the consumer surveys
conducted for Lanham Act litigation present results from nonprobability convenience samples.101 They are admitted into evidence based on the argument that
nonprobability sampling is used widely in marketing research and that “results of
these studies are used by major American companies in making decisions of considerable consequence.”102 Nonetheless, when respondents are not selected randomly
from the relevant population, the expert should be prepared to justify the method
used to select respondents. Special precautions are required to reduce the likelihood
of biased samples.103 In addition, quantitative values computed from such samples
(e.g., percentage of respondents indicating confusion) should be viewed as rough
99.. See Pharmacia Corp. v. Alcon Lab., 201 F. Supp. 2d 335, 365 (D.N.J. 2002).
100. Robert M. Groves et al., Survey Methodology, Stratification and Stratified Sampling,
106–18 (2004).
101. Jacob Jacoby & Amy H. Handlin, Non-Probability Sampling Designs for Litigation Surveys, 81
Trademark Rep. 169, 173 (1991). For probability surveys conducted in trademark cases, see James
Burrough, Ltd. v. Sign of Beefeater, Inc., 540 F.2d 266 (7th Cir. 1976); Nightlight Systems, Inc., v.
Nite Lights Franchise Sys., 2007 U.S. Dist. LEXIS 95565 (N.C. Ga. July 17, 2007); National Football
League Props., Inc. v. Wichita Falls Sportswear, Inc., 532 F. Supp. 651 (W.D. Wash. 1982).
102. National Football League Props., Inc. v. New Jersey Giants, Inc., 637 F. Supp. 507, 515
(D.N.J. 1986). A survey of members of the Council of American Survey Research Organizations,
the national trade association for commercial survey research firms in the United States, revealed that
95% of the in-person independent contacts in studies done in 1985 took place in malls or shopping
centers. Jacoby & Handlin, supra note 101, at 172–73, 176. More recently, surveys conducted over
the Internet have been administered to samples of respondents drawn from panels of volunteers; see
infra Section IV.G.4 for a discussion of online surveys. Although panel members may be randomly
selected from the panel population to complete the survey, the panel population itself is not usually
the product of a random selection process.
103. See infra Sections III.D–E.
382
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
indicators rather than as precise quantitative estimates.104 Confidence intervals technically should not be computed, although if the calculation shows a wide interval,
that may be a useful indication of the limited value of the estimate.
D. What Is the Evidence That Nonresponse Did Not Bias the
Results of the Survey?
Even when a sample is drawn randomly from a complete list of elements in the target population, responses or measures may be obtained on only part of the selected
sample. If this lack of response is distributed randomly, valid inferences about the
population can be drawn with assurance using the measures obtained from the available elements in the sample. The difficulty is that nonresponse often is not random,
so that, for example, persons who are single typically have three times the “not
at home” rate in U.S. Census Bureau surveys as do family members.105 Efforts to
increase response rates include making several attempts to contact potential respondents, sending advance letters,106 and providing financial or nonmonetary incentives
for participating in the survey.107
The key to evaluating the effect of nonresponse in a survey is to determine
as much as possible the extent to which nonrespondents differ from the respondents in the nature of the responses they would provide if they were present
in the sample. That is, the difficult question to address is the extent to which
nonresponse has biased the pattern of responses by undermining the representativeness of the sample and, if it has, the direction of that bias. It is incumbent
on the expert presenting the survey results to analyze the level and sources of
nonresponse, and to assess how that nonresponse is likely to have affected the
results. On some occasions, it may be possible to anticipate systematic patterns of
nonresponse. For example, a survey that targets a population of professionals may
encounter difficulty in obtaining the same level of participation from individuals
with high-volume practices that can be obtained from those with lower-volume
practices. To enable the researcher to assess whether response rate varies with the
volume of practice, it may be possible to identify in advance potential respondents
104. The court in Kinetic Concept, Inc. v. Bluesky Medical Corp., 2006 U.S. Dist. LEXIS
60187, *14 (W.D. Tex. Aug. 11, 2006), found the plaintiff’s survey using a nonprobability sample to
be admissible and permitted the plaintiff’s expert to present results from a survey using a convenience
sample. The court then assisted the jury by providing an instruction on the differences between probability and convenience samples and the estimates obtained from each.
105. 2 Gastwirth, supra note 64, at 501. This volume contains a useful discussion of sampling,
along with a set of examples. Id. at 467.
106. Edith De Leeuw et al., The Influence of Advance Letters on Response in Telephone Surveys:
A Meta-analysis, 71 Pub. Op. Q. 413 (2007) (advance letters effective in increasing response rates in
telephone as well as mail and face-to-face surveys).
107. Erica Ryu et al., Survey Incentives: Cash vs. In-kind; Face-to-Face vs. Mail; Response Rate vs.
Nonresponse Error, 18 Int’l J. Pub. Op. Res. 89 (2005).
383
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
with varying years of experience. Even if it is not possible to know in advance
the level of experience of each potential member in the target population and
to design a sampling plan that will produce representative samples at each level
of experience, the survey itself can include questions about volume of practice
that will permit the expert to assess how experience level may have affected the
pattern of results.108
Although high response rates (i.e., 80% or higher)109 are desirable because
they generally eliminate the need to address the issue of potential bias from
nonresponse,110 such high response rates are increasingly difficult to achieve.
Survey nonresponse rates have risen substantially in recent years, along with the
costs of obtaining responses, and so the issue of nonresponse has attracted substantial attention from survey researchers.111 Researchers have developed a variety
of approaches to adjust for nonresponse, including weighting obtained responses
in proportion to known demographic characteristics of the target population,
comparing the pattern of responses from early and late responders to mail surveys,
or the pattern of responses from easy-to-reach and hard-to-reach responders in
telephone surveys, and imputing estimated responses to nonrespondents based on
known characteristics of those who have responded. All of these techniques can
only approximate the response patterns that would have been obtained if nonrespondents had responded. Nonetheless, they are useful for testing the robustness
of the findings based on estimates obtained from the simple aggregation of answers
to questions given by responders.
To assess the general impact of the lower response rates, researchers have
conducted comparison studies evaluating the results obtained from surveys with
108. In People v. Williams, supra note 22, a published survey of experts in eyewitness research
was used to show general acceptance of various eyewitness phenomena. See Saul Kassin et al., On the
“General Acceptance” of Eyewitness Testimony Research: A New Survey of the Experts, 56 Am. Psychologist
405 (2001). The survey included questions on the publication activity of respondents and compared
the responses of those with high and low research productivity. Productivity levels in the respondent
sample suggested that respondents constituted a blue ribbon group of leading researchers. Williams, 830
N.Y.S.2d at 457 n.16. See also Pharmacia Corp. v. Alcon Lab., Inc., 201 F. Supp. 2d 335 (D.N.J. 2002).
109. Note that methods of computing response rates vary. For example, although response rate
can be generally defined as the number of complete interviews with reporting units divided by the
number of eligible reporting units in the sample, decisions on how to treat partial completions and
how to estimate the eligibility of nonrespondents can produce differences in measures of response
rate. E.g., American Association of Public Opinion Research, Standard Definitions: Final Dispositions
of Case Codes and Outcome Rates for Surveys (rev. 2008), available at www. Aapor.org/uploads/
Standard_Definitions_07-08_Final.pdf.
110. Office of Management and Budget, Standards and Guidelines for Statistical Surveys (Sept.
2006), Guideline 1.3.4: Plan for a nonresponse bias analysis if the expected unit response rate is below
80%. See Albert v. Zabin, 2009 Mass. App. Unpub. LEXIS 572 (July 14, 2009) reversing summary
judgment that had excluded surveys with response rates of 27% and 31% based on a thoughtful analysis
of measures taken to assess potential nonresponse bias.
111. E.g., Richard Curtin et al., Changes in Telephone Survey Nonresponse Over the Past Quarter
Century, 69 Pub. Op. Q. 87 (2005); Survey Nonresponse (Robert M. Groves et al. eds., 2002).
384
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
varying response rates.112 Contrary to earlier assumptions, surprisingly comparable
results have been obtained in many surveys with varying response rates, suggesting
that surveys may achieve reasonable estimates even with relatively low response
rates. The key is whether nonresponse is associated with systematic differences in
response that cannot be adequately modeled or assessed.
Determining whether the level of nonresponse in a survey seriously impairs
inferences drawn from the results of a survey generally requires an analysis of the
determinants of nonresponse. For example, even a survey with a high response
rate may seriously underrepresent some portions of the population, such as the
unemployed or the poor. If a general population sample is used to chart changes
in the proportion of the population that knows someone with HIV, the survey
would underestimate the population value if some groups more likely to know
someone with HIV (e.g., intravenous drug users) are underrepresented in the
sample. The survey expert should be prepared to provide evidence on the potential impact of nonresponse on the survey results.
In surveys that include sensitive or difficult questions, particularly surveys
that are self-administered, some respondents may refuse to provide answers or
may provide incomplete answers (i.e., item rather than unit nonresponse).113
To assess the impact of nonresponse to a particular question, the survey expert
should analyze the differences between those who answered and those who did
not answer. Procedures to address the problem of missing data include recontacting respondents to obtain the missing answers and using the respondent’s other
answers to predict the missing response (i.e., imputation).114
E. What Procedures Were Used to Reduce the Likelihood of a
Biased Sample?
If it is impractical for a survey researcher to sample randomly from the entire target
population, the researcher still can apply probability sampling to some aspects of
respondent selection to reduce the likelihood of biased selection. For example,
in many studies the target population consists of all consumers or purchasers of
a product. Because it is impractical to randomly sample from that population,
research is often conducted in shopping malls where some members of the target
population may not shop. Mall locations, however, can be sampled randomly
from a list of possible mall sites. By administering the survey at several different
112. E.g., Daniel M. Merkle & Murray Edelman, Nonresponse in Exit Polls: A Comprehensive
Analysis, in Survey Nonresponse, supra note 111, at 243–57 (finding minimal nonresponse error associated with refusals to participate in in-person exit polls); see also Jon A. Krosnick, Survey Research, 50
Ann. Rev. Psychol. 537 (1999).
113. See Roger Tourangeau et al., The Psychology of Survey Response (2000).
114. See Paul D. Allison, Missing Data, in Handbook of Survey Research, supra note 1, at 630;
see also Survey Nonresponse, supra note 111.
385
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
malls, the expert can test for and report on any differences observed across sites.
To the extent that similar results are obtained in different locations using different
onsite interview operations, it is less likely that idiosyncrasies of sample selection or
administration can account for the results.115 Similarly, because the characteristics
of persons visiting a shopping center vary by day of the week and time of day, bias
in sampling can be reduced if the survey design calls for sampling time segments
as well as mall locations.116
In mall intercept surveys, the organization that manages the onsite interview
facility generally employs recruiters who approach potential survey respondents in
the mall and ascertain if they are qualified and willing to participate in the survey.
If a potential respondent agrees to answer the questions and meets the specified
criteria, he or she is escorted to the facility where the survey interview takes
place. If recruiters are free to approach potential respondents without controls on
how an individual is to be selected for screening, shoppers who spend more time
in the mall are more likely to be approached than shoppers who visit the mall
only briefly. Moreover, recruiters naturally prefer to approach friendly looking
potential respondents, so that it is more likely that certain types of individuals
will be selected. These potential biases in selection can be reduced by providing
appropriate selection instructions and training recruiters effectively. Training that
reduces the interviewer’s discretion in selecting a potential respondent is likely to
reduce bias in selection, as are instructions to approach every nth person entering
the facility through a particular door.117
F. What Precautions Were Taken to Ensure That Only
Qualified Respondents Were Included in the Survey?
In a carefully executed survey, each potential respondent is questioned or measured on the attributes that determine his or her eligibility to participate in the survey. Thus, the initial questions screen potential respondents to determine if they
are members of the target population of the survey (e.g., Is she at least 14 years
old? Does she own a dog? Does she live within 10 miles?). The screening questions must be drafted so that they do not appeal to or deter specific groups within
the target population, or convey information that will influence the respondent’s
115. Note, however, that differences in results across sites may arise from genuine differences
in respondents across geographic locations or from a failure to administer the survey consistently
across sites.
116. Seymour Sudman, Improving the Quality of Shopping Center Sampling, 17 J. Marketing Res.
423 (1980).
117. In the end, even if malls are randomly sampled and shoppers are randomly selected within
malls, results from mall surveys technically can be used to generalize only to the population of mall
shoppers. The ability of the mall sample to describe the likely response pattern of the broader relevant population will depend on the extent to which a substantial segment of the relevant population
(1) is not found in malls and (2) would respond differently to the interview.
386
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
answers on the main survey. For example, if respondents must be prospective
and recent purchasers of Sunshine orange juice in a trademark survey designed
to assess consumer confusion with Sun Time orange juice, potential respondents
might be asked to name the brands of orange juice they have purchased recently
or expect to purchase in the next 6 months. They should not be asked specifically
if they recently have purchased, or expect to purchase, Sunshine orange juice,
because this may affect their responses on the survey either by implying who is
conducting the survey or by supplying them with a brand name that otherwise
would not occur to them.
The content of a screening questionnaire (or screener) can also set the context
for the questions that follow. In Pfizer, Inc. v. Astra Pharmaceutical Products, Inc.,118
physicians were asked a screening question to determine whether they prescribed
particular drugs. The survey question that followed the screener asked “Thinking
of the practice of cardiovascular medicine, what first comes to mind when you
hear the letters XL?” The court found that the screener conditioned the physicians to respond with the name of a drug rather than a condition (long-acting).119
The criteria for determining whether to include a potential respondent
in the survey should be objective and clearly conveyed, preferably using
written instructions addressed to those who administer the screening questions.
These instructions and the completed screening questionnaire should be made
available to the court and the opposing party along with the interview form for
each respondent.
IV. Survey Questions and Structure
A. Were Questions on the Survey Framed to Be Clear,
Precise, and Unbiased?
Although it seems obvious that questions on a survey should be clear and precise,
phrasing questions to reach that goal is often difficult. Even questions that appear
clear can convey unexpected meanings and ambiguities to potential respondents.
For example, the question “What is the average number of days each week you
have butter?” appears to be straightforward. Yet some respondents wondered
whether margarine counted as butter, and when the question was revised to
include the introductory phrase “not including margarine,” the reported frequency of butter use dropped dramatically.120
118. 858 F. Supp. 1305, 1321 & n.13 (S.D.N.Y. 1994).
119. Id. at 1321.
120. Floyd J. Fowler, Jr., How Unclear Terms Affect Survey Data, 56 Pub. Op. Q. 218, 225–26
(1992).
387
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
When unclear questions are included in a survey, they may threaten the
validity of the survey by systematically distorting responses if respondents are
misled in a particular direction, or by inflating random error if respondents guess
because they do not understand the question.121 If the crucial question is sufficiently ambiguous or unclear, it may be the basis for rejecting the survey. For
example, a survey was designed to assess community sentiment that would warrant
a change of venue in trying a case for damages sustained when a hotel skywalk
collapsed.122 The court found that the question “Based on what you have heard,
read or seen, do you believe that in the current compensatory damage trials, the
defendants, such as the contractors, designers, owners, and operators of the Hyatt
Hotel, should be punished?” could neither be correctly understood nor easily
answered.123 The court noted that the phrase “compensatory damages,” although
well-defined for attorneys, was unlikely to be meaningful for laypersons.124
A variety of pretest activities may be used to improve the clarity of communication with respondents. Focus groups can be used to find out how the
survey population thinks about an issue, facilitating the construction of clear and
understandable questions. Cognitive interviewing, which includes a combination
of think-aloud and verbal probing techniques, may be used for questionnaire
evaluation.125 Pilot studies involving a dress rehearsal for the main survey can also
detect potential problems.
Texts on survey research generally recommend pretests as a way to increase
the likelihood that questions are clear and unambiguous,126 and some courts have
recognized the value of pretests.127 In many pretests or pilot tests,128 the proposed
survey is administered to a small sample (usually between 25 and 75)129 of the
121. See id. at 219.
122. Firestone v. Crown Ctr. Redevelopment Corp., 693 S.W.2d 99 (Mo. 1985) (en banc).
123. See id. at 102, 103.
124. See id. at 103. When there is any question about whether some respondents will understand
a particular term or phrase, the term or phrase should be defined explicitly.
125. Gordon B. Willis et al., Is the Bandwagon Headed to the Methodological Promised Land? Evaluating the Validity of Cognitive Interviewing Techniques, in Cognitive and Survey Research 136 (Monroe G.
Sirken et al. eds., 1999). See also Tourangeau et al., supra note 113, at 326–27.
126. See Jon A. Krosnick & Stanley Presser, Questions and Questionnaire Design, in Handbook of
Survey Research, supra note 1, at 294 (“No matter how closely a questionnaire follows recommendations based on best practices, it is likely to benefit from pretesting. . .”). See also Jean M. Converse &
Stanley Presser, Survey Questions: Handcrafting the Standardized Questionnaire 51 (1986); Fred W.
Morgan, Judicial Standards for Survey Research: An Update and Guidelines, 54 J. Marketing 59, 64 (1990).
127. See e.g., Zippo Mfg. Co. v. Rogers Imports, Inc., 216 F. Supp. 670 (S.D.N.Y. 1963); Scott
v. City of New York, 591 F. Supp. 2d 554, 560 (S.D.N.Y. 2008) (“[s]urvey went through multiple
pretests in order to insure its usefulness and statistical validity.”).
128. The terms pretest and pilot test are sometimes used interchangeably to describe pilot work
done in the planning stages of research. When they are distinguished, the difference is that a pretest
tests the questionnaire, whereas a pilot test generally tests proposed collection procedures as well.
129. Converse & Presser, supra note 126, at 69. Converse and Presser suggest that a pretest with
25 respondents is appropriate when the survey uses professional interviewers.
388
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
same type of respondents who would be eligible to participate in the full-scale
survey. The interviewers observe the respondents for any difficulties they may
have with the questions and probe for the source of any such difficulties so that
the questions can be rephrased if confusion or other difficulties arise.130 Attorneys
who commission surveys for litigation sometimes are reluctant to approve pilot
work or to reveal that pilot work has taken place because they are concerned that
if a pretest leads to revised wording of the questions, the trier of fact may believe
that the survey has been manipulated and is biased or unfair. A more appropriate
reaction is to recognize that pilot work is a standard and valuable way to improve
the quality of a survey131 and to anticipate that it often results in word changes
that increase clarity and correct misunderstandings. Thus, changes may indicate
informed survey construction rather than flawed survey design.132
B. Were Some Respondents Likely to Have No Opinion?
If So, What Steps Were Taken to Reduce Guessing?
Some survey respondents may have no opinion on an issue under investigation,
either because they have never thought about it before or because the question
mistakenly assumes a familiarity with the issue. For example, survey respondents
may not have noticed that the commercial they are being questioned about guaranteed the quality of the product being advertised and thus they may have no
opinion on the kind of guarantee it indicated. Likewise, in an employee survey,
respondents may not be familiar with the parental leave policy at their company
and thus may have no opinion on whether they would consider taking advantage
of the parental leave policy if they became parents. The following three alternative question structures will affect how those respondents answer and how their
responses are counted.
First, the survey can ask all respondents to answer the question (e.g., “Did
you understand the guarantee offered by Clover to be a 1-year guarantee, a 60-day
guarantee, or a 30-day guarantee?”). Faced with a direct question, particularly
one that provides response alternatives, the respondent obligingly may supply an
130. Methods for testing respondent understanding include concurrent and retrospective thinkalouds, in which respondents describe their thinking as they arrive at, or after they have arrived at, an
answer, and paraphrasing (asking respondents to restate the question in their own words). Tourangeau
et al., supra note 113, at 326–27; see also Methods for Testing and Evaluating Survey Questionnaires
(Stanley Presser et al. eds., 2004).
131. See OMB Standards and Guidelines for Statistical Survey, supra note 110, Standard 1.4, Pretesting Survey Systems (specifying that to ensure that all components of a survey function as intended,
pretests of survey components should be conducted unless those components have previously been successfully fielded); American Association for Public Opinion Research, Best Practices (2011) (“Because
it is rarely possible to foresee all the potential misunderstandings or biasing effects of different questions
or procedures, it is vital for a well-designed survey operation to include provision for a pretest.”).
132. See infra Section VII.B for a discussion of obligations to disclose pilot work.
389
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
answer even if (in this example) the respondent did not notice the guarantee (or
is unfamiliar with the parental leave policy). Such answers will reflect only what
the respondent can glean from the question, or they may reflect pure guessing.
The imprecision introduced by this approach will increase with the proportion of
respondents who are unfamiliar with the topic at issue.
Second, the survey can use a quasi-filter question to reduce guessing by providing “don’t know” or “no opinion” options as part of the question (e.g., “Did
you understand the guarantee offered by Clover to be for more than a year, a
year, or less than a year, or don’t you have an opinion?”).133 By signaling to the
respondent that it is appropriate not to have an opinion, the question reduces
the demand for an answer and, as a result, the inclination to hazard a guess just
to comply. Respondents are more likely to choose a “no opinion” option if it is
mentioned explicitly by the interviewer than if it is merely accepted when the
respondent spontaneously offers it as a response. The consequence of this change in
format is substantial. Studies indicate that, although the relative distribution of the
respondents selecting the listed choices is unlikely to change dramatically, presentation of an explicit “don’t know” or “no opinion” alternative commonly leads to
a 20% to 25% increase in the proportion of respondents selecting that response.134
Finally, the survey can include full-filter questions, that is, questions that lay
the groundwork for the substantive question by first asking the respondent if he
or she has an opinion about the issue or happened to notice the feature that the
interviewer is preparing to ask about (e.g., “Based on the commercial you just
saw, do you have an opinion about how long Clover stated or implied that its
guarantee lasts?”).135 The interviewer then asks the substantive question only of
those respondents who have indicated that they have an opinion on the issue.
Which of these three approaches is used and the way it is used can affect the rate
of “no opinion” responses that the substantive question will evoke.136 Respondents
are more likely to say that they do not have an opinion on an issue if a full filter is
used than if a quasi-filter is used.137 However, in maximizing respondent expressions
of “no opinion,” full filters may produce an underreporting of opinions. There is
some evidence that full-filter questions discourage respondents who actually have
opinions from offering them by conveying the implicit suggestion that respondents
can avoid difficult followup questions by saying that they have no opinion.138
133. Norbert Schwarz & Hans-Jürgen Hippler, Response Alternatives: The Impact of Their Choice
and Presentation Order, in Measurement Errors in Surveys 41, 45–46 (Paul P. Biemer et al. eds., 1991).
134. Howard Schuman & Stanley Presser, Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording and Context 113–46 (1981).
135. See, e.g., Johnson & Johnson–Merck Consumer Pharmas. Co. v. SmithKline Beecham
Corp., 960 F.2d 294, 299 (2d Cir. 1992).
136. Considerable research has been conducted on the effects of filters. For a review, see George
F. Bishop et al., Effects of Filter Questions in Public Opinion Surveys, 47 Pub. Op. Q. 528 (1983).
137. Schwarz & Hippler, supra note 133, at 45–46.
138. Id. at 46.
390
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
In general, then, a survey that uses full filters provides a conservative estimate of the number of respondents holding an opinion, while a survey that uses
neither full filters nor quasi-filters may overestimate the number of respondents
with opinions, if some respondents offering opinions are guessing. The strategy
of including a “no opinion” or “don’t know” response as a quasi-filter avoids
both of these extremes. Thus, rather than asking, “Based on the commercial, do
you believe that the two products are made in the same way, or are they made
differently?”139 or prefacing the question with a preliminary, “Do you have an
opinion, based on the commercial, concerning the way that the two products are
made?” the question could be phrased, “Based on the commercial, do you believe
that the two products are made in the same way, or that they are made differently,
or don’t you have an opinion about the way they are made?”
Recent research on the effects of including a “don’t know” option shows that
quasi-filters as well as full filters may discourage a respondent who would be able
to provide a meaningful answer from expressing it.140 The “don’t know” option
provides a cue that it is acceptable to avoid the work of trying to provide a more
substantive response. Respondents are particularly likely to be attracted to a “don’t
know” option when the question is difficult to understand or the respondent is
not strongly motivated to carefully report an opinion.141 One solution that some
survey researchers use is to provide respondents with a general instruction not to
guess at the beginning of an interview, rather than supplying a “don’t know” or
“no opinion” option as part of the options attached to each question.142 Another
approach is to eliminate the “don’t know” option and to add followup questions
that measure the strength of the respondent’s opinion.143
C. Did the Survey Use Open-Ended or Closed-Ended
Questions? How Was the Choice in Each Instance Justified?
The questions that make up a survey instrument may be open-ended, closedended, or a combination of both. Open-ended questions require the respondent
to formulate and express an answer in his or her own words (e.g., “What was
the main point of the commercial?” “Where did you catch the fish you caught
139. The question in the example without the “no opinion” alternative was based on a question rejected by the court in Coors Brewing Co. v. Anheuser-Busch Cos., 802 F. Supp. 965, 972–73
(S.D.N.Y. 1992). See also Procter & Gamble Pharms., Inc. v. Hoffmann-La Roche, Inc., 2006 U.S.
Dist. LEXIS 64363 (S.D.N.Y. Sept. 6, 2006).
140. Jon A. Krosnick et al., The Impact of “No Opinion” Response Options on Data Quality: NonAttitude Reduction or Invitation to Satisfice? 66 Pub. Op. Q. 371 (2002).
141. Krosnick & Presser, supra note 126, at 284.
142. Anheuser-Busch, Inc. v. VIP Prods, LLC, No. 4:08cv0358, 2008 U.S. Dist. LEXIS 82258,
at *6 (E.D. Mo. Oct. 16, 2008).
143. Krosnick & Presser, supra note 126, at 285.
391
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
in these waters?”144). Closed-ended questions provide the respondent with an
explicit set of responses from which to choose; the choices may be as simple as
yes or no (e.g., “Is Colby College coeducational?”145) or as complex as a range of
alternatives (e.g., “The two pain relievers have (1) the same likelihood of causing
gastric ulcers; (2) about the same likelihood of causing gastric ulcers; (3) a somewhat different likelihood of causing gastric ulcers; (4) a very different likelihood
of causing gastric ulcers; or (5) none of the above.”146). When a survey involves
in-person interviews, the interviewer may show the respondent these choices on
a showcard that lists them.
Open-ended and closed-ended questions may elicit very different responses.147
Most responses are less likely to be volunteered by respondents who are asked
an open-ended question than they are to be chosen by respondents who are presented with a closed-ended question. The response alternatives in a closed-ended
question may remind respondents of options that they would not otherwise consider or which simply do not come to mind as easily.148
The advantage of open-ended questions is that they give the respondent
fewer hints about expected or preferred answers. Precoded responses on a closedended question, in addition to reminding respondents of options that they might
not otherwise consider,149 may direct the respondent away from or toward a
particular response. For example, a commercial reported that in shampoo tests
with more than 900 women, the sponsor’s product received higher ratings than
144. A relevant example from Wilhoite v. Olin Corp. is described in McGovern & Lind, supra
note 31, at 76.
145. Presidents & Trustees of Colby College v. Colby College–N.H., 508 F.2d 804, 809 (1st
Cir. 1975).
146. This question is based on one asked in American Home Products Corp. v. Johnson &
Johnson, 654 F. Supp. 568, 581 (S.D.N.Y. 1987), that was found to be a leading question by the
court, primarily because the choices suggested that the respondent had learned about aspirin’s and
ibuprofen’s relative likelihood of causing gastric ulcers. In contrast, in McNeilab, Inc. v. American
Home Products Corp., 501 F. Supp. 517, 525 (S.D.N.Y. 1980), the court accepted as nonleading the
question, “Based only on what the commercial said, would Maximum Strength Anacin contain more
pain reliever, the same amount of pain reliever, or less pain reliever than the brand you, yourself,
currently use most often?”
147. Howard Schuman & Stanley Presser, Question Wording as an Independent Variable in Survey
Analysis, 6 Soc. Methods & Res. 151 (1977); Schuman & Presser, supra note 134, at 79–112; Converse
& Presser, supra note 126, at 33.
148. For example, when respondents in one survey were asked, “What is the most important
thing for children to learn to prepare them for life?”, 62% picked “to think for themselves” from a list
of five options, but only 5% spontaneously offered that answer when the question was open-ended.
Schuman & Presser, supra note 134, at 104–07. An open-ended question presents the respondent with
a free-recall task, whereas a closed-ended question is a recognition task. Recognition tasks in general
reveal higher performance levels than recall tasks. Mary M. Smyth et al., Cognition in Action 25
(1987). In addition, there is evidence that respondents answering open-ended questions may be less
likely to report some information that they would reveal in response to a closed-ended question when
that information seems self-evident or irrelevant.
149. Schwarz & Hippler, supra note 133, at 43.
392
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
other brands.150 According to a competitor, the commercial deceptively implied
that each woman in the test rated more than one shampoo, when in fact each
woman rated only one. To test consumer impressions, a survey might have shown
the commercial and asked an open-ended question: “How many different brands
mentioned in the commercial did each of the 900 women try?”151 Instead, the
survey asked a closed-ended question; respondents were given the choice of
“one,” “two,” “three,” “four,” or “five or more.” The fact that four of the
five choices in the closed-ended question provided a response that was greater
than one implied that the correct answer was probably more than one.152 Note,
however, that the open-ended question also may suggest that the answer is more
than one.
By asking “how many different brands,” the question suggests (1) that the
viewer should have received some message from the commercial about the number of brands each woman tried and (2) that different brands were tried. Similarly,
an open-ended question that asks, “[W]hich company or store do you think puts
out this shirt?” indicates to the respondent that the appropriate answer is the
name of a company or store. The question would be leading if the respondent
would have considered other possibilities (e.g., an individual or Webstore) if the
question had not provided the frame of a company or store.153 Thus, the wording of a question, open-ended or closed-ended, can be leading or non-leading,
and the degree of suggestiveness of each question must be considered in evaluating
the objectivity of a survey.
Closed-ended questions have some additional potential weaknesses that arise
if the choices are not constructed properly. If the respondent is asked to choose
one response from among several choices, the response chosen will be meaningful
only if the list of choices is exhaustive—that is, if the choices cover all possible
answers a respondent might give to the question. If the list of possible choices
is incomplete, a respondent may be forced to choose one that does not express
his or her opinion.154 Moreover, if respondents are told explicitly that they are
150. See Vidal Sassoon, Inc. v. Bristol-Myers Co., 661 F.2d 272, 273 (2d Cir. 1981).
151. This was the wording of the closed-ended question in the survey discussed in Vidal Sassoon,
661 F.2d at 275–76, without the closed-ended options that were supplied in that survey.
152. Ninety-five percent of the respondents who answered the closed-ended question in the
plaintiff’s survey said that each woman had tried two or more brands. The open-ended question was
never asked. Vidal Sassoon, 661 F.2d at 276. Norbert Schwarz, Assessing Frequency Reports of Mundane
Behaviors: Contributions of Cognitive Psychology to Questionnaire Construction, in Research Methods in
Personality and Social Psychology 98 (Clyde Hendrick & Margaret S. Clark eds., 1990), suggests that
respondents often rely on the range of response alternatives as a frame of reference when they are asked
for frequency judgments. See, e.g., Roger Tourangeau & Tom W. Smith, Asking Sensitive Questions: The
Impact of Data Collection Mode, Question Format, and Question Context, 60 Pub. Op. Q. 275, 292 (1996).
153. Smith v. Wal-Mart Stores, Inc, 537 F. Supp. 2d 1302, 1331–32 (N.D. Ga. 2008).
154. See, e.g., American Home Prods. Corp. v. Johnson & Johnson, 654 F. Supp. 568, 581
(S.D.N.Y. 1987).
393
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
not limited to the choices presented, most respondents nevertheless will select an
answer from among the listed ones.155
One form of closed-ended question format that typically produces some
distortion is the popular agree/disagree, true/false, or yes/no question. Although
this format is appealing because it is easy to write and score these questions and
their responses, the format is also seriously problematic. With its simplicity comes
acquiescence, “[T]he tendency to endorse any assertion made in a question,
regardless of its content,” is a systematic source of bias that has produced an inflation effect of 10% across a number of studies.156 Only when control groups or
control questions are added to the survey design can this question format provide
reasonable response estimates.157
Although many courts prefer open-ended questions on the ground that they
tend to be less leading, the value of any open-ended or closed-ended question
depends on the information it conveys in the question and, in the case of closedended questions, in the choices provided. Open-ended questions are more appropriate when the survey is attempting to gauge what comes first to a respondent’s
mind, but closed-ended questions are more suitable for assessing choices between
well-identified options or obtaining ratings on a clear set of alternatives.
D. If Probes Were Used to Clarify Ambiguous or Incomplete
Answers, What Steps Were Taken to Ensure That the
Probes Were Not Leading and Were Administered in a
Consistent Fashion?
When questions allow respondents to express their opinions in their own words,
some of the respondents may give ambiguous or incomplete answers, or may ask
for clarification. In such instances, interviewers may be instructed to record any
answer that the respondent gives and move on to the next question, or they may
be instructed to probe to obtain a more complete response or clarify the meaning
of the ambiguous response. They may also be instructed what clarification they
can provide. In all of these situations, interviewers should record verbatim both
what the respondent says and what the interviewer says in the attempt to get or
provide clarification. Failure to record every part of the exchange in the order in
which it occurs raises questions about the reliability of the survey, because neither
the court nor the opposing party can evaluate whether the probe affected the
views expressed by the respondent.
155. See Howard Schuman, Ordinary Questions, Survey Questions, and Policy Questions, 50 Pub.
Opinion Q. 432, 435–36 (1986).
156. Jon A. Krosnick, Survey Research, 50 Ann. Rev. Psychol. 537, 552 (1999).
157. See infra Section IV.F.
394
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
If the survey is designed to allow for probes, interviewers must be given
explicit instructions on when they should probe and what they should say in
probing.158 Standard probes used to draw out all that the respondent has to say
(e.g., “Any further thoughts?” “Anything else?” “Can you explain that a little
more?” Or “Could you say that another way?”) are relatively innocuous and noncontroversial in content, but persistent continued requests for further responses
to the same or nearly identical questions may convey the idea to the respondent
that he or she has not yet produced the “right” answer.159 Interviewers should
be trained in delivering probes to maintain a professional and neutral relationship with the respondent (as they should during the rest of the interview), which
minimizes any sense of passing judgment on the content of the answers offered.
Moreover, interviewers should be given explicit instructions on when to probe,
so that probes are administered consistently.
A more difficult type of probe to construct and deliver reliably is one that
requires a substantive question tailored to the answer given by the respondent.
The survey designer must provide sufficient instruction to interviewers so that
they avoid giving directive probes that suggest one answer over another. Those
instructions, along with all other aspects of interviewer training, should be made
available for evaluation by the court and the opposing party.
E. What Approach Was Used to Avoid or Measure Potential
Order or Context Effects?
The order in which questions are asked on a survey and the order in which
response alternatives are provided in a closed-ended question can influence the
answers.160 For example, although asking a general question before a more specific
question on the same topic is unlikely to affect the response to the specific question, reversing the order of the questions may influence responses to the general
question. As a rule, then, surveys are less likely to be subject to order effects if
the questions move from the general (e.g., “What do you recall being discussed
158. Floyd J. Fowler, Jr. & Thomas W. Mangione, Standardized Survey Interviewing: Minimizing Interviewer-Related Error 41–42 (1990).
159. See, e.g., Johnson & Johnson–Merck Consumer Pharms. Co. v. Rhone-Poulenc Rorer
Pharms., Inc., 19 F.3d 125, 135 (3d Cir. 1994); American Home Prods. Corp. v. Procter & Gamble
Co., 871 F. Supp. 739, 748 (D.N.J. 1994).
160. See Schuman & Presser, supra note 134, at 23, 56–74. Krosnick & Presser, supra note 126,
at 278–81. In R.J. Reynolds Tobacco Co. v. Loew’s Theatres, Inc., 511 F. Supp. 867, 875 (S.D.N.Y.
1980), the court recognized the biased structure of a survey that disclosed the tar content of the cigarettes being compared before questioning respondents about their cigarette preferences. Not surprisingly, respondents expressed a preference for the lower tar product. See also E. & J. Gallo Winery v.
Pasatiempos Gallo, S.A., 905 F. Supp. 1403, 1409–10 (E.D. Cal. 1994) (court recognized that earlier
questions referring to playing cards, board or table games, or party supplies, such as confetti, increased
the likelihood that respondents would include these items in answers to the questions that followed).
395
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
in the advertisement?”) to the specific (e.g., “Based on your reading of the advertisement, what companies do you think the ad is referring to when it talks about
rental trucks that average five miles per gallon?”).161
The mode of questioning can influence the form that an order effect takes.
When respondents are shown response alternatives visually, as in mail surveys and
other self-administered questionnaires or in face-to-face interviews when respondents are shown a card containing response alternatives, they are more likely to
select the first choice offered (a primacy effect).162 In contrast, when response
alternatives are presented orally, as in telephone surveys, respondents are more
likely to choose the last choice offered (a recency effect).163 Although these effects
are typically small, no general formula is available that can adjust values to correct
for order effects, because the size and even the direction of the order effects may
depend on the nature of the question being asked and the choices being offered.
Moreover, it may be unclear which order is most appropriate. For example, if
the respondent is asked to choose between two different products, and there is a
tendency for respondents to choose the first product mentioned,164 which order
of presentation will produce the more accurate response?165 To control for order
effects, the order of the questions and the order of the response choices in a survey should be rotated,166 so that, for example, one-third of the respondents have
Product A listed first, one-third of the respondents have Product B listed first,
and one-third of the respondents have Product C listed first. If the three different
orders167 are distributed randomly among respondents, no response alternative will
have an inflated chance of being selected because of its position, and the average
of the three will provide a reasonable estimate of response level.168
161. This question was accepted by the court in U-Haul Int’l, Inc. v. Jartran, Inc., 522 F. Supp.
1238, 1249 (D. Ariz. 1981), aff’d, 681 F.2d 1159 (9th Cir. 1982).
162. Krosnick & Presser, supra note 126, at 280.
163. Id.
164. Similarly, candidates in the first position on the ballot tend to attract extra votes. J.M.
Miller & Jon A. Krosnick, The Impact of Candidate Name Order on Election Outcomes, 62 Pub. Op. Q.
291 (1998).
165. See Rust Env’t & Infrastructure, Inc. v. Teunissen, 131 F.3d 1210, 1218 (7th Cir. 1997)
(survey did not pass muster in part because of failure to incorporate random rotation of corporate
names that were the subject of a trademark dispute).
166. See, e.g. Winning Ways, Inc. v. Holloway Sportswear, Inc., 913 F. Supp. 1454, 1465–67
(D. Kan. 1996) (failure to rotate the order in which the jackets were shown to the consumers led to
reduced weight for the survey); Procter & Gamble Pharms., Inc. v. Hoffmann-La Roche, Inc., 2006
U.S. Dist. LEXIS 64363, 2006-2 Trade Cas. (CCH) P75465 (S.D.N.Y. Sept. 6, 2006).
167. Actually, there are six possible orders of the three alternatives: ABC, ACB, BAC, BCA,
CAB, and CBA. Thus, the optimal survey design would allocate equal numbers of respondents to
each of the six possible orders.
168. Although rotation is desirable, many surveys are conducted with no attention to this potential bias. Because it is impossible to know in the abstract whether a particular question suffers much,
little, or not at all from an order bias, lack of rotation should not preclude reliance on the answer to
the question, but it should reduce the weight given to that answer.
396
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
F. If the Survey Was Designed to Test a Causal Proposition,
Did the Survey Include an Appropriate Control Group or
Question?
Many surveys are designed not simply to describe attitudes or beliefs or reported
behaviors, but to determine the source of those attitudes or beliefs or behaviors.
That is, the purpose of the survey is to test a causal proposition. For example,
how does a trademark or the content of a commercial affect respondents’ perceptions or understanding of a product or commercial? Thus, the question is not
merely whether consumers hold inaccurate beliefs about Product A, but whether
exposure to the commercial misleads the consumer into thinking that Product A
is a superior pain reliever. Yet if consumers already believe, before viewing the
commercial, that Product A is a superior pain reliever, a survey that simply records
consumers’ impressions after they view the commercial may reflect those preexisting beliefs rather than impressions produced by the commercial.
Surveys that merely record consumer impressions have a limited ability to
answer questions about the origins of those impressions. The difficulty is that the
consumer’s response to any question on the survey may be the result of information or misinformation from sources other than the trademark the respondent is
being shown or the commercial he or she has just watched.169 In a trademark survey attempting to show secondary meaning, for example, respondents were shown
a picture of the stripes used on Mennen stick deodorant and asked, “[W]hich
[brand] would you say uses these stripes on their package?”170 The court recognized that the high percentage of respondents selecting “Mennen” from an array
of brand names may have represented “merely a playback of brand share”;171 that
is, respondents asked to give a brand name may guess the one that is most familiar,
generally the brand with the largest market share.172
Some surveys attempt to reduce the impact of preexisting impressions on
respondents’ answers by instructing respondents to focus solely on the stimulus
as a basis for their answers. Thus, the survey includes a preface (e.g., “based on
the commercial you just saw”) or directs the respondent’s attention to the mark
at issue (e.g., “these stripes on the package”). Such efforts are likely to be only
partially successful. It is often difficult for respondents to identify accurately the
169. See, e.g., Procter & Gamble Co. v. Ultreo, Inc., 574 F. Supp. 2d. 339, 351–52 (S.D.N.Y.
2008) (survey was unreliable because it failed to control for the effect of preexisting beliefs).
170. Mennen Co. v. Gillette Co., 565 F. Supp. 648, 652 (S.D.N.Y. 1983), aff’d, 742 F.2d 1437
(2d Cir. 1984). To demonstrate secondary meaning, “the [c]ourt must determine whether the mark
has been so associated in the mind of consumers with the entity that it identifies that the goods sold
by that entity are distinguished by the mark or symbol from goods sold by others.” Id.
171. Id.
172. See also Upjohn Co. v. American Home Prods. Corp., No. 1-95-CV-237, 1996 U.S. Dist.
LEXIS 8049, at *42–44 (W.D. Mich. Apr. 5, 1996).
397
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
source of their impressions.173 The more routine the idea being examined in the
survey (e.g., that the advertised pain reliever is more effective than others on
the market; that the mark belongs to the brand with the largest market share),
the more likely it is that the respondent’s answer is influenced by (1) preexisting impressions; (2) general expectations about what commercials typically say
(e.g., the product being advertised is better than its competitors); or (3) guessing,
rather than by the actual content of the commercial message or trademark being
evaluated.
It is possible to adjust many survey designs so that causal inferences about
the effect of a trademark or an allegedly deceptive commercial become clear and
unambiguous. By adding one or more appropriate control groups, the survey
expert can test directly the influence of the stimulus.174 In the simplest version
of such a survey experiment, respondents are assigned randomly to one of two
conditions.175 For example, respondents assigned to the experimental condition
view an allegedly deceptive commercial, and respondents assigned to the control
condition either view a commercial that does not contain the allegedly deceptive
material or do not view any commercial.176 Respondents in both the experimental
and control groups answer the same set of questions about the allegedly deceptive
message. The effect of the commercial’s allegedly deceptive message is evaluated
by comparing the responses made by the experimental group members with those
of the control group members. If 40% of the respondents in the experimental
group responded indicating that they received the deceptive message (e.g., the
advertised product has fewer calories than its competitor), whereas only 8% of
the respondents in the control group gave that response, the difference between
40% and 8% (within the limits of sampling error177) can be attributed only to the
allegedly deceptive message. Without the control group, it is not possible to
determine how much of the 40% is attributable to respondents’ preexisting beliefs
173. See Richard E. Nisbett & Timothy D. Wilson, Telling More Than We Can Know: Verbal
Reports on Mental Processes, 84 Psychol. Rev. 231 (1977).
174. See Shari S. Diamond, Using Psychology to Control Law: From Deceptive Advertising to Criminal
Sentencing, 13 Law & Hum. Behav. 239, 244–46 (1989); Jacob Jacoby & Constance Small, Applied
Marketing: The FDA Approach to Defining Misleading Advertising, 39 J. Marketing 65, 68 (1975). See also
David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section II.A, in this manual.
175. Random assignment should not be confused with random selection. When respondents
are assigned randomly to different treatment groups (e.g., respondents in each group watch a different commercial), the procedure ensures that within the limits of sampling error the two groups of
respondents will be equivalent except for the different treatments they receive. Respondents selected
for a mall intercept study, and not from a probability sample, may be assigned randomly to different treatment groups. Random selection, in contrast, describes the method of selecting a sample of
respondents in a probability sample. See supra Section III.C.
176. This alternative commercial could be a “tombstone” advertisement that includes only the
name of the product or a more elaborate commercial that does not include the claim at issue.
177. For a discussion of sampling error, see David H. Kaye & David A. Freedman, Reference
Guide on Statistics, Section IV.A, in this manual.
398
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
or other background noise (e.g., respondents who misunderstand the question
or misstate their responses). Both preexisting beliefs and other background noise
should have produced similar response levels in the experimental and control
groups. In addition, if respondents who viewed the allegedly deceptive commercial respond differently than respondents who viewed the control commercial, the
difference cannot be merely the result of a leading question, because both groups
answered the same question. The ability to evaluate the effect of the wording of a
particular question makes the control group design particularly useful in assessing
responses to closed-ended questions,178 which may encourage guessing or particular responses. Thus, the focus on the response level in a control group design
is not on the absolute response level, but on the difference between the response
level of the experimental group and that of the control group.179
In designing a survey-experiment, the expert should select a stimulus for the
control group that shares as many characteristics with the experimental stimulus
as possible, with the key exception of the characteristic whose influence is being
assessed.180 Although a survey with an imperfect control group may provide
better information than a survey with no control group at all, the choice of an
appropriate control group requires some care and should influence the weight that
the survey receives. For example, a control stimulus should not be less attractive
than the experimental stimulus if the survey is designed to measure how familiar
the experimental stimulus is to respondents, because attractiveness may affect perceived familiarity.181 Nor should the control stimulus share with the experimental
stimulus the feature whose impact is being assessed. If, for example, the control
stimulus in a case of alleged trademark infringement is itself a likely source of
consumer confusion, reactions to the experimental and control stimuli may not
178. The Federal Trade Commission has long recognized the need for some kind of control for
closed-ended questions, although it has not specified the type of control that is necessary. See Stouffer
Foods Corp., 118 F.T.C. 746, No. 9250, 1994 FTC LEXIS 196, at *31 (Sept. 26, 1994).
179. See, e.g., Cytosport, Inc. v. Vital Pharms., Inc., 617 F. Supp. 2d 1051, 1075–76 (E.D. Cal.
2009) (net confusion level of 25.4% obtained by subtracting 26.5% in the control group from 51.9%
in the test group).
180. See, e.g., Skechers USA, Inc. v. Vans, Inc., No. CV-07-01703, 2007 WL 4181677, at
*8–9 (C.D. Cal. Nov. 20, 2007) (in trade dress infringement case, control stimulus should have
retained design elements not at issue); Procter & Gamble Pharms., Inc. v. Hoffman-LaRoche, Inc.,
No. 06-Civ-0034, 2006 U.S. Dist. LEXIS 64363, at *87 (S.D.N.Y. Sept. 6, 2006) (in false advertising
action, disclaimer was inadequate substitute for appropriate control group).
181. See, e.g., Indianapolis Colts, Inc. v. Metropolitan Baltimore Football Club L.P., 34 F.3d
410, 415–16 (7th Cir. 1994) (court recognized that the name “Baltimore Horses” was less attractive for a sports team than the name “Baltimore Colts.”); see also Reed-Union Corp. v. Turtle Wax,
Inc., 77 F.3d 909, 912 (7th Cir. 1996) (court noted that one expert’s choice of a control brand with
a well-known corporate source was less appropriate than the opposing expert’s choice of a control
brand whose name did not indicate a specific corporate source); Louis Vuitton Malletier v. Dooney
& Bourke, Inc., 525 F. Supp. 2d 576, 595 (S.D.N.Y. 2007) (underreporting of background “noise”
likely occurred because handbag used as control was quite dissimilar in shape and pattern to both
plaintiff and defendant’s bags).
399
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
differ because both cause respondents to express the same level of confusion.182 In
an extreme case, an inappropriate control may do nothing more than control for
the effect of the nature or wording of the survey questions (e.g., acquiescence).183
That may not be enough to rule out other explanations for different or similar
responses to the experimental and control stimuli. Finally, it may sometimes be
appropriate to have more than one control group to assess precisely what is causing
the response to the experimental stimulus (e.g., in the case of an allegedly deceptive ad, whether it is a misleading graph or a misleading claim by the announcer;
or in the case of allegedly infringing trade dress, whether it is the style of the font
used or the coloring of the packaging).
Explicit attention to the value of control groups in trademark and deceptiveadvertising litigation is a relatively recent phenomenon, but courts have increasingly come to recognize the central role the control group can play in evaluating
claims.184 A LEXIS search using Lanham Act and control group revealed only 4
federal district court cases before 1991 in which surveys with control groups were
discussed, 16 in the 9 years from 1991 to 1999, and 46 in the 9 years between
2000 and 2008, a rate of growth that far exceeds the growth in Lanham Act litigation. In addition, courts in other cases have described or considered surveys using
control group designs without labeling the comparison group a control group.185
Indeed, one reason why cases involving surveys with control groups may be
underrepresented in reported cases is that a survey with a control group produces
182. See, e.g., Western Publ’g Co. v. Publications Int’l, Ltd., No. 94-C-6803, 1995 U.S. Dist.
LEXIS 5917, at *45 (N.D. Ill. May 2, 1995) (court noted that the control product was “arguably
more infringing than” the defendant’s product) (emphasis omitted). See also Classic Foods Int’l Corp.
v. Kettle Foods, Inc., 2006 U.S. Dist. LEXIS 97200 (C.D. Cal. Mar. 2, 2006); McNeil-PPC, Inc. v.
Merisant Co., 2004 U.S. Dist. LEXIS 27733 (D.P.R. July 29, 2004).
183. See text accompanying note 156, supra.
184. See, e.g., SmithKline Beecham Consumer Healthcare, L.P. v. Johnson & Johnson-Merck,
2001 U.S. Dist. LEXIS 7061, at *37 (S.D.N.Y. June 1, 2001) (survey to assess implied falsity of a
commercial not probative in the absence of a control group); Consumer American Home Prods. Corp.
v. Procter & Gamble Co., 871 F. Supp. 739, 749 (D.N.J. 1994) (discounting survey results based on
failure to control for participants’ preconceived notions); ConAgra, Inc. v. Geo. A. Hormel & Co.,
784 F. Supp. 700, 728 (D. Neb. 1992) (“Since no control was used, the . . . study, standing alone,
must be significantly discounted.”), aff’d, 990 F.2d 368 (8th Cir. 1993).
185. Indianapolis Colts, Inc. v. Metropolitan Baltimore Football Club L.P., No. 94727-C, 1994
U.S. Dist. LEXIS 19277, at *10–11 (S.D. Ind. June 27, 1994), aff’d, 34 F.3d 410 (7th Cir. 1994). In
Indianapolis Colts, the district court described a survey conducted by the plaintiff’s expert in which
half of the interviewees were shown a shirt with the name “Baltimore CFL Colts” on it and half
were shown a shirt on which the word “Horses” had been substituted for the word “Colts.” Id. The
court noted that the comparison of reactions to the horse and colt versions of the shirt made it possible “to determine the impact from the use of the word ‘Colts.’” Id. at *11. See also Quality Inns
Int’l, Inc. v. McDonald’s Corp., 695 F. Supp. 198, 218 (D. Md. 1988) (survey revealed confusion
between McDonald’s and McSleep, but control survey revealed no confusion between McDonald’s
and McTavish). See also Simon Prop. Group L.P. v. MySimon, Inc., 104 F. Supp. 2d 1033 (S.D. Ind.
2000) (court criticized the survey design based on the absence of a control that could show that results
were produced by legally relevant confusion).
400
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
less ambiguous findings, which may lead to a resolution before a preliminary
injunction hearing or trial occurs.
A less common use of control methodology is a control question. Rather than
administering a control stimulus to a separate group of respondents, the survey asks
all respondents one or more control questions along with the question about the
product or service at issue. In a trademark dispute, for example, a survey indicated
that 7.2% of respondents believed that “The Mart” and “K-Mart” were owned by
the same individuals. The court found no likelihood of confusion based on survey
evidence that 5.7% of the respondents also thought that “The Mart” and “King’s
Department Store” were owned by the same source.186
Similarly, a standard technique used to evaluate whether a brand name is
generic is to present survey respondents with a series of product or service names
and ask them to indicate in each instance whether they believe the name is a brand
name or a common name. By showing that 68% of respondents considered Teflon
a brand name (a proportion similar to the 75% of respondents who recognized
the acknowledged trademark Jell-O as a brand name, and markedly different from
the 13% who thought aspirin was a brand name), the makers of Teflon retained
their trademark.187
Every measure of opinion or belief in a survey reflects some degree of error.
Control groups and, as a second choice, control questions are the most reliable
means for assessing response levels against the baseline level of error associated
with a particular question.
G. What Limitations Are Associated with the Mode of Data
Collection Used in the Survey?
Three primary methods have traditionally been used to collect survey data:
(1) in-person interviews, (2) telephone interviews, and (3) mail questionnaires.188
Recently, in the wake of increasing use of the Internet, researchers have added
Web-based surveys to their arsenal of tools. Surveys using in-person and telephone
interviews, too, now regularly rely on computerized data collection.189
186. S.S. Kresge Co. v. United Factory Outlet, Inc., 598 F.2d 694, 697 (1st Cir. 1979). Note
that the aggregate percentages reported here do not reveal how many of the same respondents were
confused by both names, an issue that may be relevant in some situations. See Joseph L. Gastwirth,
Reference Guide on Survey Research, 36 Jurimetrics J. 181, 187–88 (1996) (review essay).
187. E.I. du Pont de Nemours & Co. v. Yoshida Int’l, Inc., 393 F. Supp. 502, 526–27 & n.54
(E.D.N.Y. 1975); see also Donchez v. Coors Brewing Co., 392 F.3d 1211, 1218 (10th Cir. 2004)
(respondents evaluated eight brand and generic names in addition to the disputed name). A similar
approach is used in assessing secondary meaning.
188. Methods also may be combined, as when the telephone is used to “screen” for eligible
respondents, who then are invited to participate in an in-person interview.
189. Wright & Marsden, supra note 1, at 13–14.
401
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The interviewer conducting a computer-assisted interview (CAI), whether by
telephone (CATI) or face-to-face (CAPI), follows the computer-generated script
for the interview and enters the respondent’s answers as the interview proceeds.
A primary advantage of CATI and other CAI procedures is that skip patterns can
be built into the program. If, for example, the respondent answers yes when asked
whether she has ever been the victim of a burglary, the computer will generate
further questions about the burglary; if she answers no, the program will automatically skip the followup burglary questions. Interviewer errors in following the skip
patterns are therefore avoided, making CAI procedures particularly valuable when
the survey involves complex branching and skip patterns.190 CAI procedures also
can be used to control for order effects by having the program rotate the order in
which the questions or choices are presented.191
Recent innovations in CAI procedures include audio computer-assisted selfinterviewing (ACASI) in which the respondent listens to recorded questions
over the telephone or reads questions from a computer screen while listening to
recorded versions of them through headphones. The respondent then answers
verbally or on a keypad. ACASI procedures are particularly useful for collecting
sensitive information (e.g., illegal drug use and other HIV risk behavior).192
All CAI procedures require additional planning to take advantage of the
potential for improvements in data quality. When a CAI protocol is used in a survey presented in litigation, the party offering the survey should supply for inspection the computer program that was used to generate the interviews. Moreover,
CAI procedures do not eliminate the need for close monitoring of interviews
to ensure that interviewers are accurately reading the questions in the interview
protocol and accurately entering the respondent’s answers.
The choice of any data collection method for a survey should be justified by
its strengths and weaknesses.
1. In-person interviews
Although costly, in-person interviews generally are the preferred method of data
collection, especially when visual materials must be shown to the respondent
under controlled conditions.193 When the questions are complex and the interviewers are skilled, in-person interviewing provides the maximum opportunity to
190. Willem E. Saris, Computer-Assisted Interviewing 20, 27 (1991).
191. See, e.g., Intel Corp. v. Advanced Micro Devices, Inc., 756 F. Supp. 1292, 1296–97 (N.D.
Cal. 1991) (survey designed to test whether the term 386 as applied to a microprocessor was generic
used a CATI protocol that tested reactions to five terms presented in rotated order).
192. See, e.g., N. Galai et al., ACASI Versus Interviewer-Administered Questionnaires for Sensitive
Risk Behaviors: Results of a Cross-Over Randomized Trial Among Injection Drug Users (abstract, 2004),
available at http://gateway.nlm.nih.gov/MeetingAbstracts/ma?f=102280272.html.
193. A mail survey also can include limited visual materials but cannot exercise control over
when and how the respondent views them.
402
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
clarify or probe. Unlike a mail survey, both in-person and telephone interviews
have the capability to implement complex skip sequences (in which the respondent’s answer determines which question will be asked next) and the power to
control the order in which the respondent answers the questions. Interviewers also
can directly verify who is completing the survey, a check that is unavailable in mail
and Web-based surveys. As described infra Section V.A, appropriate interviewer
training, as well as monitoring of the implementation of interviewing, is necessary
if these potential benefits are to be realized. Objections to the use of in-person
interviews arise primarily from their high cost or, on occasion, from evidence of
inept or biased interviewers. In-person interview quality in recent years has been
assisted by technology. Using computer-assisted personal interviewing (CAPI), the
interviewer reads the questions off the screen of a laptop computer and then enters
responses directly.194 This support makes it easier to follow complex skip patterns
and to promptly submit results via the Internet to the survey center.
2. Telephone interviews
Telephone surveys offer a comparatively fast and lower-cost alternative to in-person
surveys and are particularly useful when the population is large and geographically
dispersed. Telephone interviews (unless supplemented with mailed or e-mailed
materials) can be used only when it is unnecessary to show the respondent any
visual materials. Thus, an attorney may present the results of a telephone survey
of jury-eligible citizens in a motion for a change of venue in order to provide
evidence that community prejudice raises a reasonable suspicion of potential jury
bias.195 Similarly, potential confusion between a restaurant called McBagel’s and the
McDonald’s fast-food chain was established in a telephone survey. Over objections
from defendant McBagel’s that the survey did not show respondents the defendant’s
print advertisements, the court found likelihood of confusion based on the survey, noting that “by soliciting audio responses[, the telephone survey] was closely
related to the radio advertising involved in the case.”196 In contrast, when words
are not sufficient because, for example, the survey is assessing reactions to the trade
194. Wright & Marsden, supra note 1, at 13.
195. See, e.g., State v. Baumruk, 85 S.W.3d 644 (Mo. 2002). (overturning the trial court’s
decision to ignore a survey that found about 70% of county residents remembered the shooting that
led to the trial and that of those who had heard about the shooting, 98% believed that the defendant
was either definitely guilty or probably guilty); State v. Erickstad, 620 N.W.2d 136, 140 (N.D. 2000)
(denying change of venue motion based on media coverage, concluding that “defendants [need to]
submit qualified public opinion surveys, other opinion testimony, or any other evidence demonstrating community bias caused by the media coverage”). For a discussion of surveys used in motions for
change of venue, see Neal Miller, Facts, Expert Facts, and Statistics: Descriptive and Experimental Research
Methods in Litigation, Part II, 40 Rutgers L. Rev. 467, 470–74 (1988); National Jury Project, Jurywork:
Systematic Techniques (2d ed. 2008).
196. McDonald’s Corp. v. McBagel’s, Inc., 649 F. Supp. 1268, 1278 (S.D.N.Y. 1986).
403
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
dress or packaging of a product that is alleged to promote confusion, a telephone
survey alone does not offer a suitable vehicle for questioning respondents.197
In evaluating the sampling used in a telephone survey, the trier of fact should
consider:
1. Whether (when prospective respondents are not business personnel) some
form of random-digit dialing198 was used instead of or to supplement
telephone numbers obtained from telephone directories, because a high
percentage of all residential telephone numbers in some areas may be
unlisted;199
2. Whether any attempt was made to include cell phone users, particularly
the growing subpopulation of individuals who rely solely on cell phones
for telephone services;200
3. Whether the sampling procedures required the interviewer to sample
within the household or business, instead of allowing the interviewer
to administer the survey to any qualified individual who answered the
telephone;201 and
4. Whether interviewers were required to call back multiple times at several
different times of the day and on different days to increase the likelihood
of contacting individuals or businesses with different schedules.202
197. See Thompson Med. Co. v. Pfizer Inc., 753 F.2d 208 (2d Cir. 1985); Incorporated Publ’g
Corp. v. Manhattan Magazine, Inc., 616 F. Supp. 370 (S.D.N.Y. 1985), aff’d without op., 788 F.2d 3
(2d Cir. 1986).
198. Random-digit dialing provides coverage of households with both listed and unlisted telephone numbers by generating numbers at random from the sampling frame of all possible telephone
numbers. James M. Lepkowski, Telephone Sampling Methods in the United States, in Telephone Survey
Methodology 81–91 (Robert M. Groves et al. eds., 1988).
199. Studies comparing listed and unlisted household characteristics show some important differences. Id. at 76.
200. According to a 2009 study, an estimated 26.5% of households cannot be reached by landline
surveys, because 2.0% have no phone service and 24.5% have only a cell phone. Stephen J. Blumberg
& Julian V. Luke, Wireless Substitution: Early Release of Estimates Based on the National Health
Interview Survey, July–December 2009 (2010), available at http://www.cdc.gov/nchs/data/nhis/
earlyrelease/wireless201005.pdf. People who can be reached only by cell phone tend to be younger
and are more likely to be African American or Hispanic and less likely to be married or to own their
home than individuals reachable on a landline. Although at this point, the effect on estimates from
landline-only telephone surveys appears to be minimal on most topics, on some issues (e.g., voter registration) and within the population of young adults, the gap may warrant consideration. Scott Keeter
et al., What’s Missing from National RDD Surveys? The Impact of the Growing Cell-Only Population, Paper
presented at the 2007 Conference of AAPOR, May 2007.
201. This is a consideration only if the survey is sampling individuals. If the survey is seeking
information on the household, more than one individual may be able to answer questions on behalf
of the household.
202. This applied equally to in-person interviews.
404
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
Telephone surveys that do not include these procedures may not provide
precise measures of the characteristics of a representative sample of respondents,
but may be adequate for providing rough approximations. The vulnerability of
the survey depends on the information being gathered. More elaborate procedures
are advisable for achieving a representative sample of respondents if the survey
instrument requests information that is likely to differ for individuals with listed
telephone numbers versus individuals with unlisted telephone numbers, individuals rarely at home versus those usually at home, or groups who are more versus
less likely to rely exclusively on cell phones.
The report submitted by a survey expert who conducts a telephone survey
should specify:
1. The procedures that were used to identify potential respondents, including
both the procedures used to select the telephone numbers that were called
and the procedures used to identify the qualified individual to question),
2. The number of telephone numbers for which no contact was made; and
3. The number of contacted potential respondents who refused to participate
in the survey.203
Like CAPI interviewing,204 computer-assisted telephone interviewing (CATI)
facilitates the administration and data entry of large-scale surveys.205 A computer
protocol may be used to generate and dial telephone numbers as well as to guide
the interviewer.
3. Mail questionnaires
In general, mail surveys tend to be substantially less costly than both in-person and
telephone surveys.206 Response rates tend to be lower for self-administered mail surveys than for telephone or face-to-face surveys, but higher than for their Web-based
equivalents.207 Procedures that raise response rates include multiple mailings, highly
personalized communications, prepaid return envelopes, incentives or gratuities,
assurances of confidentiality, first-class outgoing postage, and followup reminders.208
203. Additional disclosure and reporting features applicable to surveys in general are described
in Section VII.B, infra.
204. See text accompanying note 194, supra.
205. See Roger Tourangeau et al., The Psychology of Survey Response 289 (2000); Saris, supra
note 190.
206. See Chase H. Harrison, Mail Surveys and Paper Questionnaires, in Handbook of Survey
Research, supra note 1, at 498, 499.
207. See Mick Couper et al., A Comparison of Mail and E-Mail for a Survey of Employees in Federal
Statistical Agencies, 15 J. Official Stat. 39 (1999); Mick Couper, Web Surveys: A Review of Issues and
Approaches 464, 473 (2001).
208. See, e.g., Richard J. Fox et al., Mail Survey Response Rate: A Meta-Analysis of Selected
Techniques for Inducing Response, 52 Pub. Op. Q. 467, 482 (1988); Kenneth D. Hopkins & Arlen R.
405
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
A mail survey will not produce a high rate of return unless it begins with an
accurate and up-to-date list of names and addresses for the target population. Even
if the sampling frame is adequate, the sample may be unrepresentative if some
individuals are more likely to respond than others. For example, if a survey targets
a population that includes individuals with literacy problems, these individuals will
tend to be underrepresented. Open-ended questions are generally of limited value
on a mail survey because they depend entirely on the respondent to answer fully
and do not provide the opportunity to probe or clarify unclear answers. Similarly,
if eligibility to answer some questions depends on the respondent’s answers to
previous questions, such skip sequences may be difficult for some respondents
to follow. Finally, because respondents complete mail surveys without supervision,
survey personnel are unable to prevent respondents from discussing the questions
and answers with others before completing the survey and to control the order in
which respondents answer the questions. Although skilled design of questionnaire
format, question order, and the appearance of the individual pages of a survey can
minimize these problems,209 if it is crucial to have respondents answer questions in
a particular order, a mail survey cannot be depended on to provide adequate data.
4. Internet surveys
A more recent innovation in survey technology is the Internet survey in which
potential respondents are contacted and their responses are collected over the
Internet. Internet surveys in principle can reduce substantially the cost of reaching potential respondents. Moreover, they offer some of the advantages of inperson interviews by enabling the respondent to view pictures, videos, and lists
of response choices on the computer screen during the survey. A further advantage is that whenever a respondent answers questions presented on a computer
screen, whether over the Internet or in a dedicated facility, the survey can build
in a variety of controls. In contrast to a mail survey in which the respondent can
examine and/or answer questions out of order and may mistakenly skip questions,
a computer-administered survey can control the order in which the questions are
displayed so that the respondent does not see a later question before answering
an earlier one and so that the respondent cannot go back to change an answer
previously given to an earlier question in light of the questions that follow it.
The order of the questions or response options can be rotated easily to control
for order effects. In addition, the structure permits the survey to remind, or even
require, the respondent to answer a question before the next question is presented.
One advantage of computer-administered surveys over interviewer-administered
Gullickson, Response Rates in Survey Research: A Meta-Analysis of the Effects of Monetary Gratuities, 61 J.
Experimental Educ. 52, 54–57, 59 (1992); Eleanor Singer et al., Confidentiality Assurances and Response:
A Quantitative Review of the Experimental Literature, 59 Pub. Op. Q. 66, 71 (1995); see generally Don A.
Dillman, Internet Mail and Mixed-Mode Surveys: The Tailored Design Method (3d ed. 2009).
209. Dilman, supra note 208, at 151–94.
406
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
surveys is that they eliminate interviewer error because the computer presents the
questions and the respondent records her own answers.
Internet surveys do have limitations, and many questions remain about the
extent to which those limitations impair the quality of the data they provide. A
key potential limitation is that respondents accessible over the Internet may not
fairly represent the relevant population whose responses the survey was designed
to measure. Although Internet access has not approached the 95% penetration
achieved by the telephone, the proportion of individuals with Internet access has
grown at a remarkable rate, as has the proportion of individuals who regularly
use a computer. For example, according to one estimate, use of the Internet
among adults jumped from 22% in 1997 to 60% in 2003.210 Despite this rapid
expansion, a digital divide still exists, so that the “have-nots” are less likely to be
represented in surveys that depend on Internet access. The effect of this divide on
survey results will depend on the population the survey is attempting to capture.
For example, if the target population consists of computer users, any bias from
systematic underrepresentation is likely to be minimal. In contrast, if the target
population consists of owners of television sets, a proportion of whom may not
have Internet access, significant bias is more likely. The trend toward greater
access to the Internet is likely to continue, and the issue of underrepresentation
may disappear in time. At this point, a party presenting the results of a Web-based
survey should be prepared to provide evidence on how coverage limitations may
have affected the pattern of survey results.
Even if noncoverage error is not a significant concern, courts evaluating a
Web-based survey must still determine whether the sampling approach is adequate. That evaluation will depend on the type of Internet survey involved,
because Web-based surveys vary in fundamental ways.
At one extreme is the list-based Web survey. This Web survey is sent to a
closed set of potential respondents drawn from a list that consists of the e-mail
addresses of the target individuals (e.g., all students at a university or employees at
a company where each student or employee has a known e-mail address).
At the other extreme is the self-selected Web survey in which Web users in
general, or those who happen to visit a particular Web site, are invited to express
their views on a topic and they participate simply by volunteering. Whereas the
list-based survey enables the researcher to evaluate response rates and often to assess
the representativeness of respondents on a variety of characteristics, the self-selected
Web survey provides no information on who actually participates or how representative the participants are. Thus, it is impossible to evaluate nonresponse error or
even participation rates. Moreover, participants are very likely to self-select on the
basis of the nature of the topic. These self-selected pseudosurveys resemble reader
polls published in magazines and do not meet standard criteria for legitimate surveys
210. Jennifer C. Day et al., Computer and Internet Use in the United States: 2003, 8–9 (U.S.
Census Bureau 2005).
407
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
admissible in court.211 Occasionally, proponents of such polls tout the large number
of respondents as evidence of the weight the results should be given, but the size
of the sample cannot cure the likely participation bias in such voluntary polls.212
Between these two extremes is a large category of Web-based survey
approaches that researchers have developed to address concerns about sampling
bias and nonresponse error. For example, some approaches create a large database
of potential participants by soliciting volunteers through appeals on well-traveled
sites.213 Based on the demographic data collected from those who respond to the
appeals, a sample of these panel members are asked to participate in a particular
survey by invitation only. Responses are weighted to reduce selection bias.214 An
expert presenting the results from such a survey should be prepared to explain why
the particular weighting approach can be relied upon to achieve that purpose.215
Another approach that is more costly uses probability sampling from the initial
contact with a potential respondent. Potential participants are initially contacted
by telephone using random-digit dialing procedures. Those who lack Internet
access are provided with the technology to participate. Members from the panel
are then invited to participate in a particular survey, and the researchers know
the characteristics of participants and nonparticipants from the initial telephone
contact.216 For all surveys that rely on preselected panels, whether nonrandomly
or randomly selected, questions have been raised about panel conditioning (i.e.,
the effect of having participants in earlier surveys respond to later surveys) and the
relatively low rate of response to survey invitations. An expert presenting results
from a Web-based survey should be prepared to address these issues and to discuss
how they may have affected the results.
Finally, the recent proliferation of Internet surveys has stimulated a growing
body of research on the influence of formatting choices in Web surveys. Evidence
from this research indicates that formatting decisions can significantly affect the
quality of survey responses.217
211. See, e.g., Merisant Co. v. McNeil Nutritionals, LLC, 242 F.R.D. 315 (E.D. Pa. 2007)
(report on results from AOL “instant poll” excluded).
212. See, e.g., Couper (2001), supra note 207, at 480–81 (a self-selected Web survey conducted
by the National Geographic Society through its Web site attracted 50,000 responses; a comparison
of the Canadian respondents with data from the Canadian General Social Survey telephone survey
conducted using random-digit dialing showed marked differences on a variety of response measures).
213. See, e.g., Ecce Panis, Inc. v. Maple Leaf Bakery, Inc. 2007 U.S. Dist. LEXIS 85780 (D.
Ariz. Nov. 7, 2007).
214. See, e.g., Philip Morris USA, Inc. v. Otamedia Limited, 2005 U.S. Dist. LEXIS 1259
(S.D.N.Y. Jan. 28, 2005).
215. See, e.g., A&M Records, Inc. v. Napster, Inc. 2000 WL 1170106 (N.D. Cal. Aug. 10,
2000) (court refused to rely on results from Internet panel survey when expert presenting the results
showed lack of familiarity with panel construction and weighting methods).
216. See, e.g., Price v. Philip Morris, Inc., 219 Ill. 2d 182, 848 N.E.2d 1 (2005).
217. See, e.g., Mick P. Couper et al., What They See Is What We Get: Response Options for Web
Surveys, 22 Soc. Sci. Computer Rev. 111 (2004) (comparing order effects with radio button and
408
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
A final approach to data collection does not depend on a single mode, but
instead involves a mixed-mode approach. By combining modes, the survey design
may increase the likelihood that all sampling members of the target population
will be contacted. For example, a person without a landline may be reached by
mail or e-mail. Similarly, response rates may be increased if members of the target
population are more likely to respond to one mode of contact versus another. For
example, a person unwilling to be interviewed by phone may respond to a written
or e-mail contact. If a mixed-mode approach is used, the questions and structure
of the questionnaires are likely to differ across modes, and the expert should be
prepared to address the potential impact of mode on the answers obtained.218
V. Surveys Involving Interviewers
A. Were the Interviewers Appropriately Selected and Trained?
A properly defined population or universe, a representative sample, and clear and
precise questions can be depended on to produce trustworthy survey results only if
“sound interview procedures were followed by competent interviewers.”219 Properly trained interviewers receive detailed written instructions on everything they
are to say to respondents, any stimulus materials they are to use in the survey, and
how they are to complete the interview form. These instructions should be made
available to the opposing party and to the trier of fact. Thus, interviewers should
be told, and the interview form on which answers are recorded should indicate,
which responses, if any, are to be read to the respondent. Moreover, interviewers
should be instructed to record verbatim the respondent’s answers, to indicate
explicitly whenever they repeat a question to the respondent, and to record any
statements they make to or supplementary questions they ask the respondent.
Interviewers require training to ensure that they are able to follow directions
in administering the survey questions. Some training in general interviewing
techniques is required for most interviews (e.g., practice in pausing to give the
respondent enough time to answer and practice in resisting invitations to express
the interviewer’s beliefs or opinions). Although procedures vary, there is evidence
that interviewer performance suffers with less than a day of training in general
interviewing skills and techniques for new interviewers.220
drop-box formats); Andy Peytchev et al., Web Survey Design: Paging Versus Scrolling, 70 Pub. Op. Q.
212 (2006) (comparing the effects of presenting survey questions in a multitude of short pages or in
long scrollable pages).
218. Don A. Dillman & Benjamin L. Messer, Mixed-Mode Surveys, in Wright & Marsden, supra
note 1, at 550, 553.
219. Toys “R” Us, Inc. v. Canarsie Kiddie Shop, Inc., 559 F. Supp. 1189, 1205 (E.D.N.Y. 1983).
220. Fowler & Mangione, supra note 158, at 117; Nora Cate Schaeffer et al., Interviewers and
Interviewing, in Handbook of Survey Research, supra note 1, at 437, 460.
409
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The more complicated the survey instrument is, the more training and
experience the interviewers require. Thus, if the interview includes a skip pattern (where, e.g., Questions 4–6 are asked only if the respondent says yes to
Question 3, and Questions 8–10 are asked only if the respondent says no to Question 3), interviewers must be trained to follow the pattern. Note, however, that
in surveys conducted using CAPI or CATI procedures, the interviewer will be
guided by the computer used to administer the questionnaire.
If the questions require specific probes to clarify ambiguous responses, interviewers must receive instruction on when to use the probes and what to say. In
some surveys, the interviewer is responsible for last-stage sampling (i.e., selecting
the particular respondents to be interviewed), and training is especially crucial to
avoid interviewer bias in selecting respondents who are easiest to approach or
easiest to find.
Training and instruction of interviewers should include directions on the
circumstances under which interviews are to take place (e.g., question only one
respondent at a time outside the hearing of any other respondent). The trustworthiness of a survey is questionable if there is evidence that some interviews
were conducted in a setting in which respondents were likely to have been
distracted or in which others could overhear. Such evidence of careless administration of the survey was one ground used by a court to reject as inadmissible a
survey that purported to demonstrate consumer confusion.221
Some compromises may be accepted when surveys must be conducted swiftly.
In trademark and deceptive advertising cases, the plaintiff’s usual request is for a
preliminary injunction, because a delay means irreparable harm. Nonetheless,
careful instruction and training of interviewers who administer the survey, as well
as monitoring and validation to ensure quality control,222 and complete disclosure
of the methods used for all of the procedures followed are crucial elements that, if
compromised, seriously undermine the trustworthiness of any survey.
B. What Did the Interviewers Know About the Survey and Its
Sponsorship?
One way to protect the objectivity of survey administration is to avoid telling
interviewers who is sponsoring the survey. Interviewers who know the identity
of the survey’s sponsor may affect results inadvertently by communicating to
respondents their expectations or what they believe are the preferred responses of
the survey’s sponsor. To ensure objectivity in the administration of the survey, it is
standard interview practice in surveys conducted for litigation to do double-blind
221. Toys “R” Us, 559 F. Supp. at 1204 (some interviews apparently were conducted in a
bowling alley; some interviewees waiting to be interviewed overheard the substance of the interview
while they were waiting).
222. See Section V.C, infra.
410
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
research whenever possible: Both the interviewer and the respondent are blind
to the sponsor of the survey and its purpose. Thus, the survey instrument should
provide no explicit or implicit clues about the sponsorship of the survey or the
expected responses. Explicit clues could include a sponsor’s letterhead appearing
on the survey; implicit clues could include reversing the usual order of the yes and
no response boxes on the interviewer’s form next to a crucial question, thereby
potentially increasing the likelihood that no will be checked.223
Nonetheless, in some surveys (e.g., some government surveys), disclosure of
the survey’s sponsor to respondents (and thus to interviewers) is required. Such
surveys call for an evaluation of the likely biases introduced by interviewer or
respondent awareness of the survey’s sponsorship. In evaluating the consequences
of sponsorship awareness, it is important to consider (1) whether the sponsor has
views and expectations that are apparent and (2) whether awareness is confined to
the interviewers or involves the respondents. For example, if a survey concerning
attitudes toward gun control is sponsored by the National Rifle Association, it is
clear that responses opposing gun control are likely to be preferred. In contrast,
if the survey on gun control attitudes is sponsored by the Department of Justice,
the identity of the sponsor may not suggest the kinds of responses the sponsor
expects or would find acceptable.224 When interviewers are well trained, their
awareness of sponsorship may be a less serious threat than respondents’ awareness. The empirical evidence for the effects of interviewers’ prior expectations on
respondents’ answers generally reveals modest effects when the interviewers are
well trained.225
C. What Procedures Were Used to Ensure and Determine That
the Survey Was Administered to Minimize Error and Bias?
Three methods are used to ensure that the survey instrument was implemented
in an unbiased fashion and according to instructions. The first, monitoring the
interviews as they occur, is done most easily when telephone surveys are used.
A supervisor listens to a sample of interviews for each interviewer. Field settings
make monitoring more difficult, but evidence that monitoring has occurred provides an additional indication that the survey has been reliably implemented. Some
223. See Centaur Communications, Ltd. v. A/S/M Communications, Inc., 652 F. Supp. 1105,
1111 n.3 (S.D.N.Y. 1987) (pointing out that reversing the usual order of response choices, yes or no,
to no or yes may confuse interviewers as well as introduce bias), aff’d, 830 F.2d 1217 (2d Cir. 1987).
224. See, e.g., Stanley Presser et al., Survey Sponsorship, Response Rates, and Response Effects, 73
Soc. Sci. Q. 699, 701 (1992) (different responses to a university-sponsored telephone survey and a
newspaper-sponsored survey for questions concerning attitudes toward the mayoral primary, an issue
on which the newspaper had taken a position).
225. See, e.g., Seymour Sudman et al., Modest Expectations: The Effects of Interviewers’ Prior Expectations on Responses, 6 Soc. Methods & Res. 171, 181 (1977).
411
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
monitoring systems, both telephone and field, now use recordings, procedures that
may require permission from respondents.
Second, validation of interviews occurs when respondents in a sample are
recontacted to ask whether the initial interviews took place and to determine
whether the respondents were qualified to participate in the survey. Validation
callbacks may also collect data on a few key variables to confirm that the correct
respondent has been interviewed. The standard procedure for validation of inperson interviews is to telephone a random sample of about 10% to 15% of the
respondents.226 Some attempts to reach the respondent will be unsuccessful, and
occasionally a respondent will deny that the interview took place even though it
did. Because the information checked is typically limited to whether the interview
took place and whether the respondent was qualified, this validation procedure does
not determine whether the initial interview as a whole was conducted properly.
Nonetheless, this standard validation technique warns interviewers that their work
is being checked and can detect gross failures in the administration of the survey. In
computer-assisted interviews, further validation information can be obtained from
the timings that can be automatically recorded when an interview occurs.
A third way to verify that the interviews were conducted properly is to examine the work done by each individual interviewer. By reviewing the interviews
and individual responses recorded by each interviewer and comparing patterns
of response across interviewers, researchers can identify any response patterns or
inconsistencies that warrant further investigation.
When a survey is conducted at the request of a party for litigation rather than
in the normal course of business, a heightened standard for validation checks may
be appropriate. Thus, independent validation of a random sample of interviews by
a third party rather than by the field service that conducted the interviews increases
the trustworthiness of the survey results.227
VI. Data Entry and Grouping of Responses
A. What Was Done to Ensure That the Data Were Recorded
Accurately?
Analyzing the results of a survey requires that the data obtained on each sampled
element be recorded, edited, and often coded before the results can be tabulated
226. See, e.g., Davis v. Southern Bell Tel. & Tel. Co., No. 89-2839, 1994 U.S. Dist. LEXIS
13257, at *16 (S.D. Fla. Feb. 1, 1994); National Football League Properties, Inc. v. New Jersey Giants,
Inc., 637 F. Supp. 507, 515 (D.N.J. 1986).
227. In Rust Environment & Infrastructure, Inc. v. Teunissen, 131 F.3d 1210, 1218 (7th Cir. 1997),
the court criticized a survey in part because it “did not comport with accepted practice for independent
validation of the results.”
412
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
and processed. Procedures for data entry should include checks for completeness,
checks for reliability and accuracy, and rules for resolving inconsistencies. Accurate
data entry is maximized when responses are verified by duplicate entry and comparison, and when data-entry personnel are unaware of the purposes of the survey.
B. What Was Done to Ensure That the Grouped Data Were
Classified Consistently and Accurately?
Coding of answers to open-ended questions requires a detailed set of instructions
so that decision standards are clear and responses can be scored consistently and
accurately. Two trained coders should independently score the same responses
to check for the level of consistency in classifying responses. When the criteria
used to categorize verbatim responses are controversial or allegedly inappropriate,
those criteria should be sufficiently clear to reveal the source of disagreements. In
all cases, the verbatim responses should be available so that they can be recoded
using alternative criteria.228
VII. Disclosure and Reporting
A. When Was Information About the Survey Methodology
and Results Disclosed?
Objections to the definition of the relevant population, the method of selecting
the sample, and the wording of questions generally are raised for the first time
when the results of the survey are presented. By that time it is often too late to
correct methodological deficiencies that could have been addressed in the planning stages of the survey. The plaintiff in a trademark case229 submitted a set of
proposed survey questions to the trial judge, who ruled that the survey results
228. See, e.g., Revlon Consumer Prods. Corp. v. Jennifer Leather Broadway, Inc., 858 F. Supp.
1268, 1276 (S.D.N.Y. 1994) (inconsistent scoring and subjective coding led court to find survey so
unreliable that it was entitled to no weight), aff’d, 57 F.3d 1062 (2d Cir. 1995); Rock v. Zimmerman,
959 F.2d 1237, 1253 n.9 (3d Cir. 1992) (court found that responses on a change-of-venue survey
incorrectly categorized respondents who believed the defendant was insane as believing he was
guilty); Coca-Cola Co. v. Tropicana Prods., Inc., 538 F. Supp. 1091, 1094–96 (S.D.N.Y.) (plaintiff’s
expert stated that respondents’ answers to the open-ended questions revealed that 43% of respondents
thought Tropicana was portrayed as fresh squeezed; the court’s own tabulation found no more than
15% believed this was true), rev’d on other grounds, 690 F.2d 312 (2d Cir. 1982); see also Cumberland
Packing Corp. v. Monsanto Co., 140 F. Supp. 2d 241 (E.D.N.Y. 2001) (court examined verbatim
responses that respondents gave to arrive at a confusion level substantially lower than the level reported
by the survey expert).
229. Union Carbide Corp. v. Ever-Ready, Inc., 392 F. Supp. 280 (N.D. Ill. 1975), rev’d, 531
F.2d 366 (7th Cir. 1976).
413
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
would be admissible at trial while reserving the question of the weight the evidence would be given.230 The Seventh Circuit called this approach a commendable procedure and suggested that it would have been even more desirable if the
parties had “attempt[ed] in good faith to agree upon the questions to be in such
a survey.”231
The Manual for Complex Litigation, Second, recommended that parties be
required, “before conducting any poll, to provide other parties with an outline of
the proposed form and methodology, including the particular questions that will
be asked, the introductory statements or instructions that will be given, and other
controls to be used in the interrogation process.”232 The parties then were encouraged to attempt to resolve any methodological disagreements before the survey
was conducted.233 Although this passage in the second edition of the Manual has
been cited with apparent approval,234 the prior agreement that the Manual recommends has occurred rarely, and the Manual for Complex Litigation, Fourth,
recommends, but does not advocate requiring, prior disclosure and discussion of
survey plans.235 As the Manual suggests, however, early disclosure can enable the
parties to raise prompt objections that may permit corrective measures to be taken
before a survey is completed.236
Rule 26 of the Federal Rules of Civil Procedure requires extensive disclosure
of the basis of opinions offered by testifying experts. However, Rule 26 does
not produce disclosure of all survey materials, because parties are not obligated
to disclose information about nontestifying experts. Parties considering whether
to commission or use a survey for litigation are not obligated to present a survey
that produces unfavorable results. Prior disclosure of a proposed survey instrument
places the party that ultimately would prefer not to present the survey in the position of presenting damaging results or leaving the impression that the results are
not being presented because they were unfavorable. Anticipating such a situation,
230. Before trial, the presiding judge was appointed to the court of appeals, and so the case was
tried by another district court judge
231. Union Carbide, 531 F.2d at 386. More recently, the Seventh Circuit recommended filing
a motion in limine, asking the district court to determine the admissibility of a survey based on an
examination of the survey questions and the results of a preliminary survey before the party undertakes
the expense of conducting the actual survey. Piper Aircraft Corp. v. Wag-Aero, Inc., 741 F.2d 925,
929 (7th Cir. 1984). On one recent occasion, the parties jointly developed a survey administered by
a neutral third-party survey firm. Scott v. City of New York, 591 F. Supp. 2d 554, 560 (S.D.N.Y.
2008) (survey design, including multiple pretests, negotiated with the help of the magistrate judge).
232. MCL 2d, supra note 16, § 21.484.
233. See id.
234. See, e.g., National Football League Props., Inc. v. New Jersey Giants, Inc., 637 F. Supp.
507, 514 n.3 (D.N.J. 1986).
235. MCL 4th, supra note 16, § 11.493 (“including the specific questions that will be asked,
the introductory statements or instructions that will be given, and other controls to be used in the
interrogation process.”).
236. See id.
414
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
parties do not decide whether an expert will testify until after the results of the
survey are available.
Nonetheless, courts are in a position to encourage early disclosure and discussion even if they do not lead to agreement between the parties. In McNeilab,
Inc. v. American Home Products Corp.,237 Judge William C. Conner encouraged the
parties to submit their survey plans for court approval to ensure their evidentiary
value; the plaintiff did so and altered its research plan based on Judge Conner’s
recommendations. Parties can anticipate that changes consistent with a judicial
suggestion are likely to increase the weight given to, or at least the prospects of
admissibility of, the survey.238
B. Does the Survey Report Include Complete and Detailed
Information on All Relevant Characteristics?
The completeness of the survey report is one indicator of the trustworthiness of
the survey and the professionalism of the expert who is presenting the results of
the survey. A survey report generally should provide in detail:
1. The purpose of the survey;
2. A definition of the target population and a description of the sampling
frame;
3. A description of the sample design, including the method of selecting
respondents, the method of interview, the number of callbacks, respondent
eligibility or screening criteria and method, and other pertinent information;
4. A description of the results of sample implementation, including the
number of
a. potential respondents contacted,
b. potential respondents not reached,
c. noneligibles,
d. refusals,
e. incomplete interviews or terminations, and
f. completed interviews;
5. The exact wording of the questions used, including a copy of each version
of the actual questionnaire, interviewer instructions, and visual exhibits;239
237. 848 F.2d 34, 36 (2d Cir. 1988) (discussing with approval the actions of the district court).
See also Hubbard v. Midland Credit Mgmt, 2009 U.S. Dist. LEXIS 13938 (S.D. Ind. Feb. 23, 2009)
(court responded to plaintiff’s motions to approve survey methodology with a critique of the proposed
methodology).
238. Larry C. Jones, Developing and Using Survey Evidence in Trademark Litigation, 19 Memphis
St. U. L. Rev. 471, 481 (1989).
239. The questionnaire itself can often reveal important sources of bias. See Marria v. Broaddus,
200 F. Supp. 2d 280, 289 (S.D.N.Y. 2002) (court excluded survey sent to prison administrators based
415
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
6. A description of any special scoring (e.g., grouping of verbatim responses
into broader categories);
7. A description of any weighting or estimating procedures used;
8. Estimates of the sampling error, where appropriate (i.e., in probability
samples);
9. Statistical tables clearly labeled and identified regarding the source of the
data, including the number of raw cases forming the base for each table,
row, or column; and
10. Copies of interviewer instructions, validation results, and code books.240
Additional information to include in the survey report may depend on the nature
of sampling design. For example, reported response rates along with the time
each interview occurred may assist in evaluating the likelihood that nonresponse
biased the results. In a survey designed to assess the duration of employee preshift
activities, workers were approached as they entered the workplace; records were
not kept on refusal rates or the timing of participation in the study. Thus, it was
impossible to rule out the plausible hypothesis that individuals who arrived early
for their shift with more time to spend on preshift activities were more likely to
participate in the study.241
Survey professionals generally do not describe pilot testing in their survey
reports. They would be more likely to do so if courts recognized that surveys are
improved by pilot work that maximizes the likelihood that respondents understand the questions they are being asked. Moreover, the Federal Rules of Civil
Procedure may require that a testifying expert disclose pilot work that serves as
a basis for the expert’s opinion. The situation is more complicated when a nontestifying expert conducts the pilot work and the testifying expert learns about the
pilot testing only indirectly through the attorney’s advice about the relevant issues
on questionnaire that began, “We need your help. We are helping to defend the NYS Department
of Correctional Service in a case that involves their policy on intercepting Five-Percenter literature.
Your answers to the following questions will be helpful in preparing a defense.”).
240. These criteria were adapted from the Council of American Survey Research Organizations, supra note 76, § III.B. Failure to supply this information substantially impairs a court’s ability
to evaluate a survey. In re Prudential Ins. Co. of Am. Sales Practices Litig., 962 F. Supp. 450, 532
(D.N.J. 1997) (citing the first edition of this manual). But see Florida Bar v. Went for It, Inc., 515
U.S. 618, 626–28 (1995), in which a majority of the Supreme Court relied on a summary of results
prepared by the Florida Bar from a consumer survey purporting to show consumer objections to
attorney solicitation by mail. In a strong dissent, Justice Kennedy, joined by three other Justices, found
the survey inadequate based on the document available to the court, pointing out that the summary
included “no actual surveys, few indications of sample size or selection procedures, no explanations
of methodology, and no discussion of excluded results . . . no description of the statistical universe
or scientific framework that permits any productive use of the information the so-called Summary of
Record contains.” Id. at 640.
241. See Chavez v. IBP, Inc., 2004 U.S. Dist. LEXIS 28838 (E.D. Wash. Aug. 18, 2004).
416
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
in the case. Some commentators suggest that attorneys are obligated to disclose
such pilot work.242
C. In Surveys of Individuals, What Measures Were Taken to
Protect the Identities of Individual Respondents?
The respondents questioned in a survey generally do not testify in legal proceedings and are unavailable for cross-examination. Indeed, one of the advantages of
a survey is that it avoids a repetitious and unrepresentative parade of witnesses.
To verify that interviews occurred with qualified respondents, standard survey
practice includes validation procedures,243 the results of which should be included
in the survey report.
Conflicts may arise when an opposing party asks for survey respondents’
names and addresses so that they can re-interview some respondents. The party
introducing the survey or the survey organization that conducted the research
generally resists supplying such information.244 Professional surveyors as a rule
promise confidentiality in an effort to increase participation rates and to encourage candid responses, although to the extent that identifying information is collected, such promises may not effectively prevent a lawful inquiry. Because failure
to extend confidentiality may bias both the willingness of potential respondents
to participate in a survey and their responses, the professional standards for survey researchers generally prohibit disclosure of respondents’ identities. “The
use of survey results in a legal proceeding does not relieve the Survey Research
Organization of its ethical obligation to maintain in confidence all Respondentidentifiable information or lessen the importance of Respondent anonymity.”245
Although no surveyor–respondent privilege currently is recognized, the need for
surveys and the availability of other means to examine and ensure their trustworthiness argue for deference to legitimate claims for confidentiality in order to avoid
seriously compromising the ability of surveys to produce accurate information.246
242. See Yvonne C. Schroeder, Pretesting Survey Questions, 11 Am. J. Trial Advoc. 195, 197–201
(1987).
243. See supra Section V.C.
244. See, e.g., Alpo Petfoods, Inc. v. Ralston Purina Co., 720 F. Supp. 194 (D.D.C. 1989), aff’d
in part and vacated in part, 913 F.2d 958 (D.C. Cir. 1990).
245. Council of Am. Survey Res. Orgs., supra note 76, § I.A.3.f. Similar provisions are contained
in the By-Laws of the American Association for Public Opinion Research.
246. United States v . Dentsply Int’l, Inc., 2000 U.S. Dist. LEXIS 6994, at *23 (D. Del. May 10,
2000) (Fed. R. Civ. P. 26(a)(1) does not require party to produce the identities of individual survey
respondents); Litton Indus., Inc., No. 9123, 1979 FTC LEXIS 311, at *13 & n.12 (June 19, 1979)
(Order Concerning the Identification of Individual Survey-Respondents with Their Questionnaires)
(citing Frederick H. Boness & John F. Cordes, The Researcher–Subject Relationship: The Need for Protection
and a Model Statute, 62 Geo. L.J. 243, 253 (1973)); see also Applera Corp. v. MJ Research, Inc., 389
F. Supp. 2d 344, 350 (D. Conn. 2005) (denying access to names of survey respondents); Lampshire
417
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Copies of all questionnaires should be made available upon request so that the
opposing party has an opportunity to evaluate the raw data. All identifying information, such as the respondent’s name, address, and telephone number, should be
removed to ensure respondent confidentiality.
VIII. Acknowledgment
Thanks are due to Jon Krosnick for his research on surveys and his always sage
advice.
v. Procter & Gamble Co., 94 F.R.D. 58, 60 (N.D. Ga. 1982) (defendant denied access to personal
identifying information about women involved in studies by the Centers for Disease Control based
on Fed. R. Civ. P. 26(c) giving court the authority to enter “any order which justice requires to
protect a party or persons from annoyance, embarrassment, oppression, or undue burden or expense.”)
(citation omitted).
418
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
Glossary of Terms
The following terms and definitions were adapted from a variety of sources,
including Handbook of Survey Research (Peter H. Rossi et al. eds., 1st ed. 1983;
Peter V. Marsden & James D. Wright eds., 2d ed. 2010); Measurement Errors in
Surveys (Paul P. Biemer et al. eds., 1991); Willem E. Saris, Computer-Assisted
Interviewing (1991); Seymour Sudman, Applied Sampling (1976).
branching. A questionnaire structure that uses the answers to earlier questions
to determine which set of additional questions should be asked (e.g., citizens
who report having served as jurors on a criminal case are asked different
questions about their experiences than citizens who report having served as
jurors on a civil case).
CAI (computer-assisted interviewing). A method of conducting interviews
in which an interviewer asks questions and records the respondent’s answers
by following a computer-generated protocol.
CAPI (computer-assisted personal interviewing). A method of conducting
face-to-face interviews in which an interviewer asks questions and records the
respondent’s answers by following a computer-generated protocol.
CATI (computer-assisted telephone interviewing). A method of conducting
telephone interviews in which an interviewer asks questions and records the
respondent’s answers by following a computer-generated protocol.
closed-ended question. A question that provides the respondent with a list of
choices and asks the respondent to choose from among them.
cluster sampling. A sampling technique allowing for the selection of sample
elements in groups or clusters, rather than on an individual basis; it may
significantly reduce field costs and may increase sampling error if elements
in the same cluster are more similar to one another than are elements in different clusters.
confidence interval. An indication of the probable range of error associated with
a sample value obtained from a probability sample.
context effect. A previous question influences the way the respondent perceives
and answers a later question.
convenience sample. A sample of elements selected because they were readily
available.
coverage error. Any inconsistencies between the sampling frame and the target
population.
double-blind research. Research in which the respondent and the interviewer
are not given information that will alert them to the anticipated or preferred
pattern of response.
419
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
error score. The degree of measurement error in an observed score (see true
score).
full-filter question. A question asked of respondents to screen out those who
do not have an opinion on the issue under investigation before asking them
the question proper.
mall intercept survey. A survey conducted in a mall or shopping center in
which potential respondents are approached by a recruiter (intercepted) and
invited to participate in the survey.
multistage sampling design. A sampling design in which sampling takes place
in several stages, beginning with larger units (e.g., cities) and then proceeding
with smaller units (e.g., households or individuals within these units).
noncoverage error. The omission of eligible population units from the sampling
frame.
nonprobability sample. Any sample that does not qualify as a probability
sample.
open-ended question. A question that requires the respondent to formulate his
or her own response.
order effect. A tendency of respondents to choose an item based in part on the
order of response alternatives on the questionnaire (see primacy effect and
recency effect).
parameter. A summary measure of a characteristic of a population (e.g., average
age, proportion of households in an area owning a computer). Statistics are
estimates of parameters.
pilot test. A small field test replicating the field procedures planned for the
full-scale survey; although the terms pilot test and pretest are sometimes used
interchangeably, a pretest tests the questionnaire, whereas a pilot test generally
tests proposed collection procedures as well.
population. The totality of elements (individuals or other units) that have some
common property of interest; the target population is the collection of elements that the researcher would like to study. Also, universe.
population value, population parameter. The actual value of some characteristic in the population (e.g., the average age); the population value is
estimated by taking a random sample from the population and computing
the corresponding sample value.
pretest. A small preliminary test of a survey questionnaire. See pilot test.
primacy effect. A tendency of respondents to choose early items from a list of
choices; the opposite of a recency effect.
probability sample. A type of sample selected so that every element in the
population has a known nonzero probability of being included in the sample;
a simple random sample is a probability sample.
420
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
probe. A followup question that an interviewer asks to obtain a more complete
answer from a respondent (e.g., “Anything else?” “What kind of medical
problem do you mean?”).
quasi-filter question. A question that offers a “don’t know” or “no opinion”
option to respondents as part of a set of response alternatives; used to screen out
respondents who may not have an opinion on the issue under investigation.
random sample. See probability sample.
recency effect. A tendency of respondents to choose later items from a list of
choices; the opposite of a primacy effect.
sample. A subset of a population or universe selected so as to yield information
about the population as a whole.
sampling error. The estimated size of the difference between the result obtained
from a sample study and the result that would be obtained by attempting a
complete study of all units in the sampling frame from which the sample was
selected in the same manner and with the same care.
sampling frame. The source or sources from which the individuals or other
units in a sample are drawn.
secondary meaning. A descriptive term that becomes protectable as a trademark
if it signifies to the purchasing public that the product comes from a single
producer or source.
simple random sample. The most basic type of probability sample; each unit in
the population has an equal probability of being in the sample, and all possible
samples of a given size are equally likely to be selected.
skip pattern, skip sequence. A sequence of questions in which some should
not be asked (should be skipped) based on the respondent’s answer to a previous question (e.g., if the respondent indicates that he does not own a car, he
should not be asked what brand of car he owns).
stratified sampling. A sampling technique in which the researcher subdivides
the population into mutually exclusive and exhaustive subpopulations, or
strata; within these strata, separate samples are selected. Results can be combined to form overall population estimates or used to report separate withinstratum estimates.
survey-experiment. A survey with one or more control groups, enabling the
researcher to test a causal proposition.
survey population. See population.
systematic sampling. A sampling technique that consists of a random starting
point and the selection of every nth member of the population; it is generally analyzed as if it were a simple random sample and generally produces the
same results..
target population. See population.
421
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
trade dress. A distinctive and nonfunctional design of a package or product protected under state unfair competition law and the federal Lanham Act § 43(a),
15 U.S.C. § 1125(a) (1946) (amended 1992).
true score. The underlying true value, which is unobservable because there is always
some error in measurement; the observed score = true score + error score.
universe. See population.
422
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Survey Research
References on Survey Research
Paul P. Biemer, Robert M. Groves, Lars E. Lyberg, Nancy A. Mathiowetz, &
Seymour Sudman (eds.), Measurement Errors in Surveys (2004).
Jean M. Converse & Stanley Presser, Survey Questions: Handcrafting the Standardized Questionnaire (1986).
Mick P. Couper, Designing Effective Web Surveys (2008).
Don A. Dillman, Jolene Smyth, & Leah M. Christian, Internet, Mail and MixedMode Surveys: The Tailored Design Method (3d ed. 2009).
Robert M. Groves, Floyd J. Fowler, Jr., Mick P. Couper, James M. Lepkowski,
Eleanor Singer, & Roger Tourangeau, Survey Methodology (2004).
Sharon Lohr, Sampling: Design and Analysis (2d ed. 2010).
Questions About Questions: Inquiries into the Cognitive Bases of Surveys (Judith
M. Tanur ed., 1992).
Howard Schuman & Stanley Presser, Questions and Answers in Attitude Surveys:
Experiments on Question Form, Wording and Context (1981).
Monroe G. Sirken, Douglas J. Herrmann, Susan Schechter, Norbert Schwarz,
Judith M. Tanur, & Roger Tourangeau, Cognition and Survey Research
(1999).
Seymour Sudman, Applied Sampling (1976).
Survey Nonresponse (Robert M. Groves, Don A. Dillman, John L. Eltinge, &
Roderick J. A. Little eds., 2002).
Telephone Survey Methodology (Robert M. Groves, Paul P. Biemer, Lars E.
Lyberg, James T. Massey, & William L. Nicholls eds., 1988).
Roger Tourangeau, Lance J. Rips, & Kenneth Rasinski, The Psychology of
Survey Response (2000).
423
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
Estimation of Economic Damages
M A R K A . A L L E N , R O B E RT E . H A L L , A N D
VICTORIA A. LAZEAR
Mark Allen, J.D., is Senior Consultant at Cornerstone Research, Menlo Park, California.
Robert Hall, Ph.D., is Robert and Carole McNeil Hoover Senior Fellow and Professor of
Economics, Stanford University, Stanford, California.
Victoria Lazear, M.S., is Vice President at Cornerstone Research, Menlo Park, California.
CONTENTS
I. Introduction, 429
II. Damages Experts’ Qualifications, 431
III. The Standard General Approach to Quantification of Damages, 432
A. Isolating the Effect of the Harmful Act, 432
B. The Damages Quantum Prescribed by Law, 433
C. Is There Disagreement About What Legitimate Conduct of the
Defendant Should Be Hypothesized in Projecting the Plaintiff’s
Earnings but for the Harmful Event? 439
D. Does the Damages Analysis Consider All the Differences in the
Plaintiff’s Situation in the But-For Scenario, or Does It Assume That
Many Aspects Would Be the Same as in Actuality? 440
IV. Valuation and Damages, 443
V. Quantifying Damages Using a Market Approach Based on Prices or
Values, 444
A. Is One of the Parties Using an Appraisal Approach to the
Measurement of Damages? 445
B. Are the Parties Disputing an Adjustment of an Appraisal for Partial
Loss? 445
C. Is One of the Parties Using the Assets and Liabilities Approach? 446
D. Are the Parties Disputing an Adjustment for Market Frictions? 446
E. Is One of the Parties Relying on Hypothetical Property in Its
Damages Analysis? 447
F. What Complications Arise When Anticipation of Damages Affects
Market Values? 448
425
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
VI. Quantifying Damages as the Sum of Discounted Lost Cash Flows, 448
A. Is There Disagreement About But-For Revenues in the Past? 449
B. Is There Disagreement About the Costs That the Plaintiff Would
Have Incurred but for the Harmful Event? 449
C. Is There Disagreement About the Plaintiff’s Actual Revenue After
the Harmful Event? 450
D. What Is the Role of Inflation? 451
1. Do the parties use constant dollars for future losses, or are such
losses stated in future dollars whose values will be diminished
by inflation? 451
2. Are the parties using a discount rate properly matched to the
projection? 452
3. Is one of the parties assuming that discounting and earnings
growth offset each other? 453
E. Are Losses Measured Before or After the Plaintiff’s Income Taxes? 454
F. Is There a Dispute About the Costs of Stock Options? 456
G. Is There a Dispute About Prejudgment Interest? 457
H. Is There Disagreement About the Interest Rate Used to Discount
Future Lost Value? 459
I. Is One of the Parties Using a Capitalization Factor? 459
VII. Limitations on Damages, 461
A. Is the Defendant Arguing That Plaintiff’s Damages Estimate Is Too
Uncertain and Speculative? 461
B. Are the Parties Disputing the Remoteness of Damages? 463
C. Are the Parties Disputing the Plaintiff’s Efforts to Mitigate Its
Losses? 464
D. Are the Parties Disputing Damages That May Exceed the Cost of
Avoidance? 466
E. Are the Parties Disputing a Liquidated Damages Clause? 467
VIII. Other Issues Arising in General in Damages Measurement, 468
A. Damages for a Startup Business, 458
1. Is the defendant challenging the fact of economic loss? 468
2. Is the defendant challenging the use of the expected value
approach? 468
3. Are the parties disputing the relevance and validity of the data
on the value of a startup? 469
B. Issues Specific to Damages from Loss of Personal Income, 470
1. Calculating losses over a person’s lifetime, 470
2. Calculation of fringe benefits, 471
3. Wrongful death, 473
4. Shortened life expectancy, 474
5. Damages other than lost income, 474
C. Damages with Multiple Challenged Acts: Disaggregation, 475
426
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
IX.
X.
XI.
XII.
D. Is There a Dispute About Whether the Plaintiff Is Entitled to All the
Damages? 477
E. Are the Defendants Disputing the Apportionment of Damages
Among Themselves? 479
1. Are the defendants disputing apportionment among themselves
despite full information about their roles in the harmful event? 479
2. Are the defendants disputing the apportionment because the
wrongdoer is unknown? 480
F. Is There Disagreement About the Role of Subsequent Unexpected
Events? 480
Data Used to Measure Damages, 482
A. Types of Data, 482
1. Electronic data, 482
2. Paper data, 482
3. Sampling data, 482
4. Survey data, 483
B. Are the Parties Disputing the Validity of the Data? 483
1. Criteria for determining validity of data, 484
2. Quantitative methods for validation, 485
C. Are the Parties Disputing the Handling of Missing Data? 485
Standards for Disclosing Data to Opposing Parties, 486
A. Use of Formats, 487
B. Data Dictionaries, 487
C. Resolution of Problems, 488
D. Special Masters and Neutral Experts, 489
Damages in Class Actions, 489
A. Class Certification, 489
B. Classwide Damages, 489
C. Damages of Individual Class Members, 490
D. Have the Defendant and the Class’s Counsel Proposed a Fair
Settlement? 490
Illustrations of General Principles, 491
A. Claim for Lost Personal Income, 491
1. Is there a dispute about projected earnings but for the harmful
event? 492
2. Are the parties disputing the valuation of benefits? 492
3. Is there disagreement about how earnings should be discounted
to present value? 495
4. Is there disagreement about subsequent unexpected events? 495
5. Is there disagreement about retirement and mortality? 495
6. Is there a dispute about mitigation? 496
7. Is there disagreement about how the plaintiff’s career path
should be projected? 496
427
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. Lost
1.
2.
3.
4.
Profits for a Business, 497
Is there a dispute about projected revenues? 498
Are the parties disputing the calculation of marginal costs? 499
Is there a dispute about mitigation? 499
Is there disagreement about how profits should be discounted
to present value? 500
5. Is there disagreement about subsequent unexpected events? 500
Glossary of Terms, 501
428
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
I.
Introduction
This reference guide identifies areas of dispute that arise when economic losses
are at issue in a legal proceeding. Our focus is on explaining the issues in these
disputes rather than taking positions on their proper resolutions. We discuss the
application of economic analysis within established legal frameworks for damages. We cover topics in economics that arise in measuring damages and provide
citations to cases to illustrate the principles and techniques discussed in the text.
We begin by discussing the qualifications required of experts who quantify
damages. We then set forth the standard general approach to damages quantification, with particular focus on defining the harmful event and the alternative, often
called the but-for scenario. In principle, the difference between the plaintiff’s
economic value in the but-for scenario and in actuality measures the loss caused
by the harmful act of the defendant. We then consider damages estimation for two
cases: (1) a discrete loss of market value and (2) the loss of a flow of income over
time, where damages are the discounted value of the lost cash flow. Other topics
include the role of inflation, issues relating to income taxes and stock options,
adjustments for the time value of money, legal limitations on damages, damages
for a new business, disaggregation of damages when there are multiple challenged
acts, the role of random events occurring between the harmful act and trial, data
for damages measurement, standards for disclosing data to opposing parties, special
masters and neutral experts, liquidated damages, damages in class actions, and lost
earnings.1
Our discussion follows the structure of the standard damages study, as shown
in Figure 1. Damages quantification operates on the premise that the defendant
is liable for damages from the defendant’s harmful act. The plaintiff is entitled to
recover monetary damages for losses occurring before and possibly after the time
of the trial. The top line of Figure 1 measures the losses before trial; the bottom
line measures the losses after trial.2
The goal of damages measurement is to find the plaintiff’s loss of economic
value from the defendant’s harmful act. The loss of value may have a one-time
character, such as the diminished market value of a business or property, or it may
take the form of a reduced stream of profit or earnings. The losses are net of any
costs avoided because of the harmful act.
The essential elements of a study of losses are the quantification of the reduction in economic value, the calculation of interest on past losses, and the appli1. For a discussion of specific issues relating to estimating damages in antitrust, intellectual
property, and securities litigation, see Mark A. Allen et al., Estimation of Economic Damages in Antitrust,
Intellectual Property, and Securities Litigation (June 2011), available at http://www.stanford.edu/~rehall/
DamagesEstimation.pdf.
2. Our scope here is limited to losses of actual dollar income. However, economists sometimes
have a role in the measurement of nondollar damages, including pain and suffering and the hedonic
value of life. See generally W. Kip Viscusi, Reforming Products Liability (1991).
429
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 1. Standard format for a damages study.
Earnings
before trial,
had the
harmful event
not occurred
–
Actual
earnings
before trial
+
Prejudgment
interest
=
Damages
before
trial
+
Projected
earnings after
trial, had the
harmful event
not occurred
–
Projected
earnings
after trial
–
Discounting
=
Damages
after trial
Total
Damages
cation of financial discounting to future losses. The losses are the difference
between the value the plaintiff would have received if the harmful event had not
occurred and the value the plaintiff has or will receive, given the harmful event.
The plaintiff may be entitled to interest for losses occurring before trial. Losses
occurring after trial are usually discounted to the time of trial. The plaintiff may
be due interest on the judgment from the time of trial to the time the defendant
actually pays. The majority of damages studies fit this format; thus, we have used
such a format as the basic model for this reference guide.
We use numerous brief examples to explain the disputes that can arise.
These examples are not full case descriptions; they are deliberately stylized. They
attempt to capture the types of disagreements about damages that arise in practical
experience, although they are purely hypothetical. In many examples, the dispute
involves factual as well as legal issues. We do not try to resolve the disputes in
these examples and hope that the examples will help clarify the legal and factual
disputes that need to be resolved before or at trial. We introduce many areas of
potential dispute with a question, because asking the parties these questions can
identify and clarify the majority of disputes over economic damages.
The reader with limited experience in the economic analysis of damages may
find it most helpful to begin with Sections II and III and then read Section XII.A,
which provides a straightforward application of the principles. Sections IV, V,
and VI may be particularly helpful for readers knowledgeable in accounting and
valuation. The other sections discuss specific issues relating to damages, and some
readers may find it useful to review only those specific to their needs. Section XII.B
430
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
discusses an application of some of these more specific issues in the context of a
damages analysis for a business.
II. Damages Experts’ Qualifications
Experts who quantify damages come from a variety of backgrounds. The expert
should be trained and experienced in quantitative analysis. For economists, the
common qualification is the Ph.D. Damages experts with business or accounting
backgrounds often have M.B.A. degrees or other advanced degrees, or C.P.A.
credentials. Both the method used and the substance of the damages claim dictate
the specific areas of specialization the expert needs. In some cases, participation
in original research and authorship of professional publications may add to the
qualifications of an expert. However, relevant research and publications are not
likely to be on the topic of damages measurement per se but rather on topics and
methods encountered in damages analysis. For example, a damages expert may
need to restate prices and quantities for a but-for market with more sellers than
are present in the actual market. For an expert undertaking this task, direct participation in research on the relation between market structure and performance
would be helpful.
Many damages studies use statistical regression analysis.3 Specific training is
required to apply regression analysis. Damages studies sometimes use field surveys.4 In this case, the damages expert should be trained in survey methods or
should work in collaboration with a qualified survey statistician. Because damages
estimation often makes use of accounting records, most damages experts need to
be able to interpret materials prepared by professional accountants. Some damages
issues may require assistance from a professional accountant.
Experts also benefit from professional training and experience in areas relevant
to the substance of the damages claim. For example, in antitrust, a background
in industrial organization may be helpful; in securities damages, a background in
finance may assist the expert; and in the case of lost earnings, an expert may benefit
from training in labor economics.
An analysis by even the most qualified expert may face a challenge under the
criteria associated with the Daubert and Kumho cases.5 These criteria are intended
to exclude testimony based on untested and unreliable theories. Relatively few
economists serving as damages experts succumb to Daubert challenges, because
3. For a discussion of regression analysis, see generally Daniel L. Rubinfeld, Reference Guide on
Multiple Regression, in this manual.
4. For a discussion of survey methods, see generally Shari Seidman Diamond, Reference Guide
on Survey Research, in this manual.
5. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993); Kumho Tire Co. v. Carmichael,
526 U.S. 137 (1999). For a discussion of emerging standards of scientific evidence, see Margaret A.
Berger, The Admissibility of Expert Testimony, Section IV, in this manual.
431
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
most damages analyses operate in the familiar territory of measuring economic
values using a combination of professional judgment and standard tools. But the
circumstances of each damages analysis are unique, and a party may raise a Daubert
challenge based on the proposition that the tools have never before been applied
to these circumstances. Even if a Daubert challenge fails, it can be an effective way
for the opposing party to probe the damages analysis prior to trial.
III. The Standard General Approach to
Quantification of Damages
In this section, we review the elements of the standard loss measurement in the
format of Figure 1. For each element, there are several areas of potential dispute.
The sequence of issues discussed here should identify most of the areas of disagreement between the damages analyses of opposing parties.
A. Isolating the Effect of the Harmful Act
The first step in a damages study is the translation of the legal theory of the harmful event into an analysis of the economic impact of that event. In most cases,
the analysis considers the difference between the plaintiff’s economic position if
the harmful event had not occurred and the plaintiff’s actual economic position.
In almost all cases, the damages expert proceeds on the hypothesis that the
defendant committed the harmful act and that the act was unlawful. Accordingly,
throughout this discussion, we assume that the plaintiff is entitled to compensation
for losses sustained from a harmful act of the defendant. The characterization of
the harmful event begins with a clear statement of what occurred. The characterization also will include a description of the defendant’s proper actions in place
of its unlawful actions and a statement about the economic situation absent the
wrongdoing, with the defendant’s proper actions replacing the unlawful ones (the
but-for scenario). Damages measurement then determines the plaintiff’s hypothetical value in the but-for scenario. Economic damages are the difference between
that value and the actual value that the plaintiff achieved.
Because the but-for scenario differs from what actually happened only with
respect to the harmful act, damages measured in this way isolate the loss of value
caused by the harmful act and exclude any change in the plaintiff’s value arising from other sources. Thus, a proper construction of the but-for scenario and
measurement of the hypothetical but-for plaintiff’s value by definition includes
in damages only the loss caused by the harmful act. The damages expert using
the but-for approach does not usually testify separately about the causal relation
between damages and the harmful act, although variations may occur where there
are issues about the directness of the causal link.
432
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
B. The Damages Quantum Prescribed by Law
In most cases, the law prescribes a damages measure that falls into one of the
following five categories:
• Expectation: Plaintiff restored to the same financial position as if the defendant had performed as promised.
• Reliance: Plaintiff restored to the same position as if the relationship with
the defendant or the defendant’s misrepresentation (and resulting harm)
had not existed in the first place.
• Restitution: Plaintiff compensated by the amount of the defendant’s gain
from the unlawful conduct, also called compensation for unjust enrichment, disgorgement of ill-gotten gains, or compensation for unbargainedfor benefits.6
• Statutory: Plaintiff’s compensation is a set amount per occurrence of
wrongdoing. This occurs in cases involving violations of state labor codes
and in copyright infringement.
• Punitive: Compensation rewards the plaintiff for detecting and prosecuting
wrongdoing to deter similar future wrongdoing.
Expectation damages7 often apply to breach of contract claims, where the
wrongdoing is the failure to perform as promised, and the but-for scenario
hypothesizes the absence of that wrongdoing, that is, proper performance by the
defendant. Expectation damages are an amount sufficient to give the plaintiff the
same economic value the plaintiff would have received if the defendant had fulfilled the promise or bargain.8
6. Courts and commentators often subsume unjust enrichment in defining restitution. Professor Farnsworth, for example, states: “[T]he object of restitution is not the enforcement of a promise,
but rather the prevention of unjust enrichment. . . . The party in breach is required to disgorge what
he has received in money or services. . . .” See, e.g., E. Allen Farnsworth, Contracts § 12.1, at 814
(1982). However, others have argued that restitution and unjust enrichment are different concepts. See,
e.g., James J. Edelman, Unjust Enrichment, Restitution, and Wrongs, 79 Tex. L. Rev. 1869 (2001); Peter
Birks, Unjust Enrichment and Wrongful Enrichment, 79 Tex. L. Rev. 1767 (2001); and Emily Sherwin,
Restitution and Equity: An Analysis of the Principle of Unjust Enrichment, 79 Tex. L. Rev. 2083 (2001).
Judge Posner discusses restitution (defined as returning the breaching party’s profits from the breach)
in relation to contract damages and unjust enrichment (defined as compensation for unbargained-for
benefits) in connection with implied contracts. See Richard A. Posner, Economic Analysis of Law 130,
151 (1998). See also Restatement (Third) of Restitution and Unjust Enrichment (2011).
7. See John R. Trentacosta, Damages in Breach of Contract Cases, 76 Mich. Bus. J. 1068, 1068
(1997) (describing expectation damages as damages that place the injured party in the same position
as if the breaching party completely performed the contract); Bausch & Lomb, Inc. v. Bressler, 977
F.2d 720, 728–29 (2d Cir. 1992) (defining expectation damages as damages that put the injured party
in the same economic position the party would have enjoyed if the contract had been performed).
8. See Restatement (Second) of Contracts § 344 cmt. a (1981). Expectation has been called “a
queer kind of ‘compensation,’’’ because it gives the promisee something it never had, i.e., the benefit
433
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Reliance damages generally apply to torts and to some contract breaches.
Such damages restore the plaintiff to the same financial position it would have
enjoyed absent the defendant’s conduct as well as, in the case of torts, compensation for nonpecuniary losses such as pain and suffering.9 Reliance most
often includes out-of-pocket costs, but may also include compensation for lost
opportunities, when appropriate. In such cases, reliance damages may approach
expectation damages. For a tort, reliance damages place the plaintiff in a position
economically equivalent to the position absent the harmful act.10 For a breach
of contract, measuring damages as the amount of compensation needed to place
the plaintiff in the same position as if the contract had not been made in the first
place will result in refunding the part of the plaintiff’s reliance investment that
cannot be recovered in other ways.11 Thus, reliance damages may be appropriate
when the plaintiff made an investment relying on the defendant’s performance.
of its bargain. L.L. Fuller & William R. Perdue, Jr., The Reliance Interest in Contract Damages: 1, 46
Yale L.J. 52, 53 (1936). The policy underlying expectation damages is that they promote and facilitate
reliance on business agreements. Id. at 61–62.
9. Generally, the objective of reliance damages is to put the promisee or nonbreaching party back
to the position in which it would have been had the promise not been made. See E. Allan Farnsworth,
Legal Remedies for Breach of Contract, 70 Colum. L. Rev. 1145, 1148 (1979). See also Restatement
(Second) of Contracts § 344(b). Reliance damages include expenditures made in preparation for performance and performance itself. Restatement (Second) of Contracts § 349.
10. See, e.g., East River Steamship Corp. v. Transamerica Delaval Inc., 476 U.S. 858, 873 n.9
(1986) (“tort damages generally compensate the plaintiff for loss and return him to the position he
occupied before the injury”). The compensatory goal of tort damages is to make the plaintiff whole as
nearly as possible through an award of money damages. See Randall R. Bovbjerg et al., Valuing Life and
Limb in Tort: Scheduling “Pain and Suffering,” 83 Nw. U. L. Rev. 908, 910 (1989); John C.P. Goldberg,
Two Conceptions of Tort Damages: Fair v. Full Compensation, 5 DePaul L. Rev. 435 (2006). Often, the
damages expert is not asked to provide guidance relating to estimating damages for nonpecuniary losses
such as pain and suffering. However, hedonic analysis may sometimes be used.
11. Economists and legal scholars have debated contract damages and the concepts of expectation
and reliance for decades. Fuller and Perdue’s definition of reliance included the plaintiff’s foregone
lost opportunities in addition to his expenditures. But courts that award reliance damages typically
award only out-of-pocket expenditures. See, e.g., Michael B. Kelly, The Phantom Reliance Interest in
Contract Damages, 1992 Wis. L. Rev. 1755, 1771 (1992). Farnsworth has suggested that this is most
likely explained by difficulties in damages proof rather than any rule excluding lost opportunities from
reliance damages—that is, that the reason for barring the expectation measure (most often lack of proof
of damages with reasonable certainty) will apply equally to bar lost opportunities. E Allan Farnsworth,
Precontractual Liability and Preliminary Agreements: Fair Dealing and Failed Negotiations, 87 Colum. L.
Rev. 217, 225 (1987). Reliance damages including lost opportunities may be awarded in cases where
the expectation is unavailable because the agreement is illusory or too indefinite to be enforceable.
See, e.g., Grouse v. Group Health Plan, Inc., 306 N.W.2d 114 (Minn. 1981), where the plaintiff
employee resigned one job and turned down the offer of another in reliance on defendant’s promise
of employment, but the promised employment would have been at will. The court stated that the
proper measure of damages was not what the plaintiff would have earned in his employment with the
defendant, but what he lost in quitting his job and turning down an additional offer of employment.
Id. at 116. Finally, we note that in a competitive market, reliance damages including lost opportunities
434
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Example: Agent contracts with Owner for Agent to sell Owner’s farm. The asking price is $1,000,000, and the agreed fee is 6%. Agent incurs costs of
$1,000 in listing the property. A potential buyer offers the asking price,
but Owner withdraws the listing. Agent calculates damages as $60,000,
the agreed fee for selling the property. Owner calculates damages as
$1,000, the amount that Agent spent to advertise the property.
Comment: Under the expectation remedy, Agent is entitled to $60,000, the fee
for selling the property. However, the Agent has only partly performed
under the contract, and thus it may be appropriate to limit damages
to $1,000. Some states limit recovery in this situation by law to the
$1,000, the reliance measure of damages, unless the property is actually
sold.12
Restitution damages13 are often the same, from the perspective of quantification, as reliance damages. If the only loss to the plaintiff from the defendant’s
harmful act arises from an expenditure that the plaintiff made that cannot otherwise be recovered, the plaintiff receives compensation equal to the amount of
that expenditure.14
Interesting and often difficult issues arise in cases that involve elements of
both contract and tort. Consider a contract for a product that turns out to be
defective. Generally, under what has become known as the economic loss rule, if
the defective product only causes economic or commercial loss, the dispute is a
private matter between the parties, and the contract will likely control their dispute. But if the product causes personal injury or property damage (other than to
the product itself), then tort law and tort damages will likely control.15
are generally equivalent to expectation damages. See, e.g., Robert Cooter & Melvin Aron Eisenberg,
Damages for Breach of Contract, 73 Cal. L. Rev. 1432, 1445 (1985).
12. Compare Hollinger v. McMichael, 177 Mont. 144, 580 P.2d 927, 929 (1978) (broker earned
his commission when he “procured a purchaser able, ready and willing to purchase the seller’s property”) with Ellsworth Dobbs, Inc. v. Johnson, 50 N.J. 528, 236 A.2d 843, 855 (1967) (broker earns
commission only when the transaction is completed by closing the title in accordance with the provisions of the contract). See generally Steven K. Mulliken, When Does the Seller Owe the Broker a Commission? A Discussion of the Law and What It Teaches About Listing Agreements, 132 Mil. L. Rev. 265 (1991).
13. The objective of restitution damages is to put the promisor or breaching party back in the
position in which it would have been had the promise not been made. Note the traditional legal
distinction between restitution and reliance damages: Reliance damages seek to put the promisee or
nonbreaching party back in the position in which it would have been if the promise had not been
made. See E. Allan Farnsworth, Legal Remedies for Breach of Contract, 70 Colum. L. Rev. 1145, 1148
(1979). Both measures seek to restore the status quo ante. See also Restatement (Third) of Restitution
and Unjust Enrichment (2011).
14. See Restatement (Second) of Contracts § 344(c).
15. Judge Posner has advocated using the term “commercial” rather than “economic” loss
because, since personal injuries and property losses destroy values that can be monetized, they are
economic losses also. See Miller v. United States Steel Corp., 902 F.2d 573, 574 (7th Cir. 1990). See
generally Dan B. Dobbs, An Introduction to Non-Statutory Economic Loss Claims, 48 Ariz. L. Rev. 713
435
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Fraud actions can present particularly difficult problems. For example, if the
claim is that the defendant fraudulently induced the plaintiff to enter into an agreement that caused purely commercial losses, the economic loss rule may apply to
limit the plaintiff’s recovery to only commercial losses for breach of contract, and
thus not allow recovery of additional damages recoverable under fraud, such as
punitive damages. Generally, courts have taken three approaches to this problem.
Some courts have found that the economic loss rule applies to bar the tort claim
completely, so that the plaintiff can proceed only under a breach of contract theory.
Other courts have found that fraud is an exception to the economic loss doctrine,
allowing fraud actions to proceed. A third approach allows a separate fraud action,
but only if the fraud is “independent of” or “extraneous to” the contract promises.16
A plaintiff asserting fraud can generally recover either out-of-pocket costs
or expectation damages,17 but courts today more commonly award expectation
damages to place the plaintiff in the position it would have occupied had the
fraudulent statement been true.18 In cases where the court interprets the fraudulent
statement as an actual warranty, then the appropriate remedy is expectation damages. Courts, though, have awarded expectation damages even when the fraudulent statement is not interpreted as an actual warranty. Some of these cases may be
situations where a contract exists but is legally unenforceable for technical reasons.
As an alternative, the but-for analysis may consider the value the plaintiff
would have received in the absence of the economically detrimental relationship
created by the fraud. In this case, the but-for analysis for fraud may adopt the
premise that the plaintiff would have entered into a valuable relationship with an
entity other than the defendant. For example, if the defendant’s misrepresentations
have caused the plaintiff to purchase property unsuited to the plaintiff’s planned
use, then the but-for analysis might consider the value that the plaintiff would
have received by purchasing a suitable property from another seller.19
(2006); Richard A. Posner, Common-Law Economic Torts: An Economic and Legal Analysis, 48 Ariz. L.
Rev. 735 (2006).
16. See, e.g., Dan B. Dobbs, An Introduction to Non-Statutory Economic Loss Claims, 48 Ariz. L.
Rev. 713, 728–30 (2006); Ralph C. Anzivino, The Fraud in the Inducement Exception to the Economic
Loss Doctrine, 90 Marq. L. Rev. 921, 931–36 (2007); Richard A. Posner, Common-Law Economic Torts:
An Economic and Legal Analysis, 48 Ariz. L. Rev. 735 (2006); R. Joseph Barton, Note: Drowning in a
Sea of Contract: Application of the Economic Loss Rule to Fraud and Negligent Misrepresentation Claims, 41
Wm. & Mary L. Rev. 1789 (2000). See also Marvin Lumber and Cedar Co. v. PPG Industries, 34 F.
Supp. 2d 738 (D.C. Minn. 1999) aff’d, 223 F.3d 873 (7th Cir. 2000) (economic loss doctrine barred
fraud claim of merchant against manufacturer where facts supporting such claim were not independent
of those supporting its UCC contract claims).
17. See Restatement (Second) of Torts § 549 (1974). Under the Restatement, expectation damages are available only to “the recipient of a fraudulent misrepresentation in a business transaction,”
and only for intentional, not negligent, misrepresentation. Id. §§ 549(2), 552.
18. See, e.g., Richard Craswell, Against Fuller and Perdue, 67 U. Chi. L. Rev. 99, 148 (2000).
19. This measure is equivalent to the reliance interest with recovery for lost opportunities, which
can approach expectation damages. See supra note 11.
436
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Plaintiffs cannot normally seek punitive damages in a claim for breach of
contract,20 but may seek them in addition to compensatory damages in connection
with a tort claim. Although punitive damages are rarely the subject of expert testimony, economists have advanced the concept that punitive damages compensate a
plaintiff who brings a case for a wrongdoing that is hard to detect or hard to prosecute. Thus under this concept, punitive damages should be calculated so that the
expected recovery for a randomly chosen victim is equal to the victim’s loss. To
do this, actual damages are multiplied by a factor that is equal to the reciprocal of
the probability of both detecting the harmful act and prosecuting the wrongdoer.
This adjustment to the damages estimate ensures that the expected recovery from
a randomly chosen victim is equal to the victim’s loss.21
In some situations, the plaintiff may have a choice of remedies under different
legal theories. For example, in determining damages for fraud in connection with
a contract, damages may be awarded under tort law for deceit or under contract
law for breach.22
Example: Buyer purchases a condominium from Owner for $900,000. However,
the condominium is known by the Owner to be worth only $800,000
at the time of sale because of defects. Buyer chooses to compute damages under the expectation measure of damages as $100,000 and to
retain the condominium. Owner computes damages under the reliance
measure owed to Buyer as $900,000 and also seeks the return of the
condominium to Owner, despite the fact that the condominium is
now worth $1,200,000.
Comment: Owner’s application of the reliance remedy is incomplete. Absent the
fraud, Buyer would have purchased another condominium and enjoyed
the general appreciation in the market. Thus, correctly applied, the
two measures are likely to be similar.
20. Posner explains that most breaches are either involuntary, where performance is impossible
at a reasonable cost, or voluntary but efficient. The policy of contract law is not to compel adherence
to contracts, but only to require each party either to perform under the contract or compensate the
other party for any resulting injuries. See Richard A. Posner, Economic Analysis of Law, supra note 6, at
131. For an argument in favor of punitive damages in contracts, see William S. Dodge, The Case for
Punitive Damages in Contracts, 48 Duke L. J. 629 (1999).
21. See A. Mitchell Polinsky & Steven Shavell, Punitive Damages: An Economic Analysis, 111
Harv. L. Rev. 879 (1998).
22. This assumes that the economic loss rule does not apply. Generally, plaintiffs will prefer tort
remedies to contract remedies because such remedies are broader, affording the possibility of recovery for nonpecuniary losses and punitive damages. For fraud actions, most jurisdictions do not allow
recovery for nonpecuniary loses such as emotional distress, although some do if the distress is severe.
See, e.g., Nelson v. Progressive Corp., 976 P.2d 859, 868 (Alaska 1999). The Restatement advocates
restricting fraud recovery to pecuniary losses. See Restatement (Second) of Torts § 549.
437
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
A plaintiff may argue that a harmful act has caused significant losses for many
years. The defendant may reply that most of the losses that occurred from the
injury are the result of causes other than the harmful act. Thus, the defendant
may argue that the injury was caused by multiple factors only one of which was
the result of the harmful act, or the defendant may argue that the observed injury
over time was caused by subsequent events.
Example: Worker is the victim of a disease caused either by exposure to xerxium
or by smoking. Worker makes leather jackets tanned with xerxium.
Worker sues the producer of the xerxium, Xerxium Mine, and calculates damages as all lost wages. Defendant Xerxium Mine, in contrast,
attributes most of the losses to smoking and calculates damages as only
a fraction of lost wages.
Comment: The resolution of this dispute will turn on the legal question of comparative or contributory fault. If the law permits the division of damages into
parts attributable to exposure to xerxium and smoking, then medical
evidence on the likelihood of cause may be needed to make that division. We discuss this topic further in Section VIII.B. on disaggregation
of damages.
Example: Real Estate Agent is wrongfully denied affiliation with Broker. Agent’s
damages study projects past earnings into the future at the rate of
growth of the previous 3 years. Broker’s study projects that earnings
would have declined even without the breach because the real estate
market has turned downward.
Comment: The difference between a damages study based on extrapolation from
the past, here used by Agent, and a study based on actual data after the
harmful act, here used by Broker, is one of the most common sources
of disagreement in damages. This is a factual dispute that hinges on
the broker demonstrating that there is a relationship between real
estate market conditions and the earnings of agents. The example also
illustrates how subsequent unexplained events can affect damages calculations, discussed in Section VIII.E.
Frequently, the defendant will calculate damages on the premise that the
harmful act had no causal relationship to the plaintiff’s losses—that is, that the
plaintiff’s losses would have occurred without the harmful act. The defendant’s
but-for scenario will thus describe a situation in which the losses happen anyway.
This is equivalent to arguing that the harmful act occurred but the plaintiff suffered no losses.
Example: Contractors conspired to rig bids in a construction deal. State seeks
damages for subsequent higher prices. Contractors’ damages estimate is
438
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
zero because they assert that the only effect of the bid rigging was to
determine the winner of the contract and that prices were not affected.
Comment: This is a factual dispute about how much effect bid rigging has on the
ultimate price. The analysis should go beyond the mechanics of the bidrigging system to consider how the bids would be different had there
been no collaboration among the bidders.
The defendant may also argue that the plaintiff has overstated the scope of
the harmful act. Here, the legal character of the harmful act may be critical; the
law may limit the scope to proximate effects if the harmful act was negligence,
but may require a broader scope if the harmful act was intentional.23
Example: Plaintiff Drugstore Network experiences losses because defendant
Superstore priced its products predatorily. Drugstore Network reduced
prices in all its stores because it has a policy of uniform national pricing.
Drugstore Network’s damages study considers the entire effect of
national price cuts on profits. Defendant Superstore argues that Network should have lowered prices only on the West Coast and its price
reductions elsewhere should not be included in damages.
Comment: Whether adherence to a policy of national pricing is the reasonable
response to predatory pricing in only part of the market is a question
of fact.
C. Is There Disagreement About What Legitimate Conduct of
the Defendant Should Be Hypothesized in Projecting the
Plaintiff’s Earnings but for the Harmful Event?
One party’s damages analysis may hypothesize the absence of any act of the
defendant that influenced the plaintiff, whereas the other’s damages analysis may
hypothesize an alternative, legal act. This type of disagreement is particularly common in antitrust and intellectual property disputes. Although disagreement over
the alternative scenario in a damages study is generally a legal question, opposing
experts may have been given different legal guidance and therefore made different
economic assumptions, resulting in major differences in their damages estimates.
Example:
Defendant Copier Service’s long-term contracts with customers are
found to be unlawful because they create a barrier to entry that maintains
Copier Service’s monopoly power. Rival’s damages study hypothesizes
23. See generally Prosser and Keeton on the Law of Torts § 65, at 462 (Prosser et al. 5th ed.,
1984). Dean Prosser states that simple negligence and intentional wrongdoing differ “not merely in
degree but in the kind of fault . . . and in the social condemnation attached to it.” Id.
439
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
no contracts between Copier Service and its customers, so Rival would
face no contractual barrier to bidding those customers away from Copier
Service. Copier Service’s damages study hypothesizes medium-term
contracts with its customers and argues that these would not have been
found to be unlawful. Under Copier Service’s assumption, Rival would
have been much less successful in bidding away Copier Service’s customers, and damages are correspondingly lower.
Comment: Assessment of damages will depend greatly on the substantive law
governing the injury. The proper characterization of Copier Service’s
permissible conduct usually is an economic issue. However, the expert
must also have legal guidance as to the proper legal framework for damages. Counsel for plaintiff may instruct plaintiff’s damages expert to use
a different legal framework from that of counsel for the defendant.
D. Does the Damages Analysis Consider All the Differences
in the Plaintiff’s Situation in the But-For Scenario, or
Does It Assume That Many Aspects Would Be the Same
as in Actuality?
The analysis of some types of harmful events requires consideration of effects, such
as price erosion,24 that involve changes in the economic environment caused by
the harmful event. For a business, the main elements of the economic environment that may be affected by the harmful event are the prices charged by rivals,
the demand facing the seller, and the prices of inputs. For example, misappropriation of intellectual property can cause lower prices because products produced
with the misappropriated intellectual property compete with products sold by the
owner of the intellectual property. In contrast, some harmful events do not change
the plaintiff’s economic environment. The theft of some of the plaintiff’s products
would not change the market price of those products, nor would an injury to a
worker change the general level of wages in the labor market. A damages study
need not analyze changes in broader markets when the harmful act plainly has
minuscule effects in those markets. The plaintiff may assert that, absent the defendant’s wrongdoing, a higher price could have been charged and therefore that the
defendant’s harmful act has eroded the market price. The defendant may reply
that the higher price would lower the quantity sold. The parties may then dispute
how much the quantity would fall as a result of higher prices.
24. See, e.g., General Am. Transp. Corp. v. Cryo-Trans, Inc., 897 F. Supp. 1121, 1123–24 (N.D.
Ill. 1995), modified, 93 F.3d 766 (Fed. Cir. 1996); Rawlplug Co., Inc. v. Illinois Tool Works Inc., No.
91 Civ. 1781, 1994 WL 202600, at *2 (S.D.N.Y. May 23, 1994); Micro Motion, Inc. v. Exac Corp.,
761 F. Supp. 1420, 1430–31 (N.D. Cal. 1991) (holding in all three cases that the patentee is entitled
to recover lost profits due to past price erosion caused by the wrongdoer’s infringement).
440
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Example: Valve Maker infringes patent of Rival. Rival calculates lost profits as
the profits Rival would have made plus a price-erosion effect. The
amount of price erosion is the difference between the higher price that
Rival would have been able to charge absent Valve Maker’s presence
in the market and the actual price. The price-erosion effect is that price
difference multiplied by the combined sales volume of Valve Maker
and Rival. Defendant Valve Maker counters that the volume would
have been lower had the price been higher and measures damages
using the lower volume.
Comment: Wrongful competition is likely to cause some price erosion25 and,
correspondingly, some enlargement of the total market because of
the lower price. The more elastic the demand, the lower the volume
would have been with a higher price. The actual magnitude of the
price-erosion effect could be determined by economic analysis.
Price erosion is a common issue in quantifying intellectual property damages.
However, price erosion may be an issue in many other commercial disputes. For
example, a plaintiff may argue that the disparagement of its product due to false
advertising has eroded the product’s price.26
In more complicated situations, the damages analysis may need to focus on
how an entire industry would be affected by the defendant’s wrongdoing. For
example, one federal appeals court held that a damages analysis for exclusionary
conduct must consider that other firms beside the plaintiff would have enjoyed
the benefits of the absence of that conduct. Thus, prices would have been lower,
and the plaintiff’s profits correspondingly less than those posited in the plaintiff’s
damages analysis.27
Example:
Computer Printer Maker has used unlawful means to exclude rival suppliers of ink cartridges. Rival calculates damages on the assumption that
it would have been the only additional seller in the market absent the
exclusionary conduct, and that Rival would have been able to sell its
cartridges at the same price actually charged by Printer Maker. Printer
Maker counters that other sellers would have entered the market and
driven the price down, and so Rival has overstated its damages.
25. See, e.g., Micro Motion, 761 F. Supp. at 1430 (citing Yale Lock Mfg. Co. v. Sargent, 117 U.S.
536, 553 (1886), in which the Micro Motion court stated that “[i]n most price erosion cases, a patent
owner has reduced the actual price of its patented product in response to an infringer’s competition”).
26. See, e.g., BASF Corp. v. Old World Trading Co., Inc., 41 F.3d 1081 (7th Cir. 1994) (finding that the plaintiff’s damages only consisted of lost profits before consideration of price erosion,
prejudgment interest, and costs due to the presence of other competitors who would keep prices low).
27. See Dolphin Tours, Inc. v. Pacifico Creative Servs., Inc., 773 F.2d 1506, 1512 (9th Cir.
1985).
441
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Comment: Increased competition lowers price in all but the most unusual situations. Again, determination of the number of entrants attracted by
the elimination of exclusionary conduct and their effect on the price
probably requires a full economic analysis.
A comparison of the parties’ statements about the harmful event and the likely
impact of its absence will likely reveal differences in legal theories that can result
in large differences in damages claims.
Example: Client is the victim of unsuitable investment advice by Broker (all
of Client’s investments made by Broker are the result of Broker’s
negligence). Client’s damages study measures the sum of the losses
of the investments made by Broker, including only the investments
that incurred losses. Broker’s damages study measures the net loss by
including an offset for those investments that achieved gains.
Comment: Client is considering the harmful event to be the recommendation
of investments that resulted in losses, whereas Broker is considering
the harmful event to be the entire body of investment advice. Under
Client’s theory, Client would not have made the unsuccessful investments but would have made the successful ones, absent the unsuitable
advice. Under Broker’s theory, Client would not have made any
investments based on Broker’s advice.
A clear statement about the plaintiff’s situation but for the harmful event is
also helpful in avoiding double counting that can arise if a damages study confuses
or combines reliance28 and expectation damages.
Example: Marketer is the victim of defective products made by Manufacturer;
Marketer’s business fails as a result. Marketer’s damages study adds
together the out-of-pocket costs of creating the business in the
first place and the projected profits of the business had there been
no defects. Manufacturer’s damages study measures the difference
between the profit margin Marketer would have made absent the
defects and the profit margin Marketer actually made.
28. See Section III.B. Reliance damages are distinguished from expectation damages. Reliance damages are defined as damages that do not place the injured party in as good a position as if
the contract had been fully performed (expectation damages) but in the same position as if promises
were never made. Reliance damages reimburse the injured party for expenses incurred in reliance of
promises made. See, e.g., Satellite Broad. Cable, Inc. v. Telefonica de Espana, S.A., 807 F. Supp. 218
(D.P.R. 1992) (holding that under Puerto Rican law an injured party is entitled to reliance but not
expectation damages due to the wrongdoer’s willful and malicious termination or withdrawal from
precontractual negotiations).
442
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Comment: Marketer has mistakenly added together damages from the reliance
principle and the expectation principle.29 Under the reliance principle,
Marketer is entitled to be put back to where it would have been had
it not started the business in the first place. Damages are total outlays
less the revenue actually received. Under the expectation principle, as
applied in Manufacturer’s damages study, Marketer is entitled to the
profit on the extra sales it would have received had there been no
product defects. Out-of-pocket expenses of starting the business would
have no effect on expectation damages because they would be present
in both the actual and the but-for cases and would offset each other in
the comparison of actual and but-for value.
IV. Valuation and Damages
Most damages measurements deal, one way or another, with the question of the
economic value of streams of profit or income. In this section, we introduce
some of the basic concepts of valuation. In the following two sections, we first
address market approaches that use current data on prices and values to estimate
value directly. Second, we address income approaches that start by estimating
future flows and then discounting them back to a reference date (often referred
to as discounting cash flows). The income approaches apply to losses of personal
earnings as well as to business losses from lost streams of profits or income, where
damages are calculated as the present value of a lost stream of earnings. Although
commonly called income approaches, the methods include discounting any form
of cash flow, such as revenues and costs as well as income, to arrive at an estimate
of damages.
The choice between the two types of approaches is a matter of expert judgment. In some cases, an expert will use both types of approaches. Much of our
discussion is stated in terms of business valuation, but the discussion also applies
to real estate and other assets.
Some of the ways experts implement a market approach, based on market
prices or values, to determine damages include
• Relying on comparables such as a similar business or property,
• Using balance sheet information such as assets and liabilities,
29. The injured party cannot recover both reliance and expectation damages if such recovery
would result in double counting. See, e.g., West Haven Sound Development Corp. v. City of West
Haven, 514 A.2d 734, 746–47 (Conn. 1986) (plaintiff could seek recovery of reliance expenditures
instead of lost profits, but not in addition to lost profits, because reliance expenditures were part of
the value of the business as a going concern). See also George M. Cohen, The Fault Lines in Contract
Damages, 80 Va. L. Rev. 1225, 1262 (1994).
443
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
• Using known ratios from valuing comparables to measure losses, and
• Multiplying existing valuations by changes in market values from publicly
available information.
Different methods that experts use to implement an income approach, based
on discounting cash flows, to determine damages include
• Projecting revenues and costs with and without the alleged bad act,
• Adjusting profit streams to present value using measures of inflation and
the real rate of interest, and
• Projecting profit streams to present value implicitly using capitalization
rates.
Each approach presents challenges. The expert must identify the most appropriate
method and implement it properly.
Although these methods may seem different and may rely on different information about the firm, each should generate similar numbers. If not, then there
is usually an underlying difference in assumptions. Section V discusses the issues
and pitfalls frequently encountered when damages are computed from prices or
values, while Section VI discusses the issues and pitfalls frequently encountered
when damages are computed relying on discounted cash flows.
V. Quantifying Damages Using a Market
Approach Based on Prices or Values
An expert can sometimes measure damages as of the time of the wrongdoing
directly from market prices or values. For example, if the defendant’s negligence
causes the total destruction of the plaintiff’s cargo of wheat, worth $17 million at
the current market price, damages are simply that amount. The only task for the
expert is to restate the damages as the economic equivalent at time of trial, through
the calculation of prejudgment interest, a topic we consider in Section VI.G.
In many cases, the expert does not take a market price and apply it directly.
The price of the product or object at issue may not itself be known from a
market, but the expert can approximate the market value from the prices of
similar products or objects. Appraisers are experts whose task is to estimate
the fair market value of real estate, equipment, and works of art. Experts who
assess the value of businesses—some of whom specialize as business valuation
experts—often perform similar functions based on the known market values of
comparable businesses.
444
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
A. Is One of the Parties Using an Appraisal Approach to the
Measurement of Damages?
Damages analyses based on appraisals usually have two parts. The first is an
appraisal of the property, and the second is an application of that appraisal to
quantify the loss from the harmful act. The starting point for an appraisal is the
choice of comparable properties or businesses. For real estate, the comparables are
nearby similar properties. For businesses, the comparables are businesses similar in
as many ways as possible to the business at issue, based on characteristics such as
type of business, type of customers, size, type of location, and so forth. Only in
the case of publicly traded companies is there a known market value at virtually
all times. For real estate and private businesses, the comparables must have traded
hands at a known transaction price fairly recently. Numerous firms sell databases
of transaction prices and other data for the use of business valuation experts.
The second step in an appraisal is the adjustment of the comparables to account
for differences between each comparable business or property and the one at issue.
Business values are often restated as valuation ratios, such as the ratio of price to
revenue or to earnings. Real estate is restated as value per square foot of land or
interior space. Such ratios usually need to be specific to the type of business or real
estate. In particular, rapidly growing businesses and real estate in growing areas have
higher valuation ratios than those with zero or negative growth outlooks.
Example: Oil Company deprives Gas Station Operator of the benefits of Operator’s business. Oil Company’s damages study starts by calculating the
ratio of sales value to gasoline sales for five nearby gas station businesses
that have sold recently. The ratio is $0.26 per gallon of sales per year.
The Operator sells 1.6 million gallons per year, so the business was
worth $0.26 × 1,600,000 = $416,000, according to the Oil Company’s
expert. The Gas Station Expert argues that the sales used by the Oil
Company occurred before a major business relocated nearby. Thus, the
sales value to gasoline sales should be increased to $0.30 to reflect
the new growth rate as a result of the expected increase in business.
He calculates his business to be worth $0.30 × 1,600,000 = $480,000.
B. Are the Parties Disputing an Adjustment of an Appraisal
for Partial Loss?
In most cases where the appraisal approach is appropriate, the plaintiff has not suffered a total loss of property or business, but rather some impairment of its value.
In that case, the damages expert will adapt the appraisal to measure the loss from
the impairment. Here, again, the use of valuation ratios is common.
445
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Example: Oil Company breaches an earlier agreement with Gas Station Operator
and opens another station near Operator’s station. Operator’s gasoline sales fall by 700,000 gallons per year. Oil Company’s damages
study applies the ratio of $0.26 per gallon of sales per year to the loss:
$0.26 × 700,000 = $182,000. Operator’s damages study uses a regression analysis of the valuation of recently sold businesses and finds that
each gallon of added sales raises value by $0.47 and so calculates damages as $0.47 × 700,000 = $329,000.
Comment: Because of fixed costs, the average valuation of gasoline sales will be
less than the marginal valuation, and the latter is the conceptually correct approach.
C. Is One of the Parties Using the Assets and Liabilities
Approach?
The assets and liabilities approach starts with the accounting balance sheet of a
company and adjusts assets and liabilities to approximate current market values. It
then nets the assets and the liabilities to compute the net asset value of the firm.
The asset values include the value of intangibles. Because these values are hard to
determine, the assets and liabilities method is not generally suited to the valuation
of businesses with substantial intangible assets.
D. Are the Parties Disputing an Adjustment for Market
Frictions?
Purely competitive markets have what economists term a “frictionless” market
structure. These markets have (1) a large number of buyers and sellers of a single,
homogeneous product; (2) fully informed participants; and (3) the feature that
sellers can easily enter or exit from the market. A “friction” is anything that prevents the market from being purely competitive. The markets for businesses and
properties have frictions that may make transaction values depart from the usual
concept of the price negotiated by a willing seller and a willing buyer. In the case
of a forced sale and thus a less willing seller, the transaction price may understate
the value. Adverse selection, which occurs when one party knows more about a
property or business than the other, may cause severe understatements in some
markets.30 Because equipment with hidden defects is more likely to be offered for
sale than equipment in unusually good condition, and sales prices are lower as a
result, owners of the good equipment tend not to offer it on the market.
30. See, e.g., George A. Akerlof, The Market for “Lemons”: Quality Uncertainty and the Market
Mechanism, 84 Q. J. Econ. 488 (1970).
446
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Negligence of Tire Maker causes the total loss of a 747 aircraft. Tire
Maker’s damages analysis uses the prices of 747s of similar age in the
used airplane market to set a value of $23 million on the ruined airplane. However, Airline offers the testimony of an economist expert
who explains that only a small fraction of 747s are ever put up for sale
in the used airplane market. Rather, airlines choose to sell only defective planes because they continue to fly nondefective 747s. He then
adjusts for the adverse selection of inferior airplanes in the used market
and places a value of $42 million on the plane.
Comment: Although merited in principle, the airline’s adjustment is challenging
to carry out and is likely to be the subject of expert disagreement.
Example:
A major source of friction in property and business markets is the capital gains
tax. Because capital gains are taxed only at realization, after-tax sales prices will
generally understate the value of a business or property to the existing owners
if they have no plans to sell except in the more distant future. The forced sale
implicit in any act that harms a business or property imposes a loss on the owners
in excess of their after-tax loss. We discuss this topic later in Section VI.E on taxes.
E. Is One of the Parties Relying on Hypothetical Property in
Its Damages Analysis?
Plaintiffs may argue that undeveloped land or a business opportunity yet to be
pursued was taken from them by the defendant’s harmful act and that the value
lost should include the value of the still hypothetical improvements. We consider
this topic in more detail in Section VIII.A in its most important form, damages
for harm to a startup business.
Example: Property Owner sues County for the value of undeveloped property
condemned for a rapid transit extension. Owner’s damages claim
is $18 million, the appraisal value of a hypothetical condominium
development on the property less the anticipated cost of building
the development. The County’s expert, an appraiser, argues that the
market value of the property is $2 million, based on comparable
undeveloped land nearby.
Comment: In principle, the current market value of undeveloped land and the
market value of the same land with proper development, less the cost
of that development, should be the same, because buyers would bid
based on the value of the undeveloped land. Property Owner probably understated the development costs. But the value of the nearby
property may understate the value of the condemned property—it is
for sale because it lacks certain features that make it less desirable to
447
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
develop, such as a view. On the other hand, the Property Owner’s
valuation does not reflect the probability that the Property Owner may
not succeed in building the condominium.
F. What Complications Arise When Anticipation of Damages
Affects Market Values?
For publicly traded companies, the harmful act may depress the market value of
the company itself. For example, suppose that a manufacturer of wood windows
treats its windows with a preservative that is defective, causing the windows to
rot. The window manufacturer sues the manufacturer of the preservative for damages from lost sales in addition to the cost of replacing the defective windows.
The window manufacturer’s expert may be tempted to use the decline following
the harm as a measure of damages. In cases when the news of the harm reaches the
public discretely, say in a single day, the technique of an event study, commonly
used in securities fraud cases, can be used to isolate the special component of the
decline in market value.
The problem with using the plaintiff’s market value is that the market will
anticipate recovery in the form of damages, and this will offset at least some of the
decline in market value. In the extreme, if stock traders expect that the plaintiff
will receive exactly full compensation, the plaintiff’s market value will not change
at all when knowledge of the wrongdoing—including the fact that a damages
award will be made—hits the stock market. Thus, the use of the observed decline
in the value of the plaintiff company at the time of the injury understates the
actual amount of harm by an unknown amount, so the expert should consider
using other valuation techniques. Note that this understatement arises when the
publicly traded company itself stands to recover damages.
Changes in market values have a different role in situations, such as fraud on
the market, where the public company is the defendant. If the release of previously
fraudulently concealed adverse information causes both a reduction in the value
of the company because of the information and a further reduction because the
market anticipates that the company will pay damages to investors who overpaid
for their shares during the period when the information was concealed, damages
may be overstated by an unknown amount.
VI. Quantifying Damages as the Sum of
Discounted Lost Cash Flows
The fundamental principle of economics governing the second approach to valuation is that the value of a business is the present value of its expected future cash
448
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
flows.31 In forming a present value, the expert multiplies each future year’s cash
flows by the value today for a dollar received in that future year. This price is the
discount factor. Thus, the discount factor reflects the decreased value for a dollar
received in the future compared to the value of a dollar received today.
In broad summary, the damages expert using the discounted cash flow
approach projects historical and future but-for revenue and cost, actual historical
revenue and cost, and projected actual revenue and cost. The difference between
revenue and cost is cash flow, and the difference between but-for and actual cash
flow is the loss of cash flow attributable to the harmful act. The expert then applies
discount rates to each year’s lost cash flow to determine damages.
A. Is There Disagreement About But-For Revenues in the Past?
A common source of disagreement about the likely profitability of a business is
the absence of a track record of earlier profitability. Whenever the plaintiff is a
startup business, the issue will arise of reconstructing the value of a business with
no historical benchmark.
Example: Plaintiff Xterm is a failed startup. Defendant VenFund has breached a
venture capital financing agreement. Xterm’s damages study projects
the profits it would have made under its business plan. VenFund’s damages estimate, which is much lower, is based on the value of the startup
as revealed by sales of Xterm equity made just before the breach.
Comment: Both sides confront factual issues to validate their damages estimates.
Xterm needs to show that its business plan was still a reasonable forecast as of the time of the breach. VenFund needs to show that the sale
of equity places a reasonable value on the firm, that is, that the equity
sale was at arm’s length and was not subject to discounts. This dispute
can also be characterized as whether the plaintiff is entitled to expectation damages or reliance damages. The jurisdiction may limit damages
for firms with no track record.
B. Is There Disagreement About the Costs That the Plaintiff
Would Have Incurred but for the Harmful Event?
Where the injury takes the form of lost sales volume, the plaintiff usually has
avoided the cost of production for the lost sales. Calculation of these avoided costs
is a common area of disagreement about damages. Conceptually, avoided cost is the
difference between the cost that would have been incurred at the higher volume of
31. This discussion follows that in Shannon Pratt, Business Valuation Body of Knowledge 85–95
(2d ed. 2003).
449
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
sales but for the harmful event and the cost actually incurred at the lower volume of
sales achieved. In the format of Figure 1, the avoided-cost calculation is done each
year. The following are some of the issues that arise in calculating avoided cost:
• For a firm operating at capacity, expansion of sales is cheaper in the long
run than in the short run, whereas, if there is unused capacity, expansion
may be cheaper in the short run.
• The costs that can be avoided if sales fall abruptly are smaller in the short
run than in the long run.
• Avoided costs may include marketing, selling, and administrative costs as
well as the cost of manufacturing.
• Some costs are fixed, at least in the short run, and are not avoided as a
result of the reduced volume of sales caused by the harmful act.
Sometimes putting costs into just two categories is useful: those that vary
with sales (variable costs) and those that do not vary with sales (fixed costs). This
breakdown is approximate, however, and does not do justice to important aspects
of avoided costs. In particular, costs that are fixed in the short run may be variable
in the longer run. Disputes frequently arise over whether particular costs are fixed
or variable. One side may argue that most costs are fixed and were not avoided by
losing sales volume, whereas the other side may argue that many costs are variable.
Certain accounting concepts relate to the calculation of avoided cost. Profitand-loss statements frequently report the “cost of goods sold.”32 Costs in this
category are frequently, but not uniformly, avoided when sales volume is lower.
But costs in other categories, called “operating costs” or “overhead costs,” also
may be avoided, especially in the long run. One approach to the measurement
of avoided cost is based on an examination of all of a firm’s cost categories. The
expert determines how much of each category of cost was avoided.
An alternative approach uses regression analysis or some other statistical
method to determine how costs vary with sales as a general matter within the firm
or across similar firms. The results of such an analysis can be used to measure the
costs avoided by the decline in sales volume caused by the harmful act.
C. Is There Disagreement About the Plaintiff’s Actual Revenue
After the Harmful Event?
When the plaintiff has mitigated the adverse effects of the harmful act by making
an investment that has not yet paid off at the time of trial, disagreement may arise
about the value that the plaintiff has actually achieved.
32. See, e.g., United States v. Arnous, 122 F.3d 321, 323 (6th Cir. 1997) (finding that the district court erred when it relied on government’s theory of loss because the theory ignored the cost
of goods sold).
450
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Example: Manufacturer breaches agreement with Distributor. Distributor starts
a new business that shows no accounting profit as of the time of trial.
Distributor’s damages study makes no deduction for actual earnings
during the period from breach to trial. Manufacturer’s damages study
places a value on the new business as of the time of trial and deducts
that value from damages.
Comment: Some offset for economic value created by Distributor’s mitigation
efforts may be appropriate. Note that if Distributor made a good-faith
effort to create a new business, but was unsuccessful because of adverse
events outside its control, the issue of the treatment of unexpected
subsequent events will arise.33
D. What Is the Role of Inflation?
1. Do the parties use constant dollars for future losses, or are such losses
stated in future dollars whose values will be diminished by inflation?
Persistent inflation in the U.S. economy complicates projections of future losses.
Although inflation rates in the United States since 1987 have been only in the
range of 1% to 3% per year, the cumulative effect of inflation has a pronounced
effect on future dollar quantities. At 3% annual inflation, a dollar today buys what
$4.38 will buy 50 years from now. Under inflation, the unit of measurement of
economic values becomes smaller each year, and this shrinkage must be considered
if future losses are measured in the smaller dollars of the future. Calculations of
this type are often termed “escalation.” Dollar losses grow in the future because
of the use of the shrinking unit of measurement. For example, an expert might
project that revenues for a firm will rise at approximately 5% per year for the next
10 years—3% because of general inflation and 2% more because of the growth
of the firm.34
Alternatively, the expert may project future losses in constant dollars without
explicitly accounting for escalation for future inflation.35 The use of constant
dollars avoids the problems of dealing with a shrinking unit of measurement. In
the example just given, the expert might project that revenues will rise at 2% per
year in constant dollars. Constant dollars must be stated with respect to a base
year. Thus, a calculation in constant 2009 dollars means that the unit for future
measurement is the purchasing power of the dollar in 2009.
33. See Section VIII.F.
34. See Section VI.D.2.
35. See, e.g., Willamette Indus., Inc. v. Commissioner, 64 T.C.M. (CCH) 202 (1992) (holding
expert witness erred in failing to take inflation escalation into account).
451
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
2. Are the parties using a discount rate properly matched to the projection?
For future losses, a damages study calculates the amount of compensation needed
at the time of trial to replace expected future lost income. The result is discounted
future losses;36 it is also sometimes referred to as the present value of future
losses.37 Discounting is conceptually separate from the adjustment for inflation
considered in the preceding section. Discounting is typically carried out in the
format shown in Table 1.
Table 1. Calculation of Discounted Loss at 5% Interest
Years in Future
Loss
Discount Factor
Discounted Lossa
0
1
2
$100
125
130
1.000
0.952
0.907
$100
119
118
Total
aDiscounted
$337
Loss = Loss × Discount Factor.
“Loss” is the estimated future loss, in either escalated or constant-dollar form.
“Discount factor” is a factor that calculates the number of dollars needed at the
time of trial to compensate for a lost dollar in the future year. The discount factor is the ratio of the value at a future date of a cash flow received today to its
value today. It is calculated from the discount rate, which is the interest rate that
values a cash flow at a future date. If the current 1-year interest rate is 5%, then
the discount rate is 1.05—the value of $1 will be $1.05 a year from now. The
discount factor will therefore be $1/$1.05. The 2-year discount rate is the square
of 1.05, and the discount factor will be 1/(1.05 × 1.05) Thus, the discount factor
is computed by compounding the discount rate forward from the base year to the
future year and then taking the reciprocal.
For example, in Table 1, the interest rate is 5%. As discussed, the discount
factor for the next year is calculated as the reciprocal of 1.05, and the discount factor for 2 years in the future is calculated as the reciprocal of 1.05 squared. Future
discounts would be obtained by multiplying by 1.05 a suitably larger number of
times and then taking the reciprocal. The discounted loss is the loss multiplied
by the discount factor for that year. The number of dollars at time of trial that
compensates for the loss is the sum of the discounted losses, $337 in this example.
36. See generally Michael A. Rosenhouse, Annotation, Effect of Anticipated Inflation on Damages for
Future Losses—Modern Cases, 21 A.L.R. 4th 21 (1981) (discussing discounted future losses extensively).
37. See generally George A. Schieren, Is There an Advantage in Using Time-Series to Forecast Lost
Earnings? 4 J. Legal Econ. 43 (1994) (discussing effects of different forecasting methods on present
discounted value of future losses). See, e.g., Wingad v. John Deere & Co., 523 N.W.2d 274, 277–79
(Wis. Ct. App. 1994) (calculating present discounted value of future losses).
452
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
To discount a future loss projected in escalated terms, one should use an
ordinary interest rate. For example, in Table 1, if the losses of $125 and $130 are
in dollars of those years, and not in constant dollars of the initial year, then the
use of a 5% discount rate is appropriate if 5% represents an accurate measure of
the current interest rate, also known as the time value of money. The ordinary
interest rate is often called the nominal interest rate to distinguish it from the
real interest rate.
To discount a future loss projected in constant dollars, one should use a real
interest rate as the discount rate. A real interest rate is an ordinary interest rate less
an assumed rate of future inflation.38 In Table 1, the use of a 5% discount rate for
discounting constant-dollar losses would be appropriate if the ordinary interest rate
was 8% and the rate of inflation was 3%.39 Then the real interest rate would be
8% minus 3%, or 5%. The deduction of the inflation rate from the discount rate
is the counterpart of the omission of escalation for inflation from the projection
of future losses.
3. Is one of the parties assuming that discounting and earnings growth
offset each other?
An expert might make the assumption that future growth of losses will occur at
the same rate as the discount rate. Table 2 illustrates the standard format for this
method of calculating discounted loss.
Table 2. Calculation of Discounted Loss When Growth and Discounting
Offset Each Other
Years in Future
Loss
Discount Factor
Discounted Lossa
0
1
2
$100
105
110
1.000
0.952
0.907
$100
100
100
Total
aDiscounted
$300
Loss = Loss × Discount Factor.
When growth and discounting exactly offset each other, the present discounted value is the number of years of lost future earnings multiplied by the
38. Some experts rely on the real interest rate inferred from the price of TIPS (Treasury Inflation Protected Securities).
39. Technically, the formula is: (1 + real rate of interest) = (1 + ordinary rate of interest)/
(1 + inflation). However, the difference is diminimus unless the ordinary rate of interest is high. Thus,
using this formula, the real interest rate is 4.85%.
453
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
current amount of lost earnings.40 In Table 2, the loss of $300 is exactly three
times the base year’s loss of $100. Thus the discounted value of future losses can
be calculated by a shortcut in this special case. The explicit projection of future
losses and the discounting back to the time of trial are unnecessary. However,
the parties may dispute whether the assumption that growth and discounting are
exactly offsetting is realistic in view of projected rates of growth of losses and
market interest rates at the time of trial.
In Jones & Laughlin Steel Corp. v. Pfeifer,41 the Supreme Court considered the
issue of escalated dollars with nominal discounting against constant dollars with
real discounting. It found both acceptable, although the Court seemed to express
a preference for the second format.
E. Are Losses Measured Before or After the Plaintiff’s Income
Taxes?
A damages award compensates the plaintiff for lost economic value. In principle,
the calculation of compensation should measure the plaintiff’s loss after taxes and
then calculate the magnitude of the pretax award needed to compensate the plaintiff fully, once taxation of the award is considered. In practice, the tax rates applied
to the original loss and to the compensation are frequently the same. When the
rates are the same, the two tax adjustments are a wash. In that case, the appropriate
pretax compensation is simply the pretax loss, and the damages calculation may be
simplified by the omission of tax considerations.42
In some damages analyses, explicit consideration of taxes is essential, and disagreements between the parties may arise about these tax issues. If the plaintiff’s
lost income would have been taxed as a capital gain (at a preferential rate), but
the damages award will be taxed as ordinary income, the plaintiff can be expected
to include an explicit calculation of the extra compensation needed to make up
for the loss of the tax advantage. Sometimes tax considerations are paramount in
damages calculations.43
40. Certain state courts have, in the past, required that the offset rule be used so as to avoid
speculation about future earnings growth. In Beaulieu v. Elliott, 434 P.2d 665, 671–72 (Alaska 1967),
the court ruled that discounting was exactly offset by wage growth. In Kaczkowki v. Bolubasz, 421 A.2d
1027, 1036–38 (Pa. 1980), the Pennsylvania Supreme Court ruled that no evidence on price inflation
was to be introduced and deemed that inflation was exactly offset by discounting.
41. 462 U.S. 523 (1983).
42. There is a separate issue about the effect of taxes on the interest rate for prejudgment interest
and discounting. See discussion infra Sections VI.G, VI.H.
43 See generally John H. Derrick, Annotation, Damages for Breach of Contract as Affected by Income
Tax Considerations, 50 A.L.R. 4th 452 (1987) (discussing a variety of state and federal cases in which
courts ruled on the propriety of tax considerations in damage calculations; courts have often been
reluctant to award difference in taxes as damages because it is calling for too much speculation).
454
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Example: Trustee wrongfully sells Beneficiary’s property at full market value.
Beneficiary would have owned the property until death and deferred
the capital gains tax.
Comment: Damages are the difference between the actual capital gains tax and
the present value of the future capital gains tax that would have been
paid but for the wrongful sale, even though the property sold at its full
value.
In some cases, the law requires different tax treatment of loss and compensatory awards. Again, the tax adjustments do not offset each other, and consideration
of taxes may be a source of dispute.
Example: Driver injures Victim in a truck accident. A state law provides that
awards for personal injury are not taxable, even though the income
lost as a result of the injury would have been taxable. Victim calculates
damages as lost pretax earnings, but Driver calculates damages as lost
earnings after tax.44 Driver argues that the nontaxable award would
exceed actual economic loss if it were not adjusted for the taxation of
the lost income.
Comment: Under the principle that damages are to restore the plaintiff to the
economic equivalent of the plaintiff’s position absent the harmful act,
it may be recognized that the income to be replaced by the award
would have been taxed. However, the law in a particular jurisdiction
may not allow a jury instruction on the taxability of an award.45
Example: Worker is wrongfully deprived of tax-free fringe benefits by Employer.
Under applicable law, the award is taxable. Worker’s damages estimate
includes a factor so that the amount of the award, after tax, is sufficient
to replace the lost tax-free value.
Comment: Again, to achieve the goal of restoring plaintiff to a position economically equivalent absent the harmful act, an adjustment of this type is
44. See generally Brian C. Brush & Charles H. Breedon, A Taxonomy for the Treatment of Taxes in
Cases Involving Lost Earnings, 6 J. Legal Econ. 1 (1996) (discussing four general approaches for treating
tax consequences in cases involving lost future earnings or earning capacity based on the economic
objective and the tax treatment of the lump sum award). See, e.g., Myers v. Griffin-Alexander Drilling
Co., 910 F.2d 1252 (5th Cir. 1990) (holding loss of past earnings between the time of the accident
and the trial could not be based on pretax earnings).
45. See generally John E. Theuman, Annotation, Propriety of Taking Income Tax into Consideration
in Fixing Damages in Personal Injury or Death Action, 16 A.L.R. 4th 589 (1981) (discussing a variety of
state and federal cases in which the propriety of jury instructions regarding tax consequences is at issue).
See, e.g., Bussell v. DeWalt Prods. Corp., 519 A.2d 1379 (N.J. 1987) (holding that trial court hearing
a personal injury case must instruct jury, upon request, that personal injury damages are not subject to
state and federal income taxes); Gorham v. Farmington Motor Inn, Inc., 271 A.2d 94 (Conn. 1970)
(holding court did not err in refusing to instruct jury that personal injury damages were tax-free).
455
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
appropriate. The adjustment is often called “grossing up” damages.46
To accomplish grossing up, divide the lost tax-free value by one minus
the tax rate. For example, if the loss is $100,000 of tax-free income,
and the income tax rate is 25%, the award should be $100,000 divided
by 0.75, or $133,333.
F. Is There a Dispute About the Costs of Stock Options?
In some firms, employee stock options are a significant part of total compensation. Stock options are often used by startup businesses because the options do
not require the business to pay out any cash. However, at a future date, the
options may be exercised and the option holder will pay only the price per share
at the time the options are received as opposed to the price per share at the time
the options are exercised. In this way, the firm transfers part of the compensation
costs incurred today to the firm’s shareholders.
The parties may dispute whether the value of options should be included in
the costs avoided by the plaintiff as a result of lost sales volume. The defendant
might argue that stock options should be included, because their issuance is costly
to the shareholders. The defendant might place a value on newly issued options
and amortize this value over the period from issuance to vesting. The plaintiff, in
contrast, might exclude options costs because the options cost the firm no cash
payout, even though they impose costs on the firm’s shareholders.
Example: Firm A pays its sales manager $2000 for every machine sold at $100,000
as well as options to purchase 1600 shares in a year at the existing
price of $10 per share. As a result of B’s disparagement of A, A asserts
that it lost $10,000,000 in sales (100 machines). In its damages analysis, A states that the lost sales represent lost profits of $5,800,000:
$10,000,000 less $4,000,000 in avoided production costs and $200,000
in avoided sales commissions. Defendant B calculates that each stock
option is worth $5 today based on an analysis using accepted financial models to value the options. Thus, B asserts that damages are
$5,000,000: $10,000,000 less $4,000,000 in avoided production costs
and $1,000,000 in sales commissions ($200,000 plus $5 × 100 × 1600).
Comment: The costs of the options will never show up on the profit-and-loss
statements for Firm A, even if exercised. However, Firm A will receive
a lower value for each share it sells either to an investor or through an
IPO to reflect the potential future dilution in its shares outstanding.
46. See Cecil D. Quillen, Jr., Income, Cash, and Lost Profits Damages Awards in Patent Infringement
Cases, 2 Fed. Circuit B.J. 201, 207 (1992) (discussing the importance of taking tax consequences and
cash flows into account when estimating damages).
456
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
G. Is There a Dispute About Prejudgment Interest?47
The law may specify how to calculate interest for losses prior to a verdict on
liability, generally termed “prejudgment interest.” The law may exclude prejudgment interest, specify prejudgment interest to be a statutory rate, or exclude
compounded interest. Table 3 illustrates these alternatives. With simple uncompounded interest, losses from 5 years before trial earn five times the specified
interest, and so compensation for a $100 loss from 5 years ago is $135 at 7%
interest. With compound interest, the plaintiff earns interest on past interest.
Compensation at 7% interest compounded is about $140 for a loss of $100 five
years before trial. The difference between simple and compound interest becomes
much larger if the time from loss to trial is greater or if the interest rate is higher.
Because interest receipts in practice do earn further interest, economic analysis
generally supports the use of compound interest.
Table 3. Calculation of Prejudgment Interest (in Dollars)
Years Before
Trial
Loss Without
Interest
Loss with
Compound
Interest at 7%
Loss with Simple
Uncompounded
Interest at 7%
10
9
8
7
6
5
4
3
2
1
0
100
100
100
100
100
100
100
100
100
100
100
197
184
172
161
150
140
131
123
114
107
100
170
163
156
149
142
135
128
121
114
107
100
Total
1100
1579
1485
47. See generally Michael S. Knoll, A Primer on Prejudgment Interest, 75 Tex. L. Rev. 293 (1996)
(discussing prejudgment interest extensively). See, e.g., Ford v. Rigidply Rafters, Inc., 984 F. Supp.
386, 391–92 (D. Md. 1997) (specifying a method of calculating prejudgment interest in an employment discrimination case to ensure plaintiff is fairly compensated rather than given a windfall);
Acron/Pacific Ltd. v. Coit, No. C-81-4264-VRW, 1997 WL 578673, at *2 (N.D. Cal. Sept. 8,
1997) (reviewing supplemental interest calculations and applying California state law to determine the
appropriate amount of prejudgment interest to be awarded); Prestige Cas. Co. v. Michigan Mut. Ins.
Co., 969 F. Supp. 1029 (E.D. Mich. 1997) (analyzing Michigan state law to determine the appropriate
prejudgment interest award).
457
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Where the law does not prescribe the form of interest for past losses, the
experts will normally apply a reasonable interest rate to bring those losses forward.
The parties may disagree on whether the interest rate should be measured before
or after tax. The before-tax interest rate is the normally quoted rate. To calculate
the corresponding after-tax rate, one subtracts the amount of income tax the
recipient would have to pay on the interest. Thus, the after-tax rate depends on
the tax situation of the plaintiff. The format for calculation of the after-tax interest
rate is shown in the following example:
1.
2.
3.
4.
Interest rate before tax: 9%
Tax rate: 30%
Tax on interest (line 1 times line 2): 2.7%
After-tax interest rate (line 1 less line 3): 6.3%
Even where damages are calculated on a pretax basis, economic considerations
suggest that the prejudgment interest rate should be on an after-tax basis: Had a
taxpaying plaintiff actually received the lost earnings in the past and invested the
earnings at the assumed rate, income tax would have been due on the interest. The
plaintiff’s accumulated value would be the amount calculated by compounding
past losses at the after-tax interest rate.
Where there is economic disparity between the parties, there may be a disagreement about whose interest rate should be used—the borrowing rate of the
defendant or the lending rate of the plaintiff, or some other rate. There may also
be disagreements about adjustment for risk.48
Example: Crop Insurance Company disputes payment of insurance to Farmer.
Farmer calculates damages as the payment due plus the large amount of
interest charged by a personal finance company; no bank was willing
to lend to her, given her precarious financial condition. Crop Insurer
calculates damages as a lower payment plus the interest on the late
payment at the normal bank loan rate.
Comment: The law may limit claims for prejudgment interest to a specified interest rate, and a court may hold that this situation falls within the limit.
Economic analysis does support the idea that delays in payments are
more costly to people with higher borrowing rates and that the actual
rate incurred may be considered damages.
48. See generally James M. Patell et al., Accumulating Damages in Litigation: The Roles of Uncertainty and Interest Rates, 11 J. Legal Stud. 341 (1982) (extensive discussion of interest rates in damages
calculations).
458
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
H. Is There Disagreement About the Interest Rate Used to
Discount Future Lost Value?
Discount calculations should use a reasonable interest rate drawn from current
data at the time of trial for losses projected to occur after trial. The interest rate
might be obtained from the rates that could be earned in the bond market from
a bond of maturity comparable to the lost stream of receipts. As in the case of
prejudgment interest, there is an issue as to whether the interest rate should be
on a before- or after-tax basis. The parties may also disagree about adjusting the
interest rate for risk. A common approach for determining the interest on lost
business profit is to use the Capital Asset Pricing Model (CAPM)49 to calculate
the risk-adjusted discount rate. The CAPM is the standard method in financial
economics to analyze the relation between risk and discounting. In the CAPM
method, the expert first measures the firm’s “beta”—the ratio of the percent variation in one firm’s value to the percent variation in the value of all businesses. That
is, if the index of value for a representative set of firms50 increases by 10% over a
year and the firm has a beta of 1.5, then its value is expected to increase by 15%
over a year. Then the risk-adjusted discount rate is the risk-free rate from a U.S.
Treasury security plus the beta multiplied by the historical average risk premium
for the stock market.51 The calculation may be presented in the following format:
1.
2.
3.
4.
5.
Risk-free interest rate: 4.0%
Beta for this firm: 1.2
Market equity premium: 6.0%
Equity premium for this firm (line 2 times line 3): 7.2%
Discount rate for this firm (line 1 plus line 4): 11.2%
I. Is One of the Parties Using a Capitalization Factor?
Another approach to discounting a stream of losses uses a market capitalization
factor. A capitalization factor is the ratio of the value of a future stream of income
to the current amount of the stream; for example, if a firm is worth $1 million
and its current earnings are $100,000, its capitalization factor is ten.
The capitalization factor generally is obtained from the market values of
comparable assets or businesses. For example, the expert might locate a comparable business traded in the stock market and compute the capitalization factor as
49. See, e.g., Cede & Co. v. Technicolor, Inc., No. Civ.A.7129, 1990 WL 161084 (Del. Ch.
Oct. 19, 1990) (Mem.) (assessing the propriety of using CAPM to determine the discount rate); Gilbert
v. MPM Enters., Inc., No. 14416, 1997 WL 633298, at *8 (Del. Ch. Oct. 9, 1997) (finding that
petitioner’s expert witnesses’ use of CAPM is appropriate).
50. For example, the S&P 500.
51. Richard A. Brealey et al., Principles of Corporate Finance 213–22 (9th ed. 2008).
459
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the ratio of stock market value to operating income. In addition to capitalization
factors derived from markets, experts sometimes use rule-of-thumb capitalization
factors. For example, the value of a dental practice might be taken as 2 year’s
gross revenue (the capitalization factor for revenue is 2). Often the parties dispute
whether there is reliable evidence that the capitalization factor accurately measures
value for the specific asset or business.
Once the capitalization factor is determined, the calculation of the discounted
value of the loss is straightforward: It is the current annual loss in operating profit
multiplied by the capitalization factor. A capitalization factor approach to valuing
future losses may be formatted in the following way:
1. Ratio of market value to current annual earnings in comparable publicly
traded firms: 13
2. Plaintiff’s lost earnings over past year: $200,000
3. Value of future lost earnings (line 1 times line 2): $2,600,000
The capitalization factor approach might also be applied to revenue, cash
flow, accounting profit, or other measures. The expert might adjust market values
for any differences between the valuation principles relevant for damages and
those that the market applies. For example, the value in the stock market may
be considered the value placed on a business for a minority interest, whereas the
plaintiff’s loss relates to a controlling interest. In this case, the expert would adjust
the capitalization factor upward to account for the value of the control rights. The
parties may dispute almost every element of the capitalization calculation.
Example: Lender is responsible for failure of Auto Dealer. Plaintiff Auto Dealer’s
damages study projects rapid growth of future profits based on the
current year’s profit but for Lender’s misconduct. The study uses a
discount rate calculated as the after-tax interest rate on Treasury bills.
As a result, the application of the discount rate to the future stream
of earnings implies a capitalization rate of 12 times the current pretax
profit. The resulting estimate of lost value is $10 million. Defendant
Lender’s damages study uses data on the actual sale prices of similar
dealerships in various parts of the country. The data show that the
typical sales price of a dealership is six times its 5-year average annual
pretax profit. Lender’s damages study multiplies the capitalization factor of six by the 5-year average annual pretax profit of Auto Dealer of
$500,000 to estimate lost value as $3 million.
Comment: Part of the difference between the two damages studies comes from
the higher implied capitalization factor used by Auto Dealer. Another
reason for the differences may be that the 5-year average pretax profit
is less than the current-year profit.
460
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
VII. Limitations on Damages
The law imposes four important limitations on a plaintiff’s ability to recover
losses as damages: (1) a plaintiff must prove its damages with reasonable certainty,
(2) a plaintiff may not recover damages that are too remote, (3) a plaintiff has a
duty to mitigate its damages, and (4) a liquidated damages clause may limit the
amount of damages by prior agreement.
A. Is the Defendant Arguing That Plaintiff’s Damages
Estimate Is Too Uncertain and Speculative?
In general, damages law holds that a plaintiff may not recover damages beyond an
amount proven with reasonable certainty.52 This rule permits damages estimates
that are not mathematically certain but excludes those that are speculative.53
Failure to prove damages to a reasonable certainty is a common defense. The
determination of what constitutes speculation is increasingly a matter of law to be
determined prior to trial in a Daubert proceeding.54
Courts and commentators have long recognized the difficulties in defining
what constitutes reasonable certainty or speculation in a damages analysis. The
exclusion of damages on grounds of excessive uncertainty regarding the amount of
damages may result in an award of zero damages when it is likely that the plaintiff
suffered significant damages, even though the actual amount is quite uncertain.
There are three contexts in which reasonable certainty or speculation can
arise: (1) where the outcome is uncertain, (2) where it is argued that the expert
has not used the best method or data, or (3) where the damages suffered by a
specific plaintiff are uncertain.
Traditionally, damages are calculated without reference to uncertainty about
outcomes in the but-for scenario. Outcomes are taken as actually occurring if
they are the expected outcome and as not occurring if they are not expected to
occur. This approach may overcompensate some plaintiffs and undercompensate
others. For example, suppose that a drug company was deprived of the opportunity to bring to market a drug that had a 90% chance of receiving Food and Drug
52. See, e.g., Restatement (Second) of Contracts § 352 (“Damages are not recoverable for loss
beyond an amount that the evidence permits to be established with reasonable certainty”).
53. Comment a to Restatement (Second) of Contracts § 352 states, in pertinent part: “Damages
need not be calculable with mathematical accuracy and are often at best approximate.”
54. See, e.g., Cole v. Homier Distributing Co., Inc., 599 F.3d 856, 866 (8th Cir. 2010) (expert
testimony on lost profits excluded under Daubert standard because it “failed to rise above the level of
speculation”). See also Webb v. Braswell, 930 So. 2d 387 (Miss. 2006). In Webb, the plaintiff’s expert
sought to testify as to future damages resulting from unplanted crops, without establishing that the
crops would have been profitable. The court excluded the testimony based on Mississippi’s adoption
of the Daubert standard, stating that “damages for breach of contract must be proven with reasonable
certainty and not based merely on speculation and conjecture.” Id. at 398.
461
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Administration (FDA) approval, at a profit of $2 billion, and a 10% chance of not
receiving FDA approval, with losses of $1 billion. The court may treat 90% as
near enough to certainty and ignore the 10% risk of failure and award damages
of $2 billion.
By contrast, economists quantify losses of uncertain outcomes in terms of
expected values, where the value in each outcome is weighted by its probability.
Under that approach, economic losses in our example should be calculated as the
$2 billion economic loss assuming FDA approval times 90% plus the $1 billion economic loss times 10%, or (0.9) × ($2 billion) + (0.1) × (−$1 billion) = $1.7 billion.
The plaintiff would be overcompensated by $300 million under the approach that
ignored the small probability of failure.
Now suppose the drug has only a 40% chance of FDA approval
with the same economic payoffs. The plaintiff may recover no damages on
grounds of uncertainty and speculation even though the economic loss is
(0.4) × ($2 billion) + (0.6) × (−$1 billion) = $200 million. This issue also arises
with respect to new businesses and is discussed further in Section VIII.A.
The second context where speculation arises when a damages expert fails
to conduct his analysis in accordance with the principles discussed in Daubert.55
In general, the expert should provide all available information about the
degree of uncertainty in an estimate of damages particularly when the claim is that
inadequate data are available.
Example:
A fire destroyed Broker’s business including its business records.
Defendant Smoke Detector Manufacturer argues that determining the
profitability of the Broker’s business is impossible without the business records. Therefore, damages are speculative and damages should
not be awarded. Broker argues that the information would have been
available absent the failure of Smoke Detector Manufacturer’s product,
and so Broker should be permitted wide latitude to measure damages
from fragmentary records.
This issue also arises in labor cases where the defendant has failed to maintain
the records as required by law.
Example:
A class of workers was denied lunch breaks as required by state law.
The class estimates damages assuming that no lunch breaks were ever
taken. Defendant Can Maker argues that lunch breaks were often
taken and provides testimony by a few employees as proof. The class
argues that it is entitled to damages on the hypothesis that no lunch
breaks were ever taken because Can Maker failed to keep proper
records.
55. See Margaret A. Berger, The Admissibility of Expert Testimony, in this manual.
462
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Disputes about what constitutes a reasonable damages analysis can range from
the plaintiff’s assertion that lack of records entitles it to damages under the worstcase scenario to defendant’s assertion that damages are zero because any calculation
is speculative. Furthermore, the latitude afforded the plaintiffs sometimes appears to
depend on the egregiousness of the defendants’ improper actions. The difficulties in
finding a middle ground are greater when the defendant fails to make an affirmative estimate of damages but only attacks the plaintiff’s quantification as speculative.
Defendants frequently avoid offering a jury an affirmative damages analysis for fear
that the jury will take the affirmative analysis as a concession of fault.
The question of speculative damages also arises in a third context when the
certainty of damages for a specific plaintiff is not knowable at the time of trial.
Vaccine Maker’s duck flu vaccine given to children has been proven
to harm one-quarter of the children who receive it, but determining
which children will be affected is impossible. The harm is the onset
of dementia at age 50, with economic losses of $1 million per person.
Trial occurs well before any of the vaccinated children has reached this
age. The expert for the class measures damages as $250,000 per recipient of the vaccine. The expert for Vaccine Maker argues that damages
are zero because it is more likely than not that any given child was not
harmed.
Comment: The class might not recover damages even though the average class
member’s economic loss is the expected value of $250,000. The case
might be resolved at an early stage by denial of class certification
because it is not possible to define a class in which all members were
proven to be harmed.56 Note that a possible solution would be to create a trust with $250,000 per class member, let it earn market returns,
and pay out that amount plus the returns to each class member who
develops dementia.
Example:
This difficulty in determining the probability of damages may be part of a
challenge to class certification and is discussed further in Section XI.
B. Are the Parties Disputing the Remoteness of Damages?
A second legal limitation on damages is that a plaintiff may not recover damages
that are too remote. In tort cases, this restriction is expressed in terms of proximate cause,57 which often is equivalent to reasonable foreseeability. In contract
56. See, e.g., In Re New Motor Vehicles Canadian Export Antitrust Litig., 522 F.3d 6 (1st Cir.
2008).
57. See William L. Prosser, Palsgraf Revisited, 52 U. Mich. L. Rev. 1 (1953); Osborne M.
Reynolds, Jr., Limits on Negligence Liability: Palsgraf at 50, 32 Okla. L. Rev. 63 (1979).
463
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
cases, the limitation is similarly embodied in the idea of foreseeability—a party
may not recover damages that were not reasonably foreseeable by the parties at
the time of the agreement.58 The foreseeability rule has two parts. First, a party
is liable for what are known as direct or general damages—those damages that
arise naturally from the breach itself. Second, a defendant may also be liable for
consequential or special damages—damages apart from those arising naturally
from the breach—if such damages were reasonably foreseeable at the time of the
agreement.59 Although sometimes there are differences between proximate cause
in torts and foreseeability in contracts, the general concept is the same: The law
imposes a limit on damages that are too remote.60
The rule is often at issue in cases in which the injured party’s loss greatly
exceeds the benefit the breaching party received in return.
Example:
Manufacturer hires Repairman to replace a part in a machine in its
plant. Repairman negligently performs the service, causing Manufacturer’s plant to cease production for two weeks. Manufacturer’s damages demand includes a claim for two weeks of lost profits. Repairman
counters that, although he may be liable for the cost of proper repairs,
the foreseeability rule bars a claim for lost profits because such damages
were not a probable consequence reasonably foreseeable at the time of
the agreement.
Similar examples involve cases in which a package delivery firm or courier service
is sued for remote consequential damages resulting from its failure to deliver a
package.61
These limitations on damages are closely related to mitigation and the proper
protection from losses resulting from the failure of agents or counterparties. A
responsible company would not risk large losses from the failings of a repairman
or delivery service. Rather, the company would use redundancy or other standard
measures to limit the chances that such a failure would cause huge losses.
C. Are the Parties Disputing the Plaintiff’s Efforts to Mitigate
Its Losses?
A third limitation on damages is that a party may not recover for losses it could
have avoided, and is often expressed by stating that the injured party has a duty to
mitigate, or lessen, its damages. The economic justification for the mitigation rule
58. See E. Allan Farnsworth, Legal Remedies for Breach of Contract, 70 Colum. L. Rev. 1145,
1199–1210.
59. Id.
60. See, e.g. Richard A. Posner, Economic Analysis of Law 203–04 (1998).
61. See Hampton by Hampton v. Fed. Express Corp., 917 F. 2d 1124 (8th Cir. 1990).
464
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
is that the injured party should not cause economic waste by needlessly increasing
its losses.62
In a dispute about mitigation, the law places the burden of proof on the
defendant to show that the plaintiff failed to take reasonable steps to mitigate.63
The defendant will propose that the proper offset is the earnings the plaintiff should
have achieved, under proper mitigation, rather than actual earnings. In some cases,
the defendant may presume the ability of the plaintiff to mitigate in certain ways
unless the defendant has specific knowledge to the contrary at the time of a breach.
For example, the defendant might presume that the plaintiff could mitigate by
locating another source of supply in the event of a breach of a supply agreement.
Damages are limited to the difference between the contract price and the current
market price in that situation.
For personal injuries, the issue of mitigation often arises because the defendant
believes that the plaintiff’s failure to work after the injury is a withdrawal from
the labor force or retirement rather than the result of the injury.64 For commercial torts, mitigation issues can be more subtle. Where the plaintiff believes that
the harmful act destroyed a company, the defendant may argue that the company
could have been put back together and earned profit, possibly in a different line
of business.65 The defendant will then treat the hypothetical profits as an offset
to damages.66
Alternatively, where the plaintiff continues to operate the business after the
harmful act and includes subsequent losses in damages, the defendant may argue
that the proper mitigation was to shut down after the harmful act.67
Example: Franchisee Soil Tester starts up a business based on Franchiser’s proprietary technology, which Franchiser represents as meeting government
standards. During the startup phase, Franchiser notifies Soil Tester that
the technology has failed. Soil Tester continues to develop the business but sues Franchiser for profits it would have made from successful
technology. Franchiser calculates much lower damages on the theory
that Soil Tester should have mitigated by terminating the startup.
62. See E. Allan Farnsworth, Legal Remedies for Breach of Contract, 70 Colum. L. Rev. 1145,
1183–84.
63. See, e.g., Broadnax v. City of New Haven, 415 F.3d 265, 268 (2d Cir. 2005) (defendant
employer seeking to avoid a claim of lost wages bears the burden of proving that the plaintiff failed to
mitigate his damages by, among other things, taking reasonable steps to obtain alternate employment).
64. See William T. Paulk, Commentary, Mitigation Through Employment in Personal Injury Cases:
The Application of the “Reasonable” Standard and the Wealth Effects of Remedies, 58 Ala. L. Rev. 647–64
(2007).
65. See Seahorse Marine Supplies v. Puerto Rico Sun Oil, 295 F.3d 68, 84–85 (1st Cir. 2002).
66. Id. at 84.
67. Id. at 85. Also see In re First New England Dental Ctrs., Inc., 291 B.R. 240 (D. Mass. 2003).
465
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Comment: This is primarily a factual dispute about mitigation. If the failure of
the technology was unambiguous, it would appear that Soil Tester was
deliberately trying to increase damages by continuing its business. On
the other hand, Soil Tester might argue that the notification overstated
the defects of the technology and was an attempt by Franchiser to
avoid its obligations under the contract.
Disagreements about mitigation may be hidden within the frameworks of the
plaintiff’s and the defendant’s damages studies.
Example: Defendant Board Maker has breached an agreement to supply circuit
boards. Plaintiff Computer Maker’s damages study is based on the loss
of profits on the computers to be made from the circuit boards. Board
Maker’s damages study is based on the difference between the contract
price for the boards and the market price at the time of the breach.
Comment: There is an implicit disagreement about Computer Maker’s duty to
mitigate by locating alternative sources for the boards not supplied by
the defendant. The Uniform Commercial Code spells out the principles for resolving these legal issues under the contracts it governs.68
D. Are the Parties Disputing Damages That May Exceed the
Cost of Avoidance?
An important consideration in capping damages may be the costs of steps that the
plaintiff could have taken that would have eliminated damages. This argument is
closely related to mitigation, but has an important difference: The defendant may
argue that the plaintiff’s failure to undertake a costly step that would have avoided
losses was reasonable, but that the failure to take that step shows that the plaintiff
knew that damages were much smaller than its later damages claim.
Example: Insurance Company suffered a business interruption because a fire
made its offices unusable for a period of time. Insurance Company’s
damages claim for $10 million includes not only the lost business until
the offices were usable but also damages for permanent loss of business from customers who found other sources during the period the
68. See, e.g., Aircraft Guaranty Corp. v. Strato-Lift, Inc., 991 F. Supp. 735, 738–39 (E.D. Pa.
1998) (Mem.) (finding that according to the Uniform Commercial Code, plaintiff-buyer had a duty
to mitigate if the duty was reasonable in light of all the facts and circumstances, but that failure to
mitigate does not preclude recovery); S.J. Groves & Sons Co. v. Warner Co., 576 F.2d 524 (3d Cir.
1978) (holding that the duty to mitigate is a tool to lessen plaintiff’s recovery and is a question of fact);
Thomas Creek Lumber & Log Co. v. United States, 36 Fed. Cl. 220 (1996) (finding that under federal
common law the U.S. government had a duty to mitigate in breach-of-contract cases).
466
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
offices were unusable. Defendant argues that the plaintiff’s failure to
relocate to temporary quarters shows that their losses were less than
the $350,000 cost of that relocation.
Comment: Defendant’s argument has the unstated premise that Insurance Company could have carried on its business and avoided any of its later
losses by relocating. Insurance Company will likely argue that a decision not to relocate was commercially appropriate because relocation
would not have avoided much of the lost business.
E. Are the Parties Disputing a Liquidated Damages Clause?
In addition to legally imposed limitations on damages, the parties themselves may
have agreed to impose limits on damages should a dispute arise. Such clauses are
common in many types of agreements. Once litigation has begun, the parties
may dispute whether these provisions are legally enforceable. The law may limit
enforcement of liquidated damages provisions to those that bear a reasonable relation to the actual damages. In particular, the defendant may attack the amount
of liquidated damages as an unenforceable penalty. The parties may disagree on
whether the harmful event falls within the class intended by the contract provision.
Changes in economic conditions may be an important source of disagreement
about the reasonableness of a liquidated damages provision. One party may seek
to overturn a liquidated damages provision on the grounds that new conditions
make it unreasonable.
Example: Scrap Iron Supplier breaches supply agreement and pays only the specified liquidated damages. Buyer seeks to set aside the liquidated damages
provision because the price of scrap iron has risen, and the liquidated
damages are a small fraction of actual damages under the expectation
principle.
Comment: There may be conflict between the date for judging the reasonableness
of a liquidated damages provision and the date for measuring expectation damages, as in this example. Generally, the date for evaluating the
reasonableness of liquidated damages is the date the contract is made.
In contrast, the date for measuring expectation damages is the date of
the breach. The conflict may be resolved by the substantive law of the
jurisdiction. Enforcement of the liquidated damages provision in this
example will induce inefficient breach.
467
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
VIII. Other Issues Arising in General in
Damages Measurement
A. Damages for a Startup Business
Failure rates for startups are high even without any actionable harm. More than
two-thirds of venture-funded startups return nothing to their founding entrepreneurs, although the expected value of venture outcomes is several million dollars
per entrepreneur.69 Thus, a damages calculation for harm to a startup puts particular stress on the treatment of uncertainty in damages, as we discussed earlier
in Section VII.A. At one time, legal principles barred recovery because damages
were too speculative, but today most courts will allow a new business to recover
damages for lost profits if such damages can be proven with reasonable certainty.70
Whether a court will award damages for an injured startup if the plaintiff’s damages expert testifies that the likelihood was less than 50% that the company would
have become profitable is still unresolved.
1. Is the defendant challenging the fact of economic loss?
Expert testimony on damages does not usually include separate consideration of
the fact of damages, because an opinion that damages are positive amounts to an
opinion about the fact, and a zero-damages opinion amounts to an opinion against
the fact. Damages for startups may be an exception. Analysis by the plaintiff’s
expert may conclude that there is a significant probability that the startup would
not have been profitable and, in that contingency, damages would have turned out
to be zero (or even negative, in the sense that the defendant’s action prevented the
plaintiff from incurring a loss). Thus, a defendant may argue that the plaintiff has
not proven the fact of damages. In most cases involving an existing business, the
fact of economic loss is often self-evident, but in a case involving a new business,
the fact of economic loss may be at issue.
2. Is the defendant challenging the use of the expected value approach?
The expected value approach to uncertain damages weights each outcome by its
probability of occurring. The expected value can be positive, indicating damages,
even if the odds favor a company making a loss. Application of the expected value
approach involves studying the various outcomes of the new business in relation
to risk factors. Risks can be categorized as idiosyncratic (i.e., risks specific to the
69. See Robert E. Hall & Susan E. Woodward, The Burden of the Nondiversifiable Risk of Entrepreneurship, 100 Am. Econ. Rev. 1163 (2010).
70. See Mark A. Allen & Victoria A. Lazear, Valuing Losses in New Businesses, in Litigation
Services Handbook: The Role of the Financial Expert §§ 11.1–.26 (Roman L. Weil et al. eds., 4th
ed. 2007).
468
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
venture) or systematic (i.e., risks that affect the venture in the same way as other
businesses). Idiosyncratic risks include whether the venture will succeed, the firm’s
ability to obtain financing, whether a competitor will develop a similar product,
and risks related to the pricing of the product or competitive products, such as the
price of inputs. Examples of systematic risks are financial crisis, inflation, collapse
of the stock market, and recession.
The expert proceeds first by identifying the idiosyncratic risks associated with
the venture and creating an appropriate model. Analyses usually model these types
of risk as different scenarios, each with a specific probability of occurring. The
expert computes the lost profits for each scenario, multiplies the lost profits by the
probability of the event occurring, and then sums the weighted profits to arrive at
expected lost profits. The result is a stream of future lost profits before adjustment
for general economic variables such as inflation, stock market fluctuations, or wage
growth. Because the expert has adjusted for idiosyncratic factors, the remaining
risks of lost profits for a new business are the same as those for a similar, existing business. Then experts usually adjust lost profits for systematic risks using the
CAPM (see Section VI.H) to estimate the cost of capital.
The actual calculation of expected damages is usually straightforward. Damages are the stream of expected lost profits discounted to present value. However, sometimes the alternatives and interactions between the possible outcomes
become so complex that other methods are required. In such cases, experts often
generate hundreds or thousands of possible outcomes using techniques such as
Monte Carlo or bootstrap simulation. These techniques generate random values
for the variables that change with different outcomes.71 Expected damages are
then the average of lost profits across all outcomes.
An alternative to calculating new business damages based on lost profits uses
market valuations of the firm. For a publicly traded business, the valuation is
implicit in the stock price—it is the market capitalization of the firm. For a new
venture, the valuation is implicit in financing decisions. Startup firms are often
financed by venture capitalists who invest funds in exchange for ownership in the
venture. The valuation at the time of financing is the amount of financing divided
by the ownership transferred. For example, if venture investors pay $4 million for
10% of the firm, the total value of the firm is $4 divided by 0.10, or $40 million.
3. Are the parties disputing the relevance and validity of the data on the
value of a startup?
The expert seeking to establish economic loss on behalf of a new business will
often face a lack of data and therefore will need to use additional resources for the
71. Monte Carlo relies on random draws from the hyposized distributions for the variables.
Bootstrap takes the observed variables as the population of outcomes and relies on repeated random
draws from this population.
469
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
analysis. Although the expert may have access to third-party data on factors such as
overall success rates for comparable ventures, the expert will often need to rely on
the plaintiff and other experts to refine the probability of success. If success reflects
consumer preferences, then the expert may use market research techniques such
as surveys. For example, a survey could evaluate the desirability of a new feature
for a product and the premium that consumers will pay for it. Other sources of
information include studies of success rates on behalf of venture capitalists. Such
studies typically show the success rates for new businesses at different stages of
investment and the actual returns that the venture capitalists have realized.
B. Issues Specific to Damages from Loss of Personal Income
As with all cases, many of the disputes that arise in estimating damages for lost personal income can be resolved by carefully applying the basic damages framework.
Damages are the difference between the but-for and actual worlds, where the
actual world reflects any mitigating factors. Estimating such damages also involves
issues that are unique, such as calculating losses over a person’s lifetime, valuing
fringe benefits, estimating lost income in wrongful death cases, and calculating
damages for economic losses other than lost wages. We discuss these issues below.
1. Calculating losses over a person’s lifetime
In nearly all cases involving lost income, the effects continue past trial and sometimes until the plaintiff’s death. Therefore, quantifying damages for loss of personal
income necessarily involves projecting the plaintiff’s work history and retirement.
Conceptually, the estimate of income for each year, either but-for or actual, is the
expected income multiplied by the probability that the person will be working
for that year. The probability that the person will be working for that year is the
product of the probability that the person will survive the year, the probability
that the person will be in the labor force, the probability that the person will be
employed, given that the person worked in the prior year.72 We refer to this as
the standard framework for calculating personal losses.
In many cases, such as those involving wrongful termination, the projection
that the plaintiff was working is the same for both the but-for world and the actual
world. However, in wrongful death cases and some personal injury cases, these
projections may differ,73 and the expert will need to compute separate projections
for the but-for and actual worlds before taking the difference between the two.
72. Except for wrongful death cases, the probability that the persons will be working is usually
1 for both the but-for and actual cases. In wrongful death cases the probability is still usually 1 in the
but-for case, but 0 in the actual.
73. This situation may arise in a personal injury case if, as a result of the accident, the injured
person is less likely to be able to work. If so, then the person may have an increased likelihood of
470
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
These projections usually rely on data from the Bureau of Labor Statistics
(BLS) and include tables on survival, labor force participation, and employment.
The expert usually needs to manipulate this information in order to generate the
conditional probabilities needed.
To simplify the calculations, sometimes the expert uses the person’s expected
lifespan and retirement age based on his or her age at trial using standard tables
from the BLS. Then the expert need only sum the discounted losses for each
year until the expected age at death. This method, often referred to as the life
expectancy method, will considerably simplify the calculations associated with
determining lost retirement benefits. However, the standard and the life expectancy methods will usually generate the same estimate of losses only if the expert
is assuming that the expected discounted income in each year is the same—
that is, that the expected increase in income is offset by the discount rate (see
Section VI.D.3).
BLS data are generally only available by age and sex. Other data specify these
statistics by race, location, or broad occupation categories but only by groups of
ages. More specific tables are commercially available, but the reliability of these
data may be disputed because of questions about the methodology used to generate the tables.
2. Calculation of fringe benefits
Fringe benefits are often a component of lost pay and may include medical insurance and retirement benefits such as social security. Although sick days and vacation are also fringe benefits, they are included already in lost-pay calculations. An
exception occurs if these days are accrued but not taken, where, for example, the
employee lost the cash payout that would have occurred at a normal termination
or lost the benefit of future days off with pay.
a. Medical insurance benefits
In the following discussion, we assume that the plaintiff no longer has the benefit
of the employer-provided insurance as a result of the actions of the defendant.
Such situations typically arise in wrongful termination or wrongful death cases.
Calculating damages for lost medical insurance is straightforward if the plaintiff can purchase insurance under COBRA, or from his current employer, or on
the open market. Then the value of the lost medical insurance is the employee’s
portion of the premium. If insurance coverage available to the plaintiff differs
significantly between the but-for and the actual worlds, then the expert will need
to project the impact of the difference in policy coverage.
leaving the labor force for each age compared with the likelihood prior to the incident. Similarly, the
injured person may be more likely to retire at an earlier age.
471
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
If the plaintiff chooses not to purchase insurance even though the option is
vailable, or the plaintiff is unable to purchase insurance,74 then the plaintiff may
argue that the value of insurance is the sum of his actual expenses less the premium
in the but-for world. The defendant will likely respond that the plaintiff assumed
the risk that the incurred medical expenses could exceed the plaintiff’s portion of
the premium, and therefore that the defendant’s responsibility should be limited
to the plaintiff’s portion of the premium foregone. The plaintiff may counter that
his pay was insufficient to afford the insurance.
b. Retirement benefits
For lost retirement benefits, the issues are similar to those involving lost medical
benefits, but the calculations are more complex. There are basically two types of
retirement plans: defined benefit plans and defined contribution plans. Defined
benefit plans are those where the benefits paid out after retirement are guaranteed
to be a definite amount upon retirement. In contrast, defined contribution plans
are those where the employer makes a predefined contribution for the employee
but the benefits paid out depend on the return earned on the money invested.
The expert can calculate both types of retirement plans on the basis of either
the amounts paid in or the amounts paid out by the employer. If the expert uses
the amount the employer paid for the benefits, which may be a function of the
amount the plaintiff earned, then the calculation is analogous to computing the
loss in plaintiff’s earnings. The disadvantage of this approach is that the amounts
paid in may not adequately predict the benefits paid out, particularly when the
plan is a defined benefit plan. We discuss this topic below in connection with
social security benefits, where the problem is particularly acute.
(1) Defined benefit plan
To determine the present value of the benefits received under a defined benefit
plan, the calculation is simplified if the expert uses the life expectancy method
to calculate the plaintiff’s losses. In this situation, the expert must determine the
number of years that the plaintiff would have worked at the firm upon retirement,
his retirement age, his expected lifespan, and his salary at the firm over time. These
factors must be consistent with the expert’s belief about the projected trajectory
of plaintiff’s employment in both the but-for and actual worlds.
However, if the expert instead uses probability tables for each year, then the
calculation is more complex. For each year in which the plaintiff may cease to
be in the labor force for reasons other than death, the expert must determine the
likelihood that the plaintiff would be receiving benefits from the plan because of
74. For example, a preexisting condition may make it impossible to purchase insurance on the
open market or may limit the plaintiff’s coverage to exclude a preexisting condition either permanently
or for a period of time.
472
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
disability (if the plan permits) or retirement. Complicating this determination is
that the payout from the plan may depend on the age of retirement. Thus, the
calculation must incorporate the probability for each possible payout. Depending
on the plan, defining possible outcomes can be extremely complex.
A special and common example of a defined benefit plan is Social Security.75
Determining benefits from Social Security can be forbiddingly complex because
the number of potential outcomes is so large. For example, a person can retire at
almost any age, and disability payments are made if he is unable to work. In addition, calculating the benefit at any age depends on the person’s average salary over
the most recent 35 years. If social security benefits are critical to the magnitude
of damages, the expert may choose to simplify the calculation by relying on the
life expectancy method.
(2) Defined contribution plan
For a defined contribution plan, the expert’s task is to project the employer’s
contribution, the number of years that the employee would have worked at the
firm, as well as the employee’s age at retirement. Generally, this determination is
straightforward because it is based on the same factors the expert uses to project
the employee’s salary in the but-for and actual worlds. The present value of the
employer’s contributions will be the expected payouts from the plan.
3. Wrongful death
Traditionally, under common law, the right of recovery ended with a person’s
death and thus damages for wrongful death were not recoverable. Today, states
have remedied this situation through the passage of wrongful death and survival
statutes. A wrongful death action focuses on the impact of the decedent’s death on
persons other than the decedent. In contrast, a survival action continues the action
the decedent could have maintained had he lived and compensates the decedent’s
estate for damages the decedent sustained. Some states have separate wrongful
death and survival statutes; others have hybrid statutes that combine elements of
both actions. Rules for recovery vary widely by state.
Generally, calculation of economic damages for wrongful death depends on
whether the claimant is a relative of the decedent or is the estate. If the claimant
is a relative of the decedent, economic damages are limited to the economic value
that the relative would have received had the decedent lived. If the relative is a
dependent, the recovery may be substantial, whereas if the decedent is a child or
unmarried and childless and the only relatives are parents, the recovery may be
small, because most parents receive little economic value from their children. In
75. Social Security is generally regarded as a defined benefit plan although it has some elements
of a defined contribution plan.
473
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
contrast, if the beneficiary is the estate, the recovery may include all of the lost
economic value.
Where the claimant is a relative, damages from lost wages may be reduced to
reflect the decedent’s own consumption spending had he continued living. Such
expenses may be relatively small if the claimant is a spouse with children under
the theory that much of the decedent’s income would have been spent to support
the dependents. If the decedent had no children or if there is another earner in
the family, the offset for the spending of the decedent on himself may be higher.
4. Shortened life expectancy
An important issue is whether a plaintiff may recover compensation for shortened
life expectancy caused by an injury. This issue may arise, for example, in medical
malpractice cases in which a doctor fails to diagnose and treat a condition or where
a surgeon fails to remove a medical device used during surgery. Some states allow
such a recovery; others do not.76
A related issue is whether dependents in a wrongful death action may recover
economic damages for support the decedent would have provided had the decedent lived—that is, whether such damages can be recovered over the remainder
of the decedent’s expected lifetime, had he lived. Again, rules regarding such
recovery vary widely by state, but quantifying such damages requires a projection
of the decedent’s life expectancy using the methods discussed above.
5. Damages other than lost income
a. Loss of services
Economic damages may include loss of services in addition to lost wages. For
example, in a case involving the death or disability of a housewife, the husband
may seek recovery for money necessary to hire someone to take care of the children and the home.
b. Medical expenses
Damages for wrongful death or injury may also include past and future medical
expenses. Recoverable medical expenses may include compensation for someone
to provide medical assistance to the plaintiff, such as nursing care, or expenses for
special equipment necessary for living. These expenses are usually calculated by an
expert in this area. The role of the economic damages expert is usually confined
to computing the present value of these expenses.
76. See, e.g., Dillon v. Evanston Hospital, 771 N.E.2d 357 (Ill. 2002); Swain v. Curry, 595 So.
2d 168 (Fla. Dist. Ct. App. 1992).
474
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
c. Expenses not incurred
If the plaintiff is not employed, she may not be incurring certain expenses that she
otherwise would have incurred had the wrongful termination or personal injury
never taken place. Applying the but-for world analysis, the defendant may argue
that these expenses should be an offset in the calculation of plaintiff’s economic
damages.77 Examples of such expenses are union dues or transportation costs. See
Section VIII.D for a general discussion. Legal standards vary by jurisdiction.
d. Other damages
Sometimes an expert may be asked to opine on damages for pain and suffering
and other diminution in the quality of life. There has been some development in
using hedonic models to estimate such losses, but the research is still preliminary.
In general, the expert relies upon whoever is best positioned to place an economic
value on the diminution of life suffered by the plaintiff. The expert may then be
asked to calculate the present value of the estimate.
C. Damages with Multiple Challenged Acts: Disaggregation
Plaintiffs sometimes challenge a number of a defendant’s acts and offer an estimate
of the combined effect of those acts. If the court determines that only some of the
challenged acts are illegal, the damages analysis needs to be adjusted to consider
only those acts. This issue seems to arise most often in antitrust cases, but can arise
in any type of case. Ideally the damages testimony would equip the factfinder to
determine damages for any combination of the challenged acts, but that may be
tedious. If there are, say, 10 challenged acts, it would take more than 1000 separate
studies to determine damages for every possible combination of findings about the
unlawfulness of the acts.
There have been several cases where the jury has found partially for the plaintiff, but the jury lacked assistance from the damages experts on how the damages
should be calculated for the combination of acts the jury found to be unlawful.
Although the jury has attempted to resolve the issue, appeals courts have sometimes rejected damages found by juries without supporting expert testimony.78
77. In wrongful death actions, these expenses may be included in the deduction for the amount
the decedent would have spent on himself. See supra Section VIII.B.3.
78. See e.g., Litton Sys., Inc. v. Honeywell, Inc., 1996 U.S. Dist. LEXIS 14662 (C.D. Cal.
July 26, 1996) (granting new trial on damages only “[b]ecause there is no rational basis on which the
jury could have reduced Litton’s ‘lump sum’ damage estimate to account for Litton’s losses attributable
to conduct excluded from the jury’s consideration, . . .”); Image Technical Servs., Inc. v. Eastman
Kodak Co., 125 F.3d 1195, 1224 (9th Cir. 1997), cert. denied, 118 S. Ct. 1560 (1998) (plaintiffs
“must segregate damages attributable to lawful competition from damages attributable to Kodak’s
monopolizing conduct”).
475
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
One solution to this problem is to make the determination of the illegal acts
before damages testimony is heard, termed “bifurcation” of liability and damages.
The damages experts can adjust their testimony to consider only the acts found
to be illegal.
In some situations, damages are the sum of separate damages for the various
illegal acts. For example, there may be one injury in New York and another in
Oregon. Then, the damages testimony may consider the acts separately, and disaggregation is not challenging.
When the challenged acts have effects that interact, it is not possible to
consider damages separately and add up damages for each individual act. This is
an area of great confusion. When the harmful acts substitute for each other, the
sum of damages attributable to each separately is less than their combined effect.
As an example, suppose that the defendant has used exclusionary contracts and
anticompetitive acquisitions to ruin the plaintiff’s business. However, the plaintiff’s
business could not survive if either the contracts or the acquisitions were found
to be legal. Damages for the combination of acts are the value of the business,
which would have thrived absent both the contracts and the acquisitions. Now
consider damages if only the contracts but not the acquisitions are illegal. In the
but-for analysis, the acquisitions are hypothesized to occur because they are not
illegal, but not the contracts. But plaintiff’s business cannot function in that butfor situation because the acquisitions alone were sufficient to ruin the business.
Hence damages—the difference in value of the plaintiff’s business in the but-for
and actual situations—are zero. The same would be true for a separate damages
measurement for the acquisitions, with the contracts taken to be legal but not
the acquisitions. Thus, the sum of damages for the individual acts is zero, but the
damages if both acts are illegal are the value of the business.
When the effects of the challenged conduct are complementary, the sum of
damages for each type of conduct by itself will be more than damages for all types
of conduct together. For example, suppose a party claims that a contract is exclusionary based on the combined effect of the contract’s duration and its liquidated
damages clause that includes an improper penalty provision. The actual amount of
the penalty would cause little exclusion if the duration were brief, but substantial
exclusion if the duration were long. Similarly, the actual duration of the contract
would cause little exclusion if the penalty were small but substantial exclusion if
the penalty were large. A damages analysis for the penalty provision in isolation
compares but-for—without the penalty provision but with long duration—to
actual, where both provisions are in effect. Damages are large. Similarly, a damages
estimate for the duration in isolation gives large damages. The sum of the two
estimates is nearly double the damages from the combined use of both provisions.
Thus, a request that the damages expert disaggregate damages for different
combinations of challenged acts is far more than a request that the total damages
estimate be broken down into components that add up to the damages attributable to the combination of all the challenged acts. In principle, a separate damages
476
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
analysis—with its own carefully specified but-for scenario and analysis—needs to
be done for every possible combination of illegal acts.
Example: Hospital challenges Glove Maker for illegally obtaining market power
through the use of long-term contracts and the use of a discount program that gives discounts to consortiums of hospitals if they purchase
exclusively from Glove Maker. The jury finds that Glove Maker has
attempted to monopolize the market with its discount programs, but
that the long-term contracts were legal because of efficiencies. Hospital
states that its damages are the same as in the case in which both acts
were unlawful because either act was sufficient to achieve the observed
level of market power. Glove Maker argues that damages are zero
because the lawful long-term contracts would have been enough to
allow it to dominate the market.
Comment: The appropriate damages analysis is based on a careful new comparison
of the market with and without the discount program. The but-for
analysis should include the presence of the long-term contracts because
they were found to be legal.
Apportionment, sometimes referred to as disaggregation, can arise in a different setting. A damages measure may be challenged as encompassing more
than the harm caused by the defendant’s harmful act. The expert may be asked
to apportion his estimate of damages between the harm caused by the defendant
and the harm caused by factors other than the defendant’s misconduct. In this case,
the expert is being asked to restate the improper actions, not to disaggregate the
damages estimate for the improperly inclusive damages estimate. If the expert uses
the standard format and thus properly isolates the effects of only the defendant’s
wrongful actions, no modification of the expert’s estimate of damages is needed.
In the standard format, the but-for analysis differs from the actual world only by
hypothesizing the absence of the harmful act committed by the defendant. The
comparison of the but-for world with the actual world automatically isolates
the causal effects of the harmful act. No disaggregation of damages caused by the
harmful act is needed once the standard format is applied.
D. Is There a Dispute About Whether the Plaintiff Is Entitled
to All the Damages?
When the plaintiff is in some sense a conduit to other parties, the defendant may
argue that the plaintiff is entitled to only those damages that it would have retained
in the but-for scenario. In the following example, a regulated utility is arguably a
conduit to the ratepayers:
477
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Example: Generator Maker overcharges Utility. Generator Maker argues that
the overcharge would have been part of Utility’s rate base, and
so Utility’s regulator had set higher prices because of the overcharge.
Utility, therefore, did not lose anything from the overcharge. Instead,
the ratepayers paid the overcharge. Utility argues that it stands in for
all the ratepayers and that the damages award will accrue to the ratepayers by the same principle—the regulator will set lower rates because
the award will count as revenue for rate-making purposes.
Comment: In addition to the legal issue of whether Utility does stand in for ratepayers, there are two factual issues: Was the overcharge actually passed
on to ratepayers? Will the award be passed on to ratepayers?
Similar issues can arise in the context of employment law.
Example: Plaintiff Sales Representative sues for wrongful denial of a commission.
Sales Representative has subcontracted with another individual to do
the actual selling and pays a portion of any commission to that individual as compensation. The subcontractor is not a party to the suit.
Defendant Manufacturer argues that damages should be Sales Representative’s lost profit measured as the commission less costs, including
the payout to the subcontractor. Sales Representative argues that she
is entitled to the entire commission.
Comment: Given that the subcontractor is not a plaintiff, and Sales Representative avoided the subcontractor’s commission, the literal application
of standard principles for damages measurement would appear to call
for the lost-profit measure. The subcontractor, however, may be able
to claim his share of the damages award. In that case, damages would
equal the entire lost commission, so that, after paying off the subcontractor, Sales Representative receives exactly what she would have
received absent the breach. Note that the second approach would place
the subcontractor in exactly the same position as the Internal Revenue
Service in our discussion of adjustments for taxes in Section VI.E.
The issue also arises acutely in the calculation of damages on behalf of a nonprofit corporation. When the corporation is entitled to damages for lost profits,
the defendant may argue that the corporation intentionally operates its business
without profit. The actual losers in such a case are the people who would have
enjoyed the benefits from the nonprofit that would have been financed from the
profits at issue.
478
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
E. Are the Defendants Disputing the Apportionment of
Damages Among Themselves?
When the defendants are not jointly liable for the harmful acts, but rather each is
responsible for its own harmful act, the damages expert needs to quantify damages
separately for each defendant. The issues in apportionment among defendants are
similar to those discussed above for disaggregation among the harmful acts.
1. Are the defendants disputing apportionment among themselves despite
full information about their roles in the harmful event?
In the simplest case, there are no interactions among the harmful acts of different
defendants, and the expert can proceed as if there were separate trials with separate
damages analyses.
However if there are interactions among the harmful acts, then apportionment among defendants involves puzzles that cannot be resolved by economic
principles. If either of the harmful acts of two defendants would have caused all
the harm that occurred, then either defendant can argue for zero damages on
the ground that the harm would have occurred anyway, because of the other
defendant’s act.
Example:
Tire Maker supplies faulty tire and Landing Gear Maker supplies
faulty landing gear. Either one would have resulted in the loss of
the airplane upon landing. Airline measures damages as the value
of the airplane and proposes that the two defendants split the amount
equally. But Tire Maker asserts that the damages it owes the plaintiff
are zero because the crash would have occurred anyway because of
the Landing Gear Maker’s faulty landing gear. Similarly, the Landing
Gear Maker asserts the damages it owes the plaintiff to be zero because
the crash would have occurred anyway because of the Tire Maker’s
faulty tire.
The issue also arises when the interaction is more complicated.
Example:
Teenager drives through a red light and injures Driver. The injury
is more serious than it would have been otherwise because Driver’s
airbag failed to deploy. Airbag Maker argues that it should pay nothing
because there would have been no harm if Teenager had obeyed the
red light. Teenager argues that Airbag Maker should pay the difference between the actual harm to Driver and the harm if the airbag had
worked properly.
479
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
2. Are the defendants disputing the apportionment because the wrongdoer
is unknown?
A second issue in apportioning damages arises when the harmful product is
known, but more than one defendant made the product, and it is not known
which made the product that caused the injury. One approach is to determine
the probability that each defendant made the product that caused the plaintiff’s
loss. In some cases, a reasonable assumption may be that the probability that the
defendant caused the plaintiff’s losses may be determined from its market share.
Thus, for example, a drug manufacturer’s responsibility would be proportional to
the likelihood that a plaintiff consumed one of its pills.
F. Is There Disagreement About the Role of Subsequent Unexpected
Events?
Random events occurring after the harmful event can affect the plaintiff’s actual
loss. The effect might be either to amplify the economic loss from what might
have been expected at the time of the harmful event or to reduce the loss.
Example: Housepainter uses faulty paint, which begins to peel a month after
the paint job. Owner measures damages as the cost of repainting.
Painter disputes on the ground that a hurricane that actually occurred
3 months after the paint job would have ruined a proper paint job
anyway.
Comment: This dispute will need to be resolved on legal rather than economic
grounds. Both sides can argue that their approach to damages will, on
average over many applications, result in the right incentives for proper
house painting.
The issue of subsequent random events should be distinguished from the legal
principle of supervening events.79 The subsequent events occur after the harmful
act; there is no ambiguity about who caused the damage, only an issue of quantification of damages. Under the theory of a supervening event, there is precisely
a dispute about who caused an injury. In the example above, there would be an
79. See, e.g., Derdiarian v. Felix Contracting Corp., 414 N.E.2d 666 (N.Y. 1980) (interpreting
state law to hold that a jury could find that the defendant is ultimately liable to plaintiff for negligence,
even though a third person’s negligence was a supervening event); Lavin v. Emery Air Freight Corp.,
980 F. Supp. 93 (D. Conn. 1997) (holding that under Connecticut law, a party seeking to be excused
from a promised performance as a result of a supervening event must show the performance was made
impracticable, non-occurrence was an assumption at the time the contract was made, impracticability
did not arise from the party’s actions, and the party seeking to be excused did not assume a greater
liability than the law imposed).
480
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
issue of the role of a supervening event if the paint had not begun to peel until
after the hurricane.
Disagreements about the role of subsequent random events are particularly
likely when the harmful event is fraud.
Example: Seller of property misstates condition of property. Buyer shows that
he would not have purchased the property absent the misstatement.
Property values in general decline sharply between the fraud and the
trial. Buyer measures damages as the difference between the purchase
price and the market value of the property at the time of trial. Seller
measures damages as the difference between the purchase price and the
market value at the time of purchase, assuming full disclosure.
Comment: Buyer may be able to argue that retaining the property was the reasonable course of action after uncovering the fraud; in other words, there
may be no issue of mitigation here. In that sense, Seller’s fraud caused
not only an immediate loss, as measured by Seller’s damages analysis,
but also a subsequent loss. Seller, however, did not cause the decline
in property values. The dispute needs to be resolved as a matter of law.
As a general matter, it is preferable to exclude the effects of random subsequent effects, especially if the effects are large in relation to the original loss.80
The reason is that plaintiffs choose which cases to bring, which may influence the
approach to damages. If random subsequent events are always included in damages, then plaintiffs will bring the cases that happen to have amplified damages
and will not pursue those where the random later event makes damages negative.
Such selection of cases will overcompensate plaintiffs. Similarly, if plaintiffs can
choose whether to include the effects of random subsequent events, plaintiffs
will choose to include those effects when they are positive and exclude them
when they are negative. Again, the result will be to overcompensate plaintiffs.81
If random subsequent events are always excluded, then the plaintiff is compensated for his loss, however temporary, and the defendant pays for the damages
he actually caused.
80. See Franklin M. Fisher & R. Craig Romaine, Janis Joplin’s Yearbook and the Theory of Damages,
in Industrial Organization, Economics, and the Law 392, 399–402 (John Monz ed., 1991); Fishman v.
Estate of Wirtz, 807 F.2d 520, 563 (7th Cir. 1986) (Easterbrook, J., dissenting in part).
81. See William B. Tye et al., How to Value a Lost Opportunity: Defining and Measuring Damages from Market Foreclosure, 17 Res. L. & Econ. 83 (1995). For a discussion of disclosure of expert
reports under Federal Rule of Civil Procedure 26(a)(2), see Margaret A. Berger, The Admissibility
of Expert Testimony, Section V.B.1, in this manual. For a discussion of disclosure of data supporting
expert testimony, see Daniel L. Rubinfeld, Reference Guide on Multiple Regression, Section V, in
this manual.
481
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
IX. Data Used to Measure Damages
A. Types of Data
1. Electronic data
Electronic data have four general formats: (1) proprietary, (2) electronic character,
(3) scanned, and (4) survey.
Examples of proprietary formats are those used by the SAS statistical software,
the Oracle database system, and Microsoft’s Access and Excel. Although these
formats are proprietary, the ones we have listed are de facto industry standards and
are the most convenient ways to transmit data among experts and from parties to
experts. All of these software systems can create data files in Excel format, which
is the effective universal standard for sharing smaller bodies of data.
Electronic character representations are almost always in Adobe’s Portable
Document Format (PDF), a public domain standard. Essentially any computer
software can produce a PDF document. The PDF format is convenient for the
electronic sharing of documents formatted for visual presentation (as opposed to
files formatted to be read by computers), but is not a useful way to move data,
especially in large volumes. Reading data from a PDF document into analytical
software requires endless human intervention.
Scanned documents are represented internally as pixels, not as characters.
The automatic reading of scanned numerical documents into analytical software
is close to impossible, because optical character recognition software is unreliable with numerical material and requires large amounts of human intervention,
character by character.
Much confusion exists between electronic character documents and scanned
documents, because both are part of the PDF standard. It is easy to tell them
apart. With any amount of magnification, an electronic character document shows
perfectly crisp characters, while a scanned document shows its granular pixels.
2. Paper data
Although the overwhelming majority of business records are kept in computer
form today, historical data may be available only in paper form. The data usually
reach the expert as scanned document images. Then, the expert needs to deal with
the problems of accurately reading scanned data.
3. Sampling data
In some instances, the expert is faced with more information than is possible to
process. This situation is most likely to arise if human review of each data record
is part of the processing. Even if processing all of the data may be ultimately necessary, processing a sample of the data for preliminary analyses may be appropriate.
482
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
If the expert elects to study a sample of the data, the expert needs to have
carefully considered how the information will be used to ensure that the data
sample is large enough and contains sufficient information. Usually, this requires
that the expert has constructed a model of damages and a related sampling
plan that includes an estimate of the sampling error. Unless the expert is a trained
statistician, the expert should seek outside help in designing the sampling plan.
4. Survey data
Another situation arises when the data can only be obtained by interviewing
individuals. This often arises because damages hinge on consumer preference.
For example, the issue may be how many bicycles of a certain brand would have
been sold absent a misappropriated braking system. Another example might be the
number of people who would have ordered their prescription contact lenses over
the Internet if the wholesalers had not conspired to restrict sales to the Internet
retailers.
A need for a survey can also arise when the plaintiffs comprise a class. In this
situation, it may be prohibitive to interview every class member, and the damages
expert will need to construct both a sampling plan and a survey instrument so that
the results can be reliably used to estimate damages.
The principles in constructing a sampling to collect data for analysis from a
dataset apply to constructing a sample of individuals to be surveyed: The expert
needs to have carefully considered how the information will be used to ensure
that the data sample is large enough and contains sufficient information. In addition, complexities arise because some respondents selected to be surveyed will not
be reachable or will be unwilling to complete a survey. The expert must devise
a plan to deal with such contingencies and be confident that such problems do
not bias the results.
Care must also be taken in developing the survey instrument. Generally, it is
advantageous to work with experts in survey to ensure that the responses can be
reliably interpreted and are not biased.82
B. Are the Parties Disputing the Validity of the Data?
Validation of any dataset is critical. The expert needs to have a firm basis for
relying on the chosen data. Opposing parties frequently try to impeach damages
estimates by challenging the reliability of the data or an expert’s validation of
the data.
82. For a discussion of the issues in designing sample plans and survey instruments particularly
for use in litigation, see Shari Diamond, Reference Guide on Survey Research, in this manual.
483
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
1. Criteria for determining validity of data
The validity of data is ultimately a matter of judgment. Experts often need to use
data that are not mathematically precise, because the only relevant data may be
known to contain some errors. Experts generally have an obligation to use data
that are as accurate as possible, meaning that the expert has used every practical
means to eliminate erroneous information. Experts should also perform cross
checks with other data, to the extent possible, to demonstrate completeness and
reliability. When data are inherently inaccurate because of random influences,
validity requires absence of bias or adjustment for bias. Validation of data turns
in part on commonsense indicators of accuracy and bias. The following is a list,
in rough order of presumptive validity, of data sources often used in damages
measurement:
• Officialgovernmentpublicationsanddatabases,suchasfromtheCensus
Bureau, the BLS, and the Bureau of Economic Analysis;
• A company’s audited financial statements and filings with the Securities
and Exchange Commission;
• A company’s accounting records maintained in the normal course of
business;
• A company’s operating reports prepared for management in the normal
course of business;
• A survey designed by the damages expert with assistance from survey
professionals, conforming to established standards of survey design and
execution;
• Amarketingresearchstudyconformingtoestablishedstandardsforthese
studies;
• Industryreportsandothermaterialspreparedbyunaffiliatedorganizations
and consultants;
• Newspaperarticles;
• Acompany’sstudyofdamagesfromtheharmfulevent,preparedinthe
normal course of business; and
• Acompany’sstudyofdamages,preparedforlitigation.
Other factors can alter this presumptive order of validity. When audited
financial statements are accused of being fraudulent, they lose their presumption of validity. The most fully researched articles in the best newspapers have a
higher presumption of validity. Some private industry reports are highly reliable.
Rules of evidence may also affect when and how these various data sources can
be used in expert reports and at trial. However, when internal data are unavailable
through no fault of the plaintiff, then courts often will make allowances for the
lack availability if the expert has made every effort to demonstrate that the data
relied upon are reasonable.
484
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
2. Quantitative methods for validation
One important aspect of validation is to verify that the data are complete. If a
separate summary document is available that shows the number of records in the
database together with summary statistics such as total amounts paid, completeness is easy to establish. Other methods for establishing completeness include
examining serial numbers for records and finding other sources of information
about transactions that should be in the data and verifying the presence of all or
a sample of them.
Another validation method is to examine specific observations. For example,
if the dataset consists of purchasing records, then the expert may examine all of
the records for sampled customers, or the expert may examine the information
for selected transactions. This type of validation is particularly useful if damages
depend on certain types of transactions that are identified by additional data on
the purchasing records but are not summarized in other records.
Another approach is to test the internal integrity of the data. For example, a
company may keep separate records of sales of products at its stores and shipments
to the stores. The expert can compare sales to shipments to establish data integrity.
Not surprisingly, validation of data usually reveals some inconsistent or missing data. Some ways to handle these issues are discussed in the next section.
C. Are the Parties Disputing the Handling of Missing Data?
In dealing with missing data, it is critical to ascertain why the data are missing
and to attempt to isolate the extent of the missing data. If only a small fraction of
data is missing and the pattern appears to be random, then potentially the issue
of the missing data can be disregarded and inferences can be drawn using only
the available data.
However, missing data are seldom random. For example, suppose that only
1% of transactions are missing, but the transactions that are missing are large ones
accounting for a third of all volume. Validation by summarizing across different
characteristics will usually identify missing data that are not random. Such errors
might occur if all of the missing transactions were submitted in a different format
that the program for reading the data does not handle. For example, a manufacturer might record sales to Wal-Mart separately from all other customers.
Identifying and adding the missing data to the database is the best correction for missing data, but this is often not possible. In this case, the expert needs
to address the problem in another way. The simplest method is to “gross up”
damages to reflect the missing data. For example, if 10% of the transactions are
randomly missing, then the expert may correct for the missing transactions by
dividing calculated damages by 0.9. This method implicitly assumes that the percent missing is known and that the missing and nonmissing transactions reflect
the same damages.
485
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
A related approach is to rely on partial, detailed data to measure damages as
a fraction of another variable, such as sales. With this approach, other reliable and
complete company records such as audited financials may be used to identify the
company’s total revenues, and damages are then calculated as the fraction of total
sales as calculated above. The expert may choose to patch together incomplete
data from one source and infer complete, reliable data in other ways. Such ways
can include using a survey of customers or workers to measure damages per dollar
of sales or per dollar of earnings and then applying those ratios to reliable data on
total sales or total earnings.
Credit Card Issuer sells cardholders fraudulent overcharges for insurance against theft for computer purchases. But the insurance does not
cover theft of additional purchases such as a printer. The overcharge is
found to be 1% of the price of the computer. The transactions reflected
in the only available data include computers bundled with printers.
Defendant uses the assumption that all transactions over $800 include
the purchase of a printer and deducts $150 as the average amount
spent for the printer. Credit Card Issuer’s damages estimate is 1% ×
(total purchases less $150 times the number of purchases for more than
$800). The expert for the class of insurance purchasers surveys a sample
of purchasers and finds that fewer printers were actually purchased in
the sample than implied by Issuer’s damages formula and thus calculates
a higher total overcharge for the class.
Comment: The parties have competing approximations to solve the same problem
in the data. The resolution depends on which one is more accurate. A
proper survey would probably be the better answer unless the expert
for the Issuer can offer additional evidence about the reliability of the
approach used.
Example:
X. Standards for Disclosing Data to
Opposing Parties
The usual procedure for disclosure of work performed by the damages expert in
federal cases is to provide electronic data at the same time or soon after the delivery of the expert’s Rule 26(a)(2) report.83 The data enable the opposing expert to
replicate and investigate the damages expert’s work in preparation for the expert’s
deposition and the opposing expert’s rebuttal analysis. Even the most complete
Rule 26(a)(2) report falls far short of enabling replication of damages calculations
83. For a discussion of disclosure of expert reports under Federal Rule of Civil Procedure 26(a)
(2), see Margaret A. Berger, The Admissibility of Expert Testimony, Section V.B.1, in this manual.
486
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
in all but the simplest damages case. The fundamental standard for data disclosure
pursuant to Rule 26(a)(2) should be all the materials, starting from original data
sources through all intermediate calculations up to the final computer output
reflecting the results shown in the Rule 26(a)(2) report. This disclosure should
also include all scripts (including programs or any instructions to any program
used) as well as all data involved in any step of the computations in the format
used by the corresponding software. In particular, Excel or other such programs
should include the cell instructions in the worksheet. In addition to the backup
for the calculations described in the expert’s report, the disclosure should include
the materials relating to any other opinion the expert has reached. If confidential
data are involved, appropriate protective orders can be sought.
A. Use of Formats
Disclosure of data should be in standard formats. In general, the formats used by
damages experts include Access, Oracle, and other relational databases; Excel, SAS,
and Stata datasets; and flat files containing uniformly formatted data in character
form. It is critical that the data be provided as actual data files on computer media
such as DVDs or data disks, not paper or electronic printouts or reports formatted
for visual presentation. As noted earlier, materials formatted for visual presentation
are generally difficult to convert back to formats suitable for computer analysis.
B. Data Dictionaries
The disclosure should include data dictionaries when variable formats and descriptions are not obvious. Data dictionaries should state the format for each variable,
the range of appropriate values, and how to interpret the data. This information is
particularly important for historic data because specific data formatting may have
been used to convey information. For example, positive income values might be
used to indicate total household income but negative values might indicate the
income of individuals in the household. This specific formatting may have evolved
because the underlying data came from multiple sources or data storage was at a
premium. Problems of this nature occur less often in more current data because
the price of data storage has declined dramatically.
Often, the expert will receive multiple databases from an opposing party. In
this situation, the expert needs the requisite information for linking the datasets
together. Also, the expert needs to know how the data were compiled in order
to understand how to interpret inconsistent information. For example, if the database consists of events corresponding to subscriptions to a newspaper, the expert
may encounter overlapping dates for subscriptions to weekday-only service and
subscriptions to both weekday and weekend service. Such issues become acute
when the data have not been reviewed for errors or difficulties in interpretation.
487
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In the example presented, the expert may be advised that the most reliable data
are the starting dates and that the end dates should be ignored unless the entire
subscription is terminated.
C. Resolution of Problems
Friction between parties over disclosure of electronic backup is unfortunately
common. Some common accusations are
• Failingtodisclosetheintermediatestepsthatwereperformedtogenerate
the data used in the final calculations from the source data;
• Disclosingdataandothermaterialsonlyonpaperorasscannedimages,
not in electronically readable form;
• Disclosingdataasreportsformattedastables(althoughtheexpertmayhave
originally received the data in this format);
• Concealing the logic of an Excel spreadsheet by revealing only the cell
values and not the formulas used to generate the values;
• Failingtoprovidedatadictionariesexplainingthemeaningoftheunderlying data; and
• Omitting calculations related to opinions other than the actual damages
calculation.
A judge, magistrate, or special master overseeing discovery should become
familiar with these issues to resolve the disputes fairly and to ensure full disclosure
of each expert’s numerical work to the opposition.
A tool that may be effective, but that is rarely used in the United States, is
to have the experts meet without attorneys and identify where they agree and
where they disagree.84 Such an arrangement would generally require the consent
of the parties.
84. New Zealand allows such expert conferences in certain circumstances. For example, High
Court Rule 9.44 provides:
(1) The court may, on its own initiative or on the application of a party to a proceeding, direct expert
witnesses to—(a) confer on specified matters; (b) confer in the absence of the legal advisers of the parties;
(c) try to reach agreement on matters in issue in the proceeding; (d) prepare and sign a joint witness
statement stating the matters on which the expert witnesses agree and the matters on which they do not
agree, including the reasons for their disagreement; (e) prepare the joint witness statement without the
assistance of the legal advisers of the parties. (2) The court must not give a direction under subclause
(1)(b) or (e) unless the parties agree.
Judicature Act 1908 No. 89 (as at 24 May 2010), Schedule 2 High Court Rules, Part 9 Evidence,
Subpart 5—Experts, Rule 9.44.
488
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
D. Special Masters and Neutral Experts
Court-appointed individuals with the appropriate backgrounds can be useful in
cases with complex damages calculations. If such an individual is assisting the
court, the parties are unlikely to fail to cooperate speedily in disclosing their
underlying computer work, knowing that any failure of cooperation would be
recognized immediately.
XI. Damages in Class Actions
A. Class Certification
Damages play a large and increasing role in the certification of a class. Courts
are exhibiting an increasing tendency to deny certification unless the class has a
well-developed method for measuring damages for individual class members. One
aspect of this tightening of standards is the use of a damages model to limit the
membership of the class to individuals who are known to have incurred losses
from the harmful conduct. Whereas earlier standards for damages were mainly
the assurance of a qualified expert that damages could be measured later in the
proceeding, some courts now require the expert to present a more fully developed
method for quantifying damages.85 Disputes about the practicality of damages
measurement are more and more likely in proposed class actions.86
A court operating under the rule that class certification requires a fully developed damages quantification will need to grant discovery prior to class certification
to support the class’s damages analysis and the defendant’s opposition.
B. Classwide Damages
The class’s damages expert normally measures and testifies to classwide damages using methods discussed elsewhere in this chapter. In many class actions,
damages ultimately will be paid to class members who file claims in a phase that
occurs after settlement or trial. In principle, the damages experts will need to
forecast the number of claimants as well as the average amount of damages per
claimant. The propensity of class members to file claims depends critically on the
amount of their likely recovery. Asbestos victims have high claim rates, whereas
individuals who overpaid their cell phone bills by a few dollars have low rates.
85. See David S. Evans, The New Consensus on Class Certification: What It Means for the Use of
Economic and Statistical Evidence in Meeting the Requirements of Rule 23 (Jan. 2009), available at http://
ssrn.com/abstract=1330594.
86. See, e.g., In Re New Motor Vehicles Canadian Export Antitrust Litig., 522 F.3d 6 (1st Cir.
2008).
489
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
C. Damages of Individual Class Members
Damages experts may also have a role in the process of disbursing funds from verdict or settlement to individual class members. An expert can develop software that
measures individual damages based on evidence supplied by individuals through
a claims-processing facility. For example, in Millsap v. McDonnell-Douglas,87 more
than 1000 class members, victims of the defendant’s challenged layoffs, completed sworn questionnaires describing their post-layoff experiences. McDonnellDouglas supplied additional information from its employee records. The class’s
damages expert used standard methods for valuing claims for lost earnings to
calculate estimates for each class member. Because the settlement compromised a
number of disputes about the law and about the facts underlying the layoffs, the
total cash from the settlement was less than the sum of these estimates, and class
members received a fraction of the amount indicated by the damages model.
D. Have the Defendant and the Class’s Counsel Proposed a
Fair Settlement?
The classwide damages measure has a key role in resolving class-action cases,
because courts refer to it in determining the fairness of proposed settlements. The
court’s careful review of the benefits proposed for the class is essential because
the interests of the class’s counsel are not aligned with those of the class members
with respect to settlement.
Example: Lender required excessive escrow deposits for property taxes from a
class of mortgage borrowers, although the excess was repaid at the end
of each year. Under the terms of the proposed settlement negotiated
between Lender’s lawyers and those for the class, the excess is to be
refunded to the class members immediately, with 30% of that amount
paid to class counsel as fees.
Comment: This settlement is unreasonable and leaves the class worse off than
they were under the excessive escrows. The loss to the class from
placing funds in an escrow is the foregone interest on the amount in
the escrow, which would likely be no more than 10% of the excess
amount of the escrow. By granting 30% of the refund as fees to class
counsel, the class members are at least 20% worse off than they would
be if the excess were repaid with a delay.
87. No. 94-CV-633-H(M), 2003 WL 21277124 (N.D. Okla. May 28, 2003).
490
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
At one time, settlements granted coupons to class members rather than cash
compensation, but this practice is now discouraged. The valuation of the coupons
is controversial.88
Example: In a case brought by the Department of Justice, airlines were found
culpable for price fixing. The settlement of the derivative consumer
class action granted class members coupons for discounts on air travel.
Alaska Airlines, not a defendant in the government case or earlier in
the class action, petitioned the court to be added as a defendant so that
it too could gain the marketing advantage of the coupons.89
Comment: Alaska’s petition made it clear that the coupons were beneficial to the
airlines, not costly, and so the corresponding value to the class was
presumptively small.
XII. Illustrations of General Principles
In the sections below, we provide concrete examples of how damages may be
calculated in two common situations: (1) lost personal earnings and (2) lost profits
for a business. The discussions are intended to illustrate how to apply the general
ideas presented in the previous sections.
A. Claim for Lost Personal Income
Claims for lost personal earnings generally arise from wrongful termination, discrimination, injury, or death. The earnings usually come from employment in a
firm, but essentially the same issues arise if self-employment or partnership earnings are lost. Most damages studies for personal lost earnings closely fit the model
of Figure 1. The but-for world is usually based on the projected employment
trajectory absent the harmful act. Here we present an example of a moderately
realistic lost personal income damages quantification.
A construction worker sues for lost personal income after he is severely
injured when the defendant runs through a red light and hits him. He asserts that
he is disabled and unable to work for the rest of his life. Moreover, his injuries
88. See Figueroa v. Sharper Image Corp., 519 F. Supp. 2d 1302–29 (S.D. Fla. 2007).
89. In Re Domestic Air Transp. Antitrust Litig., 148 F.R.D. 297 (N.D. Ga. 1993). According to
a spokesman for Alaska Air, “The airlines using those coupons are going to see substantial additional
ticket sales because of them. . . . We asked to be named in the case because, once we saw the settlement, we realized it was to our competitive disadvantage not to do so.” Anthony Faiola, In Settling
with Airlines, There’s No Free Ride; Coupons for Travelers, $16 Million for Lawyers, Washington Post,
Mar. 20, 1995, at A1. For a general discussion of coupon settlements, see Christopher R. Leslie, A
Market-Based Approach to Coupon Settlements in Antitrust and Consumer Class Action Litigation, 49 UCLA
L. Rev. 91 (2002).
491
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
are so severe that his lifespan has been shortened by 3.5 years. Although he was a
construction worker at the time of the accident, he had been going to school to
become a CPA. The plaintiff’s damages study presumes that he will not be able
to work at all in the future. The defendant argues that the plaintiff should have
continued his education after the accident and worked as a CPA. The defendant
also disputes the reliability of the reduced life expectancy calculation. The judge
has ruled that a jury should decide if there is a sufficient basis to conclude that
the calculation of lost personal earnings should reflect the plaintiff’s reduced life
expectancy.
1. Is there a dispute about projected earnings but for the harmful event?
A plaintiff who seeks compensation for lost earnings will normally estimate damages based on wages or salary; other cash compensation, such as commissions,
overtime, and bonuses; and the value of fringe benefits. Employees in similar
jobs whose earnings were not interrupted form a natural benchmark for earning
growth between the harmful event and trial. The plaintiff may make the case that
a promotion or job change would have occurred during that period. Disputes
involving the more variable elements of cash compensation are likely to arise.
The plaintiff may measure bonuses and overtime during a period when these parts
of compensation were unusually high, while the defendant may choose a longer
period, during which the average is lower.
In our example, the construction worker claims that he would have made
$75,000 working for one more year in construction while completing his degree.
After that, he would have worked as a CPA earning $100,000 a year until retirement at age 70 based on the average salary for all CPAs. As a result of his injury,
he only receives $22,000 a year from disability payments. Table 4 shows these
projections.
The defendant’s damages study presumes that the plaintiff could have continued his education after the injury and begun working as a CPA a year later.
However, he argues the plaintiff would have earned only $75,000 as a CPA
because of the plaintiff’s lackluster record as an undergraduate and his career as a
construction worker, where his work resulted in a depreciation of the skills that a
CPA needs. This salary is based on the median salary for all CPAs. Table 5 shows
the defendant’s projections.
2. Are the parties disputing the valuation of benefits?
Lost benefits are an important part of lost personal earnings damages. As discussed
in Section VIII.B, strict adherence to the format of Figure 1 can help resolve
these disputes.
In the example, plaintiff includes only disability payments because of the
injury. Absent his injury, he would have received higher benefits than the disability payments after retirement at age 70. The defendant projects higher social
492
Copyright © National Academy of Sciences. All rights reserved.
493
Copyright © National Academy of Sciences. All rights reserved.
Age
56-57
57-58
58-59
59-60
60-61
61-62
62-63
63-64
64-65
65-66
66-67
67-68
68-69
69-70
70-71
71-72
72-73
73-74
74-75
75-76
76-77
77-78
78-79
79-80
80-81
81-82
82-83
83-84
84-85
85-86
86-87
87-88
88-89
89-90
90-91
91-92
92-93
93-94
94-95
95-96
96-97
97-98
98-99
99-100
100 and over
Actual
Earnings
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Actual
Social
Sec
Benefits
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
22,008
Total
Actual
Income
22,008
21,727
21,428
21,107
20,761
20,387
19,984
19,556
19,104
18,629
18,130
17,605
17,052
16,467
15,849
15,197
14,506
13,775
13,005
12,200
11,364
10,505
9,628
8,740
7,852
6,973
6,112
5,283
4,494
3,758
3,082
2,475
1,941
1,484
1,102
793
551
369
236
145
84
46
24
11
0
Probability
of
Surviving
1.00
0.99
0.97
0.96
0.94
0.93
0.91
0.89
0.87
0.85
0.82
0.80
0.77
0.75
0.72
0.69
0.66
0.63
0.59
0.55
0.52
0.48
0.44
0.40
0.36
0.32
0.28
0.24
0.20
0.17
0.14
0.11
0.09
0.07
0.05
0.04
0.03
0.02
0.01
0.01
0.00
0.00
0.00
0.00
0.00
Probability
of
Working
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
Expected
Income
22,008
21,727
21,428
21,107
20,761
20,387
19,984
19,556
19,104
18,629
18,130
17,605
17,052
16,467
15,849
15,197
14,506
13,775
13,005
12,200
11,364
10,505
9,628
8,740
7,852
6,973
6,112
5,283
4,494
3,758
3,082
2,475
1,941
1,484
1,102
793
551
369
236
145
84
46
24
11
0
But-for
Social
But-for Sec
Earnings Benefits
75,000
87,083
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
But-for
Total
But-for
Income
75,000
87,083
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
100,000
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
But-for
Probability
of
Surviving
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0.92
0.91
0.90
0.88
0.86
0.84
0.82
0.80
0.78
0.76
0.73
0.71
0.68
0.65
0.61
0.58
0.54
0.51
0.47
0.43
0.39
0.35
0.31
0.28
0.24
0.21
0.17
0.14
0.12
0.09
0.07
0.06
0.04
0.03
0.02
0.01
0.01
0.00
But-for
Probability
of
Working
0.90
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
But-for
Expected
Income
67,500
86,341
98,238
97,258
96,195
95,039
93,788
92,448
91,024
89,514
87,916
86,220
84,414
82,483
25,053
24,365
23,626
22,832
21,982
21,074
20,113
19,098
18,035
16,927
15,780
14,602
13,401
12,188
10,976
9,777
8,605
7,474
6,400
5,394
4,469
3,634
2,895
2,256
1,716
1,272
916
640
433
283
0
Total
Lost
Discount
Income Rate
45,492
0.01
64,615
0.01
76,810
0.01
76,151
0.01
75,434
0.01
74,653
0.01
73,804
0.01
72,892
0.01
71,920
0.01
70,885
0.01
69,786
0.01
68,615
0.01
67,362
0.01
66,016
0.01
9,203
0.01
9,168
0.01
9,121
0.01
9,058
0.01
8,977
0.01
8,875
0.01
8,748
0.01
8,594
0.01
8,408
0.01
8,187
0.01
7,928
0.01
7,629
0.01
7,289
0.01
6,906
0.01
6,481
0.01
6,019
0.01
5,523
0.01
5,000
0.01
4,459
0.01
3,911
0.01
3,367
0.01
2,841
0.01
2,344
0.01
1,887
0.01
1,480
0.01
1,127
0.01
832
0.01
594
0.01
410
0.01
272
0.01
0
0.01
Total Lost Personal
Discount
Rate
Index
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.91
0.90
0.89
0.88
0.87
0.86
0.85
0.84
0.84
0.83
0.82
0.81
0.80
0.80
0.79
0.78
0.77
0.76
0.76
0.75
0.74
0.73
0.73
0.72
0.71
0.71
0.70
0.69
0.69
0.68
0.67
0.67
0.66
0.65
0.65
Income
Discounted
Lost
Income
45,492
63,975
75,297
73,911
72,491
71,029
69,527
67,988
66,417
64,813
63,177
61,501
59,780
58,006
8,007
7,897
7,778
7,648
7,505
7,346
7,170
6,973
6,755
6,512
6,244
5,949
5,627
5,279
4,905
4,510
4,097
3,673
3,243
2,816
2,401
2,005
1,638
1,306
1,014
765
559
395
270
177
0
1,043,866
Reference Manual on Scientific Evidence: Third Edition
Table 4. Plaintiff’s Estimate of Lost Personal Income
494
Copyright © National Academy of Sciences. All rights reserved.
Actual
Age
Earnings
56-57 0
57-58 75,000
58-59 75,000
59-60 75,000
60-61 75,000
61-62 75,000
62-63 75,000
63-64 75,000
64-65 75,000
65-66 75,000
66-67 75,000
67-68 75,000
68-69 75,000
69-70 75,000
70-71 0
71-72 0
72-73 0
73-74 0
74-75 0
75-76 0
76-77 0
77-78 0
78-79 0
79-80 0
80-81 0
81-82 0
82-83 0
83-84 0
84-85 0
85-86 0
86-87 0
87-88 0
88-89 0
89-90 0
90-91 0
91-92 0
92-93 0
93-94 0
94-95 0
95-96 0
96-97 0
97-98 0
98-99 0
99-100 0
100 + 0
Actual
Social
Sec
Benefits
0
0
0
0
0
0
0
0
0
0
0
0
0
0
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
30,864
Total
Actual
Income
0
74,361
73,678
72,944
72,146
71,280
70,341
69,336
68,268
67,135
65,937
64,665
63,310
61,863
24,821
24,140
23,408
22,621
21,779
20,879
19,927
18,922
17,868
16,771
15,634
14,467
13,277
12,076
10,874
9,686
8,525
7,405
6,341
5,344
4,428
3,600
2,868
2,235
1,700
1,260
908
634
429
280
0
Probability
of
Surviving
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0.92
0.91
0.90
0.88
0.86
0.84
0.82
0.80
0.78
0.76
0.73
0.71
0.68
0.65
0.61
0.58
0.54
0.51
0.47
0.43
0.39
0.35
0.31
0.28
0.24
0.21
0.17
0.14
0.12
0.09
0.07
0.06
0.04
0.03
0.02
0.01
0.01
0.00
Probability
of
Working
0.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
Expected
Income
0
74,361
73,678
72,944
72,146
71,280
70,341
69,336
68,268
67,135
65,937
64,665
63,310
61,863
24,821
24,140
23,408
22,621
21,779
20,879
19,927
18,922
17,868
16,771
15,634
14,467
13,277
12,076
10,874
9,686
8,525
7,405
6,341
5,344
4,428
3,600
2,868
2,235
1,700
1,260
908
634
429
280
0
But-for
Earnings
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
But-for
Social
Sec
Benefits
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
But-for
Total
But-for
Income
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
75,000
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
31,152
But-for
Probability
of
Surviving
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0.92
0.91
0.90
0.88
0.86
0.84
0.82
0.80
0.78
0.76
0.73
0.71
0.68
0.65
0.61
0.58
0.54
0.51
0.47
0.43
0.39
0.35
0.31
0.28
0.24
0.21
0.17
0.14
0.12
0.09
0.07
0.06
0.04
0.03
0.02
0.01
0.01
0.00
But-for
Probability
of
Working
0.80
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
But-for
Expected
Income
60,000
74,361
73,678
72,944
72,146
71,280
70,341
69,336
68,268
67,135
65,937
64,665
63,310
61,863
25,053
24,365
23,626
22,832
21,982
21,074
20,113
19,098
18,035
16,927
15,780
14,602
13,401
12,188
10,976
9,777
8,605
7,474
6,400
5,394
4,469
3,634
2,895
2,256
1,716
1,272
916
640
433
283
0
Total
Lost
Income
60,000
0
0
0
0
0
0
0
0
0
0
0
0
0
232
225
218
211
203
195
186
177
167
156
146
135
124
113
101
90
80
69
59
50
41
34
27
21
16
12
8
6
4
3
0
Total Lost
Discount
Discount Rate
Rate
Index
0.05
1.00
0.05
0.95
0.05
0.91
0.05
0.86
0.05
0.82
0.05
0.78
0.05
0.75
0.05
0.71
0.05
0.68
0.05
0.64
0.05
0.61
0.05
0.58
0.05
0.56
0.05
0.53
0.05
0.51
0.05
0.48
0.05
0.46
0.05
0.44
0.05
0.42
0.05
0.40
0.05
0.38
0.05
0.36
0.05
0.34
0.05
0.33
0.05
0.31
0.05
0.30
0.05
0.28
0.05
0.27
0.05
0.26
0.05
0.24
0.05
0.23
0.05
0.22
0.05
0.21
0.05
0.20
0.05
0.19
0.05
0.18
0.05
0.17
0.05
0.16
0.05
0.16
0.05
0.15
0.05
0.14
0.05
0.14
0.05
0.13
0.05
0.12
0.05
0.12
Personal Income
Discounted
Lost
Income
60,000
0
0
0
0
0
0
0
0
0
0
0
0
0
117
108
100
92
84
77
70
63
57
51
45
40
35
30
26
22
18
15
12
10
8
6
5
3
2
2
1
1
1
0
0
61,104
Reference Manual on Scientific Evidence: Third Edition
Table 5. Defendant’s Estimate of Lost Personal Income
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
security benefits based on a longer period and higher level of contributions to
social security. The parties agree that the plaintiff would have retired at age 70
absent the accident.
3. Is there disagreement about how earnings should be discounted to
present value?
Because personal lost earnings damages may accrue over the remainder of a plaintiff’s
working life, the issues of predicting future inflation and discounting earnings to
present value are likely to generate quantitatively important disagreements. As we
noted in Section VI.D, projections of future compensation can be calculated in
constant dollars or escalated terms. In the first case, the interest rate used to discount
future constant-dollar losses should be a real interest rate—the difference between
the ordinary interest rate and the projected future rate of inflation. All else being the
same, the two approaches will give identical calculations of damages.
In our example, both the plaintiff and defendant use constant dollars and use
a real rate of interest for discounting. However, the plaintiff calculates the real
rate of interest as 1%, relying on the implied rate from inflation-adjusted Treasury
bonds. In contrast, the defendant uses a discount rate of 5% based on the historic
real rate of return to investments in general.
4. Is there disagreement about subsequent unexpected events?
Disagreements about subsequent unexpected events are likely in cases involving
personal earnings, as discussed in general in Section VIII.E. For example, the
plaintiff may have suffered a debilitating illness that would have caused him to
quit his job a year later even if the wrongful act had not occurred. Alternatively,
the plaintiff may have been laid off as a result of employer hardship a year later
notwithstanding the wrongful act. In these examples, the defendant may argue
that damages should be limited to one year. The plaintiff might respond that
subsequent events were unexpected at the time of the termination and therefore
should be excluded from consideration in the calculation of damages. Thus, the
plaintiff would argue that damages should be calculated without consideration of
these events.
In our example, the defendant points out that the unemployment rate for
construction workers was 50% beginning six months after the accident. The
plaintiff argues that the unemployment rate for construction workers at the time
of the accident was only 19% and therefore the revised unemployment rate after
the accident is irrelevant.
5. Is there disagreement about retirement and mortality?
Closely related to the issue of unexpected events is how future damages should
reflect the probability that the plaintiff will die or decide to retire. Sometimes an
495
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
expert will assume a work-life expectancy and terminate damages at the end of
that period. Tables of work-life expectancy incorporate the probability of both
retirement and death. Another approach is to multiply each year’s lost earnings
by the probability that the plaintiff will be alive and working in that year. That
probability declines gradually with age and can be inferred from data on labor
force participation and mortality by age.
In our example, the plaintiff projects that his life expectancy was reduced by
3.5 years and uses revised survival rates as a result. The defendant disagrees, arguing that the survival tables relied upon by the plaintiff are unreliable. However,
both agree that the plaintiff would have worked until age 70 absent the accident
because the unemployment rate for CPAs is essentially zero in the area where the
plaintiff lives.
6. Is there a dispute about mitigation?
Actual earnings before trial, although known, may be subject to dispute if the
defendant argues that the plaintiff took too long to find a job or the job taken
was not sufficiently remunerative. Even more problematic may be the situation
in which the plaintiff continues to be unemployed. Parties disputing the length of
job search frequently offer testimony from job placement experts. Testimony from
a psychologist also may be offered if the plaintiff has suffered emotional trauma as
a result of the defendant’s actions. Recovery from temporarily disabling injuries
may be the subject of testimony by experts in vocational rehabilitation.
In our example, the plaintiff argues that he is disabled and unable to work
for the remainder of his life. The defendant argues that the plaintiff could have
finished his education and could then have worked as a CPA. Both provide the
testimony from experts in vocational rehabilitation to support their conclusions.
7. Is there disagreement about how the plaintiff’s career path should be
projected?
The issues that arise in projecting but-for and actual earnings after trial are similar
to the issues that arise in measuring damages before trial. In addition, the parties
are likely to disagree about the plaintiff’s future increases in compensation. A damages analysis should be internally consistent. For example, the compensation paths
for both but-for and actual earnings should be based on consistent assumptions
about general economic conditions, about conditions in the local labor market
for the plaintiff’s type of work, the age-earnings profile for the career path, and
particularly about the plaintiff’s likely increased skills and earning capacity. The
analysis probably should project a less successful career for mitigation if it is projecting a slow earnings growth absent the harm.
In our example, the plaintiff argues that he would have worked as a CPA but
for the accident but that he is too injured to complete his education and work
as a CPA. The defendant argues that working as a CPA is a viable option for the
496
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
plaintiff. Although there is a disagreement about how much the plaintiff would
have earned as a CPA, the plaintiff’s argument that he is too disabled to work
accounts for most of the damages. As shown in Tables 4 and 5, the plaintiff is
seeking just over $1 million while the defendant calculates that damages are only
$61,000. Differences of this magnitude between quantifications of lost personal
earnings by plaintiffs and defendants are common. Our example illustrates some
of the main reasons for the large differences.
B. Lost Profits for a Business
Claims for lost profits for a business generally arise from a lost stream of revenue.
However, lost profits can also arise from increased costs. As an example, a breach
of a supply contract may increase the victim firm’s costs. Generally, an expert
will likely be most involved in cases in which the plaintiff is seeking recovery for
expectation, reliance, or restitution damages. Most damages studies will follow
Figure 1 where earnings are the lost profits. For explication, the following is an
example of a business lost profits case:
Plaintiff HSM makes cell phone handsets. Defendant TPC is a cell phone
carrier. By denying HSM technical information and by informing HSM’s potential
customers that HSM’s handsets are incompatible with TPC’s network, TPC has
imposed economic losses on HSM. TPC asserts that HSM has failed to mitigate its
losses and overstates its lost revenues. Trial is set for the end of 2010. The respective damages analyses are shown in Tables 6 and Table 7 and discussed below.
Table 6. HSM’s Damages Analysis (Dollars in Millions)
Year
(2)
But-For
Revenue
(3)
But-For
Costs
(4)
But-For
Earnings
2008
2009
2010
2011
2012
2013
2014
2015
$561
600
639
681
726
777
828
882
$374
400
426
454
484
518
552
588
$187
200
213
227
242
259
276
294
(5)
Actual
Earnings
$34
56
45
87
96
105
116
127
(6)
Lost
Earnings
(7)
Discount
Factor
$153
144
168
140
147
153
160
167
1.21
1.14
1.07
1.00
0.96
0.92
0.89
0.85
Total
(8)
Damages
$185
164
180
140
141
142
142
143
$1236
497
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Table 7. TPC’s Damages Analysis (Dollars in Millions)
Year
(2)
But-For
Revenue
(3)
But-For
Costs
(4)
But-For
Earnings
2008
2009
2010
2011
2012
2013
2014
2015
$404
432
460
492
524
560
596
636
$303
324
345
369
393
420
447
477
$101
108
115
123
131
140
149
159
(5)
(6)
Mitigated Lost
Earnings Earnings
$79
85
81
98
108
119
130
143
$22
23
34
25
23
21
19
16
(7)
Discount
Factor
1.21
1.14
1.07
1.00
0.87
0.76
0.66
0.57
Total
(8)
Damages
$27
26
36
25
20
16
12
9
$171
1. Is there a dispute about projected revenues?
Projecting lost revenues can be straightforward if the disrupted revenue stream
occurs immediately following the bad act and the firm recovers relatively quickly.
More complex cases can arise if the effect is delayed or the recovery is slow,
intermittent, or nonexistent.
In the example above, the plaintiff’s expert would argue that revenues would
have been higher absent TPC’s conduct and thus projects revenues based on the
revenue growth prior to the bad act, which reflects increasing sales and increasing prices. The projected revenue for the plaintiff is shown in Table 6, column 2.
The defendant’s expert would argue that HSM’s projections use a growth factor
that improperly includes the period when HSM initially entered the market and,
therefore, projects HSM’s sales using the growth rate for the previous 2 years and
assumes that prices would have remained unchanged. TPC’s projection of HSM’s
revenue is shown in Table 7, column 2.
Some additional examples of complexities can found in antitrust cases. For
example, assume a company is disadvantaged because a rival has constructed
barriers to entry by entering into contracts that require customers to use its addon products such as ink for a printer. In such cases, the plaintiff’s expert may
assert that the only suppliers in the but-for market for printer ink would consist
of the defendant and the plaintiff, and that the profit would reflect pricing for a
duopoly. The defendant may respond that there would be five firms in addition
to the plaintiff who would have entered the market as suppliers, and that therefore
the pricing would be close to that of a highly competitive market.
Other complexities may arise in intellectual property cases where the revenue stream is reduced because the intellectual property for a product has been
misappropriated. In these cases, the expert may need to identify how much of the
498
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
plaintiff’s revenue stream should be attributed to the misappropriated intellectual
property and how much should be attributed to other aspects of the product. For
example, our printer manufacturer may believe that its printers are popular because
of its proprietary method to increase the printing speed. However, the defendant
may argue that the increase in printing speed has little to do with the popularity of
the plaintiff’s printer but rather the sharpness of the printing. Or the defendant may
argue that at the time of the bad act the plaintiff’s product was the fastest printer,
but 2 years later, a noninfringing printer is faster and the plaintiff’s sales therefore
would have dropped to zero.
The projection of the revenue stream is likely to be the most controversial
part of any damages estimate in a business case because it requires so many assumptions on the part of both experts with respect to the other players in the market
and customer demand.
2. Are the parties disputing the calculation of marginal costs?
Another area of dispute that can arise is the measurement of marginal costs.
Generally, if the business is an ongoing concern, then the costs can be determined from existing data. Often this is done either by directly modeling the costs
needed for the additional revenues or using regression analysis that captures how
costs have varied with revenues. The relevant concept is the measure of costs that
would have been expended to generate the lost revenues.
In our example, plaintiff’s expert would project that the additional costs
would reflect the marginal cost ratio that was derived from a regression model of
costs against revenues. The defendant’s expert might use the average ratio of costs
to revenues, arguing that this would be more appropriate because additional workers and equipment would have been needed to generate the increased revenues.
The projected costs for both parties are shown in column 3 of Tables 6 and 7.
Costs are often expressed as a percentage of revenues, which simplifies the
projection of costs. However, this approach can be problematic if there is reason
to believe that the profit rate will change over time. The rate may change because
the change in revenues will be so large as to require that an increasing percentage
of fixed costs will need to be included, the mix of costs will change over time, or
the components of cost will grow at disparate rates. If computing costs as a percentage of revenues is not viable, then the projected costs should reflect the same
assumptions about growth and inflation that were used in the revenue projection.
3. Is there a dispute about mitigation?
Defendant’s expert may argue that the plaintiff’s actual profits are understated
because the plaintiff failed to mitigate its losses. For example, the plaintiff’s losses
may have been minimized by closure of its business. Or the plaintiff perhaps
should have invested in alternative facilities while its business was interrupted
because it could not use its existing facilities.
499
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In our example, the defendant’s expert would argue that HSM could have
mitigated its losses by obtaining the technical information it needed from other
sources and could have counteracted TPC’s disparagement with vigorous marketing. HSM’s actual earnings are shown in column 5 of Table 6, and TPC’s calculation of HSM’s earnings with mitigation are shown in column 5 of Table 7.
4. Is there disagreement about how profits should be discounted to present
value?
Generally, interest for lost earnings prior to trial is computed at a statutory rate, often
not compounded. In our example, trial is at the end of year 2010 and the statutory
rate is assumed to be 7% simple (i.e., without compounding). If the prejudgment
rate is not set by law, economists favor the use of the cost of borrowing for the
defendant, because damages are a forced loan to the defendant by the plaintiff.90
The rate used to discount future losses back to the time of the trial is not
set by law and substantial disputes will arise about the discount rate. Generally,
economists believe that the discount rate should equal the after-tax cost of capital
for the plaintiff.
In our example, HSM argues that the proper discount rate should be based
on a 4%, after-tax interest rate, obtained by applying HSM’s corporate tax rate
to TPC’s medium-term borrowing rate. TPC, however, believes that the proper
discount rate should be HSM’s cost of capital, reflecting HSM’s cost of equity and
cost of debt. Column 7 of Tables 4 and 5 shows the respective discount rates after
trial. The resulting damages are shown in column 8 of Tables 6 and 7.
5. Is there disagreement about subsequent unexpected events?
Disagreements about subsequent unexpected events are likely in cases involving
lost profits. For example, the market for the plaintiff’s goods may have suffered
a substantial contraction a year after the bad act, with plaintiff likely to be forced
into bankruptcy even if the wrongful act had not occurred. Or the costs of the
plaintiff may have increased dramatically a year later because of shortages that
would have necessitated that the plaintiff retool its business even if the wrongful
act had not occurred. The plaintiff might respond that subsequent events were
unexpected at the time of the bad act and so should be excluded from consideration in the calculation of damages. Plaintiff, therefore, would argue that damages
should be calculated without consideration of these events. The defendant would
respond that damages should be limited to 1 year because the unexpected events
would have forced the closure of the plaintiff’s business. This topic is discussed
more fully in Section VIII.E.
90. See James M. Patell et al., Accumulating Damages in Litigation: The Roles of Uncertainty and
Interest Rates, 11 J. Legal Stud. 341–64 (1982).
500
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Estimation of Economic Damages
Glossary of Terms
appraisal. A method of determining the value of the plaintiff’s claim on an earnings stream by reference to the market values of comparable earnings streams.
For example, if the plaintiff has been deprived of the use of a piece of property, the appraised value of the property might be used to determine damages.
avoided cost. Cost that the plaintiff did not incur as a result of the harmful act.
Usually it is the cost that a business would have incurred in order to make the
higher level of sales the business would have enjoyed but for the harmful act.
but-for analysis. Restatement of the plaintiff’s economic situation but for the
defendant’s harmful act. Damages are generally measured as but-for value less
actual value received by the plaintiff.
capitalization factor. Factor used to convert a stream of revenue or profit into
its capital or property value. A capitalization factor of 10 for profit means that
a firm with $1 million in annual profit is worth $10 million.
compound interest. Interest calculation giving effect to interest earned
on past interest. As a result of compound interest at rate r, it takes
(1 + r)(1 + r) = 1 + 2r + r2 dollars to make up for a lost dollar of earnings
2 years earlier.
constant dollars. Dollars adjusted for inflation. When calculations are done in
constant 1999 dollars, it means that future dollar amounts are reduced in
proportion to increases in the cost of living expected to occur after 1999.
discount rate. Rate of interest used to discount future losses.
discounting. Calculation of today’s equivalent to a future dollar to reflect the
time value of money. If the interest rate is r, the discount applicable to 1 year
in the future is:
discount rate = 1/(1 + r).
The discount for 2 years is this amount squared; for 3 years, it is this amount
to the third power, and so on for longer periods. The result of the calculation
is to give effect to compound interest.
earnings. Economic value received by the plaintiff. Earnings could be salary and
benefits from a job, profit from a business, royalties from licensing intellectual property, or the proceeds from a one-time or recurring sale of property.
Earnings are measured net of costs. Thus, lost earnings are lost receipts less
costs avoided.
escalation. Consideration of future inflation in projecting earnings or other dollar
flows. The alternative is to make projections in constant dollars.
expectation damages. Damages measured on the principle that the plaintiff
is entitled to the benefit of the bargain originally made with the defendant.
501
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
fixed cost. Cost that does not change with a change in the amount of products
or services sold.
mitigation. Action taken by the plaintiff to minimize the economic effect of the
harmful act. Also often refers to the actual level of earnings achieved by
the plaintiff after the harmful act.
nominal interest rate. Interest rate quoted in ordinary dollars, without adjustment for inflation. Interest rates quoted in markets and reported in the financial press are always nominal interest rates.
prejudgment interest. Interest on losses occurring before trial.
present value. Value today of money due in the past (with interest) or in the
future (with discounting).
price erosion. Effect of the harmful act on the price charged by the plaintiff.
When the harmful act is wrongful competition, as in intellectual property
infringement, price erosion is one of the ways that the plaintiff’s earnings
have been harmed.
real interest rate. Interest rate adjusted for inflation. The real interest rate is the
nominal interest rate less the annual rate of inflation.
regression analysis. Statistical technique for inferring stable relationships among
quantities. For example, regression analysis may be used to determine how
costs typically vary when sales rise or fall.
reliance damages. Damages designed to reimburse a party for expenses incurred
from reliance upon the promises of the other party.
restitution damages. Damages measured on the principle of restoring the economic equivalent of lost property or value.
variable cost. Component of a business’s cost that would have been higher if the
business had enjoyed higher sales. See also avoided cost.
502
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
Exposure Science
JOSEPH V. RODRICKS
Joseph V. Rodricks, Ph.D., is Principal at Environ, Arlington, Virginia.
CONTENTS
I. Introduction, 505
II. Exposure Science, 506
A. What Do Exposure Scientists Do? 507
B. Who Qualifies as an Expert in Exposure Assessment? 508
C. Organization of the Reference Guide, 508
III. Contexts for the Application of Exposure Science, 509
A. Consumer Products, 509
B. Environmental and Product Contaminants, 510
C. Chemicals in Workplace Environments, 511
D. Claims of Disease Causation, 511
IV. Chemicals, 513
A. Organic and Inorganic Chemicals, 513
B. Industrial Chemistry, 514
V. Human Exposures to Chemicals, 516
A. Exposure Sources—An Overview, 516
B. The Goal of Exposure Assessment, 518
C. Pathways, 519
D. Exposure Routes, 522
E. Summary of the Descriptive Process, 524
VI. Quantification of Exposure, 525
A. Dose, 525
B. Doses from Indirect Exposure Pathways, 527
C. Direct Measurement: Analytical Science, 528
D. Environmental Models, 530
E. Integrated Exposure/Dose Assessment, 533
VII. Into the Body, 534
A. Body Burdens, 534
B. Monitoring the Body (Biomonitoring), 535
VIII. Evaluating the Scientific Quality of an Exposure Assessment, 537
IX. Qualifications of Exposure Scientists, 539
503
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Appendix A: Presentation of Data—Concentration Units, 541
Appendix B: Hazardous Waste Site Exposure Assessment, 543
Glossary of Terms, 545
References on Exposure, 548
504
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
I. Introduction
The sciences of epidemiology1 and toxicology2 are devoted to understanding the
hazardous properties (the toxicity) of chemical substances. Moreover, epidemiological and toxicological studies provide information on how the seriousness and
rate of occurrence of the hazard in a population (its risk) change as exposure to
a particular chemical changes. To evaluate whether individuals or populations
exposed to a chemical are at risk of harm,3 or have actually been harmed, the
information that arises from epidemiological and toxicological studies is needed, as
is the information on the exposures incurred by those individuals or populations.
Epidemiologists and toxicologists can tell us, for example, how the magnitude
of risk of benzene-induced leukemia changes as exposure to benzene changes.
Thus, if there is a need to understand the magnitude of the leukemia risk in
populations residing near a petroleum refinery, it becomes necessary to understand
the magnitude of the exposure of those populations to benzene. Likewise, if an
individual with leukemia claims that benzene exposure was the cause, it becomes
necessary to evaluate the history of that individual’s exposure to benzene.4
Understanding exposure is essential to understanding whether the toxic properties of chemicals have been or will be expressed. Thus, claims of toxic tort or
product liability generally require expert testimony not only in medicine and in
the sciences of epidemiology and toxicology, but also testimony concerning the
nature and magnitude of the exposures incurred by those alleging harm. Similarly,
litigation involving the regulation of chemicals said to pose excessive risks to
health also requires litigants to present evidence regarding exposure. The need
to understand exposure is a central topic in the reference guides in this publication
on epidemiology and toxicology. This reference guide provides a view of how
the magnitude of exposure comes to be understood.5
1. See Michael D. Green et al., Reference Guide on Epidemiology, in this manual.
2. See Bernard D. Goldstein & Mary Sue Henifin, Reference Guide on Toxicology, in this
manual.
3. See, e.g., Rhodes v. E.I. du Pont de Nemours & Co., 253 F.R.D. 365 (S.D. W. Va. 2008)
(suit for medical monitoring costs because exposure to perfluoroctanoic acid in drinking water allegedly caused an increased risk of developing certain diseases in the future); In re Welding Fume Prods.
Liab. Litig., 245 F.R.D. 279 (N.D. Ohio 2007) (exposure to manganese fumes allegedly increased the
risk of later developing brain damage).
4. See, e.g., Lambert v. B.P. Products North America, Inc., 2006 WL 924988 (S.D. Ill. 2006),
2006 U.S. Dist. LEXIS 16756 (plaintiff diagnosed with chronic lymphocytic leukemia was exposed
to jet fuel allegedly containing excessive levels of benzene).
5. This chapter focuses on measuring exposure to toxic substances as a specific developing area
of scientific investigation. This topic is distinct from the legal concept of “exposure,” which is an element of a claim in toxic tort litigation. The legal concept of exposure relies on the evolving scientific
understanding of the manner and extent to which individuals come into contact with toxic substances.
However, the legal concept also reflects substantive legal principles and interpretations that vary across
jurisdictions. Compare Parker v. Mobil Oil Corp., 793 N.Y.S.2d 434 (2005) (requiring findings of
specific levels of exposure to benzene by plaintiff who claimed that his leukemia was the result of his
505
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Not all questions concerning human exposures to potentially harmful substances require expert testimony. In those circumstances in which the magnitude of exposure is not relevant, or is clearly evident (e.g., because a plaintiff
was observed to take the prescribed amount of a prescription medicine), expert
testimony is not indicated. But if the magnitude of exposure is an important component of the needed evidence, and if that magnitude is not a simple question of
fact, then expert testimony will be important.
II. Exposure Science
Exposure science is not yet a distinct academic discipline. Although some schools
of public health may offer courses in exposure assessment, there are no academic
degrees offered in exposure science. When regulatory and public health agencies
began in the 1970s to examine toxicological risks in a quantitative way, it became
apparent that quantitative exposure assessments would become necessary. Initially,
exposure assessment was typically practiced by toxicologists and epidemiologists.
As the breadth and complexity of the subject began to be recognized, it became
apparent that scientists and engineers with a better grasp of the properties of
chemicals (which affect how they behave and undergo change in different environments), and of the methods available to identify and measure chemicals in
products and in the environment, would be necessary to provide scientifically
defensible assessments. As the importance of exposure assessment grew and began
to present significant scientific challenges, its practice drew increasing numbers
of scientists and engineers, and some began to refer to their work as exposure
science. Not surprisingly, most of the early expositions of exposure assessment
came from government agencies that recognized the need to develop and refine
the practice to meet their risk assessment needs. Indeed, various documents and
reports used by the U.S. Environmental Protection Agency (EPA) remain essential sources for the practice of exposure assessment.6 Academics and practitioners
have written chapters on exposure science for major multiauthor reference works
17-year occupational exposure to gasoline containing benzene) with Westberry v. Gislaved Gummi AB,
178 F.3d 257 (4th Cir. 1999) (evidence of specific exposure level not required where evidence of talc
in the workplace indicated that the worker was covered in talc and left footprints on the floor) and
Allen v. Martin Surfacing, 263 F.R.D. 47 (D. Mass. 2009) (admissible expert testimony may be based
on symptom accounts by those exposed rather than direct measurements of solvent concentrations).
This chapter takes no position regarding exposure as a substantive legal concept.
6. U.S. Environmental Protection Agency, Exposure Assessment Tools and Models (2009), available at http://www.epa.gov/oppt/exposure/ (last visited June 6, 2011); National Exposure Research
Laboratory, U.S. Environmental Protection Agency, Scientific and Ethical Approaches for Observational Exposure Studies, Doc. No. EPA 600/R-08/062 (2008), available at http://www.epa.gov/
nerl/sots/index.html (last visited July 14, 2010); U.S. Environmental Protection Agency. Exposure
Factors Handbook (1997).
506
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
on toxicology,7 but most of the work in this area is still found in the primary
reference works.
Although exposure science is not yet a distinct academic discipline, in this
reference guide the phrase is retained and used to refer to the work of scientists
and engineers (“exposure scientists”) working in one or more aspects of exposure
assessment.
A. What Do Exposure Scientists Do?
Human beings are exposed to natural and industrial chemicals from conception to
death, and because almost all chemicals can become harmful if exposures exceed
certain levels, understanding the magnitude and duration of exposures to chemicals is critical to understanding their health impacts. Exposure science is the study
of how people can come into contact with (are exposed to)8 chemicals that may be
present in various environmental media (air, water, food, soil, consumer products
of all types) and of the amounts of those chemicals that enter the body as a result
of these contacts.9 Exposure scientists also study whether and how those amounts
change over time. The goal of exposure science is to quantify those amounts and
time periods. The quantitative expression of those amounts is referred to as dose.
Ultimately the dose incurred by populations or individuals is the measure needed
by health experts to quantify risk of toxicity. Exposure science does not typically
deal with the health consequences of those exposures.
The dose entering the body (through inhalation or ingestion, through the
skin, and through other routes) is often referred to as the “exposure dose,” to
distinguish it from the dose that enters the bloodstream and reaches various organs
of the body. The latter is typically only a fraction of the exposure dose and is identified through studies that can trace the fate of a chemical after it enters the body.
The term “dose” as used in this reference guide is synonymous with “exposure
dose,” and doses reaching blood or various organs within the body are referred
to as “target site doses” or “systemic doses,”
Exposure assessments can be directed at past, present, or even future exposures and can be narrowly focused (one chemical, one environmental medium,
one population group) or very broad in scope (many chemicals, several environ7. P.J. Lioy, Exposure Analysis and Its Assessment, in Comprehensive Toxicology (I.G. Sipes et al.
eds., 1997); D.J. Paustenbach & A. Madl, The Practice of Exposure Assessment, in Principles and Methods
of Toxicology (Wallace Hayes ed., 5th ed. 2008).
8. See, e.g., Kitzmiller v. Jefferson, 2006 WL 2473399, 2006 U.S. Dist. LEXIS 61109 (N.D.
W. Va. 2006) (defendants offered expert’s testimony that plaintiff’s use of liquid cleaning agents containing benzalkonium chloride failed to show that she was exposed to benzalkonium chloride in the
air); Hawkins v. Nicholson, 2006 WL 954654, 2006 U.S. App. Vet. Claims LEXIS 197, 21 Vet. App.
64 (Vet. App. 2006) (noting that “a veteran who served on active duty in Vietnam between January 9,
1962, and May 7, 1975, is entitled to a rebuttable presumption of exposure to Agent Orange”).
9. The term “enter the body” also includes entering the external surface of the body.
507
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
mental media, several different population groups). This reference guide explores
the various contexts in which exposure assessments are conducted and how their
scope is determined.
B. Who Qualifies as an Expert in Exposure Assessment?
As noted, it is unlikely that any expert can present evidence of having an academic
degree in exposure science. An expert’s qualifications thus have to be tested by
examining the expert’s experience,10 including his or her knowledge of and reliance on authoritative reference works.11 Experts generally will have strong academic credentials in environmental science and engineering, chemistry, chemical
engineering, statistics and mathematical model building, industrial hygiene, or
other hard sciences related to the behavior of chemicals in the environment.
To the extent exposure assessments deal with the amounts and behaviors of
chemicals in the body, individuals can qualify as experts if they can offer academic
credentials or substantial experience in toxicology and in the measurement of
chemicals in blood or in biological tissues. Certainly, toxicology, epidemiology,
or medical credentials are needed if experts are to offer testimony on the health
consequences associated with particular exposures.
Not all exposure assessments are complex; indeed, some, as will be seen, are
relatively simple. Most toxicologists and epidemiologists have considerable training
and experience assessing dose from medicines and other consumer products—and
even from food. But if exposures result from chemicals moving from sources
through one or more environmental media, it is unlikely that toxicologists or epidemiologists will be able to offer appropriate qualifications, because modeling or other
forms of indirect measurement are needed to assess exposures. Further details on the
qualifications of experts are offered in the closing sections of the reference guide.
C. Organization of the Reference Guide
The reference guide begins with a discussion of the various contexts in which
exposure science is applied (Section III). Following that discussion is a section on
chemicals and their various sources. Three broad categories of chemicals are discussed: (1) those that are produced for specific uses; (2) those that are byproducts
of chemical production, use, and disposal and that enter the environment as
contaminants; and (3) those that are created and released by the combustion of
all types of organic substances (including tobacco) and of fuels used for energy
10. See, e.g., Best v. Lowe’s Home Ctrs, 2009 WL 3488367, 2009 U.S. Dist. LEXIS 97700
(E.D. Tenn. 2009) (a medical doctor with extensive industrial toxicology and product safety experience
opined that the plaintiff could not have been exposed to the chemical at issue as alleged).
11. Most of the EPA’s guidance documents on exposure assessment have been issued after
extensive peer review and thus are considered authoritative.
508
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
production. Each of these categories can be thought of as a source for chemical
exposure. Next, there is a discussion of the pathways chemicals follow from their
sources to the environmental media to which humans are or could be in contact.
Such contact is said to create an exposure. Chemicals can then move from these
media of human contact and enter the body by different routes of exposure—by
ingestion (in food or water, for example), by inhalation, or by direct skin contact
(the dermal route). The section on exposure routes includes a discussion of how
chemicals contact and enter the body and of how they behave within it. This last
topic comprises the interface between exposure science and the sciences of epidemiology and toxicology. Traditionally, exposure scientists have described their
work as ending with the description of dose to the body (exposure dose). As will
be seen, some practitioners are focusing on the amounts of chemicals present in
blood or various tissues of the body as a result of exposure. Unlike the toxicologist, the exposure scientist is not qualified to evaluate the health consequences of
these so-called biomarkers of exposure.
This reference guide first presents all of the above material in nonquantitative
terms—to describe and illustrate the various processes through which human
exposures to chemicals are created (Sections III–V). The guide then focuses on
the quantitative aspects (Sections VI and VII). Without some quantitative understanding of the magnitude of exposure, and of the duration of time over which
exposure occurs, it becomes difficult to reach meaningful conclusions about health
risks. Thus, the remaining sections are devoted to a critical quantitative concept
in exposure science—that of dose—and are intended to integrate all of the earlier
descriptive material. The reference guide ends with a review of the qualifications
of exposure science experts and how they can be assessed.
III. Contexts for the Application of
Exposure Science
There are perhaps four major contexts in which exposure science is applied:
(1) consumer products, (2) contaminants in the environment and in consumer
products, (3) chemicals in the workplace, and (4) disease causation.
A. Consumer Products
Many intentional uses of chemical substances lead to human exposures, and the
health risks that are associated with those exposures need to be understood.12
In some cases, laws and regulations require that health risks be understood in
12. See, e.g., In re Stand ’n Seal, 623 F. Supp. 2d 1355 (N.D. Ga. 2009) (consumer use of sprayon product allegedly resulted in inhalation exposure to toxic substances, causing respiratory injuries).
509
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
advance of the marketing of such chemicals or products containing them. Thus,
intentionally introduced food additives, pesticides, and certain industrial chemicals
must have regulatory approvals before they are marketed, and manufacturers of
such substances are required to demonstrate the absence of significant health risks
(i.e., their safety) based on toxicology studies and careful assessments of expected
exposures. Pharmaceuticals and other medical products must undergo similar premarket evaluations. The safety and efficacy of such products must be demonstrated
through clinical studies (which are undertaken after animal toxicology studies have
been done and have demonstrated the safety of such products for individuals who
are involved in clinical trials). Human exposure assessments are central to the
regulatory approval of these products.13
Many other consumer products require risk assessments, but premarket
approvals are not generally required under our current laws. The list of such
products is very long, and not all substances included in these products have been
subjected to exposure and risk assessments, but regulatory initiatives in the United
States and abroad are creating new requirements for more complete assessments
of consumer safety.
B. Environmental and Product Contaminants
Byproducts of many industrial processes, including those created by combustion,
have led to much environmental contamination (see Section IV for a discussion
of the sources of such contamination).14 Technically speaking, contamination
refers to the presence of chemical substances in environmental media (including
consumer products) in which such substances would not ordinarily be found.
The term also may be used to refer to their presence in greater amounts than is
usual.15 The assessment of health risks from such contaminants depends upon an
understanding of the magnitude and duration of exposure to them. Exposures
may occur through the presence of contaminants in air, drinking water, foods,
consumer products, or soils and dusts; in many cases, exposures may occur simultaneously through more than one of these media.
The results from exposure and risk assessments (which incorporate information regarding the toxic properties of the contaminants) are typically used by
regulators and public health officials to determine whether exposed populations
are at significant risk of harm. If regulators decide that the risks are excessive, they
13. B.D. Beck et al., The Use of Toxicology in the Regulatory Process, in Principles and Methods of
Toxicology (A. Wallace Hayes ed., 5th ed. 2008).
14. See, e.g., Orchard View Farms, Inc. v. Martin Marietta Aluminum, Inc., 500 F. Supp. 984,
1008 (D. Or. 1980) (failure to monitor fluoride emissions that harmed nearby orchards supported
award of punitive damages).
15. For example, lead is naturally present in soils. It could be said that a sample of soil is contaminated with lead only if it were clear that the amounts present exceeded natural levels. The issue
is complicated by the fact that natural levels are highly variable.
510
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
will take steps to reduce them, typically by using interventions that will reduce
exposures (because the inherent toxic properties of the chemicals involved cannot
be altered). Exposure scientists are called upon to assess the magnitude of exposure
reduction (and therefore risk reduction) achieved through a given intervention.16
C. Chemicals in Workplace Environments
Workers in almost all industrial sectors are exposed to chemicals.17 Exposures are
created in industries involved in the extraction of the many raw materials used to
manufacture chemical products (the mining, agricultural,18 and petroleum industries). Raw materials are refined and otherwise processed in thousands of different
ways and are eventually turned into manufactured chemical products that number
in the tens of thousands. These products enter many channels of distribution and
are incorporated into many other products (so-called downstream uses). Occupational exposures can occur at all of these various steps of manufacturing and use.
Exposure also can occur from disposal of wastes. Exposure assessments in all of
these various occupational settings are important to understand whether health
risks are excessive and therefore require reduction.19
D. Claims of Disease Causation
In the above three situations, the exposures of interest are those that are currently
occurring or that are likely to occur in the future. In those situations the exposure assessments are used to ascertain whether risks of harm are excessive (and
thus require reduction) or to document safety (when risks are negligible). There
are, however, many circumstances in which individuals claim they actually have
been harmed by chemicals. Specifically, they allege that some existing medical
condition has been caused by exposures occurring in the past, whether in the
workplace, the environment, or through the use of various consumer products.20
16. National Research Council, Air Quality Management in the United States (2004).
17. See, e.g., Kennecott Greens Creek Min. Co. v. Mine Safety & Health Admin., 476 F.3d
946 (D.C. Cir. 2007) (suit over regulations addressing miners’ exposure to diesel particulate matter).
18. The term “agriculture” is applied here very broadly and includes the production of a wide
variety of raw materials that have industrial and consumer product uses (including flavors, fragrances,
fibers of many types, and some medicinal products). See, e.g., Association of Irritated Residents v. Fred
Schakel Dairy, 634 F. Supp. 2d 1081, 1083 (E.D. Cal. 2008) (methanol emissions from dairy allegedly
resulted in exposure sufficient to create human health risks).
19. Office of Pesticide Programs, U.S. Environmental Protection Agency, General Principles for
Performing Aggregate Exposure and Risk Assessments, available at http://www.epa.gov/pesticides/
trac/science/aggregate.pdf (last visited July 14, 2010).
20. See Michael D. Green et al., supra note 1, in this manual, for a discussion on disease causation. Regulations and public health actions are usually driven by findings of excessive risk of harm
(although sometimes evidence of actual harm).
511
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Exposure science comes into play in these cases because the likelihood that any
given disease or injury was induced because of exposure to one or more chemicals
depends in large part on the size of that exposure.21 Thus, with the advent of
large numbers of so-called toxic tort claims has come the need to assess past exposures. Exposure scientists have responded to this need by adapting the methods
of exposure assessment to reconstruct the past—that is, to produce a profile of
individuals’ past exposures.22
A plaintiff with a medical condition known from epidemiological studies
to be caused by a specific chemical may not be able to substantiate his or her
claim without evidence of exposure to that chemical of a sufficient magnitude.23
Exposure experts are needed to quantify the exposures incurred; causation experts
are then called upon to offer testimony on whether those exposures are of a
magnitude sufficient to cause the plaintiff’s condition. Chemicals known to cause
diseases under certain exposure conditions will not do so under all exposure
conditions.
Exposure reconstruction has a history of use by epidemiologists who are
studying disease rates in populations that may be associated with past exposures.24
Epidemiologists have paved the way for the use of exposure assessment methods to
reconstruct the past. Although the methods for evaluating current and past exposures are essentially identical, the data needed to quantify past exposures are often
more limited and yield less certain results than the data needed to evaluate current
exposures. Assessment of past exposures is especially difficult when considering
diseases with very long latency periods.25 By the time disease occurs, documentary
proof of exposure and magnitude may have disappeared. But courts regularly deal
with evidence reconstructing the past, and assessment of toxic exposure is another
application of this common practice.26
21. See supra notes 1 & 2. Causation may sometimes be established even if quantification of
the exposure is not possible. See, e.g., Best v. Lowe’s Home Ctrs, Inc., 563 F.3d 171 (6th Cir. 2009)
(doctor permitted to testify as to causation based on differential diagnosis).
22. Confounding factors must be carefully addressed. See, e.g., Allgood v. General Motors Corp.,
2006 WL 2669337, at *11 (S.D. Ind. 2006) (selection bias rendered expert testimony inadmissible);
American Farm Bureau Fed’n v. EPA, 559 F.3d 512 (2009) (in setting particulate matter standards
addressing visibility, the data relied on should avoid the confounding effects of humidity); Avila v.
Willits Envtl. Remediation Trust, 2009 WL 1813125, 2009 U.S. Dist. LEXIS 67981 (N.D. Cal.
2009) (failure to rule out confounding factors of other sources of exposure or other causes of disease
rendered expert’s opinion inadmissible); Adams v. Cooper Indus. Inc., 2007 WL 2219212, 2007 U.S.
Dist. LEXIS 55131 (E.D. Ky. 2007) (differential diagnosis includes ruling out confounding causes of
plaintiffs’ disease).
23. See Michael D. Green et al., Reference Guide on Epidemiology, in this manual.
24. Id.
25. W.T. Sanderson et al., Estimating Historical Exposures of Workers in a Beryllium Manufacturing
Plant, 39 Am. J. Indus. Med. 145–57 (2001).
26. Courts have accepted indirect evidence of exposure. For example, differential diagnosis
may support an expert’s opinion that the exposure caused the harm. Best v. Lowe’s Home Ctrs., Inc.,
563 F.3d 171 (6th Cir. 2009). On occasion, qualitative evidence of exposure is admitted as evidence
512
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
IV. Chemicals
Before embarking on a description of the elements of exposure science, it is useful
to provide a brief primer on some of the characteristics of chemicals that influence
their behavior and that therefore affect the ways in which humans can be exposed
to them. The primer also introduces some technical terms that frequently arise in
exposure science.
A. Organic and Inorganic Chemicals
For both historical and scientific reasons, chemists divide the universe of chemicals into organic and inorganic compounds. The original basis for classifying
chemicals as organic was the hypothesis, known since the mid-nineteenth century
to be false, that organic chemicals could be produced only by living organisms. Modern scientists classify chemicals as organic if they contain the element
carbon.27 Carbon has the remarkable and nearly unique property that its atoms can
combine with each other in many different ways, and, together with a few other
elements—including hydrogen, oxygen, nitrogen, sulfur, chlorine, bromine—can
create a huge number of different molecular arrangements. Each such arrangement is a unique chemical. Several million distinct organic chemicals are already
known to chemists, and there are many more that will no doubt be found to
occur naturally or that will be created by laboratory synthesis. All of life—at least
on Earth—depends on carbon compounds and probably could not have evolved
if carbon did not have its unique and extraordinary bonding properties.
All other chemicals are called inorganic. There are 90 elements in addition
to carbon in nature (and several more that have been created in laboratories), and
because these elements do not have the special properties of carbon, the number
of different possible combinations of them is smaller than can occur with carbon.
Living organisms contain or produce organic chemicals by the millions. One
of the most abundant organic chemicals on Earth is cellulose—a giant molecule
containing thousands of atoms of carbon, hydrogen, and oxygen. Cellulose is
produced by all plants and is their essential structural component. Chemically, cel-
that the magnitude was great enough to cause harm. See, e.g., Westberry v. Gislaved Gummi AB,
178 F.3d 257 (4th Cir. 1999) (no quantitative measurement required where evidence showed plaintiff
was covered in talc and left footprints); Allen v. Martin Surfacing, 263 F.R.D. 47 (D. Mass. 2009)
(symptom accounts at the time of exposure formed the basis for expert’s opinion that exposure was
high enough to cause harm). And courts have accepted the government’s reconstruction of exposure
to radiation. Hayward v. U.S. Dep’t of Labor, 536 F.3d 376 (5th Cir. 2008); Hannis v. Shinseki, 2009
WL 3157546 (Vet. App. 2009) (no direct measure of veteran’s exposure to radiation was possible but
VA’s dose estimate was not clearly erroneous).
27. There are a few compounds of carbon that chemists still consider inorganic: These are typically simple molecules such as carbon monoxide (CO) and carbon dioxide (CO2) and the mineral
limestone, which is calcium carbonate (CaCO3).
513
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
lulose is a carbohydrate (one that is not digested by humans), a group that together
with proteins, fats, and nucleic acids are the primary components of life. But living
organisms also produce huge numbers of other types of organic molecules. The
colors of plants and animals and their odors and tastes are a result of the presence
of organic chemicals. The numbers and structural varieties of naturally occurring
chemicals are enormous.
Other important natural sources of organic chemicals are the so-called fossil
fuels—natural gas, petroleum, and coal—all deposited in the Earth from the decay
of plant and animal remains and containing thousands of degradation products.
Most of these are simple compounds containing only carbon and hydrogen (technically known as hydrocarbons). The organic chemical industry depends upon
these and just a few other natural products for everything it manufactures; the
fraction of fossil fuels not used directly for energy generation is used as feedstock
for the chemical industry. There are also inorganic chemicals—the minerals—
present in living organisms, many essential to life. But the principal natural source
of inorganic chemicals is the nonliving part of the Earth that humans have learned
how to mine.
B. Industrial Chemistry
The modern chemical industry had its origins in the late nineteenth century when
chemists, mostly European, discovered that it was possible to create in the laboratory chemicals that had previously been found only in nature. Most remarkably,
scientists also discovered they could synthesize compounds not found in nature—
substances never previously present on Earth. In other words, they found ways
to alter through chemical reactions the bonds present in one compound so that
a new compound was formed. The first compound synthesized in this way was
a dye called aniline purple by the British chemist, William Henry Perkin, who
discovered it.28 The work of chemical synthesis grew out of the development of
so-called structural theory in the nineteenth century and remains central to the
science today. This theory explains that the number and type of chemical elements present, and the ways in which those elements are bonded to each other,
are unique for each chemical compound and therefore distinguish one chemical
from another.
In the late nineteenth century and up to World War II, coal was the major
starting material for the organic chemical industry. When coal is heated in the
absence of oxygen, coke and volatile byproducts called coal tars are created.
All sorts of organic chemicals can be isolated from coal tar—benzene, toluene,
xylenes, ethylbenzene, naphthalene, creosotes, and many others. The organic
28. This compound and others related to it became the bases for the first chemical industry, that
devoted to dye production. Perkins’ dye was later called “mauve” and its wide use led to what came
to be called the Mauve Decade (1890s).
514
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
chemical industry also uses other natural products, such as animal fats, vegetable
oils, and wood byproducts.
The move to petroleum as a raw materials source for the organic chemical
industry began during the 1940s. Petrochemicals, as they are called, are now used
to create thousands of useful industrial chemicals. The rate of commercial introduction of new chemicals shot up rapidly after World War II.
Among the thousands of products produced by the organic chemical industry
and by related industries are medicines (most of which are organic chemicals of
considerable complexity), dyes, agricultural chemicals, including substances used
to eliminate pests (insecticides, fungicides, herbicides, rodenticides, and other
“cides”), soaps and detergents, synthetic fibers and rubbers, paper chemicals,
plastics and resins of great variety, adhesives, food additives, additives for drinking water, refrigerants, explosives, cleaning and polishing materials, cosmetics,
and textile chemicals. Because of past disposal practices, chemicals primarily used
as solvents (for many purposes) are among the most widespread environmental
contaminants.
The history of human efforts to tap the inorganic earth for useful materials
is complex and involves a blend of chemical, mining, and materials technologies.
Included here is everything from the various silicaceous materials derived from
stone (glasses, ceramics, clays, asbestos) to the vast number of metals derived
from ores that have been mined and processed (iron, copper, nickel, cadmium,
molybdenum, mercury, lead, silver, gold, platinum, tin, aluminum, uranium,
cobalt, chromium, germanium, iridium, cerium, palladium, manganese, zinc,
and many more). Other nonmetallic materials, such as chlorine and bromine, salt
(sodium chloride), limestone (calcium carbonate), sulfuric acid, and phosphates,
and various compounds of the metals, have hundreds of different uses, as strictly
industrial chemicals and as consumer products. These inorganic substances reach,
enter, and move about our environment, and we come into contact with them,
sometimes intentionally, sometimes inadvertently. The number of organic and
inorganic chemicals in commercial production exceeds 70,000, and the number
of uses and products created from them far exceeds this number.
There are important health questions related to what is generally referred to
as particulate matter (PM). Small particulates in the air usually arise from combustion of almost any organic material. The chemical composition of such particulates can vary depending upon source, but it is possible that their health effects
depend more upon their physical size than their chemical composition. This issue
is currently unresolved, but it is important to include PMs of all types as a class
of chemical contaminants.
Finally, it is important to note that, in addition to PM, many chemicals are
produced when fuels or other organic materials are burned. Organic chemicals
take on oxygen atoms during combustion and yield large numbers of substances
not present in the materials that are burned. Combustion also produces simple
inorganic oxides of carbon, nitrogen, and sulfur, which are major air pollutants.
515
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Burning tobacco introduces 4000 to 5000 chemicals into the lungs. Combustion
products are another important source of environmental contamination.29
V. Human Exposures to Chemicals
As noted earlier, this section is entirely descriptive, rather than quantitative. It
describes all the various physical processes that lead to human exposures to chemicals and introduces the terms that exposure scientists apply to those processes.
Section VI illustrates how these various processes can be quantified and the types
of data that are required to do so.
A. Exposure Sources—An Overview
Figure 1 provides a broad overview of most of the major sources of exposure.
As shown, sources can be intended or unintended. Thus, many chemicals are
intentionally used in ways that will lead to human exposures. Substances added
to food and indeed food itself,30 cosmetics, personal care products, fibers and the
colorants added to them, and medical products of many types are included in this
broad category. Direct ingestion of, or other types of direct contact with (on the
skin or through inhalation), such products obviously creates exposures. Nicotine
and tobacco combustion products might also be classified as intended exposures.
Generally, these exposures are more readily quantifiable than those associated with
unintended exposures.
Although the term is somewhat ambiguous, unintended exposures may be
said to fall into two broad categories. There are deliberate uses of certain chemicals
that, although not intended to lead to human exposures, will inevitably do so. Pesticides applied to food crops, some components of food packaging materials that
may migrate into food, and many types of household products are not intended
for direct human ingestion or contact, but exposures will nonetheless occur indirectly. Occupational exposures, although unintended, are similarly unavoidable.
Also, many exposures to a very broad range of environmental contaminants are
unintended (see Figure 1).
In all of these cases, such exposures are not described as intentional, in the
sense that the term is applied to a pharmaceutical ingredient or a cosmetic, but
most are not completely avoidable. Unintended exposures are generally more
29. National Research Council, Human Exposure Assessment for Airborne Pollutants: Advances
and Opportunities (1991); J. Samet & S. Wang, Environmental Tobacco Smoke, in Environmental
Toxicants (M. Lippmann ed., 2d ed. 2000).
30. The natural constituents of food include not only substances that have nutritional value, but
also hundreds of thousands of other natural chemicals.
516
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
Figure 1. Opportunities for exposure: Sources of chemical releases.
Sources
Manufacture
Examples of
Potential Exposures
Occupational Exposures
Emissions to Air
Wastewater Effluents
Storage
Leaks
Spills
Accidents
to soil
water, air, and
biota
Transportation
Use
Disposal
Application of Pesticides
Consumption of Pesticide Residues
Consumption of Food Additives
Dermal Exposure to Cosmetics
Inhalation of Gasoline Vapors
Hazardous Waste Incinerator Emissions
Evaporation to Air
Surface Runoff
Leaching to Ground Water
difficult to identify and quantify than are intended exposures.31 In the case of
the intended exposures, the pathway from source to humans is direct; in the case
of unintended exposures, the pathway is indirect, sometimes highly so. Thus,
the most important distinction for purposes of exposure assessment concerns the
directness of the pathway from source to people.
31. There are significant differences in the laws regarding the regulation of substances that have
been grouped as creating intended or unintended exposures.
517
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. The Goal of Exposure Assessment
Exposure assessment is generally intended to answer the following questions:
• Whohasbeen,orcouldbecomeexposedtoaspecificchemical(s)arising
from one or more specific sources? Is it the entire general population, or is
it a specific subpopulation (e.g., those residing near a certain manufacturing or hazardous waste facility, or infants and children), or is it workers?32
• Whatspecificchemicalscomprisetheexposures?
• Whatarethepathwaysfromthesourceofthechemicaltotheexposed
population? Pathways include direct product use, or those (so-called indirect pathways) in which the chemical moves through one or more environmental media to reach the media to which people are exposed (air,
water, foods, soils, and dusts). Understanding pathways is necessary to
understanding exposure routes (below) and quantifying exposures.
• Bywhatroutesarepeopleexposed?Routesincludeingestion,inhalation,
and dermal contact.33 Identifying exposure routes is important because
those routes affect the magnitude of ultimate exposures and because they
often affect health outcomes.
• Whatisthemagnitudeanddurationofexposureincurredbythepopulation of interest? Dose is the technical term used for magnitude, and it is
the amount of chemical entering the body or contacting the surface of
the body, usually over some specified period of time (often over 1 day34).
Duration refers to the number of days over which exposure occurs. Note
that exposures can be intermittent or continuous and can be highly variable, especially for some air contaminants.
The ultimate goal of exposure assessment is to identify dose and duration.
The concept of dose is further developed in Section VI. After a chemical enters or
contacts the body, it can be absorbed (into the bloodstream), distributed to many
organs of the body, metabolized (chemically altered by certain enzymes in cells
of the liver and other organs), and then excreted. Understanding these processes
is important to determining whether and how a chemical may cause adverse
health effects. These processes mark the interface between exposure science and
toxicology, epidemiology, and medicine. Understanding the dose is the necessary
first step in understanding these processes; for purposes of this reference guide,
the boundary of exposure science is set at understanding dose. However, some
32. See, e.g., Hackensack Riverkeeper, Inc. v. Del. Ostego, 450 F. Supp. 2d 467 (D.N.J. 2006)
(river and bay users alleged that hazardous waste runoff and emissions polluted the water).
33. Additional routes of exposure are relevant for some pharmaceuticals, diagnostics, and medical devices.
34. Shorter periods of time are used when the concern is very short-term exposures to chemicals
that have extremely high toxicity—so-called acutely poisonous materials.
518
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
discussion of how it is possible to gain more direct measures of exposure (target
site doses) by examining human blood and urine is included.
The completion of an exposure assessment provides the information needed
(the dose and duration of exposure) by epidemiologists and toxicologists, who will
have information on the adverse health effects of the chemicals involved and on
the relationships between those effects and the dose and duration of exposure.35
Recall that exposure assessments can be directed at exposures that occurred in
the past, those that are currently occurring, or those that will occur in the future
should certain actions be taken (e.g., the entry of a new product into the consumer
market or the installation of new air pollution controls).
The discussion of each of these elements of exposure assessment is expanded
in the following section, beginning with pathways.
C. Pathways
Assuming that the chemical of interest and its sources have been identified,
exposure assessment focuses on the pathway the chemical follows to reach the
population of interest.36
To ensure thoroughness in the assessment, all conceivable pathways should be
explicitly identified, with the understanding that ultimately some pathways will
be found to contribute negligibly to the overall exposure. Identifying pathways is
also important to understanding exposure routes.
As noted earlier, the simplest pathways are those described as direct. Thus, a
substance, such as a noncaloric sweetener or an emulsifier, once added to food,
follows a simple and direct pathway to the people who ingest the food. The same
can be said for pharmaceuticals, cosmetics, and other personal care products. Cal-
35. See reference guides on epidemiology and toxicology in this manual. See also, e.g., White v.
Dow Chem. Co., 321 Fed. App’x. 266, 2009 WL 931703 (4th Cir. 2009) (plaintiff must show more
than possible exposure; must show concentration and duration); Anderson v. Dow Chem. Co., 255
Fed. Appx. 1, 2007 WL 1879170 (5th Cir. 2007) (lawsuit dismissed because uncontested data showed
that magnitude and duration of exposure was insufficient to cause adverse health effects); Finestone v.
Florida Power & Light Co., 272 Fed. App’x. 761, 2008 WL 931703 (4th Cir. 2009) (experts’ testimony
was properly excluded where their conclusions relied on unsupported assumptions).
36. SPPI-Somersville, Inc. v. TRC Cos., 2009 WL 2612227, at *16 (N.D. Cal. 2009) (groundwater contamination claim was dismissed because there was no current pathway to exposure); United
States v. W.R. Grace Co., 504 F.3d 745 (9th Cir. 2007) (affirming exclusion of report, but not expert
testimony based on the report, identifying which pathways of asbestos exposure were most associated
with lung abnormalities); Grace Christian Fellowship v. KJG Investments Inc., 2009 WL 2460990, at
*12 (E.D. Wis. 2009) (preliminary injunction was denied because the plaintiff did not establish that
a complete pathway currently existed for toxins to enter the building); National Exposure Research
Laboratory, U.S. Environmental Protection Agency, Scientific and Ethical Approaches for Observational Exposure Studies, Doc. No. EPA 600/R-08/062 (2008), available at http://www.epa.gov/
nerl/sots/index.html (last visited July 14, 2010); U.S. Environmental Protection Agency. Exposure
Factors Handbook (1997).
519
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
culating doses for such substances, as shown in Section VI, is generally a straightforward process. Even in such cases, however, complexities can arise. Thus, in
the case of certain personal care products that are applied to the skin, there is a
possibility of inhalation exposures to any substance in those products that can readily volatilize at room temperatures. One physical characteristic of chemicals that
exposure scientists need to understand is their capacity to move from a liquid to
a gaseous state (to volatilize). Not all chemicals are readily volatile (and almost all
inorganic, metal-based substances are close to nonvolatile), but inhalation routes
can be significant for those that are volatile, regardless of their sources.37
Indirect pathways of exposure can range from the relatively simple to the
highly complex. Many packaging materials are polymeric chemicals—very large
molecules synthesized by causing very small molecules to chemically bind to each
other (or to other small molecules) to make very long chemical chains. These
polymers (polyethylene, polyvinyl chloride, polycarbonates, and others) tend to
be physically very stable and chemically quite inert (meaning they have very low
toxicity potential). But it is generally not possible to synthesize polymers without
very small amounts of the starting chemicals (those small molecules, usually called
monomers) remaining in the polymers. The small molecules can often migrate
from the polymer into materials with which the polymer comes into contact. If
those materials are foods or consumer products, people consuming those foods or
otherwise using those products will be exposed.
Some amount of the pesticides applied to food crops may remain behind in
treated foods and be consumed by people.38 This last pathway can become more
complicated when treated crops are used as feed for animals that humans consume
(meat and poultry and farm-raised fish) or from which humans obtain food (milk
and eggs). Exposure scientists who study these subjects thus need to understand
what paths pesticides follow when they are ingested by farm animals used as food.
The same complex indirect pathways arise for some veterinary drugs used in animals from which humans obtain food.39
In the realm of environmental contamination, pathways can multiply and
the problem of exposure assessment can become even more complex. Sources of
environmental contamination include air emissions from manufacturing facilities
and from numerous sources associated with the combustion of fuels and other
37. Inhalation exposures to nonvolatile chemicals can occur if they are caused to move into the
air as dusts. See National Research Council, Human Exposure Assessment for Airborne Pollutants:
Advances and Opportunities (1991).
38. Other pathways for pesticide exposure include spraying homes or fields. Kerner v. Terminix
Int’l Co., 2008 WL 341363 (S.D. Ohio 2008) (pesticides allegedly misapplied inside home); Brittingham
v. Collins, 2008 WL 678013 (D. Md. Feb. 26, 2008) (crop-dusting plane sprayed plaintiff’s decedent);
Haas v. Peake, 525 F.3d 1168 (Fed. Cir. 2008) (veteran claimed exposure to Agent Orange).
39. P. Frank & J.H. Schafer, Animal Health Products, in Regulatory Toxicology (S.C. Gad, ed.,
2d ed. 2001).
520
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
organic materials.40 Similar emissions to water supplies, including ground water
used for drinking or for raising plants and animals, can result in human exposures
through drinking water and food.41 Contaminants of drinking water that are volatile can enter the air when water is used for bathing, showering, and cooking. A
recent problem of much concern is the contamination of air in homes and other
buildings because of the presence of volatile chemical contaminants in the water
beneath those structures.42
Wastes from industrial processes and many kinds of consumer wastes can similarly result in releases to air and water.43 In some cases, emissions to air can lead to
the deposition of contaminants in soils and household dusts; this type of contamination is usually associated with nonvolatile substances. Some such substances may
remain in soils for very long periods; others may migrate from their sites of deposition and contaminate ground water; whereas others may degrade relatively quickly.
All of these issues regarding the movement of chemicals from their sources
and through the environment to reach human populations come under the heading of chemical fate and transport.44 Transport concerns the processes that cause
chemicals to follow certain pathways from their sources through the environment, and fate concerns their ultimate disposition—that is, the medium in which
they finally reside and the length of time that they might reside there. Fate-andtransport scientists have models available to estimate the amount of chemical that
will be present in that final environmental medium.45 Some discussion of the
nature of these models is offered in Section VI.
One final feature of pathways analysis that should be noted concerns the fact
that some chemicals degrade rapidly when they enter the environment, others
slowly, and some not at all or only exceedingly slowly. The study of environmental persistence of different chemicals is a significant feature of exposure science; its goal is to understand the chemical nature of the degradation products
and the duration of time the chemical and its degradation products persist in any
40. See, e.g., Natural Resources Defense Council, Inc. v. EPA, 489 F.3d 1250 (D.C. Cir. 2007)
(vacating EPA rule for solid waste incinerators); Kurth v. ArcelorMittal USA, Inc., 2009 WL 3346588
(N.D. Ind. 2009) (defendant manufacturers allegedly emitted toxic chemicals, endangering schoolchildren); American Industrial Hygiene Association, Guideline on Occupational Exposure Reconstruction (S.M. Viet et al. eds., 2008).
41. United States v. Sensient Colors, Inc., 580 F. Supp. 2d 369, 373 (D.N.J. 2008) (leaching
lead threatened to contaminate ground water used for drinking).
42. Interstate Technology & Regulatory Council (ITRC), Vapor Intrusion Pathway: A Practical
Guideline. (Jan. 2007), available at http://www.itrcweb.org/Documents/VI-1.pdf.
43. American Farm Bureau Fed’n. v. EPA, 559 F.3d 512 (D.C. Cir. 2009) (EPA outdoor air
pollution standards).
44. The common phrase used by exposure scientists is “fate and transport.” In fact, transport
takes place and has to be understood before fate is known.
45. In the context of exposure science, the term “final” refers to the medium through which
people become exposed. A chemical may in fact continue to move to other media after that human
exposure has occurred.
521
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
given environmental medium. Most inorganic chemicals are highly persistent;
metals that become contaminants may change their chemical forms in small ways
(lead sulfide may convert to lead oxide), but the metal persists forever (although
it may migrate from one medium to another). Most organic chemicals degrade in
the environment as a result of their exposure to light, to microorganisms present
in soils and sediments, and to other environmental substances. But a few organic
substances (e.g., polychlorinated biphenyls (PCBs) and the chlorinated dioxins,
certain chlorinated pesticides such as DDT that were once widely used) are quite
resistant to degradation and may persist for unexpectedly long periods (although
even these ultimately degrade).46
Exposure scientists also need to be aware of the possibility that the degradation
products of certain chemicals may be as or more toxic than the chemicals themselves. The once widely used solvents trichloroethylene and perchloroethylene
(tetrachloroethylene) are commonly found in ground water. Under certain conditions, these compounds degrade by processes that lead to the replacement of some
chlorine atoms by hydrogen atoms; one product of their degradation is the more
dangerous chemical called vinyl chloride (monochloroethylene). The presence of
such a degradation product in drinking water should not be ignored.
A description of pathways is the critical first step in exposure assessment and,
especially for environmental contaminants, must be done with thoroughness. Are
all conceivable pathways accounted for? Have some pathways been eliminated
from consideration, and if so, why? Are any environmental degradation products of concern? Only with adequate description can adequate quantification
(Section VI) be accomplished.
A graphical description of pathways is offered in Figure 2.
D. Exposure Routes
Pathways analysis leads to the identification of the environmental media in
which the chemical of interest comes to be present and with which human contact
can occur—the media of human exposure.
The inhalation of air containing the chemical of interest is one route of
exposure.47 The physical form of the chemical in air, which should be known
from the pathways analysis, will influence what happens to the chemical during
inhalation. Chemicals that are in the vapor phase will remain in that physical
46. K.W. Fried & K.K. Rozman, Persistent Polyhalogenated Aromatic Hydrocarbons. in Toxicology
and Risk Assessment: A Comprehensive Introduction H. Greim & R. Snyder eds., 2009).
47. See, e.g., Byers v. Lincoln Elec. Co., 607 F. Supp. 2d 840 (N.D. Ohio 2009) (welder inhaled
toxic manganese fumes); O’Connor v. Boeing North American, Inc., 2005 WL 6035256 (C.D. Cal.
2005) (alleged failure to monitor ambient air emissions of radioactive particles); In re FEMA Trailer
Formaldehyde Prod. Liab. Litig., 2009 WL 2382773 (E.D. La. 2009) (trailer residents exposed to
formaldehyde).
522
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
Figure 2. Description of the many possible environmental pathways that chemicals may follow after releases from different sources.
Prevailing Wind
Direction
Deposition
Exposure
Point A
Leak
Vapor Intrusion
Stack
Emission
(Source)
Exposure
Point B
Exposure
Point C
Drinking
Water Well
Underground
Storage Tank
(Source)
Surface
Water
Runoff
Ru
Run
off
Exposure
Point D
E
Exposure
Point E
Leaching
ng
Waste Dump
(Source)
Groundwater transport to
o
surface water
Groundwater
Flow Direction
Exposures at Point A could include: drinking water ingestion, dermal contact with water, and inhalation
Exposures at Point B could include: incidental soil ingestion, inhalation, dermal contact, and consumption of local produce and meat
Exposures at Point C could include: incidental soil ingestion, dermal contact, and consumption of game
Exposures at Point D could include: dermal contact and consumption of fish
Exposures at Point E could include: incidental water ingestion and dermal contact
Source: Graphic created by Jason Miller.
state and will move to the lungs, where a certain fraction will pass through
the lungs and enter the bloodstream. The extent to which different chemical
substances pass through the lungs is dependent in large part upon their physical
properties, particularly solubilities in both fatlike materials and water. Passage
through cell membranes (of the cells lining the lungs) requires that substances
have a degree of both fat solubility and water solubility. Predicting the extent
of absorption through the lungs (or the gastrointestinal tract or skin, discussed
below) cannot be accomplished with accuracy; knowledge in this area can be
gathered only through measurement.
Certain fibrous materials (including but not limited to asbestos) and particulate matter and dusts may move through the airways and may reach the lungs, but
some of these kinds of materials may be trapped in the nose and excreted. Generally, only very fine particles reach the lower lung area. Some particles may be
deposited in the upper regions of the respiratory tract and then carried by certain
physical processes to the pharynx and then be coughed up or swallowed. Thus,
inhaled chemicals and particulates can enter the body through the gastrointestinal
523
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
(GI) tract or the respiratory tract.48 Understanding risk requires information about
these characteristics of the chemicals involved.
Ingestion is the second major route of exposure to substances in environmental media.49 Chemicals that comprise or come to be present in foods, in
drinking water, in soils and dusts,50 and many of those that serve as medicines
are all ingested. They are swallowed, enter the GI tract, and to greater or lesser
degrees are absorbed into the bloodstream at various locations along that tract.
This is often referred to as the oral route of exposure.
The largest organ of the body, the skin, is the third route of exposure for
chemicals in products and the environment.51 As with the GI tract and the lungs,
chemicals are absorbed through the skin to greater or lesser degrees, depending on
their physical and chemical characteristics. In some cases, toxic harm can occur
directly within the respiratory or GI tracts or on the skin before absorption occurs.52
The pathways analysis allows the identification of all the routes by which
chemicals from a given source may enter the body, because it identifies the media
of human contact into which the chemicals migrate from their sources. Once the
media of human contact are identified, the possible exposure routes are known.
E. Summary of the Descriptive Process
Once the exposure question to be examined has been defined, the exposure scientist sets out to identify all the relevant sources of exposure to the chemicals of
interest. All the pathways the chemicals can follow from those sources to reach
the population of interest are then described, with careful attention to the possibility that chemical degradation (to more or less toxic substances) can occur.
The pathways analysis concludes with a description of what chemicals will be
present in the various environmental media with which the exposed populations
were, are, or could become exposed (air, water, foods, soils and dusts, consumer
products). At this point, it becomes possible to identify the routes by which the
chemicals can enter the body.
48. J.V. Rodricks, From Exposure to Dose, in Calculated Risks: The Toxicity and Human Health
Risks of Chemicals in Our Environment (2d ed. 2007).
49. See, e.g., Foster v. Legal Sea Foods, Inc., 2008 WL 2945561 (D. Md. 2008) (hepatitis A
allegedly contracted from eating undercooked mussels); Winnicki v. Bennigan’s, 2006 WL 319298
(D.N.J. 2006) (alleged foodborne illness contracted from defendant’s restaurant led to renal failure and
death); Palmer v. Asarco Inc., 2007 WL 2298422 (N.D. Okla. 2007) (children allegedly ingested dust
and soil contaminated with lead).
50. Inadvertent exposures to these and other nonfood items are known to occur and can be
especially common in children.
51. See, e.g., United States v. Chamness, 435 F.3d 724 (7th Cir. 2006) (evidence that methamphetamine and the ingredients used in its manufacture are toxic to the eyes, mucous membranes,
and skin supported sentencing enhancement for danger to human life).
52. J.V. Rodricks, From Exposure to Dose, in Calculated Risks: The Toxicity and Human Health
Risks of Chemicals in Our Environment (2d ed. 2007).
524
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
Description by itself, however, often is inadequate. Attempts have to be
made to quantify exposure, to arrive at estimates of the dose received by the
exposed population, and to determine the duration of time over which that dose
is received.
VI. Quantification of Exposure
A. Dose
The simplest dose calculations relate to situations in which direct exposures
occur.53 Thus, for example, consider the case of a substance directly added to food
(and approved by the U.S. Food and Drug Administration (FDA) for such addition). Suppose the chemical is of well-established identity and is approved for use
in nonalcoholic beverages at a concentration of 10 milligrams of additive for each
liter of beverage (10 mg/L).54 To understand the amount (weight) of the additive ingested each day, it is necessary to know how much of the beverage people
consume each day. Data are available on rates of food consumption in the general
population. Typically, those data reflect average consumption rates and also rates
at the high end of consumption. To make sure that the additive is safe for use,
FDA seeks to ensure the absence of risk for individuals who may consume at the
high end, perhaps at the 95th percentile of consumption rates.55 Surveys of intake
levels for the beverage in our example reveal that the 95th percentile intake is
1.2 L per day for adults.
The weight of additive ingested by individuals at the 95th percentile of beverage consumption rate is thus obtained as follows:
10 mg/L × 1.2 L/day = 12 mg/day.
For a number of reasons, toxicologists express dose as weight of chemical
per unit of body weight. For adults having a body weight (bw) of, on average,
70 kilograms (kg), the dose of additive is
12 mg/day ÷ 70 kg bw = 0.17 mg/kg bw per day.56
53. See, e.g., McLaughlin v. Sec’y of Dep’t of Health & Human Servs., 2008 WL 4444142 (Fed.
Cl. 2008) (plaintiff exposed to known dose of thimerosol in vaccine; study using four times that dose
was not reliable evidence that exposure caused his autistic symptoms).
54. See Appendix A for a discussion of units used in exposure science.
55. J.V. Rodricks & V. Frankos, Food Additives and Nutrition Supplements, in Regulatory Toxicology
51–82 (C.P. Chengeliss et al. eds., 2d ed. 2001).
56. To gain approval for such an additive, FDA would require that no toxic effects are observable in long-term animal studies at doses of at least 17 mg/kg bw per day (100 times the high-end
human intake).
525
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Doses from other ingested products containing specified amounts of chemicals
are calculated in much the same way. It generally would be assumed that the duration of exposure for a substance added to a food or beverage would be continuous
and would cover a large fraction of a lifetime. For other products, particularly
pharmaceuticals, exposure durations will vary widely; dose calculations would be
the same, regardless of duration, but the potential for harm requires consideration
of exposure duration.
It will be useful, before proceeding further, to illustrate dose calculations for
exposures occurring by the inhalation and dermal routes.57 Consider a hypothetical
workplace setting in which a solvent is present in the air. Measurement by an
industrial hygienist reveals its presence at a weight of 2 mg in each cubic meter
(m3) of air. Data on breathing rates reveal that a typical worker breathes in 10 m3
of air each 8-hour workday.58 Thus, the worker dose will be
2 mg/m3 × 10 m3/day = 20 mg/day
20 mg/day ÷ 70 kg = 0.28 mg/kg bw per day.
As noted earlier, it is likely that only a fraction of this dose will reach and pass
through the lungs and enter the bloodstream. As also noted earlier, if the chemical
is a fiber or other particle, its dynamics in the respiratory tract will be different
than that of a vapor, with a portion of the inhaled dose entering the GI tract.
Dose from skin exposure often is expressed as the weight of chemical per
some unit of skin surface area (e.g., per m2 of skin). The body surface area of an
average (70 kg) adult is 1.8 m2. Thus, consider a body lotion containing a chemical of interest. If the lotion is applied over the entire body, then it is necessary to
know the total amount of lotion applied and then the total amount of chemical
present in that amount of lotion. That last amount will then be divided by 1.8 to
yield the skin dose in units of milligrams per square meter. If the chemical causes
toxicity directly to the skin, that toxicity dose information also will be expressed in
milligrams per square meter. Then risk is evaluated by examining the quantitative
relationship between the toxic dose (milligrams per square meter) and the (presumably much lower) human dose expressed in the same units. If the chemical can
penetrate the skin and produce toxicity within the body, then the dose determination must include an examination of the amount absorbed into the human body.59
57. See, e.g., Henricksen v. ConocoPhillips Co., 605 F. Supp. 2d 1142, 1164 (E.D. Wash. 2009)
(benzene exposure on skin and by inhalation); Bland v. Verizon Wireless (VAW) LLC, 2007 WL
5681791, at *9 (S.D. Iowa 2007) (inhalation exposure to Freon in “canned air” sprayed into water
bottle). For a discussion of the importance of assessment of dose as a measure of exposure, see Bernard
D. Goldstein & Mary Sue Henifin, Reference Guide on Toxicology, Section I.A.1.c, in this manual.
58. The 24-hour inhalation rate outside the workplace setting is ca. 20 m3. The lack of direct
proportion to time reflects the fact that breathing rates increase under exertion.
59. Rates of absorption of chemicals into the body, through the GI tract, the lungs, or the skin,
usually must be obtained by measurement; they are not readily predicted.
526
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
One final matter concerning dose estimation concerns the importance of
body size, in particular that of the infant and the growing child. In matters such
as food and water intake, and breathing rates, small children are known to take
in these media at higher rates per unit of their body weights than do adults.60
Thus, when a small child is exposed to a food contaminant, that child will often
receive a greater dose of the contaminant than will an adult consuming food
with the same level of contaminant. Children also tend to ingest greater amounts
of nonfood items, such as soils and dusts, than do adults. In some cases, nursing
mothers excrete chemicals in their milk. The exposure scientist generally conducts
separate assessments for children that take into account the possibility of periods
of increased exposure during the developmental period.61
B. Doses from Indirect Exposure Pathways
Recall that the goal of exposure assessment is to identify the media through which
people will be exposed to chemicals of interest that are emitted from sources of
interest. As will be seen, the assessment, when completed, will reveal the amount
of the chemical of interest in a certain weight or volume of each of the media
with which people come into contact. Once this is known, dose calculations can
proceed in the manner described in the preceding section.
In the preceding section, firm and readily available knowledge was available
about the amount of chemical present in a given weight of food or consumer
product (the body lotion example) or in a given volume (cubic meters) of air.
These measures are called concentrations of the chemicals in the media of exposure (see Appendix A). When a chemical must move from one or more sources,
and then through one or more environmental media, before it comes to be present
in the media with which people have contact (the media of exposure), determining the concentrations of the chemical in the media of exposure becomes difficult.62 Such a situation is clearly different from that in which a specific amount
of an additive is directly added to a specific amount of food. The challenge faced
by exposure scientists when the chemical comes to be present in the medium of
human exposure not by direct and intentional addition, but by indirect means,
through movement from source through the environment, is to find a reliable
60. See, e.g., Northwest Coalition for Alternatives to Pesticides (NCAP) v. EPA, 544 F.3d 1043
(9th Cir. 2008) (dispute over how much lower allowable pesticide levels should be to account for
children’s greater susceptibility).
61. For some substances, susceptibility to toxicity is also enhanced during the same periods. See
Section VII.B.
62. See, e.g., Hannis v. Shinseki, 2009 WL 3157546 (Vet. App. 2009) (no direct measure of
veteran’s exposure to radiation was possible but VA’s dose estimate was not clearly erroneous); Fisher
v. Ciba Specialty Chem. Corp., 2007 WL 2302470 (S.D. Ala. 2007) (allowing expert’s qualitative
account of DDT and its metabolites spreading from defendant’s plant to plaintiffs’ property, because
quantification would necessarily rely on speculative data).
527
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
way to estimate concentrations in the medium of human exposure.63 Once concentrations are known, dose is readily calculated (as in Section VI.A), but reliably
estimating concentrations can be difficult.
Two methods typically are used to estimate those concentrations. One involves
direct measurement using the tools of analytical chemistry. The second involves the
use of models that are intended to quantify the concentrations resulting from
the movement of chemicals from the source to the media of human exposure.
C. Direct Measurement: Analytical Science
Once the media that could be subject to contamination have been identified
through pathways analysis (Section V.C), one available choice for determining
the concentrations of contaminants involves sampling those media and subjecting the samples taken to chemical analysis. The analysis will not only reveal
the concentrations of chemicals in the media of concern, but should also confirm
their identities. Environmental sampling and analysis is under way all over the
world, at and near contaminated waste sites, in the vicinity of facilities emitting
chemicals to air and water, and in many other circumstances.64
One purpose of such sampling and analysis is to determine whether products
and environmental media contain substances at concentrations that meet existing
regulatory requirements. In many circumstances, regulators have established limits
on the concentrations of certain chemicals in foods, other products, water, air,
and even soils. These limits generally are based on assessments of health risk and
calculations of concentrations that are associated with what the regulators believe
to be negligibly small risks. The calculations are made after first identifying the
total dose of a chemical that is safe (poses a negligible risk) and then determining
the concentration of that chemical in the medium of concern that should not be
exceeded if exposed individuals (typically those at the high end of media contact)
are not to incur a dose greater than the safe one. The most common concentration
limits are regulatory tolerances for pesticide residues in food, Maximum Con-
63. See, e.g., Knight v. Kirby Inland Marine Inc., 482 F.3d 347, 352–53 (5th Cir. 2007) (study
of people with much longer exposure to organic solvents could not support conclusion that plaintiff’s
injuries were caused by such solvents); Kennecott Greens Creek Mining Co. v. Mine Safety & Health
Admin., 476 F.3d 946, 950 (D.C. Cir. 2007) (because diesel particulate matter was difficult to monitor,
MSHA’s surrogate limits on total carbon and elemental carbon were reasonable).
64. See, e.g., Genereux v. American Beryllia Corp., 577 F.3d 350, 366–67 (1st Cir. 2009) (“all
beryllium operations should be periodically air-sampled, and a workspace may be dangerous to human
health even though no dust is visible”); Allen v. Martin Surfacing, 2009 WL 3461145 (D. Mass. 2009)
(where air sampling was not done, expert resorted to modeling plaintiff’s decedent’s exposure); Jowers
v. BOC Group, Inc., 608 F. Supp. 2d 724, 738 (S.D. Miss. 2009) (OSHA measurements showed that
30% of welders experienced manganese fumes at higher than allowable concentrations); In re FEMA
Trailer Formaldehyde Prod. Liab. Litig., 583 F. Supp. 2d at 776 (air sampling revealed formaldehyde
levels higher than allowable).
528
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
taminant Levels (MCLs) for drinking water contaminants, National Ambient Air
Quality Standards (NAAQS), and, for workplace exposure, Permissible Exposure
Limits (PELs) or Threshold Limit Values (TLVs).65 Much environmental sampling
and analysis is done, by both government agencies and private organizations, for
the purpose of ascertaining compliance with existing concentration limits (sometimes referred to as standards).
But sampling and analysis also are undertaken to investigate newly identified contamination or to ascertain exposures (and risks) in situations involving
noncompliance with existing standards. As described earlier, information on concentrations in the media through which people are exposed is the necessary first
step in estimating doses.
Although at first glance it might seem that direct measurements of concentrations would provide the most reliable data, there are limits to what can be gained
through this approach.
• Howcanwebesurethatthesamplestakenareactuallyrepresentativeof
the media sampled?
Standard methods are available to design sampling plans that have
specified probabilities of being representative, but they can never provide
complete assurance. Generally, when contamination is likely to be highly
homogeneous, there is a greater chance of achieving a reasonably representative sample than is the case when it is highly heterogeneous. In the
latter circumstance, obtaining a representative sample, even when very
large numbers of samples are taken, may be unachievable.
• Howcanwebesurethatthesamplestakenrepresentcontaminationover
long periods?
Sampling events may provide a good snapshot of current conditions,
but in circumstances in which concentrations could be changing over
time, and where the health concerns involve long-term exposures, snapshots could be highly misleading. This type of problem may be especially
severe when attempts are being made to reconstruct past exposures, based
on snapshots taken in the present.
• Howcanwebesurethattheanalyticalworkwasdoneproperly?
Most major laboratories that routinely engage in this type of analysis
have developed standard operating procedures and quality control proce65. PELs are official standards promulgated by the Occupational Safety and Health Administration. TLVs are guidance values offered by an organization called the American Conference of Governmental Industrial Hygienists. See, e.g., In re Howard, 570 F.3d 752, 754 (6th Cir. 2009) (challenging
PELs for coal mine dust); Jowers v. BOC Group, Inc., 608 F. Supp. 2d 724, 735–36 (S.D. Miss. 2009)
(PELs and TLVs for welders’ manganese fume exposure); International Brominated Solvents Ass’n v.
American Conf. of Gov. Indus. Hygienists, Inc., 625 F. Supp. 2d 1310 (M.D. Ga. 2008) (challenging
TLVs for several chemicals); Miami-Dade County v. EPA, 529 F.3d 1049 (11th Cir. 2008) (MCLs
for public drinking water).
529
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
dures. Laboratory certification programs of many types also exist to document performance. When analytical work is performed in certified, highly
experienced laboratories, there is a reasonably high likelihood that the analytical results are reliable. But it is very difficult to confirm reliability when
analytical work is done in laboratories or by individuals who cannot provide
evidence of certification or of longstanding quality control procedures.
• Howaredatashowingtheabsenceofcontaminationtobeinterpreted?
In most circumstances involving possible contamination of environmental media, the analysis of some (and sometimes many) of the samples
will fail to find the contaminant. The analytical chemist will often report
“ND” (for nondetect) for such samples. But an ND should never be
considered evidence that the concentration of the contaminant is zero. In
fact, most chemists will (and should) report that the contaminant is “BDL”
(below detection limit). Every analytical method has a nonzero detection
limit; the method is not sensitive to and cannot measure concentrations
below that limit. Thus, for each sample reported as BDL, all that can be
known is that the concentration of contaminant is somewhere below that
limit. If there is clear evidence that the contaminant is present in some of
the samples (its concentration exceeds the method’s BDL), then it is usually assumed that all the samples of the same medium reported as BDL will
actually contain some level of contaminant, often and for reliable reasons
assumed to be one-half the BDL. Practices for dealing with BDL findings
vary, but assuming that the BDL is actually zero is not one of the acceptable practices.
Sampling and measurement are no doubt useful, but are nonetheless limited
in important ways. The alternative involves modeling. In fact, a combination of
both approaches—one acting as a check on the other—is often the most useful
and reliable.
D. Environmental Models
A model is an attempt to provide a mathematical description of how some feature of the physical world operates. In the matters at hand, a model refers to a
mathematical description of the quantitative relationship between the amount of
a chemical emitted from some source, usually over a specified period of time, to
the concentrations of that chemical in the media of human exposure, again over
some specified time period.66
66. See, e.g., NCAP v. EPA, 544 F.3d 1043 (9th Cir. 2008) (EPA was permitted to rely on
modeling in developing allowable pesticide residual levels); O’Neill v. Sherwin-Williams Co., 2009
WL 2997026, at *5 (C.D. Cal. 2009) (exposure model was inappropriate because it was based on a
different type of paint than plaintiff was exposed to); Hayward v. U.S. Dep’t of Labor, 536 F.3d 376
530
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
Models are idealized mathematical expressions of the relationship between
two or more variables. They are usually derived from basic physical and chemical
principles that are well established under idealized circumstances, but may not be
validated under actual field conditions. Models thus cannot generate completely
accurate predictions of chemical concentrations in the environment. In some
cases, however, they are the only method available for estimating exposure—for
example, in assessing the impacts of a facility before it is built or after it has ceased
to operate. In such circumstances, they are necessary elements of exposure assessments and have been used extensively. Models are necessary if projections are to
be made backward or forward in time or to other locations where no measurements have been made.
Typically, a model is developed by first constructing a flow diagram to illustrate
the theoretical pathways of environmental contamination, as shown in Figure 2 and
for a hazardous waste site in Appendix B. These models can be used to estimate
concentrations in the relevant media based on several factors related to the nature of
the site and the chemicals of interest. Model variables include the following:
1. The total amount of chemical present in or emitted from the media that
are its sources;
2. The solubility of the chemical in water;
3. The chemical’s vapor pressure (a measure of volatility);
4. The degree to which a chemical accumulates in fish, livestock, or crops
(bioconcentration or bioaccumulation factor);
5. The nature of the soil present at the site; and
6. The volumes and movement of water around and beneath the site.
Some of this information derives from laboratory studies on the chemical
(the first four points) and some from an investigation at the site (the remaining
two points). The development of the data and modeling of the site often require
the combined skills of chemists, environmental engineers, and hydrogeologists.
In addition to the information listed above, time projection models also require
information on the stability of the chemical of interest. As noted earlier, some
chemicals degrade in the environment very quickly (in a matter of minutes),
whereas others are exceedingly resistant to degradation. Quantitative information
on rates of degradation is often available from laboratory and field studies.
Models that assess the exposures associated with air emissions consider the
fact that the opportunity for people to be exposed to chemicals depends upon
their activities and locations.67 These models account for the activity patterns of
(5th Cir. 2008) (a model was used to reconstruct the dose of radiation that the employee was exposed
to); Rodricks & Frankos, supra note 55.
67. See, e.g., Palmer v. Asarco Inc., 2007 WL 2298422 (N.D. Okla. 2007) (children allegedly were exposed to lead by “hand-to-mouth activity ingestion of soil/house dust”); Henricksen
531
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
potentially exposed populations and provide estimates of the cumulative exposure
over specified periods.
Perhaps the most widely used models are those that track the fate and transport pathways followed by substances emitted into the air. Knowledge of the
amounts emitted per unit of time (usually obtainable by measurement) from a
given location (a stack of a certain height, for example) provides the basic model
input. Information on wind directions and velocities, the nature of the physical
terrain surrounding the source, and other factors needs to be incorporated into
the modeling. Some substances will remain in the vapor phase after emission, but
chemical degradation (e.g., because of the action of sunlight) could affect media
concentrations. Some models provide for estimating the distributions of soil concentrations for those substances (particulates of a certain size) that may fall during
dispersion. Much effort has been put into developing and validating air dispersion
models.68 Similar models are available to track the movement of contaminants in
both surface and ground waters.
The fate and transport modeling issue becomes more complex when attempts
are made to follow a chemical’s movement from air, water, and soils into the food
chain and to estimate concentrations in the edible portions of plants and animals.69
Most of the effort in this area involves the use of empirical data (e.g., What does
the scientific literature tell us about the quantitative relationships between the
concentration of cadmium in soil and its concentration in the edible portions of
plants grown in that soil?). This type of empirical information, together with general data on chemical absorption into, distribution in, and excretion from living
systems, is the usual approach to ascertain concentrations in these food media.70
Many models for environmental fate and transport analysis are available. It
is not possible to specify easily which models have established validity and which
have not; rather, some are preferred for some purposes and others are preferred
for different purposes.
Perhaps the best that can be done to scrutinize the work of an expert in this
area is to
• Requirethattheexpertdescribeinfullthebasisformodelselection;
• Asktheexperttodescribethestandingofthemodelwithauthoritative
bodies such as EPA;
• Requiretheexperttostatewhyotherpossiblemodelsarenotsuitable;
v. ConocoPhillips Co., 605 F. Supp. 2d 1142, 1164 (E.D. Wash. 2009) (expert calculated plaintiff’s
benzene exposure by adjusting study results to account for plaintiff’s activities); Junk v. Terminix Int’l
Co., 2008 WL 6808423 (S.D. Iowa 2008) (study measured chlorpyrifos exposure of inhabitants of
houses sprayed indoors); In re W.R. Grace & Co., 355 B.R. 462 (Bankr. D. Del. 2006) (asbestos in
attic insulation released by normal activity).
68. National Research Council, Models in Environmental Regulatory Decision Making (2007).
69. Ecologists also use modeling results to evaluate risks to wildlife, plants, and ecosystems.
70. National Research Council, supra note 68.
532
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
• Requirethattheexpertdescribethescientificbasisandunderlyingassumptions of the model, and the ways in which the model has been verified;71
and
• Require the expert to describe the likely size of error associated with
model results.
Other issues pertaining to the sources and reliability of the data used in the
application of a model can be similarly pursued.
Results from modeling are concentrations in media of concern over time. If
sampling and analysis data are available for the same media, they can be compared
with the modeling result, and efforts can be made to reconcile the two and arrive
at the most likely values (or range of likely values).
E. Integrated Exposure/Dose Assessment
We have shown the various methods used to determine the concentrations of
chemicals in products and in various environmental media and also the methods
used to determine doses from each of the relevant media. Dose estimation as
described in Section VI.A applies to each of the relevant routes of exposure.
In many cases, the dose issue concerns one chemical in one product and only
one route of exposure. But numerous variations on this basic scenario are possible:
one chemical in several products or environmental media, many chemicals in one
product or environmental medium, or many chemicals in many environmental
media. Even though some exposure situations can be complex and involve multiple chemicals through both direct and indirect pathways, the exposure assessment methods and principles described here can be applied. Exposures occurring
by different routes can be added together, or they can be reported separately.
The decisions on the final dose estimates and their form of presentation can be
made only after discussions with the users of that information—typically the toxicologists and epidemiologists involved in the risk assessment.72 The dose metrics
emerging from the exposure assessment need to match the dose metrics that are
used to describe toxicity risks.
One additional point should be highlighted. The principle that exposure to
chemicals through foods and consumer products typically focuses on high-end
consumers of those foods or products also applies in environmental settings. Thus,
71. This point is to ensure that the expert truly understands the model and its limits and that he
or she is not simply using some “black box” computer software.
72. See, e.g., American Farm Bureau Fed’n v. EPA, 559 F.3d 512 (D.C. Cir. 2009) (challenging
EPA’s risk assessment for fine PM); Miami-Dade County v. EPA, 529 F.3d 1049 (11th Cir. 2008)
(assessment of risk of wastewater disposal methods to drinking water); Kennecott Greens Creek Min.
Co. v. Mine Safety & Health Admin., 476 F.3d 946 (D.C. Cir. 2007) (risk assessment of diesel particulate matter to miners); Rowe v. E.I. du Pont de Nemours & Co., 2008 WL 5412912, 12 (D.N.J
2008) (risk assessment for proposed class).
533
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
for example, it is possible to assert with relatively high confidence that almost no
one consumes more than 3.5 L of water a day and that almost everyone consumes
less. If the dose calculation assumes a water consumption rate of 3.5 L/day, then
the risk estimated for that dose is almost certainly an upper limit on the population risk, and regulatory actions based on that risk will almost certainly be highly
protective. For regulatory and public health decisionmaking, such a precautionary
approach has a great deal of precedent, although care must be taken to ensure
adherence to scientific data and principles.73
This approach becomes problematic, however, if applied to assessments of
exposures that may have been incurred in the past by individuals claiming to have
been harmed by them. In such cases, it would seem that there is no basis for a
precautionary approach; an approach based on attempts to accurately describe the
individual’s exposure would seem to be necessary. Whatever the case, the exposure scientist must be careful to ensure accurate description of the exposure concentration (and resulting dose), so that the users of the information can understand
whether upper limits or more typical exposures and doses have been provided.
VII. Into the Body
A. Body Burdens
Section V described how chemicals in the environment contact the three major
portals of entry into the body—the respiratory tract, the GI tract, and the skin. For
some chemicals, the dose contacting one or more of those portals may be sufficient
to cause harm before those chemicals are absorbed into the body; that is, they may
cause one or more forms of toxicity to the respiratory system, to the GI tract, or
to the skin. Although these forms of contact toxicity can be important, it is also
important to consider the many forms of systemic toxicity. The latter refers to a
large number of toxic manifestations that can affect any of the organs or organ systems of the body after a chemical is absorbed into the bloodstream and distributed
within the body. Recall also that most chemicals are acted upon by certain large
protein molecules, called enzymes, contained in cells, particularly those of the liver,
the skin, and the lungs, and are converted to new compounds, called metabolites
(the process leading to these changes is called metabolism). Metabolite formation
73. National Research Council, Evolution and Use of Risk Assessment in the Environmental Protection Agency: Current Practice and Future Prospects, in Science and Decisions: Advancing Risk Assessment
(2008). Those who must comply with regulations that were developed based on a high degree of
caution often protest that more accurate assessments should be used as their basis. For several reasons,
truly accurate prediction of risk is difficult to achieve (see Bernard D. Goldstein & Mary Sue Henifin,
Reference Guide on Toxicology, in this manual), while predicting an upper bound on the risk is not.
At the same time, unless carefully done and described, upper-bound estimates may be so remote from
reality that decisions based on them should be avoided.
534
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
is one of the body’s mechanisms for creating compounds that are easily removed
from the body by one or more excretion processes. Unfortunately, metabolism
sometimes creates new compounds that are more toxic than the original (so-called
parent) molecule, and, if the internal dose of toxic metabolite exceeds a socalled threshold, toxic harm may occur. Of course, not all toxicity is produced by
metabolites; in some cases harm may be caused directly by the parent compound.74
As in the other areas of exposure science that have been discussed, it usually
becomes important to move from description to quantification. Exposure scientists
seek to understand the amount of chemical absorbed into the body after contact
(i.e., the fraction of the dose that is absorbed), the amount of chemical reaching
and distributed within the body (the blood concentration being the most easily
measurable), and the rate of loss of the chemical from the body. The science
devoted to understanding these important phenomena is called pharmacokinetics
(drug rates). That name came to be used because most of the developmental work
in this area related to the behavior of pharmaceuticals in the body, but the tools of
pharmacokinetics have been extended to study all types of chemicals.
Pharmacokinetics is important because it reveals where in the body a chemical
is most likely to cause harm (where the greatest concentrations, or target site doses,
are reached for the longest period of time) and also the concentration—duration
level necessary to cause harm. To understand these relationships, pharmacokinetic
studies typically are carried out in conjunction with toxicity studies in animals,
and their results are used to assess possible toxic risk in humans.75
Pharmacokineticists do not ordinarily characterize themselves as exposure
scientists; more often they are toxicologists or pharmacologists. But they are in fact
extending the usual work of exposure scientists into the body, and it is here that
we see the interface between exposure science and toxicology and epidemiology.
B. Monitoring the Body (Biomonitoring)
As long as we live in a world of chemicals, we will be exposed to them. If analytical chemists developed sufficiently sensitive measuring techniques, it would not
be far-fetched to say that we could find within the human body, at some level
and for some period, virtually any of the tens of thousands of chemicals, natural
and synthetic, with which it comes into contact. Some would be found only
occasionally, some continuously; some would be found to persist for days, weeks,
74. J.V. Rodricks, From Exposure to Dose, in Calculated Risks: The Toxicity and Human Health
Risks of Chemicals in Our Environment (2d ed. 2007)
75. See Bernard D. Goldstein & Mary Sue Henifin, Reference Guide on Toxicology, in this
manual. See also, e.g., In re Fosamax Prod. Liab. Litig., 645 F. Supp. 2d 164, 186 (S.D.N.Y. 2009)
(rat and dog studies showing a bisphosphonate caused jaw necrosis relevant to whether Fosamax,
another type of bisphosphonate, could cause jaw necrosis in humans); Rose v. Matrixx Initiatives,
Inc., 2009 WL 902311, at *14 (W.D. Tenn. 2009) (studies in animals of nasal spray effects could not
be extrapolated to humans because olfactory physiology was too different).
535
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
or even longer, whereas others would persist for only minutes or hours. The
concentrations in blood would likely vary over many orders of magnitude. Currently, we can measure only a few thousand chemicals in the body, a large share
of them pharmaceuticals, nutrients, and substances of abuse. Some standards for
occupational exposures are expressed as allowable blood or urine concentrations,
and their measurement is a useful supplement to air monitoring.76
The environmental chemical that has perhaps received the most attention in
this area of exposure science is lead (chemical symbol Pb). Indeed, lead may be the
most studied of all environmental substances. After it was learned in the 1950s that
the concentration of lead in blood could be easily measured, it became common
to sample and test the blood level of lead (BPb) in individuals who had suffered
one or more forms of this metal’s toxicity. Some epidemiological studies of lead
began to include BPb as the measure of exposure, and since the 1970s, hundreds
of such studies involving lead have reported results using this measure.77
BPb is particularly useful for substances such as lead that have (or did have)
a relatively large number of environmental sources.78 The simple measure of
BPb provides a single, integrated measure of exposures through multiple sources,
pathways, and routes (although this measure reflects relatively recent and not
long-term exposure).79 This is perhaps the best example of the use of target site
dose in risk assessment.
The Centers for Disease Control and Prevention (CDC) began, in the late
1970s, to take blood samples from a relatively large number of children as part
of its National Health and Nutrition Examination Survey (NHANES). Children
were selected because it was known that they take up more lead from their envi-
76. See, e.g., Haas v. Peake, 525 F.3d 1168, 1177 (Fed. Cir. 2008) (presumption of dioxin
exposure instituted because of the difficulty of measuring dioxin in the body); Young v. Burton, 567
F. Supp. 2d 121 (D.D.C. 2008) (hormone and enzyme levels allegedly altered by exposure to biotoxins in mold); Hazlehurst v. Sec’y of Dep’t of Health & Human Servs., 2009 WL 332306, at *62
(Fed. Cl. 2009) (study measuring porphyrin in urine as a marker for mercury in the body); United
States v. Bentham, 414 F. Supp. 2d 472 (S.D.N.Y. 2006) (cocaine use monitored by a “sweatpatch”
on the skin).
77. National Center for Environmental Health, Centers for Disease Control and Prevention,
Fourth National Report on Human Exposure to Environmental Chemicals (2009), available at http://
www.cdc.gov/exposurereport/pdf/FourthReport.pdf (last visited July 1, 2010).
78. See, e.g., Potter v. EnerSys, Inc., 2009 WL 3764031 (E.D. Ky. 2009) (alleged lead exposure
from working on battery manufacturing site); City of North Chicago v. Hanovnikian, 2006 WL
1519578 (N.D. Ill. 2006) (alleged lead contamination of soil); Perry ex rel. Perry v. Frederick Inv.
Corp., 509 F. Supp. 2d 11 (D.D.C. 2007) (residential lead paint exposure); Goodstein v. Continental
Cas. Co., 509 F.3d 1042 (9th Cir. 2007) (environmental contamination from lead waste site);
Evansville Greenway & Remediation Trust v. Southern Indiana Gas & Elec., 661 F. Supp. 2d 989
(S.D. Ind. 2009) (contamination of battery recycling site).
79. BPb usually is reported in units of micrograms (1 one-millionth of 1 gram) in each deciliter
(one-tenth of a liter) of blood (µg/dL). More recently, noninvasive methods to measure lead levels in
teeth and bones have become available; such measures reflect cumulative exposures over long periods,
but their relationships to health are less clear than those based on BPb.
536
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
ronments (air,80 water, food, paint, soils and dusts, emissions from lead and other
metal smelters, consumer products, and more) than do adults; they are also, especially during early periods of development, more vulnerable to the adverse effects
of lead than are adults. Nationwide, childhood BPb levels averaged 15–20 µg/dL
during the 1970s, with substantial numbers of children having BPb levels well
in excess of what was at the time thought to be the minimum BPb associated
with adverse health effects (40 µg/dL). The most recent NHANES surveys reveal
that average childhood levels are in the range of 2 µg/dL, although there remain
substantial numbers of children with levels greater than the current CDC health
guideline of 10 µg/dL.81
Lead is not the only chemical now being studied under the NHANES biomonitoring program. The most recent surveys involve nationwide sampling of
blood and urine from close to 8000 children and adults for more than 100 different chemicals.82 The program focuses on commonly used pesticides and consumer
products and certain ubiquitous environmental contaminants, particularly those
that persist in the body for long periods. Not surprisingly, most of these chemicals
have been detected in some individuals. The NHANES program will continue,
and similar programs are under way in government and research centers around
the world.
The presence of a chemical in the body is not evidence that it is causing
harm. And in some cases—those that involve chemicals, such as the metals and
some organic compounds that occur naturally—the NHANES findings may simply reflect natural background levels.83 In any case, data such as these provide far
more direct measures of dose (often referred to as body burden), and in those cases
(which are increasing in number) in which epidemiologists and toxicologists are
able to relate disease rates to body burdens (instead of to external dose, as is the
usual case), far more accurate measures of human risk should become available.
VIII. Evaluating the Scientific Quality of an
Exposure Assessment
Exposure scientists may offer expert testimony regarding exposures to chemicals incurred by individuals or populations. Their assessments typically will include
80. At the time of the first NHANES lead survey, leaded gasoline, which emitted lead to air
and to soil, was in wide use. That use, at least in the United States, came to an end in the 1980s.
For a discussion of the routes of exposure to toxic substances, see Bernard D. Goldstein & Mary Sue
Henifin, Reference Guide on Toxicology, Section III.A, in this manual.
81. There is developing evidence of IQ deficits in children at levels below 10 µg/dL.
82. National Center for Environmental Health, supra note 77.
83. Natural background levels of certain metals may, in some geological regions, be quite high
and may even be associated with excess disease.
537
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
a description of how and when exposures have or could occur, the identities
of the chemicals involved, the routes of exposure, the doses incurred, and the
durations of exposure. In some cases, testimony will include a description and
quantification of body burdens. If the exposure scientist is also an epidemiologist
or toxicologist,84 he or she may offer additional testimony on the health risks
associated with those exposures or even regarding the question of whether such
exposures have actually caused disease.
For purposes of this reference guide, it is assumed that questions regarding
disease risk and causation are beyond the bounds of exposure science. Below is
offered a set of questions that exposure scientists should be able to answer, with
appropriate documentation and scientific reasoning, to support any given exposure
assessment:
• Isthepurposeoftheassessmentclear?Istheexposedpopulationspecified?
• Whatisthesource(s)ofexposure?
• Whendidtheexposuresoccur:past?present?Iftheyareoccurringnow,
will they continue to occur?
• Whatistheassumeddurationofexposure,andwhatisitsbasis?
• Whatarethepathwaysfromthesourcetotheexposedindividuals?How
has it been established that those pathways exist (past? present? future?).
• Whatistheconcentrationofthechemicalinthemediawithwhichthe
exposed population comes into contact (past? present? future?). What is
the basis for this answer: direct measurement? modeling?
• If the concentration is based on direct measurement, what procedures
were followed in obtaining that measurement? Was media sampling sufficient to ensure that it was representative? If not, why is representativeness
not important? Were validated analytical methods used by an accredited
laboratory? If not, how can one be assured that the analytical results are
reliable?
• Ifmodelswereused,whatistheirreliability(seeSectionVI.D)?Whatis
the variability over time in concentrations in the media of concern? How
has the variability been determined?
• Whatisthevariabilityamongmembersofthepopulationintheirexposure
to the chemical of concern? How is this known?
• Whatisknownorassumedaboutthenatureandextentofmediacontact
by members of the exposed population? How has this been ascertained?
• Whatdose,overwhatperiodoftime,bywhichroutes,hasbeenincurred?
What calculations support this determination?
84. See Section IX, which deals with the question of the qualifications of exposure scientists.
In many cases, the work of exposure experts is turned over to the health experts to incorporate into
their evaluation of risk and disease causation. In some cases, usually the less complex ones, exposure
assessments may be undertaken by the health experts.
538
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
• Whatisthelikelyerrorintheexposureestimates?
• Whatuncertaintiesareassociatedwiththedose/durationfindings?Isita
“most likely” estimate, or is it an “upper limit”? To what fraction of the
population is the “upper limit” likely to apply?
• Whathasbeenomittedfromtheexposureassessment,andwhy?
These questions are perhaps the minimum that an expert should be able to
address when offering testimony. Obviously, most such questions can be answered
fully only if the expert can support the answers with documentation.
As noted in Section III.D, the evaluation of whether a current medical condition is causally related to exposures occurring in the past (prior to the onset or
diagnosis of the medical condition) requires a retrospective examination of the
conditions that led to those exposures. Thus, for example, a plaintiff suffering
from leukemia and who alleges that benzene exposure in his or her workplace
caused the disease may easily demonstrate the fact of benzene exposure. But ordinarily an estimation of the quantitative magnitude and duration of the incurred
benzene exposure is necessary to evaluate the plausibility of the causation claim.85
The methodological tools necessary to “reconstruct” the plaintiff’s past exposure
are identical to those used to estimate current exposures, but the availability of
the data necessary to apply those methods may be limited or, in some cases,
nonexistent.
Reconstruction of occupational exposures has been a relatively successful
pursuit, because often historical industrial hygiene data are available involving the
measurement of workplace air levels of chemicals. If it is possible, through the
examination of employment records, to reconstruct an individual’s job history,
it may be possible to ascertain that individual’s exposure history.86 Guidelines
for occupational exposure reconstruction have been published by the American
Industrial Hygiene Association.87 Clearly, experts presenting testimony regarding
exposure reconstruction must be queried heavily on the sources of data used in
their applications of exposure methods.
IX. Qualifications of Exposure Scientists
Exposure science is not yet a true academic discipline. Rather, scientists and
engineers from diverse backgrounds have, over the past several decades, come
together to give shape and substance and scientific rigor to what is clearly a criti-
85. See Michael D. Green et al., Reference Guide on Epidemiology, Section VII, in this manual.
86. T.W. Armstrong, Exposure Reconstruction, in Mathematical Models for Estimating Occupational Exposures to Chemicals (Charles B. Keil et al. eds., 2d ed. 2009).
87. American Industrial Hygiene Association, Guideline on Occupational Exposure Reconstruction (S.M. Viet et al. eds., 2008).
539
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
cal element in understanding toxicity risks and disease causation. Typically, those
who have contributed to this developing field have come from backgrounds in
industrial hygiene, environmental and analytical chemistry, chemical engineering,
hydrogeology, and even behavioral sciences (pertaining to those aspects of human
behavior that affect exposures).88 Most toxicologists and epidemiologists have considerable experience in exposure science, as do pharmacologists who study drug
kinetics and disposition. Many exposure assessments involve collaborative efforts
among members of these various disciplines.
There are currently no certification programs available for exposure scientists,
but increasingly exposure science research appears in publications such as Environmental Health Perspectives, Risk Analysis, and the Journal of Exposure Science and
Environmental Epidemiology.
Certification programs do exist in occupational exposure science. Qualified
industrial hygienists will almost always be certified (CIH). The American Industrial Hygiene Association Journal includes much scholarly work related to exposure
science.
88. See, e.g., Allen v. Martin Surfacing, 2009 WL 3461145, 2008 U.S. Dist. LEXIS 111658, 263
F.R.D. 47 (D. Mass. 2008) (industrial hygienist qualified to testify regarding concentration and duration of plaintiffs’ decedent’s exposure to toluene and other chemicals); Buzzerd v. Flagship Carwash of
Port St. Lucie, Inc., 669 F. Supp. 2d 514 (M.D. Pa. 2009) (industrial hygienist qualified to opine on
carbon monoxide exposure, but his conclusions were not based on reliable methodology).
540
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
Appendix A: Presentation of Data—
Concentration Units
Choosing the proper units to express concentrations of chemicals in environmental media is crucial for precisely defining exposure. Chemical concentrations
in environmental media usually are reported in one of two forms: as numeric
ratios, such as parts per million or billion (ppm and ppb, respectively), or as unit
weight of the chemical per weight or volume of environmental media, such
as milligrams per kilogram (mg/kg) or milligrams per cubic meter (mg/m3).
Although concentrations expressed as parts per million or parts per billion are
easier for some people to conceptualize, their use assumes that media are always
sampled at standard temperature and pressure (25°C and 760 torr, respectively).
Consequently, scientists prefer to express chemical concentrations as weight of
chemical per unit weight or volume of media. This method also makes conversions to dose equivalents, usually expressed in terms of weight of chemical per
unit body weight (mg/kg bw), more convenient.
To permit the presentation of results without excessive zeroes before or after the
decimal point, appropriate units are needed. The choice of units depends on both
the medium in which the chemical resides and the amount of chemical measured.
For example, if 50 nanograms of chemical were found in 1 L of water, the appropriate units would be ng/L, rather than 0.00005 mg/L. If 50 grams were found instead,
the appropriate units would be 50,000 mg/L, because milligrams are generally the
largest units used to express the mass of a chemical in media (Table 1).
Table 1. Weight of Chemical per Unit Weight of Medium
Preferred Unit
Alternative Unit
mg/kg
µg/kg
ng/kg
pg/kg
ppm (parts per million)
ppb (parts per billion)
ppt (parts per trillion)
ppq (parts per quadrillion)
In water or food, concentration expressed by the preferred unit equals concentration expressed by alternative unit; thus, 2 mg/kg = 2 ppm. One mg
(10−3 g) per kg (103 g) equals 1 part per million (10−3/103 = 10−6). Similarly,
1 µg (10−6 g) per kilogram (103 g) equals 1 part per billion (10−6/103 = 10−9),
and so on (Table 2).
Note that in air, parts per million and parts per billion have different meanings than they do in water or food; to avoid confusion, it is always preferrable
to express air concentrations in weight of chemical per unit volume (rather than
weight) of air (usually cubic meters, m3).
541
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Table 2. Weight of Chemical per Unit Volume of Medium
Water
Air
mg/L = ppm
µg/L = ppb
ng/L = ppt
mg/m3 ≠ ppm
mg/m3 ≠ ppb
ng/m3 ≠ ppt
542
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
Appendix B: Hazardous Waste Site
Exposure Assessment
Several principles of exposure assessment can be illustrated by examining the
steps taken to evaluate a hazardous waste disposal site. From 1964 to 1972, more
than 300,000 55-gallon drums of solid and liquid pesticide production wastes
were buried in shallow trenches at a hazardous waste disposal site in Hardeman
County, Tennessee. As early as 1965, county engineers had raised concerns that
these operations might have affected the aquifer supplying drinking water to the
City of Memphis, Tennessee. The State of Tennessee ordered the landfill to stop
accepting hazardous waste in 1972; all operations were reported to have ceased
by 1975. Testing in 1978 confirmed the presence of toxic chemicals in domestic
wells, and by January 1979 all uses of the contaminated well water had been
discontinued.
Among the chemicals of concern detected in the ground water were benzene,
carbon tetrachloride, chlordane, chlorobenzene, chloroform, and several other
pesticides or chemicals associated with pesticide production. As is often the case
for ground water polluted by landfills, the observed concentrations fluctuated over
a relatively wide range. For example, in a domestic well approximately 1500 feet
north of the landfill, carbon tetrachloride concentrations ranged from 10 ppm
to 20 ppm between November 1978 and November 1979; from May 1981 to
June 1982, carbon tetrachloride levels varied from 18 ppm to 164 ppm.
The chemicals of greatest concern detected during ground-water monitoring
near the Hardeman site included carbon tetrachloride, chloroform, and tetrachloroethylene. For each of these three chemicals, the concentrations detected in
well water were significantly elevated over levels typically found in potable water.
Health surveys conducted in 1978 and 1982 suggested that these chemicals might
be causing a variety of health problems in nearby residents.
To confirm the cause-and-effect relationship suggested by the health surveys, an exposure assessment was conducted so that the findings of the health
surveys could be compared to adverse health impacts predicted from exposure
estimates and toxicological data from laboratory experiments. The exposure assessment for the Hardeman site focused on carbon tetrachloride, because of the high
concentrations of this chemical found in the ground water and the severity of the
potential health effects associated with exposure to it.
To estimate the range of possible exposures, the Hardeman site assessment
considered exposures of both an adult and an infant. The exposure assessor then
needed to identify the pathways of exposure that might be important. For the
infant, the following exposure pathways were examined:
• Consumptionofformulamadeusingwellwater,
• Dermalabsorptionduringbathingincontaminatedwater,and
543
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
• In-utero exposure of the fetus through exposure of the mother during
pregnancy.
Adult exposures were evaluated for two pathways:
• Consumptionofcontaminateddrinkingwaterand
• Inhalationofcarbontetrachlorideemanatingfromwaterduringshowers.
Because measurements of concentrations of carbon tetrachloride in the ground
water were scant before 1978, estimates were modeled for these years; measured
concentrations were used for 1978, the last year residents utilized ground water
for drinking. Standard assumptions regarding the ingestion of water by adults (2 L/
day) were used; water consumption by a child was assumed to be 0.5 L/day for 3
months following birth. Dermal absorption by infants was estimated by assuming
that the child bathed in 30 L/day of well water, that 50% of this volume contacted
the skin, and that 10% of the contaminant was absorbed through the skin. Three
baths per week were assumed for the first 3 months after birth. In-utero exposure
was estimated assuming equal concentrations of carbon tetrachloride in fetal and
maternal blood. The concentration of carbon tetrachloride in air during showering was calculated assuming that it would quickly reach equilibrium with carbon
tetrachloride in the shower water.
In Table 3, carbon tetrachloride exposure estimates for the infant and adult are
compared with the minimum daily exposure producing liver damage in guinea pigs
and the lifetime cumulative exposure producing liver cancer in mice. Daily exposure
rates were based on a predicted yearly average exposure during the highest year of
exposure. Monitoring data indicate that the concentration of carbon tetrachloride
in the ground water may have varied by a factor of 10 around the mean. The
maximum daily exposure rate may have been considerably higher than the estimates
presented in the table, whereas the long-term averages may have been lower.
Table 3. Carbon Tetrachloride Exposure Estimates for Infants and Adults
Compared with Minimum Daily Exposure Producing Liver
Damage in Guinea Pigs and Lifetime Cumulative Exposure
Producing Liver Cancer in Mice
Daily Dose Rate (mg/kg/day)
Liver damage in guinea pigs
Estimated infant exposure
Estimated adult exposure
1.5
1.8
0.3
Cumulative Dose (mg/kg)
40% Liver tumors in mice
Estimated adult exposure
1200
284
544
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
Glossary of Terms
absorbed dose. The amount of a substance that actually enters the body following absorption.
absorption. The penetration of a substance through a barrier (e.g., the skin, the
gut, or the lungs).
acute exposure. An exposure of short duration and/or rapid onset. An acute
toxic effect is one that develops during or shortly after an acute exposure to
a toxic substance.
average daily dose (ADD). The average dose received on any given day during
a period of exposure, expressed in mg/kg body weight per day. Ordinarily
used in assessing noncancer risks.
bioavailability. The rate and extent to which a chemical or chemical breakdown
product enters the general circulation, thereby permitting access to the site
of toxic action.
body burden. The total amount of a chemical present or stored in the body. In
humans, body burden is an important measure of exposure to chemicals that
tend to accumulate in fat cells, such as DDT, PCBs, or dioxins.
chronic exposure. A persistent, recurring, or long-term exposure, as distinguished from an acute exposure. Chronic exposure may result in health
effects (such as cancer) that are delayed in onset, occurring long after exposure
has ceased.
direct exposure. Exposure of a subject who comes into contact with a chemical via the medium in which it was initially released to the environment.
Examples include exposures mediated by cosmetics, other consumer products,
some food and beverage additives, medical devices, over-the-counter drugs,
and single-medium environmental exposures.
dose. The amount of a substance entering a person, usually expressed for chemicals in the form of weight of the substance (generally in milligrams (mg) or
micrograms (µg)) per unit of body weight (generally in kilograms (kg)). It is
necessary to specify whether the dose referred to is applied or absorbed. The
time over which it is received must also be specified. The time of interest is
typically 1 day. If the duration of exposure is specified, dose is actually a dose
rate and is expressed as mg or µg/kg per day.
dose–response assessment. An analysis of the relationship between the dose
administered to a group and the frequency or magnitude of the biological
effect (response).
duration of exposure. Toxicologically, there are three categories describing
duration of exposure: acute (one time), subchronic (repeated, for a fraction
of a lifetime), and chronic (repeated, for nearly a lifetime).
545
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
environmental media. Air, water, soils, and food; consumer products may also
be considered media. Chemicals may be directly and intentionally introduced
into certain media. Others may move from their sources through one or more
media before they reach the media with which people have contact.
exposure. The opportunity to receive a dose through direct contact with a
chemical or medium containing a chemical. See also direct esposure; indirect
exposure.
exposure assessment. The process of describing, for a population at risk, the
amounts of chemicals to which individuals are exposed, or the distribution
of exposures within a population, or the average exposure over an entire
population.
frequency of exposure. The number of times an exposure occurs in a given
period; exposure may be continuous, discontinuous but regular (e.g., once
daily), or intermittent (e.g., less than daily, with no standard quantitative
definition).
indirect exposure. Often defined as an exposure involving multimedia transport
of chemicals from source to exposed individual. Examples include exposures
to chemicals deposited onto soils from the air, chemicals released into the
ground water beneath a hazardous waste site, or consumption of fruits or
vegetables with pesticide residues.
intake. The amount of contact with a medium containing a chemical; used for
estimating the dose received from a particular medium.
levels. An alternative term for expressing chemical concentration in environmental media. Usually expressed as mass per unit volume or unit weight in the
medium of interest.
lifetime average daily dose (LADD). Total dose received over a lifetime multiplied by the fraction of lifetime during which exposure occurs, expressed in
mg/kg body weight per day. Ordinarily used for assessing cancer risk.
models. Idealized mathematical expressions of the relationship between two or
more factors (variables).
pathway. The connected media that transport a chemical from source to
populations.
point-of-contact exposures. Exposure expressed as the product of the concentration of the chemical in the medium of exposure and the duration and
surface area of contact with the body surface, for example, mg/cm2-hours.
Some chemicals do not need to be absorbed into the body but rather produce
toxicity directly at the point of contact, for example, the skin, mouth, GI
tract, nose, bronchial tubes, or lungs. In such cases, the absorbed dose is not
the relevant measure of exposure; rather, it is the amount of toxic chemical
coming directly into contact with the body surface.
546
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Exposure Science
population at risk. A group of subjects with the opportunity to be exposed to
a chemical.
risk. The nature and probability of occurrence of an unwanted, adverse effect on
human life or health or on the environment.
risk assessment. Characterization of the potential adverse effects on human life
or health or on the environment. According to the National Research Council’s Committee on the Institutional Means for Assessment of Health Risk,
human health risk assessment includes the following: description of the potential adverse health effects based on an evaluation of results of epidemiologic,
clinical, toxicological, and environmental research (hazard identification);
extrapolation from those results to predict the type and estimate the extent of
health effects in humans under given conditions of exposure (dose–response
assessment); judgments regarding the number and characteristics of persons
exposed at various intensities and durations (exposure assessment); summary judgments on the existence and overall magnitude of the public-health
problem; and characterization of the uncertainties inherent in the process of
inferring risk (risk characterization).
route of exposure. The way a chemical enters the body after exposure, that is,
by ingestion, inhalation, or dermal absorption.
setting. The place or situation in which a person is exposed to the chemical.
Setting is often modified by the activity a person is undertaking, for example,
occupational or in-home exposures.
source. The activity or entity from which the chemical is released for potential
human exposure.
subchronic exposure. An exposure of intermediate duration between acute
and chronic.
subject. An exposed individual, whether a human or an exposed animal or
organism in the environment. An exposed individual is sometimes also called
a receptor.
systemic dose. A dose of a chemical within the body—that is, not localized at the
point of contact. Thus, skin irritation caused by contact with a chemical is not
a systemic effect, but liver damage due to absorption of the chemical through
the skin is. Often referred to as target site dose.
total dose. The doses received by more than one route of exposure are added
to yield the total dose.
547
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
References on Exposure
D.B. Barr, Expanding the Role of Exposure Science in Environmental Health, 16 J.
Exposure Sci. Envtl. Epidemiol. 473 (2006).
Exposure Assessment in Occupational and Environmental Epidemiology (M.J.
Nieuwenhuiysen ed., 2003).
S. Gad, Regulatory Toxicology (2d ed. 2001). Includes much discussion of pharmaceuticals, food ingredients, and other consumer products.
P. Lioy, Exposure Science: A View of the Past and Milestones for the Future, 118 Envtl.
Health Persp. 1081–90 (2010).
U.S. Environmental Protection Agency, Guidelines for Exposure Assessment,
Doc. No. EPA/600/Z-92/001 (1992), available at http://cfpub.epa.gov/
ncea/cfm/recordisplay.cfm?deid=15263.
548
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
Epidemiology
MICHAEL D. GREEN, D. MICHAL FREEDMAN, AND
LEON GORDIS
Michael D. Green, J.D., is Bess & Walter Williams Chair in Law, Wake Forest University
School of Law, Winston-Salem, North Carolina.
D. Michal Freedman, J.D., Ph.D., M.P.H., is Epidemiologist, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland.
Leon Gordis, M.D., M.P.H., Dr.P.H., is Professor Emeritus of Epidemiology, Johns Hopkins
Bloomberg School of Public Health, and Professor Emeritus of Pediatrics, Johns Hopkins School
of Medicine, Baltimore, Maryland.
CONTENTS
I. Introduction, 551
II. What Different Kinds of Epidemiologic Studies Exist? 555
A. Experimental and Observational Studies of Suspected Toxic
Agents, 555
B. Types of Observational Study Design, 556
1. Cohort studies, 557
2. Case-control studies, 559
3. Cross-sectional studies, 560
4. Ecological studies, 561
C. Epidemiologic and Toxicologic Studies, 563
III. How Should Results of an Epidemiologic Study Be Interpreted? 566
A. Relative Risk, 566
B. Odds Ratio, 568
C. Attributable Risk, 570
D. Adjustment for Study Groups That Are Not Comparable, 571
IV. What Sources of Error Might Have Produced a False Result? 572
A. What Statistical Methods Exist to Evaluate the Possibility of
Sampling Error? 574
1. False positives and statistical significance, 575
2. False negatives, 581
3. Power, 582
549
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. What Biases May Have Contributed to an Erroneous Association? 583
1. Selection bias, 583
2. Information bias, 585
3. Other conceptual problems, 590
C. Could a Confounding Factor Be Responsible for the Study
Result? 591
1. What techniques can be used to prevent or limit
confounding? 595
2. What techniques can be used to identify confounding
factors? 595
3. What techniques can be used to control for confounding
factors? 596
V. General Causation: Is an Exposure a Cause of the Disease? 597
A. Is There a Temporal Relationship? 601
B. How Strong Is the Association Between the Exposure and
Disease? 602
C. Is There a Dose–Response Relationship? 603
D. Have the Results Been Replicated? 604
E. Is the Association Biologically Plausible (Consistent with Existing
Knowledge)? 604
F. Have Alternative Explanations Been Considered? 605
G. What Is the Effect of Ceasing Exposure? 605
H. Does the Association Exhibit Specificity? 605
I. Are the Findings Consistent with Other Relevant Knowledge? 606
VI. What Methods Exist for Combining the Results of Multiple Studies? 606
VII. What Role Does Epidemiology Play in Proving Specific Causation? 608
VIII. Acknowledgments, 618
Glossary of Terms, 619
References on Epidemiology, 630
References on Law and Epidemiology, 630
550
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
I. Introduction
Epidemiology is the field of public health and medicine that studies the incidence,
distribution, and etiology of disease in human populations. The purpose of epidemiology is to better understand disease causation and to prevent disease in groups
of individuals. Epidemiology assumes that disease is not distributed randomly in a
group of individuals and that identifiable subgroups, including those exposed to
certain agents, are at increased risk of contracting particular diseases.1
Judges and juries are regularly presented with epidemiologic evidence as
the basis of an expert’s opinion on causation.2 In the courtroom, epidemiologic
research findings are offered to establish or dispute whether exposure to an agent3
1. Although epidemiologists may conduct studies of beneficial agents that prevent or cure disease
or other medical conditions, this reference guide refers exclusively to outcomes as diseases, because
they are the relevant outcomes in most judicial proceedings in which epidemiology is involved.
2. Epidemiologic studies have been well received by courts deciding cases involving toxic
substances. See, e.g., Siharath v. Sandoz Pharms. Corp., 131 F. Supp. 2d 1347, 1356 (N.D. Ga. 2001)
(“The existence of relevant epidemiologic studies can be a significant factor in proving general causation in toxic tort cases. Indeed, epidemiologic studies provide ‘the primary generally accepted methodology for demonstrating a causal relation between a chemical compound and a set of symptoms or
disease.’” (quoting Conde v. Velsicol Chem. Corp., 804 F. Supp. 972, 1025–26 (S.D. Ohio 1992))),
aff’d, 295 F.3d 1194 (11th Cir. 2002); Berry v. CSX Transp., Inc., 709 So. 2d 552, 569 (Fla. Dist. Ct.
App. 1998). Well-conducted studies are uniformly admitted. 3 Modern Scientific Evidence: The Law
and Science of Expert Testimony § 23.1, at 187 (David L. Faigman et al. eds., 2007–08) [hereinafter
Modern Scientific Evidence]. Since Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 (1993), the
predominant use of epidemiologic studies is in connection with motions to exclude the testimony of
expert witnesses. Cases deciding such motions routinely address epidemiology and its implications for
the admissibility of expert testimony on causation. Often it is not the investigator who conducted the
study who is serving as an expert witness in a case in which the study bears on causation. See, e.g.,
Kennedy v. Collagen Corp., 161 F.3d 1226 (9th Cir. 1998) (physician is permitted to testify about
causation); DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 953 (3d Cir. 1990) (a pediatric pharmacologist expert’s credentials are sufficient pursuant to Fed. R. Evid. 702 to interpret epidemiologic
studies and render an opinion based thereon); Medalen v. Tiger Drylac U.S.A., Inc., 269 F. Supp. 2d
1118, 1129 (D. Minn. 2003) (holding toxicologist could testify to general causation but not specific
causation); Burton v. R.J. Reynolds Tobacco Co., 181 F. Supp. 2d 1256, 1267 (D. Kan. 2002) (a
vascular surgeon was permitted to testify to general causation); Landrigan v. Celotex Corp., 605 A.2d
1079, 1088 (N.J. 1992) (an epidemiologist was permitted to testify to both general causation and specific causation); Trach v. Fellin, 817 A.2d 1102, 1117–18 (Pa. Super. Ct. 2003) (an expert who was a
toxicologist and pathologist was permitted to testify to general and specific causation).
3. We use the term “agent” to refer to any substance external to the human body that potentially
causes disease or other health effects. Thus, drugs, devices, chemicals, radiation, and minerals (e.g.,
asbestos) are all agents whose toxicity an epidemiologist might explore. A single agent or a number
of independent agents may cause disease, or the combined presence of two or more agents may be
necessary for the development of the disease. Epidemiologists also conduct studies of individual characteristics, such as blood pressure and diet, which might pose risks, but those studies are rarely of interest
in judicial proceedings. Epidemiologists also may conduct studies of drugs and other pharmaceutical
products to assess their efficacy and safety.
551
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
caused a harmful effect or disease.4 Epidemiologic evidence identifies agents that
are associated with an increased risk of disease in groups of individuals, quantifies
the amount of excess disease that is associated with an agent, and provides a profile
of the type of individual who is likely to contract a disease after being exposed
to an agent. Epidemiology focuses on the question of general causation (i.e., is
the agent capable of causing disease?) rather than that of specific causation (i.e.,
did it cause disease in a particular individual?).5 For example, in the 1950s, Doll
and Hill and others published articles about the increased risk of lung cancer in
cigarette smokers. Doll and Hill’s studies showed that smokers who smoked 10 to
20 cigarettes a day had a lung cancer mortality rate that was about 10 times higher
than that for nonsmokers.6 These studies identified an association between smoking cigarettes and death from lung cancer that contributed to the determination
that smoking causes lung cancer.
However, it should be emphasized that an association is not equivalent to causation.7 An association identified in an epidemiologic study may or may not be
4. E.g., Bonner v. ISP Techs., Inc., 259 F.3d 924 (8th Cir. 2001) (a worker exposed to organic
solvents allegedly suffered organic brain dysfunction); Burton v. R.J. Reynolds Tobacco Co., 181
F. Supp. 2d 1256 (D. Kan. 2002) (cigarette smoking was alleged to have caused peripheral vascular
disease); In re Bextra & Celebrex Mktg. Sales Practices & Prod. Liab. Litig., 524 F. Supp. 2d 1166
(N.D. Cal. 2007) (multidistrict litigation over drugs for arthritic pain that caused heart disease); Ruff
v. Ensign-Bickford Indus., Inc., 168 F. Supp. 2d 1271 (D. Utah 2001) (chemicals that escaped from an
explosives manufacturing site allegedly caused non-Hodgkin’s lymphoma in nearby residents); Castillo
v. E.I. du Pont De Nemours & Co., 854 So. 2d 1264 (Fla. 2003) (a child born with a birth defect
allegedly resulting from mother’s exposure to a fungicide).
5. This terminology and the distinction between general causation and specific causation are
widely recognized in court opinions. See, e.g., Norris v. Baxter Healthcare Corp., 397 F.3d 878 (10th
Cir. 2005); In re Hanford Nuclear Reservation Litig., 292 F.3d 1124, 1129 (9th Cir. 2002) (“‘Generic
causation’ has typically been understood to mean the capacity of a toxic agent . . . to cause the illnesses
complained of by plaintiffs. If such capacity is established, ‘individual causation’ answers whether that
toxic agent actually caused a particular plaintiff’s illness.”); In re Rezulin Prods. Liab. Litig., 369 F.
Supp. 2d 398, 402 (S.D.N.Y. 2005); Soldo v. Sandoz Pharms. Corp., 244 F. Supp. 2d 434, 524–25
(W.D. Pa. 2003); Burton v. R.J. Reynolds Tobacco Co., 181 F. Supp. 2d 1256, 1266–67 (D. Kan.
2002). For a discussion of specific causation, see infra Section VII.
6. Richard Doll & A. Bradford Hill, Lung Cancer and Other Causes of Death in Relation to Smoking:
A Second Report on the Mortality of British Doctors, 2 Brit. Med. J. 1071 (1956).
7. See Soldo v. Sandoz Pharms. Corp., 244 F. Supp. 2d 434, 461 (W.D. Pa. 2003) (Hill criteria
[see infra Section V] developed to assess whether an association is causal); Miller v. Pfizer, Inc., 196
F. Supp. 2d 1062, 1079–80 (D. Kan. 2002); Magistrini v. One Hour Martinizing Dry Cleaning, 180
F. Supp. 2d 584, 591 (D.N.J. 2002) (“[A]n association is not equivalent to causation.” (quoting the
second edition of this reference guide)); Zandi v. Wyeth a/k/a Wyeth, Inc., No. 27-CV-06-6744,
2007 WL 3224242, at *11 (D. Minn. Oct. 15, 2007).
Association is more fully discussed infra Section III. The term is used to describe the relationship
between two events (e.g., exposure to a chemical agent and development of disease) that occur more
frequently together than one would expect by chance. Association does not necessarily imply a causal
effect. Causation is used to describe the association between two events when one event is a necessary
link in a chain of events that results in the effect. Of course, alternative causal chains may exist that do
not include the agent but that result in the same effect. For general treatment of causation in tort law
552
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
causal.8 Assessing whether an association is causal requires an understanding of
the strengths and weaknesses of the study’s design and implementation, as well as
a judgment about how the study findings fit with other scientific knowledge. It
is important to emphasize that all studies have “flaws” in the sense of limitations
that add uncertainty about the proper interpretation of the results.9 Some flaws are
inevitable given the limits of technology, resources, the ability and willingness of
persons to participate in a study, and ethical constraints. In evaluating epidemiologic evidence, the key questions, then, are the extent to which a study’s limitations compromise its findings and permit inferences about causation.
A final caveat is that employing the results of group-based studies of risk to
make a causal determination for an individual plaintiff is beyond the limits of
epidemiology. Nevertheless, a substantial body of legal precedent has developed
that addresses the use of epidemiologic evidence to prove causation for an individual litigant through probabilistic means, and the law developed in these cases
is discussed later in this reference guide.10
The following sections of this reference guide address a number of critical
issues that arise in considering the admissibility of, and weight to be accorded
to, epidemiologic research findings. Over the past several decades, courts frequently have confronted the use of epidemiologic studies as evidence and have
recognized their utility in proving causation. As the Third Circuit observed in
DeLuca v. Merrell Dow Pharmaceuticals, Inc.: “The reliability of expert testimony
founded on reasoning from epidemiologic data is generally a fit subject for judicial notice; epidemiology is a well-established branch of science and medicine,
and epidemiologic evidence has been accepted in numerous cases.”11 Indeed,
and that for factual causation to exist an agent must be a necessary link in a causal chain sufficient for
the outcome, see Restatement (Third) of Torts: Liability for Physical Harm § 26 (2010). Epidemiologic
methods cannot deductively prove causation; indeed, all empirically based science cannot affirmatively
prove a causal relation. See, e.g., Stephan F. Lanes, The Logic of Causal Inference in Medicine, in Causal
Inference 59 (Kenneth J. Rothman ed., 1988). However, epidemiologic evidence can justify an inference that an agent causes a disease. See infra Section V.
8. See infra Section IV.
9. See In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1240 (W.D.
Wash. 2003) (quoting this reference guide and criticizing defendant’s “ex post facto dissection” of a
study); In re Orthopedic Bone Screw Prods. Liab. Litig., MDL No. 1014, 1997 U.S. Dist. LEXIS
6441, at *26–*27 (E.D. Pa. May 5, 1997) (holding that despite potential for several biases in a study
that “may . . . render its conclusions inaccurate,” the study was sufficiently reliable to be admissible);
Joseph L. Gastwirth, Reference Guide on Survey Research, 36 Jurimetrics J. 181, 185 (1996) (review essay)
(“One can always point to a potential flaw in a statistical analysis.”).
10. See infra Section VII.
11. 911 F.2d 941, 954 (3d Cir. 1990); see also Norris v. Baxter Healthcare Corp., 397 F.3d 878,
882 (10th Cir. 2005) (an extensive body of exonerative epidemiologic evidence must be confronted
and the plaintiff must provide scientifically reliable contrary evidence); In re Meridia Prods. Liab.
Litig., 328 F. Supp. 2d 791, 800 (N.D. Ohio 2004) (“Epidemiologic studies are the primary generally accepted methodology for demonstrating a causal relation between the chemical compound and
a set of symptoms or a disease. . . .” (quoting Conde v. Velsicol Chem. Corp., 804 F. Supp. 972,
553
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
much more difficult problems arise for courts when there is a paucity of epidemiologic evidence.12
Three basic issues arise when epidemiology is used in legal disputes, and the
methodological soundness of a study and its implications for resolution of the
question of causation must be assessed:
1. Do the results of an epidemiologic study or studies reveal an association
between an agent and disease?
2. Could this association have resulted from limitations of the study (bias,
confounding, or sampling error), and, if so, from which?
3. Based on the analysis of limitations in Item 2, above, and on other evidence, how plausible is a causal interpretation of the association?
Section II explains the different kinds of epidemiologic studies, and Section III
addresses the meaning of their outcomes. Section IV examines concerns about
the methodological validity of a study, including the problem of sampling error.13
Section V discusses general causation, considering whether an agent is capable of
causing disease. Section VI deals with methods for combining the results of multiple epidemiologic studies and the difficulties entailed in extracting a single global
measure of risk from multiple studies. Additional legal questions that arise in most
toxic substances cases are whether population-based epidemiologic evidence can
be used to infer specific causation, and, if so, how. Section VII addresses specific
causation—the matter of whether a specific agent caused the disease in a given
plaintiff.
1025–26 (S.D. Ohio 1992))); Brasher v. Sandoz Pharms. Corp., 160 F. Supp. 2d 1291, 1296 (N.D.
Ala. 2001) (“Unquestionably, epidemiologic studies provide the best proof of the general association
of a particular substance with particular effects, but it is not the only scientific basis on which those
effects can be predicted.”).
12. See infra note 181.
13. For a more in-depth discussion of the statistical basis of epidemiology, see David H. Kaye &
David A. Freedman, Reference Guide on Statistics, Section II.A, in this manual, and two case studies:
Joseph Sanders, The Bendectin Litigation: A Case Study in the Life Cycle of Mass Torts, 43 Hastings L.J.
301 (1992); Devra L. Davis et al., Assessing the Power and Quality of Epidemiologic Studies of AsbestosExposed Populations, 1 Toxicological & Indus. Health 93 (1985). See also References on Epidemiology
and References on Law and Epidemiology at the end of this reference guide.
554
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
II. What Different Kinds of Epidemiologic
Studies Exist?
A. Experimental and Observational Studies of Suspected Toxic
Agents
To determine whether an agent is related to the risk of developing a certain disease
or an adverse health outcome, we might ideally want to conduct an experimental
study in which the subjects would be randomly assigned to one of two groups:
one group exposed to the agent of interest and the other not exposed. After a
period of time, the study participants in both groups would be evaluated for the
development of the disease. This type of study, called a randomized trial, clinical trial, or true experiment, is considered the gold standard for determining the
relationship of an agent to a health outcome or adverse side effect. Such a study
design is often used to evaluate new drugs or medical treatments and is the best
way to ensure that any observed difference in outcome between the two groups
is likely to be the result of exposure to the drug or medical treatment.
Randomization minimizes the likelihood that there are differences in relevant characteristics between those exposed to the agent and those not exposed.
Researchers conducting clinical trials attempt to use study designs that are placebo
controlled, which means that the group not receiving the active agent or treatment is given an inactive ingredient that appears similar to the active agent under
study. They also use double blinding where possible, which means that neither the
participants nor those conducting the study know which group is receiving the
agent or treatment and which group is given the placebo. However, ethical and
practical constraints limit the use of such experimental methodologies to assess the
value of agents that are thought to be beneficial to human beings.14
When an agent’s effects are suspected to be harmful, researchers cannot
knowingly expose people to the agent.15 Instead epidemiologic studies typically
14. Although experimental human studies cannot intentionally expose subjects to toxins, they
can provide evidence that a new drug or other beneficial intervention also has adverse effects. See In
re Bextra & Celebrex Mktg. Sales Practices & Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1181 (N.D.
Cal. 2007) (the court relied on a clinical study of Celebrex that revealed increased cardiovascular risk
to conclude that the plaintiff’s experts’ testimony on causation was admissible); McDarby v. Merck &
Co., 949 A.2d 223 (N.J. Super. Ct. App. Div. 2008) (explaining how clinical trials of Vioxx revealed
an association with heart disease).
15. Experimental studies in which human beings are exposed to agents known or thought to be
toxic are ethically proscribed. See Glastetter v. Novartis Pharms. Corp., 252 F.3d 986, 992 (8th Cir.
2001); Brasher v. Sandoz Pharms. Corp., 160 F. Supp. 2d 1291, 1297 (N.D. Ala. 2001). Experimental
studies can be used where the agent under investigation is believed to be beneficial, as is the case in
the development and testing of new pharmaceutical drugs. See, e.g., McDarby v. Merck & Co., 949
A.2d 223, 270 (N.J. Super. Ct. App. Div. 2008) (an expert witness relied on a clinical trial of a new
drug to find the adjusted risk for the plaintiff); see also Gordon H. Guyatt, Using Randomized Trials in
555
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
“observe”16 a group of individuals who have been exposed to an agent of interest,
such as cigarette smoke or an industrial chemical and compare them with another
group of individuals who have not been exposed. Thus, the investigator identifies
a group of subjects who have been exposed17 and compares their rate of disease
or death with that of an unexposed group. In contrast to clinical studies in which
potential risk factors can be controlled, epidemiologic investigations generally
focus on individuals living in the community, for whom characteristics other than
the one of interest, such as diet, exercise, exposure to other environmental agents,
and genetic background, may distort a study’s results. Because these characteristics
cannot be controlled directly by the investigator, the investigator addresses their
possible role in the relationship being studied by considering them in the design
of the study and in the analysis and interpretation of the study results (see infra
Section IV).18 We emphasize that the Achilles’ heel of observational studies is the
possibility of differences in the two populations being studied with regard to risk
factors other than exposure to the agent.19 By contrast, experimental studies, in
which subjects are randomized, generally avoid this problem.
B. Types of Observational Study Design
Several different types of observational epidemiologic studies can be conducted.20
Study designs may be chosen because of suitability for investigating the question
of interest, timing constraints, resource limitations, or other considerations.
Most observational studies collect data about both exposure and health outcome in every individual in the study. The two main types of observational studies
are cohort studies and case-control studies. A third type of observational study is a
cross-sectional study, although cross-sectional studies are rarely useful in identifying toxic agents.21 A final type of observational study, one in which data about
Pharmacoepidemiology, in Drug Epidemiology and Post-Marketing Surveillance 59 (Brian L. Strom &
Giampaolo Velo eds., 1992). Experimental studies also may be conducted that entail the discontinuation of exposure to a harmful agent, such as studies in which smokers are randomly assigned to a
variety of smoking cessation programs or have no cessation.
16. Classifying these studies as observational in contrast to randomized trials can be misleading to those who are unfamiliar with the area, because subjects in a randomized trial are observed as
well. Nevertheless, the use of the term “observational studies” to distinguish them from experimental
studies is widely employed.
17. The subjects may have voluntarily exposed themselves to the agent of interest, as is the case, for
example, for those who smoke cigarettes, or subjects may have been exposed involuntarily or even without knowledge to an agent, such as in the case of employees who are exposed to chemical fumes at work.
18. See David A. Freedman, Oasis or Mirage? 21 Chance 59, 59–61 (Mar. 2008).
19. Both experimental and observational studies are subject to random error. See infra Section IV.A.
20. Other epidemiologic studies collect data about the group as a whole, rather than about each
individual in the group. These group studies are discussed infra Section II.B.4.
21. See infra Section II.B.3.
556
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
individuals are not gathered, but rather population data about exposure and disease
are used, is an ecological study.22
The difference between cohort studies and case-control studies is that
cohort studies measure and compare the incidence of disease in the exposed and
unexposed (“control”) groups, while case-control studies measure and compare
the frequency of exposure in the group with the disease (the “cases”) and the
group without the disease (the “controls”). In a case-control study, the rates of
exposure in the cases and the rates in the controls are compared, and the odds of
having the disease when exposed to a suspected agent can be compared with the
odds when not exposed. The critical difference between cohort studies and casecontrol studies is that cohort studies begin with exposed people and unexposed
people, while case-control studies begin with individuals who are selected based
on whether they have the disease or do not have the disease and their exposure
to the agent in question is measured. The goal of both types of studies is to determine if there is an association between exposure to an agent and a disease and the
strength (magnitude) of that association.
1. Cohort studies
In cohort studies,23 researchers define a study population without regard to the
participants’ disease status. The cohort may be defined in the present and followed
forward into the future (prospectively) or it may be constructed retrospectively
as of sometime in the past and followed over historical time toward the present.
In either case, the researchers classify the study participants into groups based on
whether they were exposed to the agent of interest (see Figure 1).24 In a prospective study, the exposed and unexposed groups are followed for a specified length
of time, and the proportions of individuals in each group who develop the disease
of interest are compared. In a retrospective study, the researcher will determine
the proportion of individuals in the exposed group who developed the disease
from available records or evidence and compare that proportion with the proportion of another group that was not exposed.25 Thus, as illustrated in Table 1,
22. For thumbnail sketches on all types of epidemiologic study designs, see Brian L. Strom,
Study Designs Available for Pharmacoepidemiology Studies, in Pharmacoepidemiology 17, 21–26 (Brian L.
Strom ed., 4th ed. 2005).
23. Cohort studies also are referred to as prospective studies and followup studies.
24. In some studies, there may be several groups, each with a different magnitude of exposure to
the agent being studied. Thus, a study of cigarette smokers might include heavy smokers (>3 packs a day),
moderate smokers (1 to 2 packs a day), and light smokers (<1 pack a day). See, e.g., Robert A. Rinsky
et al., Benzene and Leukemia: An Epidemiologic Risk Assessment, 316 New Eng. J. Med. 1044 (1987).
25. Sometimes in retrospective cohort studies the researcher gathers historical data about exposure and disease outcome of a cohort. Harold A. Kahn, An Introduction to Epidemiologic Methods
39–41 (1983). Irving Selikoff, in his seminal study of asbestotic disease in insulation workers, included
several hundred workers who had died before he began the study. Selikoff was able to obtain information about exposure from union records and information about disease from hospital and autopsy
557
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 1. Design of a cohort study.
Defined
Population
Exposed
Not Exposed
Do Not
Develop
Disease
Develop
Disease
Develop
Disease
Do Not
Develop
Disease
Table 1. Cross-Tabulation of Exposure by Disease Status
No Disease
Disease
Totals
Incidence Rates
of Disease
Not exposed
a
c
a+c
c /(a + c)
Exposed
b
d
b+d
d/(b + d)
a researcher would compare the proportion of unexposed individuals with the
disease, c /(a + c), with the proportion of exposed individuals with the disease,
d/(b + d). If the exposure causes the disease, the researcher would expect a greater
proportion of the exposed individuals to develop the disease than the unexposed
individuals.26
One advantage of the cohort study design is that the temporal relationship
between exposure and disease can often be established more readily than in other
study designs, especially a case-control design, discussed below. By tracking people
who are initially not affected by the disease, the researcher can determine the time
of disease onset and its relation to exposure. This temporal relationship is critical to the question of causation, because exposure must precede disease onset if
exposure caused the disease.
As an example, in 1950 a cohort study was begun to determine whether
uranium miners exposed to radon were at increased risk for lung cancer as comrecords. Irving J. Selikoff et al., The Occurrence of Asbestosis Among Insulation Workers in the United States,
132 Ann. N.Y. Acad. Sci. 139, 143 (1965).
26. Researchers often examine the rate of disease or death in the exposed and control groups.
The rate of disease or death entails consideration of the number developing disease within a specified
period. All smokers and nonsmokers will, if followed for 100 years, die. Smokers will die at a greater
rate than nonsmokers in the earlier years.
558
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
pared with nonminers. The study group (also referred to as the exposed cohort)
consisted of 3400 white, underground miners. The control group (which need not
be the same size as the exposed cohort) comprised white nonminers from the same
geographic area. Members of the exposed cohort were examined every 3 years,
and the degree of this cohort’s exposure to radon was measured from samples
taken in the mines. Ongoing testing for radioactivity and periodic medical monitoring of lungs permitted the researchers to examine whether disease was linked
to prior work exposure to radiation and allowed them to discern the relationship
between exposure to radiation and disease. Exposure to radiation was associated
with the development of lung cancer in uranium miners.27
The cohort design is used often in occupational studies such as the one just discussed. Because the design is not experimental, and the investigator has no control
over what other exposures a subject in the study may have had, an increased risk of
disease among the exposed group may be caused by agents other than the exposure
of interest. A cohort study of workers in a certain industry that pays below-average
wages might find a higher risk of cancer in those workers. This may be because
they work in that industry, or, among other reasons, because low-wage groups are
exposed to other harmful agents, such as environmental toxins present in higher
concentrations in their neighborhoods. In the study design, the researcher must
attempt to identify factors other than the exposure that may be responsible for the
increased risk of disease. If data are gathered on other possible etiologic factors,
the researcher generally uses statistical methods28 to assess whether a true association exists between working in the industry and cancer. Evaluating whether the
association is causal involves additional analysis, as discussed in Section V.
2. Case-control studies
In case-control studies,29 the researcher begins with a group of individuals who
have a disease (cases) and then selects a similar group of individuals who do not
have the disease (controls). (Ideally, controls should come from the same source
population as the cases.) The researcher then compares the groups in terms of past
exposures. If a certain exposure is associated with or caused the disease, a higher
proportion of past exposure among the cases than among the controls would be
expected (see Figure 2).
27. This example is based on a study description in Abraham M. Lilienfeld & David E. Lilienfeld, Foundations of Epidemiology 237–39 (2d ed. 1980). The original study is Joseph K. Wagoner et
al., Radiation as the Cause of Lung Cancer Among Uranium Miners, 273 New Eng. J. Med. 181 (1965).
28. See Daniel L. Rubinfeld, Reference Guide on Multiple Regression, Section II.B, in this
manual; David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section V.D, in
this manual.
29. Case-control studies are also referred to as retrospective studies, because researchers gather
historical information about rates of exposure to an agent in the case and control groups.
559
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 2. Design of a case-control study.
Exposed
Not Exposed
Exposed
Not Exposed
Disease
No Disease
CASES
CONTROLS
Thus, for example, in the late 1960s, doctors in Boston were confronted with
an unusual number of young female patients with vaginal adenocarcinoma. Those
patients became the “cases” in a case-control study (because they had the disease
in question) and were matched with “controls,” who did not have the disease.
Controls were selected based on their being born in the same hospitals and at the
same time as the cases. The cases and controls were compared for exposure to
agents that might be responsible, and researchers found maternal ingestion of DES
(diethylstilbestrol) in all but one of the cases but none of the controls.30
An advantage of the case-control study is that it usually can be completed in
less time and with less expense than a cohort study. Case-control studies are also
particularly useful in the study of rare diseases, because if a cohort study were conducted, an extremely large group would have to be studied in order to observe the
development of a sufficient number of cases for analysis.31 A number of potential
problems with case-control studies are discussed in Section IV.B.
3. Cross-sectional studies
A third type of observational study is a cross-sectional study. In this type of study,
individuals are interviewed or examined, and the presence of both the exposure
of interest and the disease of interest is determined in each individual at a single
point in time. Cross-sectional studies determine the presence (prevalence) of both
exposure and disease in the subjects and do not determine the development of
disease or risk of disease (incidence). Moreover, because both exposure and disease are determined in an individual at the same point in time, it is not possible
to establish the temporal relation between exposure and disease—that is, that the
30. See Arthur L. Herbst et al., Adenocarcinoma of the Vagina: Association of Maternal Stilbestrol
Therapy with Tumor Appearance, 284 New Eng. J. Med. 878 (1971).
31. Thus, for example, to detect a doubling of disease caused by exposure to an agent where
the incidence of disease is 1 in 100 in the unexposed population would require sample sizes of 3100
for the exposed and nonexposed groups for a cohort study, but only 177 for the case and control
groups in a case-control study. Harold A. Kahn & Christopher T. Sempos, Statistical Methods in
Epidemiology 66 (1989).
560
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
exposure preceded the disease, which would be necessary for drawing any causal
inference. Thus, a researcher may use a cross-sectional study to determine the
connection between a personal characteristic that does not change over time,
such as blood type, and existence of a disease, such as aplastic anemia, by examining individuals and determining their blood types and whether they suffer from
aplastic anemia. Cross-sectional studies are infrequently used when the exposure of
interest is an environmental toxic agent (current smoking status is a poor measure
of an individual’s history of smoking), but these studies can provide valuable leads
to further directions for research.32
4. Ecological studies
Up to now, we have discussed studies in which data on both exposure and health
outcome are obtained for each individual included in the study.33 In contrast,
studies that collect data only about the group as a whole are called ecological
studies.34 In ecological studies, information about individuals is generally not
gathered; instead, overall rates of disease or death for different groups are obtained
and compared. The objective is to identify some difference between the two
groups, such as diet, genetic makeup, or alcohol consumption, that might explain
differences in the risk of disease observed in the two groups.35 Such studies may
be useful for identifying associations, but they rarely provide definitive causal
answers.36 The difficulty is illustrated below with an ecological study of the relationship between dietary fat and cancer.
32. For more information (and references) about cross-sectional studies, see Leon Gordis, Epidemiology 195–98 (4th ed. 2009).
33. Some individual studies may be conducted in which all members of a group or community
are treated as exposed to an agent of interest (e.g., a contaminated water system) and disease status is
determined individually. These studies should be distinguished from ecological studies.
34. In Cook v. Rockwell International Corp., 580 F. Supp. 2d 1071, 1095–96 (D. Colo. 2006), the
plaintiffs’ expert conducted an ecological study in which he compared the incidence of two cancers
among those living in a specified area adjacent to the Rocky Flats Nuclear Weapons Plant with other
areas more distant. (The likely explanation for relying on this type of study is the time and expense of
a study that gathered information about each individual in the affected area.) The court recognized that
ecological studies are less probative than studies in which data are based on individuals but nevertheless held that limitation went to the weight of the study. Plaintiff’s expert was permitted to testify to
causation, relying on the ecological study he performed.
In Renaud v. Martin Marietta Corp., 749 F. Supp. 1545, 1551 (D. Colo. 1990), aff’d, 972 F.2d
304 (10th Cir. 1992), the plaintiffs attempted to rely on an excess incidence of cancers in their neighborhood to prove causation. Unfortunately, the court confused the role of epidemiology in proving
causation with the issue of the plaintiffs’ exposure to the alleged carcinogen and never addressed the
evidentiary value of the plaintiffs’ evidence of a disease cluster (i.e., an unusually high incidence of a
particular disease in a neighborhood or community). Id. at 1554.
35. David E. Lilienfeld & Paul D. Stolley, Foundations of Epidemiology 12 (3d ed. 1994).
36. Thus, the emergence of a cluster of adverse events associated with use of heparin, a longtime
and widely-prescribed anticoagulent, led to suspicions that some specific lot of heparin was responsible.
These concerns led the Centers for Disease Control to conduct a case control study that concluded
561
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
If a researcher were interested in determining whether a high dietary fat intake
is associated with breast cancer, he or she could compare different countries in terms
of their average fat intakes and their average rates of breast cancer. If a country with
a high average fat intake also tends to have a high rate of breast cancer, the finding
would suggest an association between dietary fat and breast cancer. However, such
a finding would be far from conclusive, because it lacks particularized information
about an individual’s exposure and disease status (i.e., whether an individual with
high fat intake is more likely to have breast cancer).37 In addition to the lack of
information about an individual’s intake of fat, the researcher does not know about
the individual’s exposures to other agents (or other factors, such as a mother’s age at
first birth) that may also be responsible for the increased risk of breast cancer. This
lack of information about each individual’s exposure to an agent and disease status
detracts from the usefulness of the study and can lead to an erroneous inference
about the relationship between fat intake and breast cancer, a problem known as
an ecological fallacy. The fallacy is assuming that, on average, the individuals in the
study who have suffered from breast cancer consumed more dietary fat than those
who have not suffered from the disease. This assumption may not be true. Nevertheless, the study is useful in that it identifies an area for further research: the fat
intake of individuals who have breast cancer as compared with the fat intake of those
who do not. Researchers who identify a difference in disease or death in an ecological study may follow up with a study based on gathering data about individuals.
Another epidemiologic approach is to compare disease rates over time and
focus on disease rates before and after a point in time when some event of interest took place.38 For example, thalidomide’s teratogenicity (capacity to cause
birth defects) was discovered after Dr. Widukind Lenz found a dramatic increase
in the incidence of limb reduction birth defects in Germany beginning in 1960.
Yet, other than with such powerful agents as thalidomide, which increased the
incidence of limb reduction defects by several orders of magnitude, these seculartrend studies (also known as time-line studies) are less reliable and less able to
that contaminated heparin manufactured by Baxter was responsible for the outbreak of adverse events.
See David B. Blossom et al., Outbreak of Adverse Event Reactions Associated with Contaminated Heparin,
359 New Eng. J. Med. 2674 (2008); In re Heparin Prods. Liab. Litig. 2011 WL 2971918 (N.D. Ohio
July 21, 2011).
37. For a discussion of the data on this question and what they might mean, see David Freedman
et al., Statistics (4th ed. 2007).
38. In Wilson v. Merrell Dow Pharmaceuticals, Inc., 893 F.2d 1149, 1152–53 (10th Cir. 1990), the
defendant introduced evidence showing total sales of Bendectin and the incidence of birth defects
during the 1970–1984 period. In 1983, Bendectin was removed from the market, but the rate of birth
defects did not change. The Tenth Circuit affirmed the lower court’s ruling that the time-line data
were admissible and that the defendant’s expert witnesses could rely on them in rendering their opinions. Similar evidence was relied on in cases involving cell phones and the drug Parlodel, which was
alleged to cause postpartum strokes in women who took the drug to suppress lactation. See Newman
v. Motorola, Inc., 218 F. Supp. 2d 769, 778 (D. Md. 2002); Siharath v. Sandoz Pharms. Corp., 131
F. Supp. 2d 1347, 1358 (N.D. Ga. 2001).
562
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
detect modest causal effects than the observational studies described above. Other
factors that affect the measurement or existence of the disease, such as improved
diagnostic techniques and changes in lifestyle or age demographics, may change
over time. If those factors can be identified and measured, it may be possible to
control for them with statistical methods. Of course, unknown factors cannot be
controlled for in these or any other kind of epidemiologic studies.
C. Epidemiologic and Toxicologic Studies
In addition to observational epidemiology, toxicology models based on live animal
studies (in vivo) may be used to determine toxicity in humans.39 Animal studies
have a number of advantages. They can be conducted as true experiments, and
researchers control all aspects of the animals’ lives. Thus, they can avoid the problem
of confounding,40 which epidemiology often confronts. Exposure can be carefully
controlled and measured. Refusals to participate in a study are not an issue, and loss
to followup very often is minimal. Ethical limitations are diminished, and animals
can be sacrificed and their tissues examined, which may improve the accuracy of disease assessment. Animal studies often provide useful information about pathological
mechanisms and play a complementary role to epidemiology by assisting researchers
in framing hypotheses and in developing study designs for epidemiologic studies.
Animal studies have two significant disadvantages, however. First, animal study
results must be extrapolated to another species—human beings—and differences
in absorption, metabolism, and other factors may result in interspecies variation in
responses. For example, one powerful human teratogen, thalidomide, does not cause
birth defects in most rodent species.41 Similarly, some known teratogens in animals
are not believed to be human teratogens. In general, it is often difficult to confirm
that an agent known to be toxic in animals is safe for human beings.42 The second
difficulty with inferring human causation from animal studies is that the high doses
customarily used in animal studies require consideration of the dose–response relationship and whether a threshold no-effect dose exists.43 Those matters are almost
always fraught with considerable, and currently unresolvable, uncertainty.44
39. For an in-depth discussion of toxicology, see Bernard D. Goldstein & Mary Sue Henifin,
Reference Guide on Toxicology, in this manual.
40. See infra Section IV.C.
41. Phillip Knightley et al., Suffer the Children: The Story of Thalidomide 271–72 (1979).
42. See Ian C.T. Nesbit & Nathan J. Karch, Chemical Hazards to Human Reproduction 98–106
(1983); Int’l Agency for Research on Cancer (IARC), Interpretation of Negative Epidemiologic Evidence for Carcinogenicity (N.J. Wald & Richard Doll eds., 1985) [hereafter IARC].
43. See infra Section V.C & note 119.
44. See Soldo v. Sandoz Pharms. Corp., 244 F. Supp. 2d 434, 466 (W.D. Pa. 2003) (quoting
this reference guide in the first edition of the Reference Manual); see also General Elec. Co. v. Joiner,
522 U.S. 136, 143–45 (1997) (holding that the district court did not abuse its discretion in exclud-
563
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Toxicologists also use in vitro methods, in which human or animal tissue or
cells are grown in laboratories and are exposed to certain substances. The problem
with this approach is also extrapolation—whether one can generalize the findings
from the artificial setting of tissues in laboratories to whole human beings.45
Often toxicologic studies are the only or best available evidence of toxicity.46
Epidemiologic studies are difficult, time-consuming, expensive, and sometimes,
because of limited exposure or the infrequency of disease, virtually impossible
to perform.47 Consequently, they do not exist for a large array of environmental
agents. Where both animal toxicologic and epidemiologic studies are available,
no universal rules exist for how to interpret or reconcile them.48 Careful assessing expert testimony on causation based on expert’s failure to explain how animal studies supported
expert’s opinion that agent caused disease in humans).
45. For a further discussion of these issues, see Bernard D. Goldstein & Mary Sue Henifin,
Reference Guide on Toxicology, Section III.A, in this manual.
46. IARC, a well-regarded international public health agency, evaluates the human carcinogenicity of various agents. In doing so, IARC obtains all of the relevant evidence, including animal
studies as well as any human studies. On the basis of a synthesis and evaluation of that evidence,
IARC publishes a monograph containing that evidence and its analysis of the evidence and provides a categorical assessment of the likelihood the agent is carcinogenic. In a preamble to each
of its monographs, IARC explains what each of the categorical assessments means. Solely on the
basis of the strength of animal studies, IARC may classify a substance as “probably carcinogenic to
humans.” International Agency for Research on Cancer, Human Papillomaviruses, 90 Monographs on
the Evaluation of Carcinogenic Risks to Humans 9–10 (2007), available at http://monographs.iarc.fr/
ENG/Monographs/vol90/index.php; see also Magistrini v. One Hour Martinizing Dry Cleaning, 180
F. Supp. 2d 584, 600 n.18 (D.N.J. 2002). When IARC monographs are available, they are generally recognized as authoritative. Unfortunately, IARC has conducted evaluations of only a fraction
of potentially carcinogenic agents, and many suspected toxic agents cause effects other than cancer.
47. Thus, in a series of cases involving Parlodel, a lactation suppressant for mothers of newborns,
efforts to conduct an epidemiologic study of its effect on causing strokes were stymied by the infrequency of such strokes in women of child-bearing age. See, e.g., Brasher v. Sandoz Pharms. Corp.,
160 F. Supp. 2d 1291, 1297 (N.D. Ala. 2001). In other cases, a plaintiff’s exposure to an overdose
of a drug may be unique or nearly so. See Zuchowicz v. United States, 140 F.3d 381 (2d Cir. 1998).
48. See IARC, supra note 41 (identifying a number of substances and comparing animal toxicology evidence with epidemiologic evidence); Michele Carbone et al., Modern Criteria to Establish Human
Cancer Etiology, 64 Cancer Res. 5518, 5522 (2004) (National Cancer Institute symposium concluding
that “There should be no hierarchy [among different types of scientific methods to determine cancer
causation]. Epidemiology, animal, tissue culture and molecular pathology should be seen as integrating
evidences in the determination of human carcinogenicity.”)
A number of courts have grappled with the role of animal studies in proving causation in a toxic
substance case. One line of cases takes a very dim view of their probative value. For example, in Brock
v. Merrell Dow Pharmaceuticals, Inc., 874 F.2d 307, 313 (5th Cir. 1989), the court noted the “very limited
usefulness of animal studies when confronted with questions of toxicity.” A similar view is reflected
in Richardson v. Richardson-Merrell, Inc., 857 F.2d 823, 830 (D.C. Cir. 1988), Bell v. Swift Adhesives,
Inc., 804 F. Supp. 1577, 1579–80 (S.D. Ga. 1992), and Cadarian v. Merrell Dow Pharmaceuticals, Inc.,
745 F. Supp. 409, 412 (E.D. Mich. 1989).
Other courts have been more amenable to the use of animal toxicology in proving causation.
Thus, in Marder v. G.D. Searle & Co., 630 F. Supp. 1087, 1094 (D. Md. 1986), aff’d sub nom. Wheelahan
v. G.D. Searle & Co., 814 F.2d 655 (4th Cir. 1987), the court observed: “There is a range of scientific
564
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
ment of the methodological validity and power49 of the epidemiologic evidence
must be undertaken, and the quality of the toxicologic studies and the questions
of interspecies extrapolation and dose–response relationship must be considered.50
methods for investigating questions of causation—for example, toxicology and animal studies, clinical research, and epidemiology—which all have distinct advantages and disadvantages.” In Milward
v. Acuity Specialty Products Group, Inc., 639 F.3d 11, 17-19 (1st Cir. 2011), the court endorsed an
expert’s use of a “weight-of-the-evidence” methodology, holding that the district court abused its
discretion in ruling inadmissible an expert’s testimony about causation based on that methodology. As
a corollary to recognizing weight of the evidence as a valid scientific technique, the court also noted
the role of judgment in making an appropriate inference from the evidence. While recognizing the
legitimacy of the methodology, the court also acknowledged that, as with any scientific technique,
it can be improperly applied. See also Metabolife Int’l, Inc. v. Wornick, 264 F.3d 832, 842 (9th Cir.
2001) (holding that the lower court erred in per se dismissing animal studies, which must be examined to determine whether they are appropriate as a basis for causation determination); In re Heparin
Prods. Liab. Litig. 2011 WL 2971918 (N.D. Ohio July 21, 2011) (holding that animal toxicology in
conjunction with other non-epidemiologic evidence can be sufficient to prove causation); Ruff v.
Ensign-Bickford Indus., Inc., 168 F. Supp. 2d 1271, 1281 (D. Utah 2001) (affirming animal studies as
sufficient basis for opinion on general causation.); cf. In re Paoli R.R. Yard PCB Litig., 916 F.2d 829,
853–54 (3d Cir. 1990) (questioning the exclusion of animal studies by the lower court). The Third
Circuit in a subsequent opinion in Paoli observed:
[I]n order for animal studies to be admissible to prove causation in humans, there must be good grounds
to extrapolate from animals to humans, just as the methodology of the studies must constitute good
grounds to reach conclusions about the animals themselves. Thus, the requirement of reliability, or
“good grounds,” extends to each step in an expert’s analysis all the way through the step that connects
the work of the expert to the particular case.
In re Paoli R.R. Yard PCB Litig., 35 F.3d 717, 743 (3d Cir. 1994); see also Cavallo v. Star Enter., 892
F. Supp. 756, 761–63 (E.D. Va. 1995) (courts must examine each of the steps that lead to an expert’s
opinion), aff’d in part and rev’d in part, 100 F.3d 1150 (4th Cir. 1996).
One explanation for these conflicting lines of cases may be that when there is a substantial body
of epidemiologic evidence that addresses the causal issue, animal toxicology has much less probative
value. That was the case, for example, in the Bendectin cases of Richardson, Brock, and Cadarian. Where
epidemiologic evidence is not available, animal toxicology may be thought to play a more prominent
role in resolving a causal dispute. See Michael D. Green, Expert Witnesses and Sufficiency of Evidence in
Toxic Substances Litigation: The Legacy of Agent Orange and Bendectin Litigation, 86 Nw. U. L. Rev. 643,
680–82 (1992) (arguing that plaintiffs should be required to prove causation by a preponderance of the
available evidence); Turpin v. Merrell Dow Pharms., Inc., 959 F.2d 1349, 1359 (6th Cir. 1992); In re
Paoli R.R. Yard PCB Litig., No. 86-2229, 1992 U.S. Dist. LEXIS 16287, at *16 (E.D. Pa. 1992). For
another explanation of these cases, see Gerald W. Boston, A Mass-Exposure Model of Toxic Causation:
The Control of Scientific Proof and the Regulatory Experience, 18 Colum. J. Envtl. L. 181 (1993) (arguing
that epidemiologic evidence should be required in mass-exposure cases but not in isolated-exposure
cases); see also IARC, supra note 41; Bernard D. Goldstein & Mary Sue Henifin, Reference Guide
on Toxicology, Section I.F, in this manual. The Supreme Court, in General Electric Co. v. Joiner, 522
U.S. 136, 144–45 (1997), suggested that there is no categorical rule for toxicologic studies, observing,
“[W]hether animal studies can ever be a proper foundation for an expert’s opinion [is] not the issue. . . .
The [animal] studies were so dissimilar to the facts presented in this litigation that it was not an abuse
of discretion for the District Court to have rejected the experts’ reliance on them.”
49. See infra Section IV.A.3.
50. See Ellen F. Heineman & Shelia Hoar Zahm, The Role of Epidemiology in Hazard Evaluation,
9 Toxic Substances J. 255, 258–62 (1989).
565
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
III. How Should Results of an
Epidemiologic Study Be Interpreted?
Epidemiologists are ultimately interested in whether a causal relationship exists
between an agent and a disease. However, the first question an epidemiologist
addresses is whether an association exists between exposure to the agent and disease. An association between exposure to an agent and disease exists when they
occur together more frequently than one would expect by chance.51 Although a
causal relationship is one possible explanation for an observed association between
an exposure and a disease, an association does not necessarily mean that there is
a cause–effect relationship. Interpreting the meaning of an observed association
is discussed below.
This section begins by describing the ways of expressing the existence and
strength of an association between exposure and disease. It reviews ways in which
an incorrect result can be produced because of the sampling methods used in all
observational epidemiologic studies and then examines statistical methods for
evaluating whether an association is real or the result of a sampling error.
The strength of an association between exposure and disease can be stated in
various ways,52 including as a relative risk, an odds ratio, or an attributable risk.53
Each of these measurements of association examines the degree to which the risk
of disease increases when individuals are exposed to an agent.
A. Relative Risk
A commonly used approach for expressing the association between an agent and
disease is relative risk (“RR”). It is defined as the ratio of the incidence rate (often
referred to as incidence) of disease in exposed individuals to the incidence rate in
unexposed individuals:
(Incidence rate in the exposed)
RR = (Incidence rate in the unexposed)
51. A negative association implies that the agent has a protective or curative effect. Because the
concern in toxic substances litigation is whether an agent caused disease, this reference guide focuses
on positive associations.
52. Another outcome measure is a risk difference. A risk difference is the difference between
the proportion of disease in those exposed to the agent and the proportion of disease in those who
were unexposed. Thus, in the example of relative risk in the text below discussing relative risk, the
proportion of disease in those exposed is 40/100 and the proportion of disease in the unexposed is
20/100. The risk difference is 20/100.
53. Numerous courts have employed these measures of the strength of an association. See, e.g., In re
Bextra & Celebrex Mktg. Sales Practices & Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1172–74 (N.D. Cal.
2007); Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071, 1095 (D. Colo. 2006) (citing the second
edition of this reference guide); In re W.R. Grace & Co., 355 B.R. 462, 482–83 (Bankr. D. Del. 2006).
566
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
The incidence rate of disease is defined as the number of cases of disease that
develop during a specified period of time divided by the number of persons in the
cohort under study.54 Thus, the incidence rate expresses the risk that a member of
the population will develop the disease within a specified period of time.
For example, a researcher studies 100 individuals who are exposed to an
agent and 200 who are not exposed. After 1 year, 40 of the exposed individuals
are diagnosed as having a disease, and 20 of the unexposed individuals also are
diagnosed as having the disease. The relative risk of contracting the disease is
calculated as follows:
• Theincidencerateofdiseaseintheexposedindividualsis40casesperyear
per 100 persons (40/100), or 0.4.
• Theincidencerateofdiseaseintheunexposedindividualsis20casesper
year per 200 persons (20/200), or 0.1.
• Therelativeriskiscalculatedastheincidencerateintheexposedgroup
(0.4) divided by the incidence rate in the unexposed group (0.1), or 4.0.
A relative risk of 4.0 indicates that the risk of disease in the exposed group is four
times as high as the risk of disease in the unexposed group.55
In general, the relative risk can be interpreted as follows:
• Iftherelativeriskequals1.0,theriskinexposedindividualsisthesame
as the risk in unexposed individuals.56 There is no association between
exposure to the agent and disease.
• Iftherelativeriskisgreaterthan1.0,theriskinexposedindividualsis
greater than the risk in unexposed individuals. There is a positive association between exposure to the agent and the disease, which could be
causal.
• Iftherelativeriskislessthan1.0,theriskinexposedindividualsislessthan
the risk in unexposed individuals. There is a negative association, which
could reflect a protective or curative effect of the agent on risk of disease.
For example, immunizations lower the risk of disease. The results suggest
that immunization is associated with a decrease in disease and may have a
protective effect on the risk of disease.
Although relative risk is a straightforward concept, care must be taken in
interpreting it. Whenever an association is uncovered, further analysis should be
54. Epidemiologists also use the concept of prevalence, which measures the existence of disease in
a population at a given point in time, regardless of when the disease developed. Prevalence is expressed as
the proportion of the population with the disease at the chosen time. See Gordis, supra note 32, at 43–47.
55. See DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 947 (3d Cir. 1990); Magistrini v.
One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 591 (D.N.J. 2002).
56. See Magistrini, 180 F. Supp. 2d at 591.
567
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
conducted to assess whether the association is real or a result of sampling error,
confounding, or bias.57 These same sources of error may mask a true association,
resulting in a study that erroneously finds no association.
B. Odds Ratio
The odds ratio (“OR”) is similar to a relative risk in that it expresses in quantitative terms the association between exposure to an agent and a disease.58 It is
a convenient way to estimate the relative risk in a case-control study when the
disease under investigation is rare.59 The odds ratio approximates the relative risk
when the disease is rare.60
In a case-control study, the odds ratio is the ratio of the odds that a case (one
with the disease) was exposed to the odds that a control (one without the disease)
was exposed. In a cohort study, the odds ratio is the ratio of the odds of developing a disease when exposed to a suspected agent to the odds of developing the
disease when not exposed.
Consider a case-control study, with results as shown schematically in a 2 × 2
table (Table 2):
Table 2. Cross-tabulation of cases and controls by exposure status
Cases
(with disease)
Controls
(no disease)
Exposed
a
b
Not exposed
c
d
In a case-control study,
OR =
(Odds that a case was exposed)
(Odds that a control was exposed).
57. See infra Sections IV.B–C.
58. A relative risk cannot be calculated for a case-control study, because a case-control study
begins by examining a group of persons who already have the disease. That aspect of the study design
prevents a researcher from determining the rate at which individuals develop the disease. Without a
rate or incidence of disease, a researcher cannot calculate a relative risk.
59. If the disease is not rare, the odds ratio is still valid to determine whether an association
exists, but interpretation of its magnitude is less intuitive.
60. See Marcello Pagano & Kimberlee Gauvreau, Principles of Biostatistics 354 (2d ed. 2000).
For further detail about the odds ratio and its calculation, see Kahn & Sempos, supra note 31, at 47–56.
568
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
Looking at Table 2, this ratio can be calculated as
(a/c)
(b/d).
This works out to ad/bc. Because we are multiplying two diagonal cells in the
table and dividing by the product of the other two diagonal cells, the odds ratio
is also called the cross-products ratio.
Consider the following hypothetical study: A researcher identifies 100 individuals with a disease who serve as “cases” and 100 people without the disease
who serve as “controls” for her case-control study. Forty of the 100 cases were
exposed to the agent and 60 were not. Among the control group, 20 people
were exposed and 80 were not. The data can be presented in a 2 × 2 table
(Table 3):
Table 3. Case-Control Study Outcome
Cases
(with disease)
Controls
(no disease)
Exposed
40
20
Not exposed
60
80
The calculation of the odds ratio would be:
(40/60)
OR = (20/80) = 2.67.
If the disease is relatively rare in the general population (about 5% or less), the
odds ratio is a good approximation of the relative risk, which means that there is
almost a tripling of the disease in those exposed to the agent.61
61. The odds ratio is usually marginally greater than the relative risk. As the disease in question
becomes more common, the difference between the odds ratio and the relative risk grows.
The reason why the odds ratio approximates the relative risk when the incidence of disease is
small can be demonstrated by referring to Table 2. The odds ratio, as stated in the text, is ad/bc. The
relative risk for such a study would compare the incidence of disease in the exposed group, or a/(a + b),
with the incidence of disease in the unexposed group or c/(c + d). The relative risk would be:
(
(
a/ a +b
c / c +d
) = a / (c + d )
) c / (a + b )
When the incidence of disease is low, a and c will be small in relation to b and d, and the relative
risk will then approximate the odds ratio of ad/bc. See Leon Gordis, Epidemiology 208–09 (4th
ed. 2009).
569
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
C. Attributable Risk
A frequently used measurement of risk is the attributable risk (“AR”). The attributable risk represents the amount of disease among exposed individuals that can be
attributed to the exposure. It also can be expressed as the proportion of the disease
among exposed individuals that is associated with the exposure (also called the
“attributable proportion of risk,” the “etiologic fraction,” or the “attributable risk
percent”). The attributable risk reflects the maximum proportion of the disease
that can be attributed to exposure to an agent and consequently the maximum
proportion of disease that could be potentially prevented by blocking the effect of
the exposure or by eliminating the exposure.62 In other words, if the association is
causal, the attributable risk is the proportion of disease in an exposed population
that might be caused by the agent and that might be prevented by eliminating
exposure to that agent (see Figure 3).63
Figure 3. Risks in exposed and unexposed groups.
Incidence Due to
Exposure
Incidence Not
Due to Exposure
{
{
Exposed
Group
Unexposed
Group
To determine the proportion of a disease that is attributable to an exposure, a
researcher would need to know the incidence of the disease in the exposed group
and the incidence of disease in the unexposed group. The attributable risk is
AR =
(incidence in the exposed) − (incidence in the unexposed)
incidence in the exposed
62. Kenneth J. Rothman et al., Modern Epidemiology 297 (3d ed. 2008); see also Landrigan v.
Celotex Corp., 605 A.2d 1079, 1086 (N.J. 1992) (illustrating that a relative risk of 1.55 conforms to
an attributable risk of 35%, that is, (1.55 − 1.0)/1.55 = .35, or 35%).
63. Risk is not zero for the control group (those not exposed) when there are other causal chains
that cause the disease that do not require exposure to the agent. For example, some birth defects are
the result of genetic sources, which do not require the presence of any environmental agent. Also,
some degree of risk in the control group may be the result of background exposure to the agent being
studied. For example, nonsmokers in a control group may have been exposed to passive cigarette
smoke, which is responsible for some cases of lung cancer and other diseases. See also Ethyl Corp. v.
EPA, 541 F.2d 1, 25 (D.C. Cir. 1976). There are some diseases that do not occur without exposure
to an agent; these are known as signature diseases. See infra note 177.
570
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
The attributable risk can be calculated using the example described in
Section III.A. Suppose a researcher studies 100 individuals who are exposed to
a substance and 200 who are not exposed. After 1 year, 40 of the exposed individuals are diagnosed as having a disease, and 20 of the unexposed individuals
are also diagnosed as having the disease.
• Theincidenceofdiseaseintheexposedgroupis40personsoutof100
who contract the disease in a year.
• Theincidenceofdiseaseintheunexposedgroupis20personsoutof200
(or 10 out of 100) who contract the disease in a year.
• Theproportionofdiseasethatisattributabletotheexposureis30persons
out of 40, or 75%.
This means that 75% of the disease in the exposed group is attributable to the
exposure. We should emphasize here that “attributable” does not necessarily mean
“caused by.” Up to this point, we have only addressed associations. Inferring causation from an association is addressed in Section V.
D. Adjustment for Study Groups That Are Not Comparable
Populations often differ in characteristics that relate to disease risk, such as age,
sex, and race. Those who live in Florida have a much higher death rate than those
who live in Alaska.64 Is sunshine dangerous? Perhaps, but the Florida population
is much older than the Alaska population, and some adjustment must be made for
the differences in age distribution in the two states in order to compare disease
or death rates between populations. The technique used to accomplish this is
called adjustment, and two types of adjustment are used—direct and indirect. In
direct adjustment (e.g., when based on age), overall disease/death rates are calculated for each population as though each had the age distribution of another standard, or reference, population, using the age-specific disease/death rates for each
study population. We can then compare these overall rates, called age-adjusted
rates, knowing that any difference between these rates cannot be attributed to
differences in age, since both age-adjusted rates were generated using the same
standard population.
Indirect adjustment is used when the age-specific rates for a study population are not known. In that case, the overall disease/death rate for the standard/
reference population is recalculated based on the age distribution of the population
of interest using the age-specific rates of the standard population. Then, the actual
number of disease cases/deaths in the population of interest can be compared with
64. See Lilienfeld & Stolley, supra note 35, at 68–70 (the mortality rate in Florida is approximately three times what it is in Alaska).
571
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the number in the reference population that would be expected if the reference
population had the age distribution of the population of interest.
This ratio is called the standardized mortality ratio (SMR). When the outcome of interest is disease rather than death, it is called the standardized morbidity
ratio.65 If the ratio equals 1.0, the observed number of deaths equals the expected
number of deaths, and the mortality rate of the population of interest is no different from that of the reference population. If the SMR is greater than 1.0,
the population of interest has a higher mortality risk than that of the reference
population, and if the SMR is less than 1.0, the population of interest has a lower
mortality rate than that of the reference population.
Thus, age adjustment provides a way to compare populations while in effect
holding age constant. Adjustment is used not only for comparing mortality rates
in different populations but also for comparing rates in different groups of subjects
selected for study in epidemiologic investigations. Although this discussion has
focused on adjusting for age, it is also possible to adjust for any number of other
variables, such as gender, race, occupation, and socioeconomic status. It is also
possible to adjust for several factors simultaneously.66
IV. What Sources of Error Might Have
Produced a False Result?
Incorrect study results occur in a variety of ways. A study may find a positive
association (relative risk greater than 1.0) when there is no true association. Or a
study may erroneously result in finding that that there is no association when in
reality there is. A study may also find an association when one truly exists, but the
association found may be greater or less than the real association.
Three general categories of phenomena can result in an association found in
a study to be erroneous: chance, bias, and confounding. Before any inferences
about causation are drawn from a study, the possibility of these phenomena must
be examined.67
65. See Taylor v. Airco, Inc., 494 F. Supp. 2d 21, 25 n.4 (D. Mass. 2007) (explaining SMR and
its relationship with relative risk). For an example of adjustment used to calculate an SMR for workers
exposed to benzene, see Robert A. Rinsky et al., Benzene and Leukemia: An Epidemiologic Risk Assessment, 316 New Eng. J. Med. 1044 (1987).
66. For further elaboration on adjustment, see Gordis, supra note 32, at 73–78; Philip Cole,
Causality in Epidemiology, Health Policy, and Law, 27 Envtl. L. Rep. 10,279, 10,281 (1997).
67. See Cole, supra note 65, at 10,285. In DeLuca v. Merrell Dow Pharmaceuticals, Inc., 911 F.2d
941, 955 (3d Cir. 1990), the court recognized and discussed random sampling error. It then went on
to refer to other errors (e.g., systematic bias) that create as much or more error in the outcome of a
study. For a similar description of error in study procedure and random sampling, see David H. Kaye
& David A. Freedman, Reference Guide on Statistics, Section IV, in this manual.
572
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
The findings of a study may be the result of chance (or random error). In
designing a study, the size of the sample can be increased to reduce (but not eliminate) the likelihood of random error. Once a study has been completed, statistical
methods (discussed in Section IV.A) permit an assessment of the extent to which
the results of a study may be due to random error.
The two main techniques for assessing random error are statistical significance
and confidence intervals. A study that is statistically significant has results that are
unlikely to be the result of random error, although any criterion for “significance”
is somewhat arbitrary. A confidence interval provides both the relative risk (or
other risk measure) found in the study and a range (interval) within which the risk
likely would fall if the study were repeated numerous times. These two techniques
(which are closely related) are explained in Section IV.A.
We should emphasize a matter that those unfamiliar with statistical methodology frequently find confusing: That a study’s results are statistically significant
says nothing about the importance of the magnitude of any association (i.e., the
relative risk or odds ratio) found in a study or about the biological or clinical
importance of the finding.68 “Significant,” as used with the adjective “statistically,”
does not mean important. A study may find a statistically significant relationship
that is quite modest—perhaps it increases the risk only by 5%, which is equivalent
to a relative risk of 1.05.69 An association may be quite large—the exposed cohort
might be 10 times more likely to develop disease than the control group—but
the association is not statistically significant because of the potential for random
error given a small sample size. In short, statistical significance is not about the size of
the risk found in a study.
Bias (or systematic error) also can produce error in the outcome of a study.
Epidemiologists attempt to minimize bias through their study design, including
data collection protocols. Study designs are developed before they begin gathering
data. However, even the best designed and conducted studies have biases, which
may be subtle. Consequently, after data collection is completed, analytical tools
are often used to evaluate potential sources of bias. Sometimes, after bias is identified, the epidemiologist can determine whether the bias would tend to inflate
or dilute any association that may exist. Identification of the bias may permit the
68. See Modern Scientific Evidence, supra note 2, § 6.36 at 358 (“Statisticians distinguish
between ‘statistical’ and ‘practical’ significance. . . .”); Cole, supra note 65, at 10,282. Understandably,
some courts have been confused about the relationship between statistical significance and the magnitude of the association. See Hyman & Armstrong, P.S.C. v. Gunderson, 279 S.W.3d 93, 102 (Ky.
2008) (describing a small increased risk as being considered statistically insignificant and a somewhat
larger risk as being considered statistically significant.); In re Pfizer Inc. Sec. Litig., 584 F. Supp. 2d
621, 634–35 (S.D.N.Y. 2008) (confusing the magnitude of the effect with whether the effect was
statistically significant); In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1041 (S.D.N.Y.
1993) (concluding that any relative risk less than 1.50 is statistically insignificant), rev’d on other grounds,
52 F.3d 1124 (2d Cir. 1995).
69. In general, small effects that are statistically significant require larger sample sizes. When
effects are larger, generally fewer subjects are required to produce statistically significant findings.
573
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
epidemiologist to make an assessment of whether the study’s conclusions are valid.
Epidemiologists may reanalyze a study’s data to correct for a bias identified in a
completed study or to validate the analytical methods used.70 Common biases and
how they may produce invalid results are described in Section IV.B.
Finally, a study may reach incorrect conclusions about causation because,
although the agent and disease are associated, the agent is not a true causal factor.
Rather, the agent may be associated with another agent that is the true causal factor, and this latter factor confounds the relationship being examined in the study.
Confounding is explained in Section IV.C.
A. What Statistical Methods Exist to Evaluate the Possibility
of Sampling Error?71
Before detailing the statistical methods used to assess random error (which we use
as synonymous with sampling error), two concepts are explained that are central
to epidemiology and statistical analysis. Understanding these concepts should
facilitate comprehension of the statistical methods.
Epidemiologists often refer to the true association (also called “real association”), which is the association that really exists between an agent and a disease
and that might be found by a perfect (but nonexistent) study. The true association
is a concept that is used in evaluating the results of a given study even though
its value is unknown. By contrast, a study’s outcome will produce an observed
association, which is known.
Formal procedures for statistical testing begin with the null hypothesis, which
posits that there is no true association (i.e., a relative risk of 1.0) between the
agent and disease under study. Data are gathered and analyzed to see whether they
disprove72 the null hypothesis. The data are subjected to statistical testing to assess
the plausibility that any association found is a result of random error or whether
it supports rejection of the null hypothesis. The use of the null hypothesis for this
testing should not be understood as the a priori belief of the investigator. When
epidemiologists investigate an agent, it is usually because they hypothesize that
the agent is a cause of some outcome. Nevertheless, epidemiologists prepare their
70. E.g., Richard A. Kronmal et al., The Intrauterine Device and Pelvic Inflammatory Disease: The
Women’s Health Study Reanalyzed, 44 J. Clin. Epidemiol. 109 (1991) (a reanalysis of a study that found
an association between the use of IUDs and pelvic inflammatory disease concluded that IUDs do not
increase the risk of pelvic inflammatory disease).
71. For a bibliography on the role of statistical significance in legal proceedings, see Sanders,
supra note 13, at 329 n.138.
72. See, e.g., Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 593 (1993) (scientific methodology involves generating and testing hypotheses).
574
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
study designs and test the plausibility that any association found in a study was the
result of random error by using the null hypothesis.73
1. False positives and statistical significance
When a study results in a positive association (i.e., a relative risk greater than 1.0),
epidemiologists try to determine whether that outcome represents a true association or is the result of random error.74 Random error is illustrated by a fair coin
(i.e., not modified to produce more heads than tails [or vice versa]). On average,
for example, we would expect that coin tosses would yield half heads and half tails.
But sometimes, a set of coin tosses might yield an unusual result, for example, six
heads out of six tosses,75 an occurrence that would result, purely by chance, in less
than 2% of a series of six tosses. In the world of epidemiology, sometimes the study
findings, merely by chance, do not reflect the true relationships between an agent
and outcome. Any single study—even a clinical trial—is in some ways analogous
to a set of coin tosses, being subject to the play of chance. Thus, for example,
even though the true relative risk (in the total population) is 1.0, an epidemiologic
study of a particular study population may find a relative risk greater than (or less
73. See DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 945 (3d Cir. 1990); United States
v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 706 n.29 (D.D.C. 2006); Stephen E. Fienberg et al.,
Understanding and Evaluating Statistical Evidence in Litigation, 36 Jurimetrics J. 1, 21–24 (1995).
74. Hypothesis testing is one of the most counterintuitive techniques in statistics. Given a set
of epidemiologic data, one wants to ask the straightforward, obvious question: What is the probability that the difference between two samples reflects a real difference between the populations
from which they were taken? Unfortunately, there is no way to answer this question directly or to
calculate the probability. Instead, statisticians—and epidemiologists—address a related but very different question: If there really is no difference between the populations, how probable is it that one
would find a difference at least as large as the observed difference between the samples? See Modern
Scientific Evidence, supra note 2, § 6:36, at 359 (“it is easy to mistake the p-value for the probability
that there is no difference”); Expert Evidence: A Practitioner’s Guide to Law, Science, and the FJC
Manual 91 (Bert Black & Patrick W. Lee eds., 1997). Thus, the p-value for a given study does not
provide a rate of error or even a probability of error for an epidemiologic study. In Daubert v. Merrell
Dow Pharmaceuticals, Inc., 509 U.S. 579, 593 (1993), the Court stated that “the known or potential
rate of error” should ordinarily be considered in assessing scientific reliability. Epidemiology, however,
unlike some other methodologies—fingerprint identification, for example—does not permit an assessment of its accuracy by testing with a known reference standard. A p-value provides information only
about the plausibility of random error given the study result, but the true relationship between agent
and outcome remains unknown. Moreover, a p-value provides no information about whether other
sources of error—bias and confounding—exist and, if so, their magnitude. In short, for epidemiology,
there is no way to determine a rate of error. See Kumho Tire Co. v. Carmichael, 526 U.S. 137, 151
(1999) (recognizing that for different scientific and technical inquiries, different considerations will
be appropriate for assessing reliability); Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071, 1100
(D. Colo. 2006) (“Defendants have not argued or presented evidence that . . . a method by which an
overall ‘rate of error’ can be calculated for an epidemiologic study.”)
75. DeLuca, 911 F.2d at 946–47.
575
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
than) 1.0 because of random error or chance.76 An erroneous conclusion that the
null hypothesis is false (i.e., a conclusion that there is a difference in risk when
no difference actually exists) owing to random error is called a false-positive error
(also Type I error or alpha error).
Common sense leads one to believe that a large enough sample of individuals
must be studied if the study is to identify a relationship between exposure to an
agent and disease that truly exists. Common sense also suggests that by enlarging
the sample size (the size of the study group), researchers can form a more accurate
conclusion and reduce the chance of random error in their results. Both statements
are correct and can be illustrated by a test to determine if a coin is fair. A test in
which a fair coin is tossed 1000 times is more likely to produce close to 50% heads
than a test in which the coin is tossed only 10 times. It is far more likely that a
test of a fair coin with 10 tosses will come up, for example, with 80% heads than
will a test with 1000 tosses. With large numbers, the outcome of the test is less
likely to be influenced by random error, and the researcher would have greater
confidence in the inferences drawn from the data.77
One means for evaluating the possibility that an observed association could
have occurred as a result of random error is by calculating a p-value.78 A p-value
represents the probability that an observed positive association could result from
random error even if no association were in fact present. Thus, a p-value of .1
means that there is a 10% chance that values at least as large as the observed relative
risk could have occurred by random error, with no association actually present
in the population.79
To minimize false positives, epidemiologists use a convention that the p-value
must fall below some selected level known as alpha or significance level for the
results of the study to be statistically significant.80 Thus, an outcome is statistically
significant when the observed p-value for the study falls below the preselected
76. See Magistrini v. One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 592 (D.N.J.
2002) (citing the second edition of this reference guide).
77. This explanation of numerical stability was drawn from Brief for Professor Alvan R.
Feinstein as Amicus Curiae Supporting Respondents at 12–13, Daubert v. Merrell Dow Pharms., Inc.,
509 U.S. 579 (1993) (No. 92-102). See also Allen v. United States, 588 F. Supp. 247, 417–18 (D. Utah
1984), rev’d on other grounds, 816 F.2d 1417 (10th Cir. 1987). The Allen court observed that although
“[s]mall communities or groups of people are deemed ‘statistically unstable’” and “data from small
populations must be handled with care [, it] does not mean that [the data] cannot provide substantial
evidence in aid of our effort to describe and understand events.”
78. See also David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section IV.B,
in this manual (the p-value reflects the implausibility of the null hypothesis).
79. Technically, a p-value of .1 means that if in fact there is no association, 10% of all similar
studies would be expected to yield an association the same as, or greater than, the one found in the
study due to random error.
80. Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071, 1100–01 (D. Colo. 2006) (discussing p-values and their relationship with statistical significance); Allen, 588 F. Supp. at 416–17 (discussing statistical significance and selection of a level of alpha); see also Sanders, supra note 13, at 343–44
(explaining alpha, beta, and their relationship to sample size); Developments in the Law—Confronting
576
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
significance level. The most common significance level, or alpha, used in science
is .05.81 A .05 value means that the probability is 5% of observing an association
at least as large as that found in the study when in truth there is no association.82
Although .05 is often the significance level selected, other levels can and have
been used.83 Thus, in its study of the effects of second-hand smoke, the U.S.
the New Challenges of Scientific Evidence, 108 Harv. L. Rev. 1481, 1535–36, 1540–46 (1995) [hereafter
Developments in the Law].
81. A common error made by lawyers, judges, and academics is to equate the level of alpha
with the legal burden of proof. Thus, one will often see a statement that using an alpha of .05 for
statistical significance imposes a burden of proof on the plaintiff far higher than the civil burden of
a preponderance of the evidence (i.e., greater than 50%). See, e.g., In re Ephedra Prods. Liab. Litig.,
393 F. Supp. 2d 181, 193 (S.D.N.Y. 2005); Marmo v. IBP, Inc., 360 F. Supp. 2d 1019, 1021 n.2 (D.
Neb. 2005) (an expert toxicologist who stated that science requires proof with 95% certainty while
expressing his understanding that the legal standard merely required more probable than not). But see
Giles v. Wyeth, Inc., 500 F. Supp. 2d 1048, 1056–57 (S.D. Ill. 2007) (quoting the second edition of
this reference guide).
Comparing a selected p-value with the legal burden of proof is mistaken, although the reasons are
a bit complex and a full explanation would require more space and detail than is feasible here. Nevertheless, we sketch out a brief explanation: First, alpha does not address the likelihood that a plaintiff’s
disease was caused by exposure to the agent; the magnitude of the association bears on that question.
See infra Section VII. Second, significance testing only bears on whether the observed magnitude of
association arose as a result of random chance, not on whether the null hypothesis is true. Third, using
stringent significance testing to avoid false-positive error comes at a complementary cost of inducing
false-negative error. Fourth, using an alpha of .5 would not be equivalent to saying that the probability the association found is real is 50%, and the probability that it is a result of random error is 50%.
Statistical methodology does not permit assessments of those probabilities. See Green, supra note 47, at
686; Michael D. Green, Science Is to Law as the Burden of Proof Is to Significance Testing, 37 Jurimetrics
J. 205 (1997) (book review); see also David H. Kaye, Apples and Oranges: Confidence Coefficients and
the Burden of Persuasion, 73 Cornell L. Rev. 54, 66 (1987); David H. Kaye & David A. Freedman,
Reference Guide on Statistics, Section IV.B.2, in this manual; Turpin v. Merrell Dow Pharms., Inc.,
959 F.2d 1349, 1357 n.2 (6th Cir. 1992), cert. denied, 506 U.S. 826 (1992); cf. DeLuca, 911 F.2d at 959
n.24 (“The relationship between confidence levels and the more likely than not standard of proof is
a very complex one . . . and in the absence of more education than can be found in this record, we
decline to comment further on it.”).
82. This means that if one conducted an examination of a large number of associations in which
the true RR equals 1, on average 1 in 20 associations found to be statistically significant at a .05 level
would be spurious. When researchers examine many possible associations that might exist in their
data—known as data dredging—we should expect that even if there are no true causal relationships,
those researchers will find statistically significant associations in 1 of every 20 associations examined.
See Rachel Nowak, Problems in Clinical Trials Go Far Beyond Misconduct, 264 Sci. 1538, 1539 (1994).
83. A significance test can be either one-tailed or two-tailed, depending on the null hypothesis
selected by the researcher. Because most investigators of toxic substances are only interested in
whether the agent increases the incidence of disease (as distinguished from providing protection
from the disease), a one-tailed test is often viewed as appropriate. In re Phenylpropanolamine (PPA)
Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1241 (W.D. Wash. 2003) (accepting the propriety of a
one-tailed test for statistical significance in a toxic substance case); United States v. Philip Morris
USA, Inc., 449 F. Supp. 2d 1, 701 (D.D.C. 2006) (explaining the basis for EPA’s decision to use
one-tailed test in assessing whether second-hand smoke was a carcinogen). But see Good v. Fluor
Daniel Corp., 222 F. Supp. 2d 1236, 1243 (E.D. Wash. 2002). For an explanation of the difference
577
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Environmental Protection Agency (EPA) used a .10 standard for significance
testing.84
There is some controversy among epidemiologists and biostatisticians about
the appropriate role of significance testing.85 To the strictest significance testers,
between one-tailed and two-tailed tests, see David H. Kaye & David A. Freedman, Reference Guide
on Statistics, Section IV.C.2, in this manual.
84. U.S. Environmental Protection Agency, Respiratory Health Effects of Passive Smoking:
Lung Cancer and Other Disorders (1992); see also Turpin, 959 F.2d at 1353–54 n.1 (confidence level
frequently set at 95%, although 90% (which corresponds to an alpha of .10) is also used; selection of
the value is “somewhat arbitrary”).
85. Similar controversy exists among the courts that have confronted the issue of whether statistically significant studies are required to satisfy the burden of production. The leading case advocating
statistically significant studies is Brock v. Merrell Dow Pharmaceuticals, Inc., 874 F.2d 307, 312 (5th Cir.
1989), amended, 884 F.2d 167 (5th Cir.), cert. denied, 494 U.S. 1046 (1990). Overturning a jury verdict
for the plaintiff in a Bendectin case, the court observed that no statistically significant study had been
published that found an increased relative risk for birth defects in children whose mothers had taken
Bendectin. The court concluded: “[W]e do not wish this case to stand as a bar to future Bendectin
cases in the event that new and statistically significant studies emerge which would give a jury a firmer
basis on which to determine the issue of causation.” Brock, 884 F.2d at 167.
A number of courts have followed the Brock decision or have indicated strong support for significance testing as a screening device. See Good v. Fluor Daniel Corp., 222 F. Supp. 2d 1236, 1243
(E.D. Wash. 2002) (“In the absence of a statistically significant difference upon which to opine, Dr.
Au’s opinion must be excluded under Daubert.”); Miller v. Pfizer, Inc., 196 F. Supp. 2d 1062, 1080
(D. Kan. 2002) (the expert must have statistically significant studies to serve as basis of opinion on
causation); Kelley v. Am. Heyer-Schulte Corp., 957 F. Supp. 873, 878 (W.D. Tex. 1997) (the lower
end of the confidence interval must be above 1.0—equivalent to requiring that a study be statistically
significant—before a study may be relied upon by an expert), appeal dismissed, 139 F.3d 899 (5th Cir.
1998); Renaud v. Martin Marietta Corp., 749 F. Supp. 1545, 1555 (D. Colo. 1990) (quoting Brock
approvingly), aff’d, 972 F.2d 304 (10th Cir. 1992).
By contrast, a number of courts are more cautious about or reject using significance testing as a
necessary condition, instead recognizing that assessing the likelihood of random error is important in
determining the probative value of a study. In Allen v. United States, 588 F. Supp. 247, 417 (D. Utah
1984), the court stated, “The cold statement that a given relationship is not ‘statistically significant’
cannot be read to mean there is no probability of a relationship.” The Third Circuit described confidence intervals (i.e., the range of values that would be found in similar studies due to chance, with a
specified level of confidence) and their use as an alternative to statistical significance in DeLuca v. Merrell
Dow Pharmaceuticals, Inc., 911 F.2d 941, 948–49 (3d Cir. 1990). See also Milward v. Acuity Specialty
Products Group, Inc., 639 F.3d 11, 24-25 (1st Cir. 2011) (recognizing the difficulty of obtaining
statistically significant results when the disease under investigation occurs rarely and concluding that
district court erred in imposing a statistical significance threshold); Turpin v. Merrell Dow Pharms.,
Inc., 959 F.2d 1349, 1357 (6th Cir. 1992) (“The defendant’s claim overstates the persuasive power of
these statistical studies. An analysis of this evidence demonstrates that it is possible that Bendectin causes
birth defects even though these studies do not detect a significant association.”); In re Viagra Prods.
Liab. Litig., 572 F. Supp. 2d 1071, 1090 (D. Minn. 2008) (holding that, for purposes of supporting
an opinion on general causation, a study does not have to find results with statistical significance);
United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 706 n.29 (D.D.C. 2006) (rejecting the
position of an expert who denied that the causal connection between smoking and lung cancer had
been established, in part, on the ground that any study that found an association that was not statistically significant must be excluded from consideration); Cook v. Rockwell Int’l Corp., 580 F. Supp.
578
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
any study whose p-value is not less than the level chosen for statistical significance
should be rejected as inadequate to disprove the null hypothesis. Others are critical of using strict significance testing, which rejects all studies with an observed
p-value below that specified level. Epidemiologists have become increasingly
sophisticated in addressing the issue of random error and examining the data
from a study to ascertain what information they may provide about the relationship between an agent and a disease, without the necessity of rejecting all studies
that are not statistically significant.86 Meta-analysis, as well, a method for pooling
the results of multiple studies, sometimes can ameliorate concerns about random
error.87
Calculation of a confidence interval permits a more refined assessment of
appropriate inferences about the association found in an epidemiologic study.88
2d 1071, 1103 (D. Colo. 2006) (“The statistical significance or insignificance of Dr. Clapp’s results
may affect the weight given to his testimony, but does not determine its admissibility under Rule
702.”); In re Ephedra Prods. Liab. Litig., 393 F. Supp. 2d 181, 186 (S.D.N.Y. 2005) (“[T]he absence
of epidemiologic studies establishing an increased risk from ephedra of sufficient statistical significance
to meet scientific standards of causality does not mean that the causality opinions of the PCC’s experts
must be excluded entirely.”).
Although the trial court had relied in part on the absence of statistically significant epidemiologic
studies, the Supreme Court in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993), did
not explicitly address the matter. The Court did, however, refer to “the known or potential rate of
error” in identifying factors relevant to the scientific validity of an expert’s methodology. Id. at 594.
The Court did not address any specific rate of error, although two cases that it cited affirmed the
admissibility of voice spectrograph results that the courts reported were subject to a 2%–6% chance of
error owing to either false matches or false eliminations. One commentator has concluded, “Daubert
did not set a threshold level of statistical significance either for admissibility or for sufficiency of scientific evidence.” Developments in the Law, supra note 79, at 1535–36, 1540–46. The Supreme Court in
General Electric Co. v. Joiner, 522 U.S. 136, 145–47 (1997), adverted to the lack of statistical significance
in one study relied on by an expert as a ground for ruling that the district court had not abused its
discretion in excluding the expert’s testimony.
In Matrixx Initiatives, Inc. v. Siracusano, 131 S. Ct. 1309 (2011), the Supreme Court was confronted with a question somewhat different from the relationship between statistically significant study
results and causation. Matrixx was a securities fraud case in which the defendant argued that unless
adverse event reports from use of a drug are statistically significant, the information about them is
not material, as a matter of law (materiality is required as an element of a fraud claim). Defendant’s
claim was premised on the idea that only statistically significant results can be a basis for an inference
of causation. The Court, unanimously, rejected that claim, citing cases in which courts had permitted expert witnesses to testify to toxic causation in the absence of any statistically significant studies.
For a hypercritical assessment of statistical significance testing that nevertheless identifies much
inappropriate overreliance on it, see Stephen T. Ziliak & Deidre N. McCloskey, The Cult of Statistical Significance (2008).
86. See Sanders, supra note 13, at 342 (describing the improved handling and reporting of statistical analysis in studies of Bendectin after 1980).
87. See infra Section VI.
88. Kenneth Rothman, Professor of Public Health at Boston University and Adjunct Professor of Epidemiology at the Harvard School of Public Health, is one of the leaders in advocating
the use of confidence intervals and rejecting strict significance testing. In DeLuca, 911 F.2d at 947,
the Third Circuit discussed Rothman’s views on the appropriate level of alpha and the use of con-
579
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
A confidence interval is a range of possible values calculated from the results of a
study. If a 95% confidence interval is specified, the range encompasses the results
we would expect 95% of the time if samples for new studies were repeatedly drawn
from the same population. Thus, the width of the interval reflects random error.
The narrower the confidence interval, the more statistically stable the results
of the study. The advantage of a confidence interval is that it displays more information than significance testing. “Statistically significant” does not convey the
magnitude of the association found in the study or indicate how statistically stable
that association is. A confidence interval shows the boundaries of the relative risk
based on selected levels of alpha or statistical significance. Just as the p-value does
not provide the probability that the risk estimate found in a study is correct, the
confidence interval does not provide the range within which the true risk must
lie. Rather, the confidence interval reveals the likely range of risk estimates consistent with random error. An example of two confidence intervals that might be
calculated for a given relative risk is displayed in Figure 4.
Figure 4. Confidence intervals.
1.0
}
p < .05
}
p < .10
RR 0.8
1.1
1.5
2.2
3.4
The confidence intervals shown in Figure 4 are for a study that found a relative
risk of 1.5, with boundaries of 0.8 to 3.4 when the alpha is set to .05 (equivalently,
a confidence level of .95), and with boundaries of 1.1 to 2.2 when alpha is set to
.10 (equivalently, a confidence level of .90). The confidence interval for alpha equal
to .10 is narrower because it encompasses only 90% of the expected test results.
By contrast, the confidence interval for alpha equal to .05 includes the expected
outcomes for 95% of the tests. To generalize this point, the lower the alpha chosen
(and therefore the more stringent the exclusion of possible random error) the wider
the confidence interval. At a given alpha, the width of the confidence interval is
fidence intervals. In Turpin, 959 F.2d at 1353–54 n.1, the court discussed the relationship among
confidence intervals, alpha, and power. See also Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d
1071, 1100–01 (D. Colo. 2006) (discussing confidence intervals, alpha, and significance testing).
The use of confidence intervals in evaluating sampling error more generally than in the epidemiologic context is discussed in David H. Kaye & David A. Freedman, Reference Guide on Statistics,
Section IV.A, in this manual.
580
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
determined by sample size. All other things being equal, the larger the sample size,
the narrower the confidence boundaries (indicating greater numerical stability).
For a given risk estimate, a narrower confidence interval reflects a decreased likelihood that the association found in the study would occur by chance if the true
association is 1.0.89
For the example in Figure 4, the boundaries of the confidence interval with
alpha set at .05 encompass a relative risk of 1.0, and the result would be said to be
not statistically significant at the .05 level. Alternatively, if the confidence boundaries are defined as an alpha equal to .10, then the confidence interval no longer
includes a relative risk of 1.0, and the result would be characterized as statistically
significant at the .10 level.
2. False negatives
As Figure 4 illustrates, false positives can be reduced by adopting more stringent
values for alpha. Using an alpha of .05 will result in fewer false positives than
using an alpha of .10, and an alpha of .01 or .001 would produce even fewer
false positives.90 The tradeoff for reducing false positives is an increase in falsenegative errors (also called beta errors or Type II errors). This concept reflects the
possibility that a study will be interpreted as “negative” (not disproving the null
89. Where multiple epidemiologic studies are available, a technique known as meta-analysis (see
infra Section VI) may be used to combine the results of the studies to reduce the numerical instability
of all the studies. See generally Diana B. Petitti, Meta-analysis, Decision Analysis, and Cost-Effectiveness
Analysis: Methods for Quantitative Synthesis in Medicine (2d ed. 2000). Meta-analysis is better suited
to combining results from randomly controlled experimental studies, but if carefully performed it
may also be helpful for observational studies, such as those in the epidemiologic field. See Zachary B.
Gerbarg & Ralph I. Horwitz, Resolving Conflicting Clinical Trials: Guidelines for Meta-Analysis, 41 J. Clin.
Epidemiol. 503 (1988). In In re Bextra & Celebrex Marketing Sales Practices & Products Liability Litigation,
524 F. Supp. 2d 1166 (N.D. Cal. 2007), the court relied on several meta-analyses of Celebrex at a
200-mg dose to conclude that the plaintiffs’ experts who proposed to testify to toxicity at that dosage
failed to meet the requirements of Daubert. The court criticized those experts for the wholesale rejection of meta-analyses of observational studies.
In In re Paoli Railroad Yard PCB Litigation, 916 F.2d 829, 856–57 (3d Cir. 1990), the court discussed the use and admissibility of meta-analysis as a scientific technique. Overturning the district court’s
exclusion of a report using meta-analysis, the Third Circuit observed that meta-analysis is a regularly
used scientific technique. The court recognized that the technique might be poorly performed, and
it required the district court to reconsider the validity of the expert’s work in performing the metaanalysis. See also E.R. Squibb & Sons, Inc. v. Stuart Pharms., No. 90-1178, 1990 U.S. Dist. LEXIS
15788, at *41 (D.N.J. Oct. 16, 1990) (acknowledging the utility of meta-analysis but rejecting its use in
that case because one of the two studies included was poorly performed); Tobin v. Astra Pharm. Prods.,
Inc., 993 F.2d 528, 538–39 (6th Cir. 1992) (identifying an error in the performance of a meta-analysis,
in which the Food and Drug Administration pooled data from control groups in different studies in
which some gave the controls a placebo and others gave the controls an alternative treatment).
90. It is not uncommon in genome-wide association studies to set the alpha at .00001 or even
lower because of the large number of associations tested in such studies. Reducing alpha is designed
to limit the number of false-positive findings.
581
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
hypothesis), when in fact there is a true association of a specified magnitude.91 The
beta for any study can be calculated only based on a specific alternative hypothesis
about a given positive relative risk and a specific level of alpha selected.92
3. Power
When a study fails to find a statistically significant association, an important question is whether the result tends to exonerate the agent’s toxicity or is essentially
inconclusive with regard to toxicity.93 The concept of power can be helpful in
evaluating whether a study’s outcome is exonerative or inconclusive.94
The power of a study is the probability of finding a statistically significant
association of a given magnitude (if it exists) in light of the sample sizes used in
the study. The power of a study depends on several factors: the sample size; the
level of alpha (or statistical significance) specified; the background incidence of
disease; and the specified relative risk that the researcher would like to detect.95
Power curves can be constructed that show the likelihood of finding any given
relative risk in light of these factors. Often, power curves are used in the design
of a study to determine what size the study populations should be.96
The power of a study is the complement of beta (1 – β). Thus, a study with
a likelihood of .25 of failing to detect a true relative risk of 2.097 or greater has a
power of .75. This means the study has a 75% chance of detecting a true relative
risk of 2.0. If the power of a negative study to find a relative risk of 2.0 or greater
91. See also DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 947 (3d Cir. 1990).
92. See Green, supra note 47, at 684–89.
93. Even when a study or body of studies tends to exonerate an agent, that does not establish
that the agent is absolutely safe. See Cooley v. Lincoln Elec. Co., 693 F. Supp. 2d 767 (N.D. Ohio
2010). Epidemiology is not able to provide such evidence.
94. See Fienberg et al., supra note 72, at 22–23. Thus, in Smith v. Wyeth-Ayerst Labs. Co., 278 F.
Supp. 2d 684, 693 (W.D.N.C. 2003) and Cooley v. Lincoln Electric Co., 693 F. Supp. 2d 767, 773 (N.D.
Ohio 2010), the courts recognized that the power of a study was critical to assessing whether the failure
of the study to find a statistically significant association was exonerative of the agent or inconclusive.
See also Procter & Gamble Pharms., Inc. v. Hoffmann-LaRoche Inc., No. 06 Civ. 0034(PAC), 2006
WL 2588002, at *32 n.16 (S.D.N.Y. Sept. 6, 2006) (discussing power curves and quoting the second
edition of this reference guide); In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp.
2d 1230, 1243–44 (W.D. Wash. 2003) (explaining expert’s testimony that “statistical reassurance as to
lack of an effect would require an upper bound of a reasonable confidence interval close to the null
value”); Ruff v. Ensign-Bickford Indus., Inc., 168 F. Supp. 2d 1271, 1281 (D. Utah 2001) (explaining why a study should be treated as inconclusive rather than exonerative based on small number of
subjects in study).
95. See Malcolm Gladwell, How Safe Are Your Breasts? New Republic, Oct. 24, 1994, at 22, 26.
96. For examples of power curves, see Kenneth J. Rothman, Modern Epidemiology 80 (1986);
Pagano & Gauvreau, supra note 59, at 245.
97. We use a relative risk of 2.0 for illustrative purposes because of the legal significance courts
have attributed to this magnitude of association. See infra Section VII.
582
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
is low, it has substantially less probative value than a study with similar results but
a higher power.98
B. What Biases May Have Contributed to an Erroneous
Association?
The second major reason for an invalid outcome in epidemiologic studies is systematic error or bias. Bias may arise in the design or conduct of a study, data collection, or data analysis. The meaning of scientific bias differs from conventional
(and legal) usage, in which bias refers to a partisan point of view.99 When scientists
use the term bias, they refer to anything that results in a systematic (nonrandom)
error in a study result and thereby compromises its validity. Two important
categories of bias are selection bias (inappropriate methodology for selection of
study subjects) and information bias (a flaw in measuring exposure or disease in
the study groups).
Most epidemiologic studies have some degree of bias that may affect the
outcome. If major bias is present, it may invalidate the study results. Finding the
bias, however, can be difficult, if not impossible. In reviewing the validity of an
epidemiologic study, the epidemiologist must identify potential biases and analyze
the amount or kind of error that might have been induced by the bias. Often, the
direction of error can be determined; depending on the specific type of bias, it
may exaggerate the real association, dilute it, or even completely mask it.
1. Selection bias
Selection bias refers to the error in an observed association that results from the
method of selection of cases and controls (in a case-control study) or exposed
and unexposed individuals (in a cohort study).100 The selection of an appropriate
98. See also David H. Kaye & David A. Freedman, Reference Guide on Statistics, Section
IV.C.1, in this manual.
99. A Dictionary of Epidemiology 15 (John M. Last ed., 3d ed. 1995); Edmond A. Murphy,
The Logic of Medicine 239–62 (1976).
100. Selection bias is defined as “[e]rror due to systematic differences in characteristics between
those who are selected for study and those who are not.” A Dictionary of Epidemiology, supra note 98,
at 153. In In re “Agent Orange” Product Liability Litigation, 597 F. Supp. 740, 783 (E.D.N.Y. 1985), aff’d,
818 F.2d 145 (2d Cir. 1987), the court expressed concern about selection bias. The exposed cohort
consisted of young, healthy men who served in Vietnam. Comparing the mortality rate of the exposed
cohort and that of a control group made up of civilians might have resulted in error that was a result
of selection bias. Failing to account for health status as an independent variable tends to understate
any association between exposure and disease in studies in which the exposed cohort is healthier. See
also In re Baycol Prods. Litig., 532 F. Supp. 2d 1029, 1043 (D. Minn. 2007) (upholding admissibility
of testimony by expert witness who criticized study based on selection bias).
583
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
control group has been described as the Achilles’ heel of a case-control study.101
Ideally, controls should be drawn from the same population that produced the
cases. Selecting control participants becomes problematic if the control participants
are selected for reasons that are related to their having the exposure being studied.
For example, a study of the effect of smoking on heart disease will suffer selection
bias if subjects of the study are volunteers and the decision to volunteer is affected
by both being a smoker and having a family history of heart disease. The association will be biased upward because of the additional disease among the exposed
smokers caused by genetics.
Hospital-based studies, which are relatively common among researchers
located in medical centers, illustrate the problem. Suppose an association is found
between coffee drinking and coronary heart disease in a study using hospital
patients as controls. The problem is that the hospitalized control group may include
individuals who had been advised against drinking coffee for medical reasons, such
as to prevent the aggravation of a peptic ulcer. In other words, the controls may
become eligible for the study because of their medical condition, which is in turn
related to their exposure status—their likelihood of avoiding coffee. If this is true,
the amount of coffee drinking in the control group would understate the extent
of coffee drinking expected in people who do not have the disease, and thus bias
upwardly (i.e., exaggerate) any odds ratio observed.102 Bias in hospital studies may
also understate the true odds ratio when the exposures at issue led to the cases’
hospitalizations and also contributed to the controls’ chances of hospitalization.
Just as cases and controls in case-control studies should be selected independently of their exposure status, so the exposed and unexposed participants
in cohort studies should be selected independently of their disease risk.103 For
example, if women with hysterectomies are overrepresented among exposed
women in a cohort study of cervical cancer, this could overstate the association
between the exposure and the disease.
A further source of selection bias occurs when those selected to participate
decline to participate or drop out before the study is completed. Many studies have
shown that individuals who participate in studies differ significantly from those who
do not. If a significant portion of either study group declines to participate, the
researcher should investigate whether those who declined are different from those
who agreed. The researcher can compare relevant characteristics of those who
101. William B. Kannel & Thomas R. Dawber, Coffee and Coronary Disease, 289 New Eng. J.
Med. 100 (1973) (editorial).
102. Hershel Jick et al., Coffee and Myocardial Infarction, 289 New Eng. J. Med. 63 (1973).
103. When unexposed controls may differ from the exposed cohort because exposure is associated with other risk (or protective factors), investigators can attempt to measure and adjust for those
differences, as explained in Section IV.C.3, infra. See also Martha J. Radford & JoAnne M. Foody,
How Do Observational Studies Expand the Evidence Base for Therapy? 286 JAMA 1228 (2001) (discussing
the use of propensity analysis to adjust for potential confounding and selection biases that may occur
from nonrandomization).
584
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
participate with those who do not to show the extent to which the two groups are
comparable. Similarly, if a significant number of subjects drop out of a study before
completion, the remaining subjects may not be representative of the original study
populations. The researcher should examine whether that is the case.
The fact that a study may suffer from selection bias does not necessarily
invalidate its results. A number of factors may suggest that a bias, if present, had
only limited effect. If the association is particularly strong, for example, bias is less
likely to account for all of it. In addition, a consistent association across different
control groups suggests that possible biases applicable to a particular control group
are not invalidating. Similarly, a dose–response relationship (see Section V.C,
infra) found among multiple groups exposed to different doses of the agent would
provide additional evidence that biases applicable to the exposed group are not a
major problem.
2. Information bias
Information bias is a result of inaccurate information about either the disease or
the exposure status of the study participants or a result of confounding. In a casecontrol study, potential information bias is an important consideration because
the researcher depends on information from the past to determine exposure and
disease and their temporal relationship.104 In some situations, the researcher is
required to interview the subjects about past exposures, thus relying on the subjects’ memories. Research has shown that individuals with disease (cases) tend to
recall past exposures more readily than individuals with no disease (controls);105
this creates a potential for bias called recall bias.
For example, consider a case-control study conducted to examine the cause of
congenital malformations. The epidemiologist is interested in whether the malformations were caused by an infection during the mother’s pregnancy.106 A group
of mothers of malformed infants (cases) and a group of mothers of infants with no
104. Information bias can be a problem in cohort studies as well. When exposure is determined
retrospectively, there can be a variety of impediments to obtaining accurate information. Similarly,
when disease status is determined retrospectively, bias is a concern. The determination that asbestos is a
cause of mesothelioma was hampered by inaccurate death certificates that identified lung cancer rather
than mesothelioma, a rare form of cancer, as the cause of death. See I.J. Selikoff et al., Mortality Experience of Insulation Workers in the United States and Canada, 220 Ann. N.Y. Acad. Sci. 91, 110–11 (1979).
105. Steven S. Coughlin, Recall Bias in Epidemiological Studies, 43 J. Clinical Epidemiology 87
(1990).
106. See Brock v. Merrell Dow Pharms., Inc., 874 F.2d 307, 311–12 (5th Cir. 1989) (discussion of recall bias among women who bear children with birth defects). We note that the court was
mistaken in its assertion that a confidence interval could correct for recall bias, or for any bias for
that matter. Confidence intervals are a statistical device for analyzing error that may result from random sampling. Systematic errors (bias) in the design or data collection are not addressed by statistical
methods, such as confidence intervals or statistical significance. See Green, supra note 47, at 667–68;
Vincent M. Brannigan et al., Risk, Statistical Inference, and the Law of Evidence: The Use of Epidemiological
Data in Toxic Tort Cases, 12 Risk Analysis 343, 344–45 (1992).
585
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
malformation (controls) are interviewed regarding infections during pregnancy.
Mothers of children with malformations may recall an inconsequential fever or
runny nose during pregnancy that readily would be forgotten by a mother who
had a normal infant. Even if in reality the infection rate in mothers of malformed
children is no different from the rate in mothers of normal children, the result in
this study would be an apparently higher rate of infection in the mothers of the
children with the malformations solely on the basis of recall differences between
the two groups.107 The issue of recall bias can sometimes be evaluated by finding an alternative source of data to validate the subject’s response (e.g., blood
test results from prenatal visits or medical records that document symptoms of
infection).108 Alternatively, the mothers’ responses to questions about other exposures may shed light on the presence of a bias affecting the recall of the relevant
exposures. Thus, if mothers of cases do not recall greater exposure than controls’
mothers to pesticides, children with German measles, and so forth, then one can
have greater confidence in their recall of illnesses.
Bias may also result from reliance on interviews with surrogates who are individuals other than the study subjects. This is often necessary when, for example,
a subject (in a case-control study) has died of the disease under investigation or
may be too ill to be interviewed.
There are many sources of information bias that affect the measure of exposure, including its intensity and duration. Exposure to the agent can be measured
directly or indirectly.109 Sometimes researchers use a biological marker as a direct
measure of exposure to an agent—an alteration in tissue or body fluids that occurs
as a result of an exposure and that can be detected in the laboratory. Biological
markers, however, are only available for a small number of toxins and usually only
reveal whether a person was exposed.110 Biological markers rarely help determine
the intensity or duration of exposure.111
107. Thus, in Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 778 (D. Md. 2002), the court
considered a study of the effect of cell phone use on brain cancer and concluded that there was good
reason to suspect that recall bias affected the results of the study, which found an association between
cell phone use and cancers on the side of the head where the cell phone was used but no association
between cell phone use and overall brain tumors.
108. Two researchers who used a case-control study to examine the association between congenital heart disease and the mother’s use of drugs during pregnancy corroborated interview data with
the mother’s medical records. See Sally Zierler & Kenneth J. Rothman, Congenital Heart Disease in
Relation to Maternal Use of Bendectin and Other Drugs in Early Pregnancy, 313 New Eng. J. Med. 347,
347–48 (1985).
109. See In re Paoli R.R. Yard PCB Litig., No. 86-2229, 1992 U.S. Dist LEXIS 18430, at
*9–*11 (E.D. Pa. Oct. 21, 1992) (discussing valid methods of determining exposure to chemicals).
110. See Gary E. Marchant, Genetic Susceptibility and Biomarkers in Toxic Injury Litigation, 41 Jurimetrics
J. 67, 68, 73–74, 95–97 (2000) (explaining concept of biomarkers, how they might be used to provide
evidence of exposure or dose, discussing cases in which biomarkers were invoked in an effort to prove
exposure, and concluding, “biomarkers are likely to be increasingly relied on to demonstrate exposure”).
111. There are different definitions of dose, but dose often refers to the intensity or magnitude
of exposure multiplied by the time exposed. See Sparks v. Owens-Illinois, Inc., 38 Cal. Rptr. 2d 739,
586
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
Monitoring devices also can be used to measure exposure directly but often are
not available for exposures that have occurred in the past. For past exposures, epidemiologists often use indirect measures of exposure, such as interviewing workers
and reviewing employment records. Thus, all those employed to install asbestos
insulation may be treated as having been exposed to asbestos during the period that
they were employed. However, there may be a wide variation of exposure within
any job, and these measures may have limited applicability to a given individual.112
If the agent of interest is a drug, medical or hospital records can be used to determine past exposure. Thus, retrospective studies, which are often used for occupational or environmental investigations, entail measurements of exposure that are
usually less accurate than prospective studies or followup studies, including ones in
which a drug or medical intervention is the independent variable being measured.
742 (Ct. App. 1995). Other definitions of dose may be more appropriate in light of the biological
mechanism of the disease.
For a discussion of the difficulties of determining dose from atomic fallout, see Allen v. United
States, 588 F. Supp. 247, 425–26 (D. Utah 1984), rev’d on other grounds, 816 F.2d 1417 (10th Cir. 1987).
The timing of exposure may also be critical, especially if the disease of interest is a birth defect. In Smith
v. Ortho Pharmaceutical Corp., 770 F. Supp. 1561, 1577 (N.D. Ga. 1991), the court criticized a study for
its inadequate measure of exposure to spermicides. The researchers had defined exposure as receipt of
a prescription for spermicide within 600 days of delivery, but this definition of exposure is too broad
because environmental agents are likely to cause birth defects only during a narrow band of time.
A different, but related, problem often arises in court. Determining the plaintiff’s exposure to
the alleged toxic substance always involves a retrospective determination and may involve difficulties
similar to those faced by an epidemiologist planning a study. Thus, in John’s Heating Service v. Lamb,
46 P.3d 1024 (Alaska 2002), plaintiffs were exposed to carbon monoxide because of defendants’ negligence with respect to a home furnace. The court observed: “[W]hile precise information concerning
the exposure necessary to cause specific harm to humans and exact details pertaining to the plaintiff’s
exposure are beneficial, such evidence is not always available, or necessary, to demonstrate that a
substance is toxic to humans given substantial exposure and need not invariably provide the basis for
an expert’s opinion on causation.” Id. at 1035 (quoting Westberry v. Gislaved Gummi AB, 178 F.3d
257, 264 (4th Cir. 1999)); see also Alder v. Bayer Corp., AGFA Div., 61 P.3d 1068, 1086–88 (Utah
2002) (summarizing other decisions on the precision with which plaintiffs must establish the dosage
to which they were exposed). See generally Restatement (Third) of Torts: Liability for Physical and
Emotional Harm § 28 cmt. c(2) & rptrs. note (2010).
In asbestos litigation, a number of courts have adopted a requirement that the plaintiff demonstrate (1) regular use by an employer of the defendant’s asbestos-containing product, (2) the plaintiff’s
proximity to that product, and (3) exposure over an extended period of time. See, e.g., Lohrmann v.
Pittsburgh Corning Corp., 782 F.2d 1156, 1162–64 (4th Cir. 1986); Gregg v. V-J Auto Parts, Inc.,
943 A.2d 216, 226 (Pa. 2007).
112. Frequently, occupational epidemiologists employ study designs that consider all agents to
which those who work in a particular occupation are exposed because they are trying to determine the
hazards associated with that occupation. Isolating one of the agents for examination would be difficult if
not impossible. These studies, then, present difficulties when employed in court in support of a claim by a
plaintiff who was exposed to only one or fewer than all of the agents present at the worksite that was the
subject of the study. See, e.g., Knight v. Kirby Inland Marine Inc., 482 F.3d 347, 352–53 (5th Cir. 2007)
(concluding that case-control studies of cancer that entailed exposure to a variety of organic solvents
at job sites did not support claims of plaintiffs who claimed exposure to benzene caused their cancers).
587
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The route (e.g., inhalation or absorption), duration, and intensity of exposure are important factors in assessing disease causation. Even with environmental
monitoring, the dose measured in the environment generally is not the same as
the dose that reaches internal target organs. If the researcher has calculated the
internal dose of exposure, the scientific basis for this calculation should be examined for soundness.113
In assessing whether the data may reflect inaccurate information, one must
assess whether the data were collected from objective and reliable sources. Medical records, government documents, employment records, death certificates, and
interviews are examples of data sources that are used by epidemiologists to measure both exposure and disease status.114 The accuracy of a particular source may
affect the validity of a research finding. If different data sources are used to collect
information about a study group, differences in the accuracy of those sources may
affect the validity of the findings. For example, using employment records to
gather information about exposure to narcotics probably would lead to inaccurate
results, because employees tend to keep such information private. If the researcher
uses an unreliable source of data, the study may not be useful.
The kinds of quality control procedures used may affect the accuracy of the
data. For data collected by interview, quality control procedures should probe
the reliability of the individual and whether the information is verified by other
sources. For data collected and analyzed in the laboratory, quality control procedures should probe the validity and reliability of the laboratory test.
Information bias may also result from inaccurate measurement of disease
status. The quality and sophistication of the diagnostic methods used to detect a
disease should be assessed.115 The proportion of subjects who were examined also
should be questioned. If, for example, many of the subjects refused to be tested,
the fact that the test used was of high quality would be of relatively little value.
113. See also Bernard D. Goldstein & Mary Sue Henifin, Reference Guide on Toxicology,
Section I.D, in this manual.
114. Even these sources may produce unanticipated error. Identifying the causal connection
between asbestos and mesothelioma, a rare form of cancer, was complicated and delayed because
doctors who were unfamiliar with mesothelioma erroneously identified other causes of death in death
certificates. See David E. Lilienfeld & Paul D. Gunderson, The “Missing Cases” of Pleural Malignant
Mesothelioma in Minnesota, 1979–81: Preliminary Report, 101 Pub. Health Rep. 395, 397–98 (1986).
115. The hazards of adversarial review of epidemiologic studies to determine bias is highlighted
by O’Neill v. Novartis Consumer Health, Inc., 55 Cal. Rptr. 3d 551, 558–60 (Ct. App. 2007). Defendant’s experts criticized a case-control study relied on by plaintiff on the ground that there was misclassification of exposure status among the cases. Plaintiff objected to this criticism because defendant’s
experts had only examined the cases for exposure misclassification, which would tend to exaggerate
any association by providing an inaccurately inflated measure of exposure in the cases. The experts
failed to examine whether there was misclassification in the controls, which, if it existed, would tend
to incorrectly diminish any association.
588
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
The scientific validity of the research findings is influenced by the reliability of the diagnosis of disease or health status under study.116 The disease must
be one that is recognized and defined to enable accurate diagnoses.117 Subjects’
health status may be essential to the hypothesis under investigation. For example,
a researcher interested in studying spontaneous abortion in the first trimester must
determine that study subjects are pregnant. Diagnostic criteria that are accepted by
the medical community should be used to make the diagnosis. If a diagnosis had
been made at a time when home pregnancy kits were known to have a high rate
of false-positive results (indicating pregnancy when the woman is not pregnant),
the study will overestimate the number of spontaneous abortions.
Misclassification bias is a consequence of information bias in which, because
of problems with the information available, individuals in the study may be misclassified with regard to exposure status or disease status. Bias due to exposure
misclassification can be differential or nondifferential. In nondifferential misclassification, the inaccuracies in determining exposure are independent of disease
status, or the inaccuracies in diagnoses are independent of exposure status—in
other words, the data are crude, with a great deal of random error. This is a common problem. Generally, nondifferential misclassification bias leads to a shift in the
odds ratio toward one, or, in other words, toward a finding of no effect. Thus,
if the errors are nondifferential, it is generally misguided to criticize an apparent
association between an exposure and disease on the ground that data were inaccurately classified. Instead, nondifferential misclassification generally underestimates
the true size of the association.
Differential misclassification is systematic error in determining exposure in
cases as compared with controls, or disease status in unexposed cohorts relative to
exposed cohorts. In a case-control study this would occur, for example, if, in the
116. In In re Swine Flu Immunization Products Liability Litigation, 508 F. Supp. 897, 903 (D. Colo.
1981), aff’d sub nom. Lima v. United States, 708 F.2d 502 (10th Cir. 1983), the court critically evaluated
a study relied on by an expert whose testimony was stricken. In that study, determination of whether
a patient had Guillain-Barré syndrome was made by medical clerks, not physicians who were familiar
with diagnostic criteria.
117. The difficulty of ill-defined diseases arose in some of the silicone gel breast implant cases.
Thus, in Grant v. Bristol-Myers Squibb, 97 F. Supp. 2d 986 (D. Ariz. 2000), in the face of a substantial
body of exonerative epidemiologic evidence, the female plaintiff alleged she suffered from an atypical
systemic joint disease. The court concluded:
As a whole, the Court finds that the evidence regarding systemic disease as proposed by Plaintiffs’ experts
is not scientifically valid and therefore will not assist the trier of fact. As for the atypical syndrome that
is suggested, where experts propose that breast implants cause a disease but cannot specify the criteria
for diagnosing the disease, it is incapable of epidemiologic testing. This renders the experts’ methods
insufficiently reliable to help the jury.
Id. at 992; see also Burton v. Wyeth-Ayerst Labs., 513 F. Supp. 2d 719, 722–24 (N.D. Tex. 2007)
(parties disputed whether cardiology problem involved two separate diseases or only one; court concluded that all experts in the case reflected a view that there was but a single disease); In re Breast
Implant Cases, 942 F. Supp. 958, 961 (E.D.N.Y. & S.D.N.Y. 1996).
589
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
process of anguishing over the possible causes of the disease, parents of ill children
recalled more exposures to a particular agent than actually occurred, or if parents
of the controls, for whom the issue was less emotionally charged, recalled fewer.
This can also occur in a cohort study in which, for example, birth control users (the
exposed cohort) are monitored more closely for potential side effects, leading to a
higher rate of disease identification in that cohort than in the unexposed cohort.
Depending on how the misclassification occurs, a differential bias can produce an
error in either direction—the exaggeration or understatement of a true association.
3. Other conceptual problems
There are dozens of other potential biases that can occur in observational studies, which is an important reason why clinical studies (when ethical) are often
preferable. Sometimes studies are limited by flawed definitions or premises. For
example, if the researcher defines the disease of interest as all birth defects, rather
than a specific birth defect, there should be a scientific basis to hypothesize that
the effects of the agent being investigated could be so broad. If the effect is in
fact more limited, the result of this conceptualization error could be to dilute or
mask any real effect that the agent might have on a specific type of birth defect.118
Some biases go beyond errors in individual studies and affect the overall
body of available evidence in a way that skews what appears to be the universe
of evidence. Publication bias is the tendency for medical journals to prefer studies
that find an effect.119 If negative studies are never published, the published literature will be biased. Financial conflicts of interest by researchers and the source of
funding of studies have been shown to have an effect on the outcomes of such
studies.120
118. In Brock v. Merrell Dow Pharmaceuticals, Inc., 874 F.2d 307, 312 (5th Cir. 1989), the court
discussed a reanalysis of a study in which the effect was narrowed from all congenital malformations
to limb reduction defects. The magnitude of the association changed by 50% when the effect was
defined in this narrower fashion. See Rothman et al. supra note 61, at 144 (“Unwarranted assurances
of a lack of any effect can easily emerge from studies in which a wide range of etiologically unrelated
outcomes are grouped.”).
119. Investigators may contribute to this effect by neglecting to submit negative studies for
publication.
120. See Jerome P. Kassirer, On the Take: How Medicine’s Complicity with Big Business Can
Endanger Your Health 79–84 (2005); J.E. Bekelman et al., Scope and Impact of Financial Conflicts of
Interest in Biomedical Research: A Systematic Review, 289 JAMA 454 (2003). Richard Smith, the editor
in chief of the British Medical Journal, wrote on this subject:
The major determinant of whether reviews of passive smoking concluded it was harmful was whether
the authors had financial ties with tobacco manufacturers. In the disputed topic of whether thirdgeneration contraceptive pills cause an increase in thromboembolic disease, studies funded by the
pharmaceutical industry find that they don’t and studies funded by public money find that they do.
Richard Smith, Making Progress with Competing Interests, 325 Brit. Med. J. 1375, 1376 (2002).
590
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
Examining a study for potential sources of bias is an important task that helps
determine the accuracy of a study’s conclusions. In addition, when a source of
bias is identified, it may be possible to determine whether the error tended to
exaggerate or understate the true association. Thus, bias may exist in a study that
nevertheless has probative value.
Even if one concludes that the findings of a study are statistically stable and
that biases have not created significant error, additional considerations remain. As
repeatedly noted, an association does not necessarily mean a causal relationship
exists. To make a judgment about causation, a knowledgeable expert121 must consider the possibility of confounding factors. The expert must also evaluate several
criteria to determine whether an inference of causation is appropriate.122 These
matters are discussed below.
C. Could a Confounding Factor Be Responsible for the Study
Result?123
The third major reason for error in epidemiologic studies is confounding. Confounding occurs when another causal factor (the confounder) confuses the relationship between the agent of interest and outcome of interest.124 (Confounding
and selection bias (Section IV.B.1, supra) can, depending on terminology, overlap.)
Thus, one instance of confounding is when a confounder is both a risk factor for
the disease and a factor associated with the exposure of interest. For example,
researchers may conduct a study that finds individuals with gray hair have a higher
rate of death than those with hair of another color. Instead of hair color having
an impact on death, the results might be explained by the confounding factor
of age. If old age is associated differentially with the gray-haired group (those
with gray hair tend to be older), old age may be responsible for the association
found between hair color and death.125 Researchers must separate the relationship
between gray hair and risk of death from that of old age and risk of death. When
researchers find an association between an agent and a disease, it is critical to
determine whether the association is causal or the result of confounding.126 Some
121. In a lawsuit, this would be done by an expert. In science, the effort is usually conducted
by a panel of experts.
122. For an excellent example of the authors of a study analyzing whether an inference of causation is appropriate in a case-control study examining whether bromocriptine (Parlodel)—a lactation
suppressant—causes seizures in postpartum women, see Kenneth J. Rothman et al., Bromocriptine and
Puerpal Seizures, 1 Epidemiology 232, 236–38 (1990).
123. See Grassis v. Johns-Manville Corp., 591 A.2d 671, 675 (N.J. Super. Ct. App. Div. 1991)
(discussing the possibility that confounders may lead to an erroneous inference of a causal relationship).
124. See Rothman et al., supra note 61, at 129.
125. This example is drawn from Kahn & Sempos, supra note 31, at 63.
126. Confounding can bias a study result by either exaggerating or diluting any true association. One example of a confounding factor that may result in a study’s outcome understating an
591
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
epidemiologists classify confounding as a form of bias. However, confounding is
a reality—that is, the observed association of a factor and a disease is actually the
result of an association with a third, confounding factor.127
Confounding can be illustrated by a hypothetical prospective cohort study of
the role of alcohol consumption and emphysema. The study is designed to investigate whether drinking alcohol is associated with emphysema. Participants are followed for a period of 20 years and the incidence of emphysema in the “exposed”
(participants who consume more than 15 drinks per week) and the unexposed is
compared. At the conclusion of the study, the relative risk of emphysema in the
drinking group is found to be 2.0, an association that suggests a possible effect).
But does this association reflect a true causal relationship or might it be the product of confounding?
One possibility for a confounding factor is smoking, a known causal risk factor for emphysema. If those who drink alcohol are more likely to be smokers than
those who do not drink, then smoking may be responsible for some or all of the
higher level of emphysema among those who do not drink.
A serious problem in observational studies such as this hypothetical study is
that the individuals are not assigned randomly to the groups being compared.128
As discussed above, randomization maximizes the possibility that exposures other
than the one under study are evenly distributed between the exposed and the
control cohorts.129 In observational studies, by contrast, other forces, including
self-selection, determine who is exposed to other (possibly causal) factors. The
lack of randomization leads to the potential problem of confounding. Thus, for
example, the exposed cohort might consist of those who are exposed at work to
an agent suspected of being an industrial toxin. The members of this cohort may,
however, differ from unexposed controls by residence, socioeconomic or health
status, age, or other extraneous factors.130 These other factors may be causing (or
association is vaccination. Thus, if a group exposed to an agent has a higher rate of vaccination for
the disease under study than the unexposed group, the vaccination may reduce the rate of disease
in the exposed group, thereby producing an association that is less than the true association without
the confounding of vaccination.
127. Schwab v. Philip Morris USA, Inc., 449 F. Supp. 2d 992, 1199–1200 (E.D.N.Y. 2006),
rev’d on other grounds, 522 F.3d 215 (2d Cir. 2008), describes confounding that led to premature conclusions that low-tar cigarettes were safer than regular cigarettes. Smokers who chose to switch to low-tar
cigarettes were different from other smokers in that they were more health conscious in other aspects
of their lifestyles. Failure to account for that confounding—and measuring a healthy lifestyle is difficult
even if it is identified as a potential confounder—biased the results of those studies.
128. Randomization attempts to ensure that the presence of a characteristic, such as coffee
drinking, is governed by chance, as opposed to being determined by the presence of an underlying
medical condition.
129. See Rothman et al., supra note 61, at 129; see also supra Section II.A.
130. See, e.g., In re “Agent Orange” Prod. Liab. Litig., 597 F. Supp. 740, 783 (E.D.N.Y. 1984)
(discussing the problem of confounding that might result in a study of the effect of exposure to Agent
Orange on Vietnam servicemen), aff’d, 818 F.2d 145 (2d Cir. 1987).
592
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
protecting against) the disease, but because of potential confounding, an apparent (yet false) association of the disease with exposure to the agent may appear.
Confounders, like smoking in the alcohol drinking study, do not reflect an error
made by the investigators; rather, they reflect the inherently “uncontrolled” nature
of exposure designations in observational studies. When they can be identified,
confounders should be taken into account. Unanticipated confounding factors
that are suspected after data collection can sometimes be controlled during data
analysis, if data have been gathered about them.
To evaluate whether smoking is a confounding factor, the researcher would
stratify each of the exposed and control groups into smoking and nonsmoking
subgroups to examine whether subjects’ smoking status affects the study results.
If the relationship between alcohol drinking and emphysema in the smoking subgroups is the same as that in the all-subjects group, smoking is not a confounding
factor. If the subjects’ smoking status affects the relationship between drinking
and emphysema, then smoking is a confounder, for which adjustment is required.
If the association between drinking and emphysema completely disappears when
the subjects’ smoking status is considered, then smoking is a confounder that fully
accounts for the association with drinking observed. Table 4 reveals our hypothetical study’s results, with smoking being a confounding factor, which, when
accounted for, eliminates the association. Thus, in the full cohort, drinkers have
twice the risk of emphysema compared with nondrinkers. When the relationship between drinking and emphysema is examined separately in smokers and in
nonsmokers, the risk of emphysema in drinkers compared with nondrinkers is not
elevated in smokers or in nonsmokers. This is because smokers are disproportionately drinkers and have a higher rate of emphysema than nonsmokers. Thus, the
relationship between drinking and emphysema in the full cohort is distorted by
failing to take into account the relationship between being a drinker and a smoker.
Even after accounting for the effect of smoking, there is always a risk that
an undiscovered or unrecognized confounding factor may contribute to a study’s
findings, by either magnifying or reducing the observed association.131 It is,
however, necessary to keep that risk in perspective. Often the mere possibility of
uncontrolled confounding is used to call into question the results of a study. This
was certainly the strategy of some seeking, or unwittingly helping, to undermine
the implications of the studies persuasively linking cigarette smoking to lung
cancer. The critical question is whether it is plausible that the findings of a given
study could indeed be due to unrecognized confounders.
In designing a study, researchers sometimes make assumptions that cannot be
validated or evaluated empirically. Thus, researchers may assume that a missing
potential confounder is not needed for the analysis or that a variable used was
adequately classified. Researchers employ a sensitivity analysis to assess the effect
of those assumptions should they be incorrect. Conducting a sensitivity analysis
131. Rothman et al., supra note 61, at 129; see also supra Section II.A.
593
Copyright © National Academy of Sciences. All rights reserved.
Drinking
Status
Total Cohort
Smokers
Nonsmokers
Total
Cases
Incidence
RR
Total
Cases
Incidence
RR
Total
Cases
Incidence
RR
Nondrinkers
471
16
0.034
1.0b
111
9
0.081
1.0b
360
7
0.019
1.0b
Drinkers
739
41
0.069
2.0
592
48
0.081
1.0
147
3
0.020
1.0
The incidence of disease is not normally presented in an epidemiologic study, but we include it here to aid in comprehension of the ideas discussed in the text.
RR = relative risk. The relative risk for each of the cohorts is determined based on reference to the risk among nondrinkers; that is, the incidence of disease
among drinkers is compared with nondrinkers for each of the three cohorts separately.
a
b
Reference Manual on Scientific Evidence: Third Edition
594
Copyright © National Academy of Sciences. All rights reserved.
Table 4. Hypothetical Emphysema Study Dataa
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
entails repeating the analysis using different assumptions (e.g., alternative corrections for missing data or for classifying data) to see if the results are sensitive to the
varying assumptions. Such analyses can show that the assumptions are not likely to
affect the findings or that alternative explanations cannot be ruled out.132
1. What techniques can be used to prevent or limit confounding?
Choices in the design of a research project (e.g., methods for selecting the subjects) can prevent or limit confounding. In designing a study, the researcher must
determine other risk factors for the disease under study. When a factor or factors,
such as age, sex, or even smoking status, are risk factors and potential confounders
in a study, investigators can limit the differential distribution of these factors in the
study groups by selecting controls to “match” cases (or the exposed group) in terms
of these variables. If the two groups are matched, for example, by age, then any
association observed in the study cannot be due to age, the matched variable.133
Restricting the persons who are permitted as subjects in a study is another
method to control for confounders. If age or sex is suspected as a confounder,
then the subjects enrolled in a study can be limited to those of one sex and those
who are within a specified age range. When there is no variance among subjects
in a study with regard to a potential confounder, confounding as a result of that
variable is eliminated.
2. What techniques can be used to identify confounding factors?
Once the study data are ready to be analyzed, the researcher must assess a range of
factors that could influence risk. In the hypothetical study, the researcher would
evaluate whether smoking is a confounding factor by comparing the incidence of
emphysema in smoking alcohol drinkers with the incidence in nonsmoking alcohol
drinkers. If the incidence is substantially the same, smoking is not a confounding
factor (e.g., smoking does not distort the relationship between alcohol drinking and
the development of emphysema). If the incidence is substantially different, but still
exists in the nonsmoking group, then smoking is a confounder, but does not wholly
account for the association with alcohol drinking. If the association disappears, then
smoking is a confounder that fully accounts for the association observed.
132. Kenneth Rothman & Sander Greenland, Modern Epidemiology (2d ed. 1998).
133. Selecting a control population based on matched variables necessarily affects the representativeness of the selected controls and may affect how generalizable the study results are to the population
at large. However, for a study to have merit, it must first be internally valid; that is, it must not be
subject to unreasonable sources of bias or confounding. Only after a study has been shown to meet this
standard does its universal applicability or generalizability to the population at large become an issue.
When a study population is not representative of the general or target population, existing scientific
knowledge may permit reasonable inferences about the study’s broader applicability, or additional
confirmatory studies of other populations may be necessary.
595
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
3. What techniques can be used to control for confounding factors?
A good study design will consider potential confounders and obtain data about
them if possible. If researchers have good data on potential confounders, they
can control for those confounders in the data analysis. There are several analytic
approaches to account for the distorting effects of a confounder, including stratification or multivariate analysis. Stratification permits an investigator to evaluate the
effect of a suspected confounder by subdividing the study groups based on a confounding factor. Thus, in Table 4, drinkers have been stratified based on whether
they smoke (the suspected confounder). To take another example that entails
a continuous rather than dichotomous potential confounder, let us say we are
interested in the relationship between smoking and lung cancer but suspect that
air pollution or urbanization may confound the relationship. Thus, an observed
relationship between smoking and lung cancer could theoretically be due in part
to pollution, if smoking were more common in polluted areas. We could address
this issue by stratifying our data by degree of urbanization and look at the relationship between smoking and lung cancer in each urbanization stratum. Figure 5
shows actual age-adjusted lung cancer mortality rates per 100,000 person-years by
urban or rural classification and smoking category.134
Age-Adjusted Lung Cancer Death
Rates/100,000 Person-Years
Figure 5: Age-adjusted lung cancer mortality rates per 100,000 person-years by
urban or rural classification and smoking category.
Source: Adapted from E. Cuyler Hammond & Daniel Horn, Smoking and Death Rates—Report on FortyFour Months of Follow-Up of 187,783 Men: II, Death Rates by Cause, 166 JAMA 1294 (1958).
134. This example and Figure 4 are from Leon Gordis, Epidemiology 254 (4th ed. 2009).
596
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
For each degree of urbanization, lung cancer mortality rates in smokers are
shown by the dark gray bars, and nonsmoker mortality rates are indicated by light
gray bars. From these data we see that in every level (or stratum) of urbanization,
lung cancer mortality is higher in smokers than in nonsmokers. Therefore, the
observed association of smoking and lung cancer cannot be attributed to level of
urbanization. By examining each stratum separately, we, in effect, hold urbanization constant, and still find much higher lung cancer mortality in smokers than
in nonsmokers.
For each degree of urbanization, lung cancer mortality rates and smokers
are shown by the dark-colored bars, and nonsmoker mortality rates are indicated
by light-colored bars. For these data we see that in every level (or stratum) of
urbanization, lung cancer mortality is higher in smokers than in nonsmokers.
Therefore, the observed association of lung cancer cannot be attributed to level
of urbanization. By examining each stratum separately, we are, in effect, holding
urbanization constant, and we still find much higher lung cancer mortality in
smokers than in nonsmokers.
Multivariate analysis controls for the confounding factor through mathematical modeling. Models are developed to describe the simultaneous effect of exposure and confounding factors on the increase in risk.135
Both of these methods allow for adjustment of the effect of confounders. They
both modify an observed association to take into account the effect of risk factors
that are not the subject of the study and that may distort the association between the
exposure being studied and the disease outcomes. If the association between exposure and disease remains after the researcher completes the assessment and adjustment for confounding factors, the researcher must then assess whether an inference
of causation is justified. This entails consideration of the Hill factors explained in
Section V, infra.
V. General Causation: Is an Exposure a
Cause of the Disease?
Once an association has been found between exposure to an agent and development of a disease, researchers consider whether the association reflects a true
cause–effect relationship. When epidemiologists evaluate whether a cause–effect
relationship exists between an agent and disease, they are using the term causation
in a way similar to, but not identical to, the way that the familiar “but for,” or
sine qua non, test is used in law for cause in fact. “Conduct is a factual cause of
135. For a more complete discussion of multivariate analysis, see Daniel L. Rubinfeld, Reference
Guide on Multiple Regression, in this manual.
597
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
[harm] when the harm would not have occurred absent the conduct.”136 This is
equivalent to describing the conduct as a necessary link in a chain of events that
results in the particular event.137 Epidemiologists use causation to mean that an
increase in the incidence of disease among the exposed subjects would not have
occurred had they not been exposed to the agent.138 Thus, exposure is a necessary
condition for the increase in the incidence of disease among those exposed.139
The relationship between the epidemiologic concept of cause and the legal question of whether exposure to an agent caused an individual’s disease is addressed
in Section VII.
As mentioned in Section I, epidemiology cannot prove causation; rather, causation is a judgment for epidemiologists and others interpreting the epidemiologic
data.140 Moreover, scientific determinations of causation are inherently tentative.
The scientific enterprise must always remain open to reassessing the validity of
past judgments as new evidence develops.
In assessing causation, researchers first look for alternative explanations for the
association, such as bias or confounding factors, which are discussed in Section
IV, supra. Once this process is completed, researchers consider how guidelines
for inferring causation from an association apply to the available evidence. We
emphasize that these guidelines are employed only after a study finds an association
136. Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 26 (2010);
see also Dan B. Dobbs, The Law of Torts § 168, at 409–11 (2000). When multiple causes are each
operating and capable of causing an event, the but-for, or necessary-condition, concept for causation
is problematic. This is the familiar “two-fires” scenario in which two independent fires simultaneously
burn down a house and is sometimes referred to as overdetermined outcomes. Neither fire is a but-for,
or necessary condition, for the destruction of the house, because either fire would have destroyed the
house. See Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 28 (2010). This
two-fires situation is analogous to an individual being exposed to two agents, each of which is capable
of causing the disease contracted by the individual. See Basko v. Sterling Drug, Inc., 416 F.2d 417
(2d Cir. 1969). A difference between the disease scenario and the fire scenario is that, in the former,
one will have no more than a probabilistic assessment of whether each of the exposures would have
caused the disease in the individual.
137. See supra note 7; see also Restatement (Third) of Torts: Liability for Physical and Emotional
Harm § 26 cmt. c (2010) (employing a “causal set” model to explain multiple elements, each of which
is required for an outcome).
138. “The imputed causal association is at the group level, and does not indicate the cause of
disease in individual subjects.” Bruce G. Charlton, Attribution of Causation in Epidemiology: Chain or
Mosaic? 49 J. Clin. Epidemiology 105, 105 (1999).
139. See Rothman et al., supra note 61, at 8 (“We can define a cause of a specific disease event as
an antecedent event, condition, or characteristic that was necessary for the occurrence of the disease at
the moment it occurred, given that other conditions are fixed.”); Allen v. United States, 588 F. Supp.
247, 405 (D. Utah 1984) (quoting a physician on the meaning of the statement that radiation causes
cancer), rev’d on other grounds, 816 F.2d 1417 (10th Cir. 1987).
140. Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 28 cmt. c (2010)
(“[A]n evaluation of data and scientific evidence to determine whether an inference of causation is
appropriate requires judgment and interpretation.”).
598
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
to determine whether that association reflects a true causal relationship.141 These
guidelines consist of several key inquiries that assist researchers in making a judgment about causation.142 Generally, researchers are conservative when it comes to
assessing causal relationships, often calling for stronger evidence and more research
before a conclusion of causation is drawn.143
The factors that guide epidemiologists in making judgments about causation
(and there is no threshold number that must exist) are144
141. In a number of cases, experts attempted to use these guidelines to support the existence of
causation in the absence of any epidemiologic studies finding an association. See, e.g., Rains v. PPG
Indus., Inc., 361 F. Supp. 2d 829, 836–37 (S.D. Ill. 2004) (explaining Hill criteria and proceeding to
apply them even though there was no epidemiologic study that found an association); Soldo v. Sandoz
Pharms. Corp., 244 F. Supp. 2d 434, 460–61 (W.D. Pa. 2003). There may be some logic to that effort,
but it does not reflect accepted epidemiologic methodology. See In re Fosamax Prods. Liab. Litig.,
645 F. Supp. 2d 164, 187–88 (S.D.N.Y. 2009); Dunn v. Sandoz Pharms. Corp., 275 F. Supp. 2d 672,
678–79 (M.D.N.C. 2003) (“The greater weight of authority supports Sandoz’ assertion that [use of]
the Bradford Hill criteria is a method for determining whether the results of an epidemiologic study
can be said to demonstrate causation and not a method for testing an unproven hypothesis.”); Soldo,
244 F. Supp. 2d at 514 (the Hill criteria “were developed as a mean[s] of interpreting an established
association based on a body of epidemiologic research for the purpose of trying to judge whether the
observed association reflects a causal relation between an exposure and disease.” (quoting report of
court-appointed expert)).
142. See Mervyn Susser, Causal Thinking in the Health Sciences: Concepts and Strategies in
Epidemiology (1973); Gannon v. United States, 571 F. Supp. 2d 615, 624 (E.D. Pa. 2007) (quoting
expert who testified that the Hill criteria are “‘well-recognized’ and widely used in the science community to assess general causation”); Chapin v. A & L Parts, Inc., 732 N.W.2d 578, 584 (Mich. Ct.
App. 2007) (expert testified that Hill criteria are the most well-utilized method for determining if an
association is causal).
143. Berry v. CSX Transp., Inc., 709 So. 2d 552, 568 n.12 (Fla. Dist. Ct. App. 1998) (“Almost
all genres of research articles in the medical and behavioral sciences conclude their discussion with
qualifying statements such as ‘there is still much to be learned.’ This is not, as might be assumed,
an expression of ignorance, but rather an expression that all scientific fields are open-ended and can
progress from their present state. . . .”); Hall v. Baxter Healthcare Corp., 947 F. Supp. 1387 app.
B. at 1446–51 (D. Or. 1996) (report of Merwyn R. Greenlick, court-appointed epidemiologist). In
Cadarian v. Merrell Dow Pharmaceuticals, Inc., 745 F. Supp. 409 (E.D. Mich. 1989), the court refused
to permit an expert to rely on a study that the authors had concluded should not be used to support an inference of causation in the absence of independent confirmatory studies. The court did
not address the question whether the degree of certainty used by epidemiologists before making a
conclusion of cause was consistent with the legal standard. See DeLuca v. Merrell Dow Pharms.,
Inc., 911 F.2d 941, 957 (3d Cir. 1990) (standard of proof for scientific community is not necessarily
appropriate standard for expert opinion in civil litigation); Wells v. Ortho Pharm. Corp., 788 F.2d
741, 745 (11th Cir. 1986).
144. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071, 1098 (D. Colo. 2006) (“Defendants cite no authority, scientific or legal, that compliance with all, or even one, of these factors
is required. . . . The scientific consensus is, in fact, to the contrary. It identifies Defendants’ list of
factors as some of the nine factors or lenses that guide epidemiologists in making judgments about
causation. . . . These factors are not tests for determining the reliability of any study or the causal
inferences drawn from it.”).
599
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
1.
2.
3.
4.
5.
6.
7.
8.
9.
Temporal relationship,
Strength of the association,
Dose–response relationship,
Replication of the findings,
Biological plausibility (coherence with existing knowledge),
Consideration of alternative explanations,
Cessation of exposure,
Specificity of the association, and
Consistency with other knowledge.
There is no formula or algorithm that can be used to assess whether a causal
inference is appropriate based on these guidelines.145 One or more factors may
be absent even when a true causal relationship exists.146 Similarly, the existence
of some factors does not ensure that a causal relationship exists. Drawing causal
inferences after finding an association and considering these factors requires judgment and searching analysis, based on biology, of why a factor or factors may be
absent despite a causal relationship, and vice versa. Although the drawing of causal
inferences is informed by scientific expertise, it is not a determination that is made
by using an objective or algorithmic methodology.
These guidelines reflect criteria proposed by the U.S. Surgeon General
in 1964147 in assessing the relationship between smoking and lung cancer and
expanded upon by Sir Austin Bradford Hill in 1965148 and are often referred to
as the Hill criteria or Hill factors.
145. See Douglas L. Weed, Epidemiologic Evidence and Causal Inference, 14 Hematology/Oncology
Clinics N. Am. 797 (2000).
146. See Cook v. Rockwell Int’l Corp., 580 F. Supp. 2d 1071, 1098 (D. Colo. 2006) (rejecting
argument that plaintiff failed to provide sufficient evidence of causation based on failing to meet four
of the Hill factors).
147. Public Health Serv., U.S. Dep’t of Health, Educ., & Welfare, Smoking and Health: Report
of the Advisory Committee to the Surgeon General (1964); see also Centers for Disease Control and
Prevention, U.S. Dep’t of Health & Human Servs., The Health Consequences of Smoking: A Report
of the Surgeon General (2004).
148. See Austin Bradford Hill, The Environment and Disease: Association or Causation? 58 Proc.
Royal Soc’y Med. 295 (1965) (Hill acknowledged that his factors could only serve to assist in the inferential process: “None of my nine viewpoints can bring indisputable evidence for or against the causeand-effect hypothesis and none can be required as a sine qua non.”). For discussion of these criteria and
their respective strengths in informing a causal inference, see Gordis, supra note 32, at 236–39; David E.
Lilienfeld & Paul D. Stolley, Foundations of Epidemiology 263–66 (3d ed. 1994); Weed, supra note 144.
600
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
A. Is There a Temporal Relationship?
A temporal, or chronological, relationship must exist for causation to exist. If an
exposure causes disease, the exposure must occur before the disease develops.149 If
the exposure occurs after the disease develops, it cannot have caused the disease.
Although temporal relationship is often listed as one of many factors in assessing
whether an inference of causation is justified, this aspect of a temporal relationship is a necessary factor: Without exposure before the disease, causation cannot
exist.150
With regard to specific causation, a subject dealt with in detail in Section VII,
infra, there may be circumstances in which a temporal relationship supports the
existence of a causal relationship. If the latency period between exposure and
outcome is known,151 then exposure consistent with that information may lend
credence to a causal relationship. This is particularly true when the latency period
is short and competing causes are known and can be ruled out. Thus, if an individual suffers an acute respiratory response shortly after exposure to a suspected
agent and other causes of that respiratory problem are known and can be ruled
out, the temporal relationship involved supports the conclusion that a causal relationship exists.152 Similarly, exposure outside a known latency period constitutes
evidence, perhaps conclusive evidence, against the existence of causation.153 On
the other hand, when latency periods are lengthy, variable, or not known and a
149. See Carroll v. Litton Sys., Inc., No. B-C-88-253, 1990 U.S. Dist. LEXIS 16833, at *29
(W.D.N.C. 1990) (“[I]t is essential for . . . [the plaintiffs’ medical experts opining on causation] to
know that exposure preceded plaintiffs’ alleged symptoms in order for the exposure to be considered
as a possible cause of those symptoms. . . .”).
150. Exposure during the disease initiation process may cause the disease to be more severe than
it otherwise would have been without the additional dose.
151. When the latency period is known—or is known to be limited to a specific range of time—
as is the case with the adverse effects of some vaccines, the time frame from exposure to manifestation
of disease can be critical to determining causation.
152. For courts that have relied on temporal relationships of the sort described, see Bonner v.
ISP Technologies, Inc., 259 F.3d 924, 930–31 (8th Cir. 2001) (giving more credence to the expert’s
opinion on causation for acute response based on temporal relationship than for chronic disease that
plaintiff also developed); Heller v. Shaw Industries, Inc. 167 F.3d 146 (3d Cir. 1999); Westberry v.
Gislaved Gummi AB, 178 F.3d 257 (4th Cir. 1999); Zuchowicz v. United States, 140 F.3d 381 (2d
Cir. 1998); Creanga v. Jardal, 886 A.2d 633, 641 (N.J. 2005); Alder v. Bayer Corp., AGFA Div., 61
P.3d 1068, 1090 (Utah 2002) (“If a bicyclist falls and breaks his arm, causation is assumed without
argument because of the temporal relationship between the accident and the injury [and, the court
might have added, the absence of any plausible competing causes that might instead be responsible
for the broken arm].”).
153. See In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1238 (W.D.
Wash. 2003) (determining expert testimony on causation for plaintiffs whose exposure was beyond
known latency period was inadmissible).
601
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
substantial proportion of the disease is due to unknown causes, temporal relationship provides little beyond satisfying the requirement that cause precede effect.154
B. How Strong Is the Association Between the Exposure and
Disease?155
The relative risk is one of the cornerstones for causal inferences.156 Relative risk
measures the strength of the association. The higher the relative risk, the greater
the likelihood that the relationship is causal.157 For cigarette smoking, for example,
the estimated relative risk for lung cancer is very high, about 10.158 That is, the
risk of lung cancer in smokers is approximately 10 times the risk in nonsmokers.
A relative risk of 10, as seen with smoking and lung cancer, is so high that
it is extremely difficult to imagine any bias or confounding factor that might
account for it. The higher the relative risk, the stronger the association and the
lower the chance that the effect is spurious. Although lower relative risks can
reflect causality, the epidemiologist will scrutinize such associations more closely
because there is a greater chance that they are the result of uncontrolled confounding or biases.
154. These distinctions provide a framework for distinguishing between cases that are largely
dismissive of temporal relationships as supporting causation and others that find it of significant persuasiveness. Compare cases cited in note 151, supra, with Moore v. Ashland Chem. Inc., 151 F.3d 269,
278 (5th Cir. 1998) (giving little weight to temporal relationship in a case in which there were several
plausible competing causes that may have been responsible for the plaintiff’s disease), and Glastetter v.
Novartis Pharms. Corp., 252 F.3d 986, 990 (8th Cir. 2001) (giving little weight to temporal relationship in case studies involving drug and stroke).
155. Assuming that an association is determined to be causal, the strength of the association plays
an important role legally in determining the specific causation question—whether the agent caused an
individual plaintiff’s injury. See infra Section VII.
156. See supra Section III.A.
157. See Miller v. Pfizer, Inc., 196 F. Supp. 2d 1062, 1079 (D. Kan. 2002) (citing this reference guide); Landrigan v. Celotex Corp., 605 A.2d 1079, 1085 (N.J. 1992). The use of the strength
of the association as a factor does not reflect a belief that weaker effects occur less frequently than
stronger effects. See Green, supra note 47, at 652–53 n.39. Indeed, the apparent strength of a given
agent is dependent on the prevalence of the other necessary elements that must occur with the agent
to produce the disease, rather than on some inherent characteristic of the agent itself. See Rothman
et al., supra note 61, at 9–11.
158. See Doll & Hill, supra note 6. The relative risk of lung cancer from smoking is a function of
intensity and duration of dose (and perhaps other factors). See Karen Leffondré et al., Modeling Smoking
History: A Comparison of Different Approaches, 156 Am. J. Epidemiology 813 (2002). The relative risk
provided in the text is based on a specified magnitude of cigarette exposure.
602
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
C. Is There a Dose–Response Relationship?
A dose–response relationship means that the greater the exposure, the greater
the risk of disease. Generally, higher exposures should increase the incidence
(or severity) of disease.159 However, some causal agents do not exhibit a dose–
response relationship when, for example, there is a threshold phenomenon (i.e.,
an exposure may not cause disease until the exposure exceeds a certain dose).160
Thus, a dose–response relationship is strong, but not essential, evidence that the
relationship between an agent and disease is causal.161
159. See Newman v. Motorola, Inc., 218 F. Supp. 2d 769, 778 (D. Md. 2002) (recognizing
importance of dose–response relationship in assessing causation).
160. The question whether there is a no-effect threshold dose is a controversial one in a variety
of toxic substances areas. See, e.g., Irving J. Selikoff, Disability Compensation for Asbestos-Associated
Disease in the United States: Report to the U.S. Department of Labor 181–220 (1981); Paul Kotin,
Dose–Response Relationships and Threshold Concepts, 271 Ann. N.Y. Acad. Sci. 22 (1976); K. Robock,
Based on Available Data, Can We Project an Acceptable Standard for Industrial Use of Asbestos? Absolutely,
330 Ann. N.Y. Acad. Sci. 205 (1979); Ferebee v. Chevron Chem. Co., 736 F.2d 1529, 1536 (D.C.
Cir. 1984) (dose–response relationship for low doses is “one of the most sharply contested questions
currently being debated in the medical community”); In re TMI Litig. Consol. Proc., 927 F. Supp.
834, 844–45 (M.D. Pa. 1996) (discussing low-dose extrapolation and no-dose effects for radiation
exposure).
Moreover, good evidence to support or refute the threshold-dose hypothesis is exceedingly
unlikely because of the inability of epidemiology or animal toxicology to ascertain very small effects.
Cf. Arnold L. Brown, The Meaning of Risk Assessment, 37 Oncology 302, 303 (1980). Even the shape
of the dose–response curve—whether linear or curvilinear, and if the latter, the shape of the curve—is
a matter of hypothesis and speculation. See Allen v. United States, 588 F. Supp. 247, 419–24 (D. Utah
1984), rev’d on other grounds, 816 F.2d 1417 (10th Cir. 1987); In re Bextra & Celebrex Mktg. Sales
Practices & Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1180 (N.D. Cal. 2007) (criticizing expert for
“primitive” extrapolation of risk based on assumption of linear relationship of risk to dose); Troyen
A. Brennan & Robert F. Carter, Legal and Scientific Probability of Causation for Cancer and Other Environmental Disease in Individuals, 10 J. Health Pol’y & L. 33, 43–44 (1985).
The idea that the “dose makes the poison” is a central tenet of toxicology and attributed to
Paracelsus, in the sixteenth century. See Bernard D. Goldstein & Mary Sue Henifin, Reference Guide
on Toxicology, Section I.A, in this manual. It does not mean that any agent is capable of causing any
disease if an individual is exposed to a sufficient dose. Agents tend to have specific effects, see infra
Section V.H., and this dictum reflects only the idea that there is a safe dose below which an agent
does not cause any toxic effect. See Michael A Gallo, History and Scope of Toxicology, in Casarett and
Doull’s Toxicology: The Basic Science of Poisons 1, 4–5 (Curtis D. Klaassen ed., 7th ed. 2008). For
a case in which a party made such a mistaken interpretation of Paracelsus, see Alder v. Bayer Corp.,
AGFA Div., 61 P.3d 1068, 1088 (Utah 2002). Paracelsus was also responsible for the initial articulation
of the specificity tenet. See infra Section V.H.
161. Evidence of a dose–response relationship as bearing on whether an inference of general
causation is justified is analytically distinct from determining whether evidence of the dose to which
a plaintiff was exposed is required in order to establish specific causation. On the latter matter, see
infra Section VII; Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 28 cmt.
c(2) & rptrs. note (2010).
603
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
D. Have the Results Been Replicated?
Rarely, if ever, does a single study persuasively demonstrate a cause–effect relationship.162 It is important that a study be replicated in different populations and
by different investigators before a causal relationship is accepted by epidemiologists
and other scientists.163
The need to replicate research findings permeates most fields of science. In
epidemiology, research findings often are replicated in different populations.164
Consistency in these findings is an important factor in making a judgment about
causation. Different studies that examine the same exposure–disease relationship
generally should yield similar results. Although inconsistent results do not necessarily rule out a causal nexus, any inconsistencies signal a need to explore whether
different results can be reconciled with causality.
E. Is the Association Biologically Plausible (Consistent with
Existing Knowledge)?165
Biological plausibility is not an easy criterion to use and depends upon existing
knowledge about the mechanisms by which the disease develops. When biological plausibility exists, it lends credence to an inference of causality. For example,
the conclusion that high cholesterol is a cause of coronary heart disease is plausible because cholesterol is found in atherosclerotic plaques. However, observations
have been made in epidemiologic studies that were not biologically plausible at
the time but subsequently were shown to be correct.166 When an observation is
inconsistent with current biological knowledge, it should not be discarded, but
162. In Kehm v. Procter & Gamble Co., 580 F. Supp. 890, 901 (N.D. Iowa 1982), aff’d, 724 F.2d
613 (8th Cir. 1983), the court remarked on the persuasive power of multiple independent studies, each
of which reached the same finding of an association between toxic shock syndrome and tampon use.
163. This may not be the legal standard, however. Cf. Smith v. Wyeth-Ayerst Labs. Co., 278
F. Supp. 2d 684, 710 n.55 (W.D.N.C. 2003) (observing that replication is difficult to establish when
there is only one study that has been performed at the time of trial).
164. See Cadarian v. Merrell Dow Pharms., Inc., 745 F. Supp. 409, 412 (E.D. Mich. 1989)
(holding a study on Bendectin insufficient to support an expert’s opinion, because “the study’s authors
themselves concluded that the results could not be interpreted without independent confirmatory
evidence”).
165. A number of courts have adverted to this criterion in the course of their discussions of
causation in toxic substances cases. E.g., In re Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F.
Supp. 2d 1230, 1247–48 (W.D. Wash. 2003); Cook v. United States, 545 F. Supp. 306, 314–15 (N.D.
Cal. 1982) (discussing biological implausibility of a two-peak increase of disease when plotted against
time); Landrigan v. Celotex Corp., 605 A.2d 1079, 1085–86 (N.J. 1992) (discussing the existence vel
non of biological plausibility); see also Bernard D. Goldstein & Mary Sue Henifin, Reference Guide
on Toxicology, Section III.E, in this manual.
166. See In re Rezulin Prods. Liab. Litig., 369 F. Supp. 2d 398, 405 (S.D.N.Y. 2005); In re
Phenylpropanolamine (PPA) Prods. Liab. Litig., 289 F. Supp. 2d 1230, 1247 (W.D. Wash. 2003).
604
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
the observation should be confirmed before significance is attached to it. The
saliency of this factor varies depending on the extent of scientific knowledge
about the cellular and subcellular mechanisms through which the disease process
works. The mechanisms of some diseases are understood quite well based on the
available evidence, including from toxicologic research, whereas other mechanism explanations are merely hypothesized—although hypotheses are sometimes
accepted under this factor.167
F. Have Alternative Explanations Been Considered?
The importance of considering the possibility of bias and confounding and ruling
out the possibilities is discussed above.168
G. What Is the Effect of Ceasing Exposure?
If an agent is a cause of a disease, then one would expect that cessation of
exposure to that agent ordinarily would reduce the risk of the disease. This has
been the case, for example, with cigarette smoking and lung cancer. In many
situations, however, relevant data are simply not available regarding the possible
effects of ending the exposure. But when such data are available and eliminating
exposure reduces the incidence of disease, this factor strongly supports a causal
relationship.
H. Does the Association Exhibit Specificity?
An association exhibits specificity if the exposure is associated only with a single
disease or type of disease.169 The vast majority of agents do not cause a wide vari-
167. See Douglas L. Weed & Stephen D. Hursting, Biologic Plausibility in Causal Inference: Current
Methods and Practice, 147 Am. J. Epidemiology 415 (1998) (examining use of this criterion in contemporary epidemiologic research and distinguishing between alternative explanations of what constitutes
biological plausibility, ranging from mere hypotheses to “sufficient evidence to show how the factor
influences a known disease mechanism”).
168. See supra Sections IV.B–C.
169. This criterion reflects the fact that although an agent causes one disease, it does not necessarily cause other diseases. See, e.g., Nelson v. Am. Sterilizer Co., 566 N.W.2d 671, 676–77 (Mich. Ct.
App. 1997) (affirming dismissal of plaintiff’s claims that chemical exposure caused her liver disorder,
but recognizing that evidence supported claims for neuropathy and other illnesses); Sanderson v. Int’l
Flavors & Fragrances, Inc., 950 F. Supp. 981, 996–98 (C.D. Cal. 1996); see also Taylor v. Airco, Inc.,
494 F. Supp. 2d 21, 27 (D. Mass. 2007) (holding that plaintiff’s expert could testify to causal relationship between vinyl chloride and one type of liver cancer for which there was only modest support
given strong causal evidence for vinyl chloride and another type of liver cancer).
When a party claims that evidence of a causal relationship between an agent and one disease
is relevant to whether the agent caused another disease, courts have required the party to show that
605
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ety of effects. For example, asbestos causes mesothelioma and lung cancer and may
cause one or two other cancers, but there is no evidence that it causes any other
types of cancers. Thus, a study that finds that an agent is associated with many different diseases should be examined skeptically. Nevertheless, there may be causal
relationships in which this guideline is not satisfied. Cigarette manufacturers have
long claimed that because cigarettes have been linked to lung cancer, emphysema,
bladder cancer, heart disease, pancreatic cancer, and other conditions, there is no
specificity and the relationships are not causal. There is, however, at least one good
reason why inferences about the health consequences of tobacco do not require
specificity: Because tobacco and cigarette smoke are not in fact single agents but
consist of numerous harmful agents, smoking represents exposure to multiple
agents, with multiple possible effects. Thus, whereas evidence of specificity may
strengthen the case for causation, lack of specificity does not necessarily undermine
it where there is a good biological explanation for its absence.
I. Are the Findings Consistent with Other Relevant Knowledge?
In addressing the causal relationship of lung cancer to cigarette smoking, researchers examined trends over time for lung cancer and for cigarette sales in the United
States. A marked increase in lung cancer death rates in men was observed, which
appeared to follow the increase in sales of cigarettes. Had the increase in lung
cancer deaths followed a decrease in cigarette sales, it might have given researchers
pause. It would not have precluded a causal inference, but the inconsistency of the
trends in cigarette sales and lung cancer mortality would have had to be explained.
VI. What Methods Exist for Combining the
Results of Multiple Studies?
Not infrequently, the scientific record may include a number of epidemiologic
studies whose findings differ. These may be studies in which one shows an association and the other does not, or studies that report associations, but of different
the mechanisms involved in development of the disease are similar. Thus, in Austin v. Kerr-McGee
Refining Corp., 25 S.W.3d 280 (Tex. App. 2000), the plaintiff suffered from a specific form of chronic
leukemia. Studies demonstrated a causal relationship between benzene and all leukemias, but there was
a paucity of evidence on the relationship between benzene and the specific form of leukemia from
which plaintiff suffered. The court required that plaintiff’s expert demonstrate the similarity of the
biological mechanism among leukemias as a condition for the admissibility of his causation testimony,
a requirement the court concluded had not been satisfied. Accord In re Bextra & Celebrex Mktg. Sales
Practices & Prod. Liab. Litig., 524 F. Supp. 2d 1166, 1183 (N.D. Cal. 2007); Magistrini v. One Hour
Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 603 (D.N.J. 2002).
606
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
magnitude.170 In view of the fact that studies may disagree and that often many
of the studies are small and lack the statistical power needed for definitive conclusions, the technique of meta-analysis was developed, initially for clinical trials.171
Meta-analysis is a method of pooling study results to arrive at a single figure to
represent the totality of the studies reviewed.172 It is a way of systematizing the
time-honored approach of reviewing the literature, which is characteristic of science, and placing it in a standardized framework with quantitative methods for
estimating risk. In a meta-analysis, studies are given different weights in proportion
to the sizes of their study populations and other characteristics.173
Meta-analysis is most appropriate when used in pooling randomized experimental trials, because the studies included in the meta-analysis share the most significant methodological characteristics, in particular, use of randomized assignment
of subjects to different exposure groups. However, often one is confronted with
nonrandomized observational studies of the effects of possible toxic substances
or agents. A method for summarizing such studies is greatly needed, but when
meta-analysis is applied to observational studies—either case-control or cohort—it
becomes more controversial.174 The reason for this is that often methodological
differences among studies are much more pronounced than they are in randomized trials. Hence, the justification for pooling the results and deriving a single
estimate of risk, for example, is problematic.175
170. See, e.g., Zandi v. Wyeth a/k/a Wyeth, Inc., No. 27-CV-06-6744, 2007 WL 3224242
(Minn. Dist. Ct. Oct. 15, 2007) (plaintiff’s expert cited 40 studies in support of a causal relationship
between hormone therapy and breast cancer; many studies found different magnitudes of increased risk).
171. See In re Paoli R.R. Yard PCB Litig., 916 F.2d 829, 856 (3d Cir. 1990), cert. denied, 499
U.S. 961 (1991); Hines v. Consol. Rail Corp., 926 F.2d 262, 273 (3d Cir. 1991); Allen v. Int’l Bus.
Mach. Corp., No. 94-264-LON, 1997 U.S. Dist. LEXIS 8016, at *71–*74 (meta-analysis of observational studies is a controversial subject among epidemiologists). Thus, contrary to the suggestion
by at least one court, multiple studies with small numbers of subjects may be pooled to reduce the
possibility of sampling error. See In re Joint E. & S. Dist. Asbestos Litig., 827 F. Supp. 1014, 1042
(S.D.N.Y. 1993) (“[N]o matter how many studies yield a positive but statistically insignificant SMR
for colorectal cancer, the results remain statistically insignificant. Just as adding a series of zeros together
yields yet another zero as the product, adding a series of positive but statistically insignificant SMRs
together does not produce a statistically significant pattern.”), rev’d, 52 F.3d 1124 (2d Cir. 1995); see
also supra note 76.
172. For a nontechnical explanation of meta-analysis, along with case studies of a variety of
scientific areas in which it has been employed, see Morton Hunt, How Science Takes Stock: The
Story of Meta-Analysis (1997).
173. Petitti, supra note 88.
174. See Donna F. Stroup et al., Meta-analysis of Observational Studies in Epidemiology: A Proposal
for Reporting, 283 JAMA 2008, 2009 (2000); Jesse A. Berlin & Carin J. Kim, The Use of Meta-Analysis
in Pharmacoepidemiology, in Pharmacoepidemiology 681, 683–84 (Brian L. Strom ed., 4th ed. 2005).
175. On rare occasions, meta-analyses of both clinical and observational studies are available.
See, e.g., In re Bextra & Celebrex Mktg. Sales Practices & Prod. Liab. Litig., 524 F. Supp. 2d 1166,
1175 (N.D. Cal. 2007) (referring to clinical and observational meta-analyses of low dose of a drug;
both analyses failed to find any effect).
607
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
A number of problems and issues arise in meta-analysis. Should only published
papers be included in the meta-analysis, or should any available studies be used,
even if they have not been peer reviewed? Can the results of the meta-analysis
itself be reproduced by other analysts? When there are several meta-analyses of a
given relationship, why do the results of different meta-analyses often disagree?
The appeal of a meta-analysis is that it generates a single estimate of risk (along
with an associated confidence interval), but this strength can also be a weakness,
and may lead to a false sense of security regarding the certainty of the estimate. A
key issue is the matter of heterogeneity of results among the studies being summarized. If there is more variance among study results than one would expect
by chance, this creates further uncertainty about the summary measure from the
meta-analysis. Such differences can arise from variations in study quality, or in
study populations or in study designs. Such differences in results make it harder
to trust a single estimate of effect; the reasons for such differences need at least
to be acknowledged and, if possible, explained.176 People often tend to have an
inordinate belief in the validity of the findings when a single number is attached
to them, and many of the difficulties that may arise in conducting a meta-analysis,
especially of observational studies such as epidemiologic ones, may consequently
be overlooked.177
VII. What Role Does Epidemiology Play in
Proving Specific Causation?
Epidemiology is concerned with the incidence of disease in populations, and
epidemiologic studies do not address the question of the cause of an individual’s
disease.178 This question, often referred to as specific causation, is beyond the
176. See Stroup et al., supra note 173 (recommending methodology for meta-analysis of observational studies).
177. Much has been written about meta-analysis recently and some experts consider the problems
of meta-analysis to outweigh the benefits at the present time. For example, John Bailar has observed:
[P]roblems have been so frequent and so deep, and overstatements of the strength of conclusions so
extreme, that one might well conclude there is something seriously and fundamentally wrong with the
method. For the present . . . I still prefer the thoughtful, old-fashioned review of the literature by a
knowledgeable expert who explains and defends the judgments that are presented. We have not yet
reached a stage where these judgments can be passed on, even in part, to a formalized process such as
meta-analysis.
John C. Bailar III, Assessing Assessments, 277 Science 528, 529 (1997) (reviewing Morton Hunt, How
Science Takes Stock (1997)); see also Point/Counterpoint: Meta-analysis of Observational Studies, 140 Am.
J. Epidemiology 770 (1994).
178. See DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 945 & n.6 (3d Cir. 1990) (“Epidemiological studies do not provide direct evidence that a particular plaintiff was injured by exposure
to a substance.”); In re Viagra Prods. Liab. Litig., 572 F. Supp. 2d 1071, 1078 (D. Minn. 2008) (“Epi-
608
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
domain of the science of epidemiology. Epidemiology has its limits at the point
where an inference is made that the relationship between an agent and a disease is
causal (general causation) and where the magnitude of excess risk attributed to the
agent has been determined; that is, epidemiologists investigate whether an agent
can cause a disease, not whether an agent did cause a specific plaintiff’s disease.179
Nevertheless, the specific causation issue is a necessary legal element in a
toxic substance case. The plaintiff must establish not only that the defendant’s
agent is capable of causing disease, but also that it did cause the plaintiff’s disease.
Thus, numerous cases have confronted the legal question of what is acceptable
proof of specific causation and the role that epidemiologic evidence plays in
answering that question.180 This question is not a question that is addressed
by epidemiology.181 Rather, it is a legal question with which numerous courts
demiology focuses on the question of general causation (i.e., is the agent capable of causing disease?)
rather than that of specific causation (i.e., did it cause a disease in a particular individual?)” (quoting
the second edition of this reference guide)); In re Asbestos Litig,, 900 A.2d 120, 133 (Del. Super. Ct.
2006); Michael Dore, A Commentary on the Use of Epidemiological Evidence in Demonstrating Cause-in-Fact,
7 Harv. Envtl. L. Rev. 429, 436 (1983).
There are some diseases that do not occur without exposure to a given toxic agent. This is the
same as saying that the toxic agent is a necessary cause for the disease, and the disease is sometimes
referred to as a signature disease (also, the agent is pathognomonic), because the existence of the disease
necessarily implies the causal role of the agent. See Kenneth S. Abraham & Richard A. Merrill, Scientific
Uncertainty in the Courts, Issues Sci. & Tech. 93, 101 (1986). Asbestosis is a signature disease for asbestos,
and vaginal adenocarcinoma (in young adult women) is a signature disease for in utero DES exposure.
179. Cf. In re “Agent Orange” Prod. Liab. Litig., 597 F. Supp. 740, 780 (E.D.N.Y. 1984) (Agent
Orange allegedly caused a wide variety of diseases in Vietnam veterans and their offspring), aff’d, 818
F.2d 145 (2d Cir. 1987).
180. In many instances, causation can be established without epidemiologic evidence. When
the mechanism of causation is well understood, the causal relationship is well established, or the timing between cause and effect is close, scientific evidence of causation may not be required. This is
frequently the situation when the plaintiff suffers traumatic injury rather than disease. This section
addresses only those situations in which causation is not evident, and scientific evidence is required.
181. Nevertheless, an epidemiologist may be helpful to the factfinder in answering this question.
Some courts have permitted epidemiologists (or those who use epidemiologic methods) to testify about
specific causation. See Ambrosini v. Labarraque, 101 F.3d 129, 137–41 (D.C. Cir. 1996); Zuchowicz v.
United States, 870 F. Supp. 15 (D. Conn. 1994); Landrigan v. Celotex Corp., 605 A.2d 1079, 1088–89
(N.J. 1992). In general, courts seem more concerned with the basis of an expert’s opinion than with
whether the expert is an epidemiologist or clinical physician. See Porter v. Whitehall, 9 F.3d 607, 614
(7th Cir. 1992) (“curb side” opinion from clinician not admissible); Burton v. R.J. Reynolds Tobacco
Co., 181 F. Supp. 2d 1256, 1266–67 (D. Kan. 2002) (vascular surgeon permitted to testify to general
causation over objection based on fact he was not an epidemiologist); Wade-Greaux v. Whitehall Labs.,
874 F. Supp. 1441, 1469–72 (D.V.I.) (clinician’s multiple bases for opinion inadequate to support
causation opinion), aff’d, 46 F.3d 1120 (3d Cir. 1994); Landrigan, 605 A.2d at 1083–89 (permitting
both clinicians and epidemiologists to testify to specific causation provided the methodology used is
sound); Trach v. Fellin, 817 A.2d 1102, 1118–19 (Pa. Super. Ct. 2003) (toxicologist and pathologist
permitted to testify to specific causation).
609
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
have grappled.182 The remainder of this section is predominantly an explanation of judicial opinions. It is, in addition, in its discussion of the reasoning
behind applying the risk estimates of an epidemiologic body of evidence to an
individual, informed by epidemiologic principles and methodological research.
Before proceeding, one more caveat is in order. This section assumes that
epidemiologic evidence has been used as proof of causation for a given plaintiff.
The discussion does not address whether a plaintiff must use epidemiologic evidence to prove causation.183
Two legal issues arise with regard to the role of epidemiology in proving
individual causation: admissibility and sufficiency of evidence to meet the burden
of production. The first issue tends to receive less attention by the courts but
nevertheless deserves mention. An epidemiologic study that is sufficiently rigorous to justify a conclusion that it is scientifically valid should be admissible,184 as
it tends to make an issue in dispute more or less likely.185
182. See Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 28 cmt.
c(3) (2010) (“Scientists who conduct group studies do not examine specific causation in their
research. No scientific methodology exists for assessing specific causation for an individual based on
group studies. Nevertheless, courts have reasoned from the preponderance-of-the-evidence standard
to determine the sufficiency of scientific evidence on specific causation when group-based studies
are involved”).
183. See id. § 28 cmt. c(3) & rptrs. note (“most courts have appropriately declined to impose
a threshold requirement that a plaintiff always must prove causation with epidemiologic evidence”);
see also Westberry v. Gislaved Gummi AB, 178 F.2d 257 (4th Cir. 1999) (acute response, differential
diagnosis ruled out other known causes of disease, dechallenge, rechallenge tests by expert that were
consistent with exposure to defendant’s agent causing disease, and absence of epidemiologic or toxicologic studies; holding that expert’s testimony on causation was properly admitted); Zuchowicz v.
United States, 140 F.3d 381 (2d Cir. 1998); In re Heparin Prods. Liab. Litig. 2011 WL 2971918, at
*7-10 (N.D. Ohio July 21, 2011).
184. See DeLuca v. Merrell Dow Pharms., Inc., 911 F.2d 941, 958 (3d Cir. 1990); cf. Kehm v.
Procter & Gamble Co., 580 F. Supp. 890, 902 (N.D. Iowa 1982) (“These [epidemiologic] studies were
highly probative on the issue of causation—they all concluded that an association between tampon use
and menstrually related TSS [toxic shock syndrome] cases exists.”), aff’d, 724 F.2d 613 (8th Cir. 1984).
Hearsay concerns may limit the independent admissibility of the study, but the study could be
relied on by an expert in forming an opinion and may be admissible pursuant to Fed. R. Evid. 703 as
part of the underlying facts or data relied on by the expert.
In Ellis v. International Playtex, Inc., 745 F.2d 292, 303 (4th Cir. 1984), the court concluded that
certain epidemiologic studies were admissible despite criticism of the methodology used in the studies.
The court held that the claims of bias went to the studies’ weight rather than their admissibility. Cf.
Christophersen v. Allied-Signal Corp., 939 F.2d 1106, 1109 (5th Cir. 1991) (“As a general rule,
questions relating to the bases and sources of an expert’s opinion affect the weight to be assigned that
opinion rather than its admissibility. . . . “).
185. Even if evidence is relevant, it may be excluded if its probative value is substantially
outweighed by prejudice, confusion, or inefficiency. Fed. R. Evid. 403. However, exclusion of an
otherwise relevant epidemiologic study on Rule 403 grounds is unlikely.
In Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 591 (1993), the Court invoked the
concept of “fit,” which addresses the relationship of an expert’s scientific opinion to the facts of
the case and the issues in dispute. In a toxic substance case in which cause in fact is disputed, an epi-
610
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
Far more courts have confronted the role that epidemiology plays with
regard to the sufficiency of the evidence and the burden of production.186 The
civil burden of proof is described most often as requiring belief by the factfinder
“that what is sought to be proved is more likely true than not true.”187 The relative risk from epidemiologic studies can be adapted to this 50%-plus standard to
yield a probability or likelihood that an agent caused an individual’s disease.188 An
important caveat is necessary, however. The discussion below speaks in terms of
the magnitude of the relative risk or association found in a study. However, before
an association or relative risk is used to make a statement about the probability
of individual causation, the inferential judgment, described in Section V, that the
association is truly causal rather than spurious, is required: “[A]n agent cannot
be considered to cause the illness of a specific person unless it is recognized as a
cause of that disease in general.”189 The following discussion should be read with
this caveat in mind.190
demiologic study of the same agent to which the plaintiff was exposed that examined the association
with the same disease from which the plaintiff suffers would undoubtedly have sufficient “fit” to be
a part of the basis of an expert’s opinion. The Court’s concept of “fit,” borrowed from United States
v. Downing, 753 F.2d 1224, 1242 (3d Cir. 1985), appears equivalent to the more familiar evidentiary
concept of probative value, albeit one requiring assessment of the scientific reasoning the expert used
in drawing inferences from methodology or data to opinion.
186. We reiterate a point made at the outset of this section: This discussion of the use of a
threshold relative risk for specific causation is not epidemiology or an inquiry an epidemiologist would
undertake. This is an effort by courts and commentators to adapt the legal standard of proof to the
available scientific evidence. See supra text accompanying notes 175–179. While strength of association
is a guideline for drawing an inference of causation from an association, see supra Section V, there is
no specified threshold required.
187. Kevin F. O’Malley et al., Federal Jury Practice and Instructions § 104.01 (5th ed. 2000); see
also United States v. Fatico, 458 F. Supp. 388, 403 (E.D.N.Y. 1978) (“Quantified, the preponderance
standard would be 50%+ probable.”), aff’d, 603 F.2d 1053 (2d Cir. 1979).
188. An adherent of the frequentist school of statistics would resist this adaptation, which may
explain why many epidemiologists and toxicologists also resist it. To take the step identified in the text
of using an epidemiologic study outcome to determine the probability of specific causation requires a
shift from a frequentist approach, which involves sampling or frequency data from an empirical test,
to a subjective probability about a discrete event. Thus, a frequentist might assert, after conducting
a sampling test, that 60% of the balls in an opaque container are blue. The same frequentist would
resist the statement, “The probability that a single ball removed from the box and hidden behind a
screen is blue is 60%.” The ball is either blue or not, and no frequentist data would permit the latter
statement. “[T]here is no logically rigorous definition of what a statement of probability means with
reference to an individual instance. . . .” Lee Loevinger, On Logic and Sociology, 32 Jurimetrics J. 527,
530 (1992); see also Steve Gold, Causation in Toxic Torts: Burdens of Proof, Standards of Persuasion and
Statistical Evidence, 96 Yale L.J. 376, 382–92 (1986). Subjective probabilities about unique events are
employed by those using Bayesian methodology. See Kaye, supra note 80, at 54–62; David H. Kaye &
David A. Freedman, Reference Guide on Statistics, Section IV.D, in this manual.
189. Cole, supra note 65, at 10,284.
190. We emphasize this caveat, both because it is not intuitive and because some courts have failed
to appreciate the difference between an association and a causal relationship. See, e.g., Forsyth v. Eli Lilly
& Co., Civ. No. 95-00185 ACK, 1998 U.S. Dist. LEXIS 541, at *26–*31 (D. Haw. Jan. 5, 1998). But see
611
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Some courts have reasoned that when epidemiologic studies find that exposure to the agent causes an incidence in the exposed group that is more than
twice the incidence in the unexposed group (i.e., a relative risk greater than 2.0),
the probability that exposure to the agent caused a similarly situated individual’s
disease is greater than 50%.191 These courts, accordingly, hold that when there is
group-based evidence finding that exposure to an agent causes an incidence of disease in the exposed group that is more than twice the incidence in the unexposed
group, the evidence is sufficient to satisfy the plaintiff’s burden of production and
permit submission of specific causation to a jury. In such a case, the factfinder may
find that it is more likely than not that the substance caused the particular plaintiff’s disease. Courts, thus, have permitted expert witnesses to testify to specific
causation based on the logic of the effect of a doubling of the risk.192
While this reasoning has a certain logic as far as it goes, there are a number of
significant assumptions and important caveats that require explication:
1. A valid study and risk estimate. The propriety of this “doubling” reasoning
depends on group studies identifying a genuine causal relationship and a
reasonably reliable measure of the increased risk.193 This requires attention
Berry v. CSX Transp., Inc., 709 So. 2d 552, 568 (Fla. Dist. Ct. App. 1998) (“From epidemiologic studies
demonstrating an association, an epidemiologist may or may not infer that a causal relationship exists.”).
191. An alternative, yet similar, means to address probabilities in individual cases is use of the
attributable fraction parameter, also known as the attributable risk. See supra Section III.C. The attributable fraction is that portion of the excess risk that can be attributed to an agent, above and beyond
the background risk that is due to other causes. Thus, when the relative risk is greater than 2.0, the
attributable fraction exceeds 50%.
192. For a comprehensive list of cases that support proof of causation based on group studies,
see Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 28 cmt. c(4) rptrs.
note (2010). The Restatement catalogues those courts that require a relative risk in excess of 2.0 as a
threshold for sufficient proof of specific causation and those courts that recognize that a lower relative
risk than 2.0 can support specific causation, as explained below. Despite considerable disagreement on
whether a relative risk of 2.0 is required or merely a taking-off point for determining the sufficiency
of the evidence on specific causation, two commentators who surveyed the cases observed that “[t]
here were no clear differences in outcomes as between federal and state courts.” Russellyn S. Carruth
& Bernard D. Goldstein, Relative Risk Greater than Two in Proof of Causation in Toxic Tort Litigation, 41
Jurimetrics J. 195, 199 (2001).
193. Indeed, one commentator contends that, because epidemiology is sufficiently imprecise
to accurately measure small increases in risk, in general, studies that find a relative risk less than 2.0
should not be sufficient for causation. The concern is not with specific causation but with general
causation and the likelihood that an association less than 2.0 is noise rather than reflecting a true causal
relationship. See Michael D. Green, The Future of Proportional Liability, in Exploring Tort Law (Stuart
Madden ed., 2005); see also Samuel M. Lesko & Allen A. Mitchell, The Use of Randomized Controlled
Trials for Pharmacoepidemiology Studies, in Pharmacoepidemiology 599, 601 (Brian L. Strom ed., 4th
ed. 2005) (“it is advisable to use extreme caution in making causal inferences from small relative risks
derived from observational studies”); Gary Taubes, Epidemiology Faces Its Limits, 269 Science 164 (1995)
(explaining views of several epidemiologists about a threshold relative risk of 3.0 to seriously consider
a causal relationship); N.E. Breslow & N.E. Day, Statistical Methods in Cancer Research, in The Analysis
612
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
to the possibility of random error, bias, or confounding being the source
of the association rather than a true causal relationship as explained in Sections IV and V, supra.194
2. Similarity among study subjects and plaintiff. Only if the study subjects and
the plaintiff are similar with respect to other risk factors will a risk estimate from a study or studies be valid when applied to an individual.195
Thus, if those exposed in a study of the risk of lung cancer from smoking
smoked half a pack of cigarettes a day for 20 years, the degree of increased
incidence of lung cancer among them cannot be extrapolated to someone
who smoked two packs of cigarettes for 30 years without strong (and questionable) assumptions about the dose–response relationship.196 This is also
applicable to risk factors for competing causes. Thus, if all of the subjects
in a study are participating because they were identified as having a family
history of heart disease, the magnitude of risk found in a study of smokof Case-Control Studies 36 (IARC Pub. No. 32, 1980) (“[r]elative risks of less than 2.0 may readily
reflect some unperceived bias or confounding factor”); David A. Freedman & Philip B. Stark, The
Swine Flu Vaccine and Guillain-Barré Syndrome: A Case Study in Relative Risk and Specific Causation, 64
Law & Contemp. Probs. 49, 61 (2001) (“If the relative risk is near 2.0, problems of bias and confounding in the underlying epidemiologic studies may be serious, perhaps intractable.”).
194. An excellent explanation for why differential diagnoses generally are inadequate without
further proof of general causation was provided in Cavallo v. Star Enterprises, 892 F. Supp. 756 (E.D.
Va. 1995), aff’d in relevant part, 100 F.3d 1150 (4th Cir. 1996):
The process of differential diagnosis is undoubtedly important to the question of “specific causation”.
If other possible causes of an injury cannot be ruled out, or at least the probability of their contribution
to causation minimized, then the “more likely than not” threshold for proving causation may not be
met. But, it is also important to recognize that a fundamental assumption underlying this method is
that the final, suspected “cause” remaining after this process of elimination must actually be capable of
causing the injury. That is, the expert must “rule in” the suspected cause as well as “rule out” other
possible causes. And, of course, expert opinion on this issue of “general causation” must be derived
from a scientifically valid methodology.
Id. at 771 (footnote omitted); see also Ruggiero v. Warner-Lambert Co., 424 F.3d 249, 254 (2d Cir.
2005); Norris v. Baxter Healthcare Corp., 397 F.3d 878, 885 (10th Cir. 2005); Meister v. Med. Eng’g
Corp., 267 F.3d 1123, 1128–29 (D.C. Cir. 2001); Bickel v. Pfizer, Inc., 431 F. Supp. 2d 918, 923–24
(N.D. Ind. 2006); In re Rezulin Prods. Liab. Litig., 369 F. Supp. 2d 398, 436 (S.D.N.Y. 2005); Coastal
Tankships, U.S.A., Inc. v. Anderson, 87 S.W.3d 591, 608–09 (Tex. Ct. App. 2002); see generally Joseph
Sanders & Julie Machal-Fulks, The Admissibility of Differential Diagnosis Testimony to Prove Causation
in Toxic Tort Cases: The Interplay of Adjective and Substantive Law, 64 Law & Contemp. Probs. 107,
122–25 (2001) (discussing cases rejecting differential diagnoses in the absence of other proof of general
causation and contrary cases).
195 “The basic premise of probability of causation is that individual risk can be determined from
epidemiologic data for a representative population; however the premise only holds if the individual
is truly representative of the reference population.” Council on Scientific Affairs, American Medical
Association, Radioepidemiological Tables 257 JAMA 806 (1987).
196. Conversely, a risk estimate from a study that involved a greater exposure is not applicable to
an individual exposed to a lower dose. See, e.g., In re Bextra & Celebrex Mktg. Sales Practices & Prod.
Liab. Litig., 524 F. Supp. 2d 1166, 1175–76 (N.D. Cal. 2007) (relative risk found in studies of those
who took twice the dose of others could not support expert’s opinion of causation for latter group).
613
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ing on the risk of heart disease cannot validly be applied to an individual
without such a family history. Finally, if an individual has been differentially exposed to other risk factors from those in a study, the results of the
study will not provide an accurate basis for the probability of causation
for the individual.197 Consider once again a study of the effect of smoking
on lung cancer among subjects who have no asbestos exposure. The relative risk of smoking in that study would not be applicable to an asbestos
insulation worker. More generally, if the study subjects are heterogeneous
with regard to risk factors related to the outcome of interest, the relative
risk found in a study represents an average risk for the group rather than a
uniform increased risk applicable to each individual.198
3. Nonacceleration of disease. Another assumption embedded in using the risk
findings of a group study to determine the probability of causation in an
individual is that the disease is one that never would have been contracted
absent exposure. Put another way, the assumption is that the agent did not
merely accelerate occurrence of the disease without affecting the lifetime
risk of contracting the disease. Birth defects are an example of an outcome
that is not accelerated. However, for most of the chronic diseases of adulthood, it is not possible for epidemiologic studies to distinguish between
acceleration of disease and causation of new disease. If, in fact, acceleration
197. See David H. Kaye & David A. Freedman, Reference Guide on Statistics, in this manual
(explaining the problems of employing a study outcome to determine the probability of an individual’s
having contracted the disease from exposure to the agent because of variations in individuals that bear
on the risk of a given individual contracting the disease); David A. Freedman & Philip Stark, The Swine
Flu Vaccine and Guillain-Barré Syndrome: A Case Study in Relative Risk and Specific Causation, 23 Evaluation Rev. 619 (1999) (analyzing the role that individual variation plays in determining the probability
of specific causation based on the relative risk found in a study and providing a mathematical model
for calculating the effect of individual variation); Mark Parascandola, What Is Wrong with the Probability
of Causation? 39 Jurimetrics J. 29 (1998).
198. The comment of two prominent epidemiologists on this subject is illuminating:
We cannot measure the individual risk, and assigning the average value to everyone in the category
reflects nothing more than our ignorance about the determinants of lung cancer that interact with
cigarette smoke. It is apparent from epidemiological data that some people can engage in chain smoking for many decades without developing lung cancer. Others are or will become primed by unknown
circumstances and need only to add cigarette smoke to the nearly sufficient constellation of causes to
initiate lung cancer. In our ignorance of these hidden causal components, the best we can do in assessing
risk is to classify people according to measured causal risk indicators and then assign the average observed
within a class to persons within the class.
Rothman & Greenland, supra note 131, at 9; see also Ofer Shpilberg et al., The Next Stage: Molecular
Epidemiology, 50 J. Clinical Epidemiology 633, 637 (1997) (“A 1.5-fold relative risk may be composed
of a 5-fold risk in 10% of the population, and a 1.1-fold risk in the remaining 90%, or a 2-fold risk in
25% and a 1.1-fold for 75%, or a 1.5-fold risk for the entire population.”).
614
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
is involved, the relative risk from a study will understate the probability
that exposure accelerated the occurrence of the disease.199
4. Agent operates independently. Employing a risk estimate to determine the
probability of causation is not valid if the agent interacts with another
cause in a way that results in an increase in disease beyond merely the sum
of the increased incidence due to each agent separately. For example, the
relative risk of lung cancer due to smoking is around 10, while the relative
risk for asbestos exposure is approximately 5. The relative risk for someone
exposed to both is not the arithmetic sum of the two relative risks, that
is, 15, but closer to the product (50- to 60-fold), reflecting an interaction
between the two.200 Neither of the individual agent’s relative risks can be
employed to estimate the probability of causation in someone exposed to
both asbestos and cigarette smoke.201
5. Other assumptions. Additional assumptions include (a) the agent of interest
is not responsible for fatal diseases other than the disease of interest202 and
(b) the agent does not provide a protective effect against the outcome of
interest in a subpopulation of those being studied.203
Evidence in a given case may challenge one or more of these assumptions.
Bias in a study may suggest that the study findings are inaccurate and should be
estimated to be higher or lower or, even, that the findings are spurious, that is,
do not reflect a true causal relationship. A plaintiff may have been exposed to a
199. See Sander Greenland & James M. Robins, Epidemiology, Justice, and the Probability of Causation, 40 Jurimetrics J. 321 (2000); Sander Greenland, Relation of Probability of Causation to Relative
Risk and Doubling Dose: A Methodologic Error That Has Become a Social Problem, 89 Am. J. Pub. Health
1166 (1999). If acceleration occurs, then the appropriate characterization of the harm for purposes of
determining damages would have to be addressed. A defendant who only accelerates the occurrence
of harm, say, chronic back pain, that would have occurred independently in the plaintiff at a later
time is not liable for the same amount of damages as a defendant who causes a lifetime of chronic
back pain. See David A. Fischer, Successive Causes and the Enigma of Duplicated Harm, 66 Tenn. L. Rev.
1127, 1127 (1999); Michael D. Green, The Intersection of Factual Causation and Damages, 55 DePaul
L. Rev. 671 (2006).
200. We use interaction to mean that the combined effect is other than the additive sum of each
effect, which is what we would expect if the two agents operate independently. Statisticians employ
the term interaction in a different manner to mean the outcome deviates from what was expected in
the model specified in advance. See Jay S. Kaufman, Interaction Reaction, 20 Epidemiology 159 (2009);
Sander Greenland & Kenneth J. Rothman, Concepts of Interaction, in Rothman & Greenland, supra
note 131, at 329.
201. See Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 28 cmt.
c(5) (2010); Jan Beyea & Sander Greenland, The Importance of Specifying the Underlying Biologic Model in
Estimating the Probability of Causation, 76 Health Physics 269 (1999).
202. This is because in the epidemiologic studies relied on, those deaths caused by the alternative
disease process will mask the true magnitude of increased incidence of the studied disease when the
study subjects die before developing the disease of interest.
203. See Greenland & Robins, supra, note 198, at 332–33.
615
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
dose of the agent in question that is greater or lower than that to which those in
the study were exposed.204 A plaintiff may have individual factors, such as higher
age than those in the study, that make it less likely that exposure to the agent
caused the plaintiff’s disease. Similarly, an individual plaintiff may be able to rule
out other known (background) causes of the disease, such as genetics, that increase
the likelihood that the agent was responsible for that plaintiff’s disease. Evidence
of a pathological mechanism may be available for the plaintiff that is relevant to
the cause of the plaintiff’s disease.205 Before any causal relative risk from an epidemiologic study can be used to estimate the probability that the agent in question
caused an individual plaintiff’s disease, consideration of these (and related) factors
is required.206
Having additional evidence that bears on individual causation has led a few
courts to conclude that a plaintiff may satisfy his or her burden of production
even if a relative risk less than 2.0 emerges from the epidemiologic evidence.207
For example, genetics might be known to be responsible for 50% of the incidence
of a disease independent of exposure to the agent.208 If genetics can be ruled out
204. See supra Section V.C; see also Ferebee v. Chevron Chem. Co., 736 F.2d 1529, 1536 (D.C.
Cir. 1984) (“The dose–response relationship at low levels of exposure for admittedly toxic chemicals
like paraquat is one of the most sharply contested questions currently being debated in the medical community.”); In re Joint E. & S. Dist. Asbestos Litig., 774 F. Supp. 113, 115 (S.D.N.Y. 1991)
(discussing different relative risks associated with different doses), rev’d on other grounds, 964 F.2d 92
(2d Cir. 1992).
205. See Tobin v. Astra Pharm. Prods., Inc., 993 F.2d 528 (6th Cir. 1993) (plaintiff’s expert relied
predominantly on pathogenic evidence).
206. See Merrell Dow Pharms., Inc. v. Havner, 953 S.W.2d 706, 720 (Tex. 1997); Smith v.
Wyeth-Ayerst Labs. Co., 278 F. Supp. 2d 684, 708–09 (W.D.N.C. 2003) (describing expert’s effort
to refine relative risk applicable to plaintiff based on specific risk characteristics applicable to her, albeit
in an ill-explained manner); McDarby v. Merck & Co., 949 A.2d 223 (N.J. Super. Ct. App. Div.
2008); Mary Carter Andrues, Proof of Cancer Causation in Toxic Waste Litigation, 61 S. Cal. L. Rev.
2075, 2100–04 (1988). An example of a judge sitting as factfinder and considering individual factors
for a number of plaintiffs in deciding cause in fact is contained in Allen v. United States, 588 F. Supp.
247, 429–43 (D. Utah 1984), rev’d on other grounds, 816 F.2d 1417 (10th Cir. 1987), cert. denied, 484
U.S. 1004 (1988); see also Manko v. United States, 636 F. Supp. 1419, 1437 (W.D. Mo. 1986), aff’d,
830 F.2d 831 (8th Cir. 1987).
207. In re Hanford Nuclear Reservation Litig., 292 F.3d 1124, 1137 (9th Cir. 2002) (applying
Washington law) (recognizing the role of individual factors that may modify the probability of causation based on the relative risk); Magistrini v. One Hour Martinizing Dry Cleaning, 180 F. Supp. 2d
584, 606 (D.N.J. 2002) (“[A] relative risk of 2.0 is not so much a password to a finding of causation
as one piece of evidence, among others for the court to consider in determining whether an expert
has employed a sound methodology in reaching his or her conclusion.”); Miller v. Pfizer, Inc., 196 F.
Supp. 2d 1062, 1079 (D. Kan. 2002) (rejecting a threshold of 2.0 for the relative risk and recognizing that even a relative risk greater than 2.0 may be insufficient); Pafford v. Sec’y, Dept. of Health &
Human Servs., 64 Fed. Cl. 19 (2005) (acknowledging that epidemiologic studies finding a relative risk
of less than 2.0 can provide supporting evidence of causation), aff’d, 451 F.3d 1352 (Fed. Cir. 2006).
208. See generally Steve C. Gold, The More We Know, the Less Intelligent We Are? How Genomic
Information Should, and Should Not, Change Toxic Tort Causation Doctrine, 34 Harv. Envtl. L. Rev. 369
(2010); Jamie A. Grodsky, Genomics and Toxic Torts: Dismantling the Risk-Injury Divide, 59 Stan. L. Rev.
616
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
in an individual’s case, then a relative risk greater than 1.5 might be sufficient to
support an inference that the agent was more likely than not responsible for the
plaintiff’s disease.209
Indeed, this idea of eliminating a known and competing cause is central to
the methodology popularly known in legal terminology as differential diagnosis210
but is more accurately referred to as differential etiology.211 Nevertheless, the
logic is sound if the label is not: Eliminating other known and competing causes
increases the probability that a given individual’s disease was caused by exposure
to the agent. In a differential etiology, an expert first determines other known
causes of the disease in question and then attempts to ascertain whether those
competing causes can be “ruled out” as a cause of plaintiff’s disease212 as in the
1671 (2007); Gary E. Marchant, Genetic Data in Toxic Tort Litigation, 14 J.L. & Pol’y 7 (2006); Gary
E. Marchant, Genetics and Toxic Torts, 31 Seton Hall L. Rev. 949 (2001).
209. The use of probabilities in excess of .50 to support a verdict results in an all-or-nothing
approach to damages that some commentators have criticized. The criticism reflects the fact that defendants responsible for toxic agents with a relative risk just above 2.0 may be required to pay damages
not only for the disease that their agents caused, but also for all instances of the disease. Similarly, those
defendants whose agents increase the risk of disease by less than a doubling may not be required to
pay damages for any of the disease that their agents caused. See, e.g., 2 American Law Inst., Reporter’s
Study on Enterprise Responsibility for Personal Injury: Approaches to Legal and Institutional Change
369–75 (1991). Judge Posner has been in the vanguard of those advocating that damages be awarded
on a proportional basis that reflects the probability of causation or liability. See, e.g., Doll v. Brown,
75 F.3d 1200, 1206–07 (7th Cir. 1996). To date, courts have not adopted a rule that would apportion
damages based on the probability of cause in fact in toxic substances cases. See Green, supra note 192.
210. Physicians regularly employ differential diagnoses in treating their patients to identify the
disease from which the patient is suffering. See Jennifer R. Jamison, Differential Diagnosis for Primary
Practice (1999).
211. It is important to emphasize that the term “differential diagnosis” in a clinical context refers
to identifying a set of diseases or illnesses responsible for the patient’s symptoms, while “differential
etiology” refers to identifying the causal factors involved in an individual’s disease or illness. For many
health conditions, the cause of the disease or illness has no relevance to its treatment, and physicians,
therefore, do not employ this term or pursue that question. See Zandi v. Wyeth a/k/a Wyeth, Inc., No.
27-CV-06-6744, 2007 WL 3224242 (Minn. Dist. Ct. Oct. 15, 2007) (commenting that physicians do
not attempt to determine the cause of breast cancer). Thus, the standard differential diagnosis performed
by a physician is not to determine the cause of a patient’s disease. See John B. Wong et al., Reference
Guide on Medical Testimony, in this manual; Edward J. Imwinkelried, The Admissibility and Legal Sufficiency of Testimony About Differential Diagnosis (Etiology): of Under — and Over — Estimations, 56 Baylor
L. Rev. 391, 402–03 (2004); see also Turner v. Iowa Fire Equip. Co., 229 F.3d 1202, 1208 (8th Cir.
2000) (distinguishing between differential diagnosis conducted for the purpose of identifying the disease
from which the patient suffers and one attempting to determine the cause of the disease); Creanga v.
Jardal, 886 A.2d 633, 639 (N.J. 2005) (“Whereas most physicians use the term to describe the process
of determining which of several diseases is causing a patient’s symptoms, courts have used the term in a
more general sense to describe the process by which causes of the patient’s condition are identified.”).
212. Courts regularly affirm the legitimacy of employing differential diagnostic methodology.
See, e.g., In re Ephedra Prods. Liab. Litig., 393 F. Supp. 2d 181, 187 (S.D.N.Y. 2005); Easum v. Miller,
92 P.3d 794, 802 (Wyo. 2004) (“Most circuits have held that a reliable differential diagnosis satisfies
Daubert and provides a valid foundation for admitting an expert opinion. The circuits reason that a
differential diagnosis is a tested methodology, has been subjected to peer review/publication, does not
617
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
genetics example in the preceding paragraph. Similarly, an expert attempting to
determine whether an individual’s emphysema was caused by occupational chemical exposure would inquire whether the individual was a smoker. By ruling out
(or ruling in) the possibility of other causes, the probability that a given agent was
the cause of an individual’s disease can be refined. Differential etiologies are most
critical when the agent at issue is relatively weak and is not responsible for a large
proportion of the disease in question.
Although differential etiologies are a sound methodology in principle, this
approach is only valid if general causation exists and a substantial proportion of
competing causes are known.213 Thus, for diseases for which the causes are largely
unknown, such as most birth defects, a differential etiology is of little benefit.214
And, like any scientific methodology, it can be performed in an unreliable
manner.215
VIII. Acknowledgments
The authors are grateful for the able research assistance provided by Murphy
Horne, Wake Forest Law School class of 2012, and Cory Randolph, Wake Forest
Law School class of 2010.
frequently lead to incorrect results, and is generally accepted in the medical community.” (quoting
Turner v. Iowa Fire Equip. Co., 229 F.3d 1202, 1208 (8th Cir. 2000))); Alder v. Bayer Corp., AGFA
Div., 61 P.3d 1068, 1084–85 (Utah 2002).
213. Courts have long recognized that to prove causation plaintiff need not eliminate all potential competing causes. See Stubbs v. City of Rochester, 134 N.E. 137, 140 (N.Y. 1919) (rejecting
defendant’s argument that plaintiff was required to eliminate all potential competing causes of typhoid);
see also Easum v. Miller, 92 P.3d 794, 804 (Wyo. 2004). At the same time, before a competing cause
should be considered relevant to a differential diagnosis, there must be adequate evidence that it is a
cause of the disease. See Cooper v. Smith & Nephew, Inc., 259 F.3d 194, 202 (4th Cir. 2001); Ranes
v. Adams Labs., Inc., 778 N.W.2d 677, 690 (Iowa 2010).
214. See Perry v. Novartis Pharms. Corp., 564 F. Supp. 2d 452, 469 (E.D. Pa. 2008) (finding experts’ testimony inadmissible because of failure to account for idiopathic (unknown) causes in
conducting differential diagnosis); Soldo v. Sandoz Pharms. Corp., 244 F. Supp. 2d 434, 480, 519
(W.D. Pa. 2003) (criticizing expert for failing to account for idiopathic causes); Magistrini v. One
Hour Martinizing Dry Cleaning, 180 F. Supp. 2d 584, 609 (D.N.J. 2002) (observing that 90–95% of
leukemias are of unknown causes, but proceeding incorrectly to assert that plaintiff was obliged to
prove that her exposure to defendant’s benzene was the cause of her leukemia rather than simply a
cause of the disease that combined with other exposures to benzene). But see Ruff v. Ensign-Bickford
Indus., Inc., 168 F. Supp. 2d 1271, 1286 (D. Utah 2001) (responding to defendant’s evidence that
most instances of disease are of unknown origin by stating that such matter went to the weight to be
attributed to plaintiff’s expert’s testimony not its admissibility).
215. Numerous courts have concluded that, based on the manner in which a differential diagnosis was conducted, it was unreliable and the expert’s testimony based on it is inadmissible. See, e.g.,
Glastetter v. Novartis Pharms. Corp., 252 F.3d 986, 989 (8th Cir. 2001).
618
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
Glossary of Terms
The following terms and definitions were adapted from a variety of sources,
including A Dictionary of Epidemiology (Miquel M. Porta et al. eds., 5th ed.
2008); 1 Joseph L. Gastwirth, Statistical Reasoning in Law and Public Policy
(1988); James K. Brewer, Everything You Always Wanted to Know about Statistics, but Didn’t Know How to Ask (1978); and R.A. Fisher, Statistical Methods
for Research Workers (1973).
adjustment. Methods of modifying an observed association to take into account
the effect of risk factors that are not the focus of the study and that distort
the observed association between the exposure being studied and the disease
outcome. See also direct age adjustment, indirect age adjustment.
agent. Also, risk factor. A factor, such as a drug, microorganism, chemical substance, or form of radiation, whose presence or absence can result in the
occurrence of a disease. A disease may be caused by a single agent or a number of independent alternative agents, or the combined presence of a complex
of two or more factors may be necessary for the development of the disease.
alpha. The level of statistical significance chosen by a researcher to determine if
any association found in a study is sufficiently unlikely to have occurred by
chance (as a result of random sampling error) if the null hypothesis (no association) is true. Researchers commonly adopt an alpha of .05, but the choice
is arbitrary, and other values can be justified.
alpha error. Also called Type I error and false-positive error, alpha error occurs
when a researcher rejects a null hypothesis when it is actually true (i.e.,
when there is no association). This can occur when an apparent difference
is observed between the control group and the exposed group, but the difference is not real (i.e., it occurred by chance). A common error made by
lawyers, judges, and academics is to equate the level of alpha with the legal
burden of proof.
association. The degree of statistical relationship between two or more events
or variables. Events are said to be associated when they occur more or less
frequently together than one would expect by chance. Association does not
necessarily imply a causal relationship. Events are said not to have an association when the agent (or independent variable) has no apparent effect on the
incidence of a disease (the dependent variable). This corresponds to a relative
risk of 1.0. A negative association means that the events occur less frequently
together than one would expect by chance, thereby implying a preventive or
protective role for the agent (e.g., a vaccine).
attributable fraction. Also, attributable risk. The proportion of disease in
exposed individuals that can be attributed to exposure to an agent, as distinguished from the proportion of disease attributed to all other causes.
619
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
attributable proportion of risk (PAR). This term has been used to denote the
fraction of risk that is attributable to exposure to a substance (e.g., X percent
of lung cancer is attributable to cigarettes). Synonymous terms include attributable fraction, attributable risk, etiologic fraction, population attributable
risk, and risk difference. See attributable risk.
background risk of disease. Also, background rate of disease. Rate of disease
in a population that has no known exposures to an alleged risk factor for the
disease. For example, the background risk for all birth defects is 3–5% of live
births.
beta error. Also called Type II error and false-negative error. Occurs when a
researcher fails to reject a null hypothesis when it is incorrect (i.e., when
there is an association). This can occur when no statistically significant difference is detected between the control group and the exposed group, but a
difference does exist.
bias. Any effect at any stage of investigation or inference tending to produce
results that depart systematically from the true values. In epidemiology, the
term bias does not necessarily carry an imputation of prejudice or other
subjective factor, such as the experimenter’s desire for a particular outcome.
This differs from conventional usage, in which bias refers to a partisan point
of view.
biological marker. A physiological change in tissue or body fluids that occurs
as a result of an exposure to an agent and that can be detected in the laboratory. Biological markers are only available for a small number of chemicals.
biological plausibility. Consideration of existing knowledge about human biology and disease pathology to provide a judgment about the plausibility that
an agent causes a disease.
case-comparison study. See case-control study.
case-control study. Also, case-comparison study, case history study, case referent
study, retrospective study. A study that starts with the identification of persons
with a disease (or other outcome variable) and a suitable control (comparison,
reference) group of persons without the disease. Such a study is often referred
to as retrospective because it starts after the onset of disease and looks back to
the postulated causal factors.
case group. A group of individuals who have been exposed to the disease,
intervention, procedure, or other variable whose influence is being studied.
causation. As used here, an event, condition, characteristic, or agent being a
necessary element of a set of other events that can produce an outcome, such
as a disease. Other sets of events may also cause the disease. For example,
smoking is a necessary element of a set of events that result in lung cancer, yet
there are other sets of events (without smoking) that cause lung cancer. Thus,
a cause may be thought of as a necessary link in at least one causal chain that
620
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
results in an outcome of interest. Epidemiologists generally speak of causation
in a group context; hence, they will inquire whether an increased incidence
of a disease in a cohort was “caused” by exposure to an agent.
clinical trial. An experimental study that is performed to assess the efficacy and
safety of a drug or other beneficial treatment. Unlike observational studies,
clinical trials can be conducted as experiments and use randomization, because
the agent being studied is thought to be beneficial.
cohort. Any designated group of persons followed or traced over a period of time
to examine health or mortality experience.
cohort study. The method of epidemiologic study in which groups of individuals
can be identified who are, have been, or in the future may be differentially
exposed to an agent or agents hypothesized to influence the incidence of
occurrence of a disease or other outcome. The groups are observed to find
out if the exposed group is more likely to develop disease. The alternative
terms for a cohort study (concurrent study, followup study, incidence study,
longitudinal study, prospective study) describe an essential feature of the
method, which is observation of the population for a sufficient number of
person-years to generate reliable incidence or mortality rates in the population subsets. This generally implies study of a large population, study for a
prolonged period (years), or both.
confidence interval. A range of values calculated from the results of a study
within which the true value is likely to fall; the width of the interval reflects
random error. Thus, if a confidence level of .95 is selected for a study, 95%
of similar studies would result in the true relative risk falling within the confidence interval. The width of the confidence interval provides an indication
of the precision of the point estimate or relative risk found in the study; the
narrower the confidence interval, the greater the confidence in the relative
risk estimate found in the study. Where the confidence interval contains a
relative risk of 1.0, the results of the study are not statistically significant.
confounding factor. Also, confounder. A factor that is both a risk factor for
the disease and a factor associated with the exposure of interest. Confounding refers to a situation in which an association between an exposure and
outcome is all or partly the result of a factor that affects the outcome but is
unaffected by the exposure.
control group. A comparison group comprising individuals who have not been
exposed to the disease, intervention, procedure, or other variable whose
influence is being studied.
cross-sectional study. A study that examines the relationship between disease
and variables of interest as they exist in a population at a given time. A
cross-sectional study measures the presence or absence of disease and other
variables in each member of the study population. The data are analyzed to
621
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
determine if there is a relationship between the existence of the variables and
disease. Because cross-sectional studies examine only a particular moment in
time, they reflect the prevalence (existence) rather than the incidence (rate)
of disease and can offer only a limited view of the causal association between
the variables and disease. Because exposures to toxic agents often change over
time, cross-sectional studies are rarely used to assess the toxicity of exogenous
agents.
data dredging. Jargon that refers to results identified by researchers who, after
completing a study, pore through their data seeking to find any associations
that may exist. In general, good research practice is to identify the hypotheses
to be investigated in advance of the study; hence, data dredging is generally
frowned on. In some cases, however, researchers conduct exploratory studies
designed to generate hypotheses for further study.
demographic study. See ecological study.
dependent variable. The outcome that is being assessed in a study based on the
effect of another characteristic—the independent variable. Epidemiologic
studies attempt to determine whether there is an association between the
independent variable (exposure) and the dependent variable (incidence of
disease).
differential misclassification. A form of bias that is due to the misclassification
of individuals or a variable of interest when the misclassification varies among
study groups. This type of bias occurs when, for example, it is incorrectly
determined that individuals in a study are unexposed to the agent being
studied when in fact they are exposed. See nondifferential misclassification.
direct adjustment. A technique used to eliminate any difference between two
study populations based on age, sex, or some other parameter that might
result in confounding. Direct adjustment entails comparison of the study
group with a large reference population to determine the expected rates based
on the characteristic, such as age, for which adjustment is being performed.
dose. Generally refers to the intensity or magnitude of exposure to an agent
multiplied by the duration of exposure. Dose may be used to refer only to
the intensity of exposure.
dose–response relationship. A relationship in which a change in amount,
intensity, or duration of exposure to an agent is associated with a change—
either an increase or a decrease—in risk of disease.
double blinding. A method used in experimental studies in which neither the
individuals being studied nor the researchers know during the study whether
any individual has been assigned to the exposed or control group. Double
blinding is designed to prevent knowledge of the group to which the individual was assigned from biasing the outcome of the study.
622
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
ecological fallacy. Also, aggregation bias, ecological bias. An error that occurs
from inferring that a relationship that exists for groups is also true for individuals. For example, if a country with a higher proportion of fishermen also
has a higher rate of suicides, then inferring that fishermen must be more likely
to commit suicide is an ecological fallacy.
ecological study. Also, demographic study. A study of the occurrence of disease
based on data from populations, rather than from individuals. An ecological
study searches for associations between the incidence of disease and suspected
disease-causing agents in the studied populations. Researchers often conduct
ecological studies by examining easily available health statistics, making these
studies relatively inexpensive in comparison with studies that measure disease
and exposure to agents on an individual basis.
epidemiology. The study of the distribution and determinants of disease or other
health-related states and events in populations and the application of this study
to control of health problems.
error. Random error (sampling error) is the error that is due to chance when the
result obtained for a sample differs from the result that would be obtained if
the entire population (universe) were studied.
etiologic factor. An agent that plays a role in causing a disease.
etiology. The cause of disease or other outcome of interest.
experimental study. A study in which the researcher directly controls the conditions. Experimental epidemiology studies (also clinical studies) entail random
assignment of participants to the exposed and control groups (or some other
method of assignment designed to minimize differences between the groups).
exposed, exposure. In epidemiology, the exposed group (or the exposed) is used
to describe a group whose members have been exposed to an agent that may
be a cause of a disease or health effect of interest, or possess a characteristic
that is a determinant of a health outcome.
false-negative error. See beta error.
false-positive error. See alpha error.
followup study. See cohort study.
general causation. Issue of whether an agent increases the incidence of disease in
a group and not whether the agent caused any given individual’s disease.
Because of individual variation, a toxic agent generally will not cause disease in
every exposed individual.
generalizable. When the results of a study are applicable to populations other
than the study population, such as the general population.
in vitro. Within an artificial environment, such as a test tube (e.g., the cultivation
of tissue in vitro).
in vivo. Within a living organism (e.g., the cultivation of tissue in vivo).
623
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
incidence rate. The number of people in a specified population falling ill from a
particular disease during a given period. More generally, the number of new
events (e.g., new cases of a disease in a defined population) within a specified
period of time.
incidence study. See cohort study.
independent variable. A characteristic that is measured in a study and that is
suspected to have an effect on the outcome of interest (the dependent variable). Thus, exposure to an agent is measured in a cohort study to determine
whether that independent variable has an effect on the incidence of disease,
which is the dependent variable.
indirect adjustment. A technique employed to minimize error that might
result when comparing two populations because of differences in age, sex,
or another parameter that may independently affect the rate of disease in the
populations. The incidence of disease in a large reference population, such as
all residents of a country, is calculated for each subpopulation (based on the
relevant parameter, such as age). Those incidence rates are then applied to
the study population with its distribution of persons to determine the overall
incidence rate for the study population, which provides a standardized mortality or morbidity ratio (often referred to as SMR).
inference. The intellectual process of making generalizations from observations.
In statistics, the development of generalizations from sample data, usually with
calculated degrees of uncertainty.
information bias. Also, observational bias. Systematic error in measuring data
that results in differential accuracy of information (such as exposure status)
for comparison groups.
interaction. When the magnitude or direction (positive or negative) of the effect
of one risk factor differs depending on the presence or level of the other. In
interaction, the effect of two risk factors together is different (greater or less)
than the sum of their individual effects.
meta-analysis. A technique used to combine the results of several studies to
enhance the precision of the estimate of the effect size and reduce the
plausibility that the association found is due to random sampling error.
Meta-analysis is best suited to pooling results from randomly controlled
experimental studies, but if carefully performed, it also may be useful for
observational studies.
misclassification bias. The erroneous classification of an individual in a study as
exposed to the agent when the individual was not, or incorrectly classifying
a study individual with regard to disease. Misclassification bias may exist in
all study groups (nondifferential misclassification) or may vary among groups
(differential misclassification).
624
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
morbidity rate. State of illness or disease. Morbidity rate may refer to either the
incidence rate or prevalence rate of disease.
mortality rate. Proportion of a population that dies of a disease or of all causes.
The numerator is the number of individuals dying; the denominator is the
total population in which the deaths occurred. The unit of time is usually a
calendar year.
model. A representation or simulation of an actual situation. This may be either
(1) a mathematical representation of characteristics of a situation that can be
manipulated to examine consequences of various actions; (2) a representation of a country’s situation through an “average region” with characteristics
resembling those of the whole country; or (3) the use of animals as a substitute
for humans in an experimental system to ascertain an outcome of interest.
multivariate analysis. A set of techniques used when the variation in several
variables has to be studied simultaneously. In statistics, any analytical method
that allows the simultaneous study of two or more independent factors or
variables.
nondifferential misclassification. Error due to misclassification of individuals
or a variable of interest into the wrong category when the misclassification
varies among study groups. The error may result from limitations in data
collection, may result in bias, and will often produce an underestimate of the
true association. See differential misclassification.
null hypothesis. A hypothesis that states that there is no true association between
a variable and an outcome. At the outset of any observational or experimental
study, the researcher must state a proposition that will be tested in the study.
In epidemiology, this proposition typically addresses the existence of an
association between an agent and a disease. Most often, the null hypothesis
is a statement that exposure to Agent A does not increase the occurrence
of Disease D. The results of the study may justify a conclusion that the null
hypothesis (no association) has been disproved (e.g., a study that finds a strong
association between smoking and lung cancer). A study may fail to disprove
the null hypothesis, but that alone does not justify a conclusion that the null
hypothesis has been proved.
observational study. An epidemiologic study in situations in which nature is
allowed to take its course, without intervention from the investigator. For
example, in an observational study the subjects of the study are permitted to
determine their level of exposure to an agent.
odds ratio (OR). Also, cross-product ratio, relative odds. The ratio of the odds
that a case (one with the disease) was exposed to the odds that a control (one
without the disease) was exposed. For most purposes the odds ratio from a
case-control study is quite similar to a risk ratio from a cohort study.
625
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
p (probability), p-value. The p-value is the probability of getting a value of
the test outcome equal to or more extreme than the result observed, given
that the null hypothesis is true. The letter p, followed by the abbreviation
“n.s.” (not significant) means that p > .05 and that the association was not
statistically significant at the .05 level of significance. The statement “p < .05”
means that p is less than 5%, and, by convention, the result is deemed statistically significant. Other significance levels can be adopted, such as .01 or .1.
The lower the p-value, the less likely that random error would have produced
the observed relative risk if the true relative risk is 1.
pathognomonic. When an agent must be present for a disease to occur. Thus,
asbestos is a pathognomonic agent for asbestosis. See signature disease.
placebo controlled. In an experimental study, providing an inert substance to
the control group, so as to keep the control and exposed groups ignorant of
their status.
power. The probability that a difference of a specified amount will be detected
by the statistical hypothesis test, given that a difference exists. In less formal
terms, power is like the strength of a magnifying lens in its capability to identify an association that truly exists. Power is equivalent to one minus Type II
error. This is sometimes stated as Power = 1 – β.
prevalence. The percentage of persons with a disease in a population at a specific
point in time.
prospective study. A study in which two groups of individuals are identified:
(1) individuals who have been exposed to a risk factor and (2) individuals who
have not been exposed. Both groups are followed for a specified length of time,
and the proportion that develops disease in the first group is compared with
the proportion that develops disease in the second group. See cohort study.
random. The term implies that an event is governed by chance. See randomization.
randomization. Assignment of individuals to groups (e.g., for experimental and
control regimens) by chance. Within the limits of chance variation, randomization should make the control group and experimental group similar at the
start of an investigation and ensure that personal judgment and prejudices of
the investigator do not influence assignment. Randomization should not be
confused with haphazard assignment. Random assignment follows a predetermined plan that usually is devised with the aid of a table of random numbers.
Randomization cannot ethically be used where the exposure is known to
cause harm (e.g., cigarette smoking).
randomized trial. See clinical trial.
recall bias. Systematic error resulting from differences between two groups in a
study in accuracy of memory. For example, subjects who have a disease may
recall exposure to an agent more frequently than subjects who do not have
the disease.
626
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
relative risk (RR). The ratio of the risk of disease or death among people
exposed to an agent to the risk among the unexposed. For instance, if 10%
of all people exposed to a chemical develop a disease, compared with 5% of
people who are not exposed, the disease occurs twice as frequently among the
exposed people. The relative risk is 10%/5% = 2. A relative risk of 1 indicates
no association between exposure and disease.
research design. The procedures and methods, predetermined by an investigator,
to be adhered to in conducting a research project.
risk. A probability that an event will occur (e.g., that an individual will become
ill or die within a stated period of time or by a certain age).
risk difference (RD). The difference between the proportion of disease in the
exposed population and the proportion of disease in the unexposed population. –1.0 ≤ RD ≥ 1.0.
sample. A selected subset of a population. A sample may be random or nonrandom.
sample size. The number of subjects who participate in a study.
secular-trend study. Also, time-line study. A study that examines changes over
a period of time, generally years or decades. Examples include the decline of
tuberculosis mortality and the rise, followed by a decline, in coronary heart
disease mortality in the United States in the past 50 years.
selection bias. Systematic error that results from individuals being selected for
the different groups in an observational study who have differences other than
the ones that are being examined in the study.
sensitivity. Measure of the accuracy of a diagnostic or screening test or device in
identifying disease (or some other outcome) when it truly exists. For example,
assume that we know that 20 women in a group of 1000 women have cervical cancer. If the entire group of 1000 women is tested for cervical cancer and
the screening test only identifies 15 (of the known 20) cases of cervical cancer,
the screening test has a sensitivity of 15/20, or 75%. Also see specificity.
signature disease. A disease that is associated uniquely with exposure to an agent
(e.g., asbestosis and exposure to asbestos). See also pathognomonic.
significance level. A somewhat arbitrary level selected to minimize the risk that
an erroneous positive study outcome that is due to random error will be
accepted as a true association. The lower the significance level selected, the
less likely that false-positive error will occur.
specific causation. Whether exposure to an agent was responsible for a given
individual’s disease.
specificity. Measure of the accuracy of a diagnostic or screening test in identifying those who are disease-free. Once again, assume that 980 women out of
a group of 1000 women do not have cervical cancer. If the entire group of
1000 women is screened for cervical cancer and the screening test only iden627
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
tifies 900 women without cervical cancer, the screening test has a specificity
of 900/980, or 92%.
standardized morbidity ratio (SMR). The ratio of the incidence of disease
observed in the study population to the incidence of disease that would be
expected if the study population had the same incidence of disease as some
selected reference population.
standardized mortality ratio (SMR). The ratio of the incidence of death
observed in the study population to the incidence of death that would be
expected if the study population had the same incidence of death as some
selected standard or known population.
statistical significance. A term used to describe a study result or difference
that exceeds the Type I error rate (or p-value) that was selected by the
researcher at the outset of the study. In formal significance testing, a statistically significant result is unlikely to be the result of random sampling error
and justifies rejection of the null hypothesis. Some epidemiologists believe
that formal significance testing is inferior to using a confidence interval to
express the results of a study. Statistical significance, which addresses the role
of random sampling error in producing the results found in the study, should
not be confused with the importance (for public health or public policy) of
a research finding.
stratification. Separating a group into subgroups based on specified criteria, such
as age, gender, or socioeconomic status. Stratification is used both to control
for the possibility of confounding (by separating the studied populations based
on the suspected confounding factor) and when there are other known factors that affect the disease under study. Thus, the incidence of death increases
with age, and a study of mortality might use stratification of the cohort and
control groups based on age.
study design. See research design.
systematic error. See bias.
teratogen. An agent that produces abnormalities in the embryo or fetus by disturbing maternal health or by acting directly on the fetus in utero.
teratogenicity. The capacity for an agent to produce abnormalities in the embryo
or fetus.
threshold phenomenon. A certain level of exposure to an agent below which
disease does not occur and above which disease does occur.
time-line study. See secular-trend study.
toxicology. The science of the nature and effects of poisons. Toxicologists study
adverse health effects of agents on biological organisms, such as live animals
and cells. Studies of humans are performed by epidemiologists.
toxic substance. A substance that is poisonous.
628
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
true association. Also, real association. The association that really exists between
exposure to an agent and a disease and that might be found by a perfect (but
nonetheless nonexistent) study.
Type I error. Rejecting the null hypothesis when it is true. See alpha error.
Type II error. Failing to reject the null hypothesis when it is false. See beta error.
validity. The degree to which a measurement measures what it purports to measure; the accuracy of a measurement.
variable. Any attribute, condition, or other characteristic of subjects in a study
that can have different numerical characteristics. In a study of the causes of
heart disease, blood pressure and dietary fat intake are variables that might
be measured.
629
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
References on Epidemiology
Causal Inferences (Kenneth J. Rothman ed., 1988).
William G. Cochran, Sampling Techniques (1977).
A Dictionary of Epidemiology (John M. Last et al. eds., 5th ed. 2008).
Anders Ahlbom & Steffan Norell, Introduction to Modern Epidemiology (2d
ed. 1990).
Robert C. Elston & William D. Johnson, Basic Biostatistices for Geneticists and
Epidemiologists (2008)
Encyclopedia of Epidemiology (Sarah E. Boslaugh ed., 2008).
Joseph L. Fleiss et al., Statistical Methods for Rates and Proportions (3d ed. 2003).
Leon Gordis, Epidemiology (4th ed. 2009).
Morton Hunt, How Science Takes Stock: The Story of Meta-Analysis (1997).
International Agency for Research on Cancer (IARC), Interpretation of Negative Epidemiologic Evidence for Carcinogenicity (N.J. Wald & R. Doll eds.,
1985).
Harold A. Kahn & Christopher T. Sempos, Statistical Methods in Epidemiology
(1989).
David E. Lilienfeld, Overview of Epidemiology, 3 Shepard’s Expert & Sci. Evid. Q.
25 (1995).
David E. Lilienfeld & Paul D. Stolley, Foundations of Epidemiology (3d ed.
1994).
Marcello Pagano & Kimberlee Gauvreau, Principles of Biostatistics (2d ed. 2000).
Pharmacoepidemiology (Brian L. Strom ed., 4th ed. 2005).
Richard K. Riegelman & Robert A. Hirsch, Studying a Study and Testing a Test:
How to Read the Health Science Literature (5th ed. 2005).
Bernard Rosner, Fundamentals of Biostatistics (6th ed. 2006).
Kenneth J. Rothman et al., Modern Epidemiology (3d ed. 2008).
David A. Savitz, Interpreting Epidemiologic Evidence: Strategies for Study Design
and Analysis (2003).
James J. Schlesselman, Case-Control Studies: Design, Conduct, Analysis (1982).
Lisa M. Sullivan, Essentials of Biostatistics (2008).
Mervyn Susser, Epidemiology, Health and Society: Selected Papers (1987).
References on Law and Epidemiology
American Law Institute, Reporters’ Study on Enterprise Responsibility for Personal Injury (1991).
Bert Black & David H. Hollander, Jr., Unraveling Causation: Back to the Basics, 3
U. Balt. J. Envtl. L. 1 (1993).
Bert Black & David Lilienfeld, Epidemiologic Proof in Toxic Tort Litigation, 52 Fordham L. Rev. 732 (1984).
630
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Epidemiology
Gerald Boston, A Mass-Exposure Model of Toxic Causation: The Content of Scientific
Proof and the Regulatory Experience, 18 Colum. J. Envtl. L. 181 (1993).
Vincent M. Brannigan et al., Risk, Statistical Inference, and the Law of Evidence: The
Use of Epidemiological Data in Toxic Tort Cases, 12 Risk Analysis 343 (1992).
Troyen Brennan, Causal Chains and Statistical Links: The Role of Scientific Uncertainty
in Hazardous-Substance Litigation, 73 Cornell L. Rev. 469 (1988).
Troyen Brennan, Helping Courts with Toxic Torts: Some Proposals Regarding Alternative Methods for Presenting and Assessing Scientific Evidence in Common Law
Courts, 51 U. Pitt. L. Rev. 1 (1989).
Philip Cole, Causality in Epidemiology, Health Policy, and Law, 27 Envtl. L. Rep.
10,279 (June 1997).
Comment, Epidemiologic Proof of Probability: Implementing the Proportional Recovery
Approach in Toxic Exposure Torts, 89 Dick. L. Rev. 233 (1984).
George W. Conk, Against the Odds: Proving Causation of Disease with Epidemiological
Evidence, 3 Shepard’s Expert & Sci. Evid. Q. 85 (1995).
Carl F. Cranor, Toxic Torts: Science, Law, and the Possibility of Justice (2006).
Carl F. Cranor et al., Judicial Boundary Drawing and the Need for Context-Sensitive
Science in Toxic Torts After Daubert v. Merrell Dow Pharmaceuticals, Inc., 16
Va. Envtl. L.J. 1 (1996).
Richard Delgado, Beyond Sindell: Relaxation of Cause-in-Fact Rules for Indeterminate
Plaintiffs, 70 Cal. L. Rev. 881 (1982).
Michael Dore, A Commentary on the Use of Epidemiological Evidence in Demonstrating
Cause-in-Fact, 7 Harv. Envtl. L. Rev. 429 (1983).
Jean Macchiaroli Eggen, Toxic Torts, Causation, and Scientific Evidence After Daubert,
55 U. Pitt. L. Rev. 889 (1994).
Daniel A. Farber, Toxic Causation, 71 Minn. L. Rev. 1219 (1987).
Heidi Li Feldman, Science and Uncertainty in Mass Exposure Litigation, 74 Tex.
L. Rev. 1 (1995).
Stephen E. Fienberg et al., Understanding and Evaluating Statistical Evidence in Litigation, 36 Jurimetrics J. 1 (1995).
Joseph L. Gastwirth, Statistical Reasoning in Law and Public Policy (1988).
Herman J. Gibb, Epidemiology and Cancer Risk Assessment, in Fundamentals of Risk
Analysis and Risk Management 23 (Vlasta Molak ed., 1997).
Steve Gold, Note, Causation in Toxic Torts: Burdens of Proof, Standards of Persuasion
and Statistical Evidence, 96 Yale L.J. 376 (1986).
Leon Gordis, Epidemiologic Approaches for Studying Human Disease in Relation to
Hazardous Waste Disposal Sites, 25 Hous. L. Rev. 837 (1988).
Michael D. Green, Expert Witnesses and Sufficiency of Evidence in Toxic Substances
Litigation: The Legacy of Agent Orange and Bendectin Litigation, 86 Nw. U. L.
Rev. 643 (1992).
Michael D. Green, The Future of Proportional Liability, in Exploring Tort Law
(Stuart Madden ed., 2005).
631
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Sander Greenland, The Need for Critical Appraisal of Expert Witnesses in Epidemiology
and Statistics, 39 Wake Forest L. Rev. 291 (2004).
Khristine L. Hall & Ellen Silbergeld, Reappraising Epidemiology: A Response to Mr.
Dore, 7 Harv. Envtl. L. Rev. 441 (1983).
Jay P. Kesan, Drug Development: Who Knows Where the Time Goes?: A Critical
Examination of the Post-Daubert Scientific Evidence Landscape, 52 Food Drug
Cosm. L.J. 225 (1997).
Jay P. Kesan, An Autopsy of Scientific Evidence in a Post-Daubert World, 84 Geo. L.
Rev. 1985 (1996).
Constantine Kokkoris, Comment, DeLuca v. Merrell Dow Pharmaceuticals, Inc.:
Statistical Significance and the Novel Scientific Technique, 58 Brook. L. Rev. 219
(1992).
James P. Leape, Quantitative Risk Assessment in Regulation of Environmental Carcinogens, 4 Harv. Envtl. L. Rev. 86 (1980).
David E. Lilienfeld, Overview of Epidemiology, 3 Shepard’s Expert & Sci. Evid. Q.
23 (1995).
Junius McElveen, Jr., & Pamela Eddy, Cancer and Toxic Substances: The Problem of
Causation and the Use of Epidemiology, 33 Clev. St. L. Rev. 29 (1984).
Modern Scientific Evidence: The Law and Science of Expert Testimony (David
L. Faigman et al. eds., 2009–2010).
Note, Development in the Law—Confronting the New Challenges of Scientific Evidence,
108 Harv. L. Rev. 1481 (1995).
Susan R. Poulter, Science and Toxic Torts: Is There a Rational Solution to the Problem
of Causation? 7 High Tech. L.J. 189 (1992).
Jon Todd Powell, Comment, How to Tell the Truth with Statistics: A New Statistical
Approach to Analyzing the Data in the Aftermath of Daubert v. Merrell Dow
Pharmaceuticals, 31 Hous. L. Rev. 1241 (1994).
Restatement (Third) of Torts: Liability for Physical and Emotional Harm § 28,
cmt. c & rptrs. note (2010).
David Rosenberg, The Causal Connection in Mass Exposure Cases: A Public Law
Vision of the Tort System, 97 Harv. L. Rev. 849 (1984).
Joseph Sanders, The Bendectin Litigation: A Case Study in the Life-Cycle of Mass Torts,
43 Hastings L.J. 301 (1992).
Joseph Sanders, Scientific Validity, Admissibility, and Mass Torts After Daubert, 78
Minn. L. Rev. 1387 (1994).
Joseph Sanders & Julie Machal-Fulks, The Admissibility of Differential Diagnosis to
Prove Causation in Toxic Tort Cases: The Interplay of Adjective and Substantive
Law, 64 L. & Contemp. Probs. 107 (2001).
Palma J. Strand, The Inapplicability of Traditional Tort Analysis to Environmental
Risks: The Example of Toxic Waste Pollution Victim Compensation, 35 Stan. L.
Rev. 575 (1983).
Richard W. Wright, Causation in Tort Law, 73 Cal. L. Rev. 1735 (1985).
632
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
BERNARD D. GOLDSTEIN AND MARY SUE HENIFIN
Bernard D. Goldstein, M.D., is Professor of Environmental and Occupational Health and
Former Dean, Graduate School of Public Health, University of Pittsburgh.
Mary Sue Henifin, J.D., M.P.H., is a Partner with Buchanan Ingersoll, P.C., Princeton,
New Jersey.
CONTENTS
I. Introduction, 635
A. Toxicology and the Law, 637
B. Purpose of the Reference Guide on Toxicology, 639
C. Toxicological Study Design, 639
1. In vivo research, 640
2. In vitro research, 645
D. Extrapolation from Animal and Cell Research to Humans, 646
E. Safety and Risk Assessment, 646
1. The use of toxicological information in risk assessment, 650
F. Toxicological Processes and Target Organ Toxicity, 651
G. Toxicology and Exposure Assessment, 656
H. Toxicology and Epidemiology, 657
II. Demonstrating an Association Between Exposure and Risk of Disease, 660
A. On What Species of Animals Was the Compound Tested? What Is
Known About the Biological Similarities and Differences Between
the Test Animals and Humans? How Do These Similarities and
Differences Affect the Extrapolation from Animal Data in Assessing
the Risk to Humans? 661
B. Does Research Show That the Compound Affects a Specific Target
Organ? Will Humans Be Affected Similarly? 662
C. What Is Known About the Chemical Structure of the Compound
and Its Relationship to Toxicity? 663
D. Has the Compound Been the Subject of In Vitro Research, and if
So, Can the Findings Be Related to What Occurs In Vivo? 664
E. Is the Association Between Exposure and Disease Biologically
Plausible? 664
633
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
III. Specific Causal Association Between an Individual’s Exposure and the
Onset of Disease, 665
A. Was the Plaintiff Exposed to the Substance, and if So, Did the
Exposure Occur in a Manner That Can Result in Absorption into
the Body? 666
B. Were Other Factors Present That Can Affect the Distribution of the
Compound Within the Body? 667
C. What Is Known About How Metabolism in the Human Body Alters
the Toxic Effects of the Compound? 668
D. What Excretory Route Does the Compound Take, and How Does
This Affect Its Toxicity? 668
E. Does the Temporal Relationship Between Exposure and the Onset
of Disease Support or Contradict Causation? 668
F. If Exposure to the Substance Is Associated with the Disease, Is There
a No Observable Effect, or Threshold, Level, and if So, Was the
Individual Exposed Above the No Observable Effect Level? 669
IV. Medical History, 670
A. Is the Medical History of the Individual Consistent with the
Toxicologist’s Expert Opinion Concerning the Injury? 670
B. Are the Complaints Specific or Nonspecific? 671
C. Do Laboratory Tests Indicate Exposure to the Compound? 672
D. What Other Causes Could Lead to the Given Complaint? 672
E. Is There Evidence of Interaction with Other Chemicals? 673
F. Do Humans Differ in the Extent of Susceptibility to the Particular
Compound in Question? Are These Differences Relevant in This
Case? 674
G. Has the Expert Considered Data That Contradict His or Her
Opinion? 674
V. Expert Qualifications, 675
A. Does the Proposed Expert Have an Advanced Degree in Toxicology,
Pharmacology, or a Related Field? If the Expert Is a Physician,
Is He or She Board Certified in a Field Such as Occupational
Medicine? 675
B. Has the Proposed Expert Been Certified by the American Board
of Toxicology, Inc., or Does He or She Belong to a Professional
Organization, Such as the Academy of Toxicological Sciences or the
Society of Toxicology? 677
C. What Other Criteria Does the Proposed Expert Meet? 678
VI. Acknowledgments, 679
Glossary of Terms, 680
References on Toxicology, 685
634
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
I.
Introduction
The discipline of toxicology is primarily concerned with identifying and understanding the adverse effects of external chemical and physical agents on biological
systems. The interface of the evidence from toxicological science with toxic torts
can be complex, in part reflecting the inherent challenges of bringing science into
a courtroom, but also because of issues particularly pertinent to toxicology. For
the most part, toxicological study begins with a chemical or physical agent and
asks what impact it will have, while toxic tort cases begin with an individual or
a group that has suffered an adverse impact and makes claims about its cause. A
particular challenge is that only rarely is the adverse impact highly specific to the
toxic agent; for example, the relatively rare lung cancer known as mesothelioma
is almost always caused by asbestos. The more common form of lung cancer,
bronchial carcinoma, also can be caused by asbestos, but asbestos is a relatively
uncommon cause compared with smoking, radon, and other known causes of lung
cancer.1 Lung cancer itself is unusual in that for the vast majority of cases, we can
point to a known cause—smoking. However, for many diseases, there are few if
any known causes, for example, pancreatic cancer. Even when there are known
causes of a disease, most individual cases are often not ascribable to any of the
known causes, such as with leukemia.
In general, there are only a limited number of ways that biological tissues
can respond, and there are many causes for each response. Accordingly, the role
of toxicology in toxic tort cases often is to provide information that helps evaluate the causal probability that an adverse event with potentially many causes is
caused by a specific agent. Similarly, toxicology is commonly used as a basis for
regulating chemicals, depending upon their potential for effect. Assertions related
to the toxicological predictability of an adverse consequence in relation to the
stringency of the regulatory law are not uncommon bases for legal actions against
regulatory agencies.
Identifying cause-and-effect relationships in toxicology can be relatively
straightforward; for example, when placed on the skin, concentrated sulfuric acid
will cause massive tissue destruction, and carbon monoxide poisoning is identifiable by the extent to which carbon monoxide is attached to the oxygen-carrying
portion of blood hemoglobin, thereby decreasing oxygen availability to the body.
But even these two seemingly straightforward examples serve to illustrate the
complexity of toxicology and particularly its emphasis on understanding dose–
response relationships. The tissue damage caused by sulfuric acid is not specific
to this chemical, and at lower doses, no effect will be seen. Carbon monoxide is
not only an external poison but is a product of normal internal metabolism such
1. Contrast this issue with the relatively straightforward situation in infectious disease in which
the disease name identifies the cause; for example, cholera is caused by Vibrio cholerae, tuberculosis by
the Mycobacterium tuberculosis, HIV-AIDs by the HIV virus, and so on.
635
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
that about 1 out of 200 hemoglobin molecules will normally have carbon monoxide attached, and this can increase depending upon concomitant disease states.
Furthermore, the complex temporal relation governing the uptake and release of
carbon monoxide from hemoglobin also must be considered in assessing the extent
to which an adverse impact may be ascribable to carbon monoxide exposure. Thus
the diagnosis of carbon monoxide poisoning requires far more information than
the simple presence of detectable carbon monoxide in the blood.
Complexity in toxicology is derived primarily from three factors. The first
is that chemicals often change within the body as they go through various routes
to eventual elimination.2 Thus absorption, distribution, metabolism, and excretion are central to understanding the toxicology of an agent. The second is that
human sensitivity to chemical and physical agents can vary greatly among individuals, often as a result of differences in absorption, distribution, metabolism,
or excretion, as well as target organ sensitivity—all of which can be genetically
determined. The third major source of complexity is the need for extrapolation,
either across species, because much toxicological data are obtained from studies
in laboratory animals, or across doses, because human toxicological and epidemiological data often are limited to specific dose ranges that differ from the dose
suffered by a plaintiff alleging a toxic tort impact. All three of these factors are
responsible for much of the complexity in utilizing toxicology for tort or regulatory judicial decisions and are described in more detail below.
Classically, toxicology is known as the science of poisons. It is the study of
the adverse effects of chemical and physical agents on living organisms.3 Although
it is an age-old science, toxicology has only recently become a discipline distinct
from pharmacology, biochemistry, cell biology, and related fields.
There are three central tenets of toxicology. First, “the dose makes the
poison”; this implies that all chemical agents are intrinsically hazardous—whether
they cause harm is only a question of dose.4 Even water, if consumed in large
quantities, can be toxic. Second, each chemical or physical agent tends to produce a specific pattern of biological effects that can be used to establish disease
2. Direct-acting toxic agents are those whose toxicity is due to the parent chemical entering the
body. A change in chemical structure through metabolism usually results in detoxification. Indirectacting chemicals are those that must first be metabolized to a harmful intermediate for toxicity to occur.
For an overview of metabolism in toxicology, see R.A. Kemper et al., Metabolism: A Determinant of
Toxicity, in Principles and Methods of Toxicology 103–178 (A. Wallace Hayes ed., 5th ed. 2008).
3. Casarett and Doull’s Toxicology: The Basic Science of Poisons 13 (Curtis D. Klaassen ed.,
7th ed. 2007).
4. A discussion of more modern formulations of this principle, which was articulated by
Paracelsus in the sixteenth century, can be found in David L. Eaton, Scientific Judgment and Toxic Torts—
A Primer in Toxicology for Judges and Lawyers, 12 J.L. & Pol’y 5, 15 (2003); Ellen K. Silbergeld, The Role
of Toxicology in Causation: A Scientific Perspective, 1 Cts. Health Sci. & L. 374, 378 (1991). A short review
of the field of toxicology can be found in Curtis D. Klaassen, Principles of Toxicology and Treatment of
Poisoning, in Goodman and Gilman’s The Pharmacological Basis of Therapeutics 1739 (11th ed. 2008).
636
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
causation.5 Third, the toxic responses in laboratory animals are useful predictors of
toxic responses in humans. Each of these tenets, and their exceptions, is discussed
in greater detail in this reference guide.
The science of toxicology attempts to determine at what doses foreign agents
produce their effects. The foreign agents classically of interest to toxicologists are
all chemicals (including foods and drugs) and physical agents in the form of radiation, but not living organisms that cause infectious diseases.6
The discipline of toxicology provides scientific information relevant to the
following questions:
1. What hazards does a chemical or physical agent present to human populations or the environment?
2. What degree of risk is associated with chemical exposure at any given
dose?7
Toxicological studies, by themselves, rarely offer direct evidence that a disease
in any one individual was caused by a chemical exposure.8 However, toxicology
can provide scientific information regarding the increased risk of contracting a
disease at any given dose and help rule out other risk factors for the disease. Toxicological evidence also contributes to the weight of evidence supporting causal
inferences by explaining how a chemical causes a specific disease through describing metabolic, cellular, and other physiological effects of exposure.
A. Toxicology and the Law
The growing concern about chemical causation of disease is reflected in the public
attention devoted to lawsuits alleging toxic torts, as well as in litigation concerning
the many federal and state regulations related to the release of potentially toxic
compounds into the environment.
Toxicological evidence frequently is offered in two types of litigation: tort
and regulatory. In tort litigation, toxicologists offer evidence that either supports
5. Some substances, such as central nervous system toxicants, can produce complex and nonspecific symptoms, such as headaches, nausea, and fatigue.
6. Forensic toxicology, a subset of toxicology generally concerned with criminal matters, is not
addressed in this reference guide, because it is a highly specialized field with its own literature and
methodologies that do not relate directly to toxic tort or regulatory issues.
7. In standard risk assessment terminology, hazard is an intrinsic property of a chemical or physical agent, while risk is dependent both upon hazard and on the extent of exposure. Note that this first
“law” of toxicology is particularly pertinent to questions of specific causation, while the second “law”
of toxicology, the specificity of effect, is pertinent to questions of general causation.
8. There are exceptions, for example, when measurements of levels in the blood or other body
constituents of the potentially offending agent are at a high enough level to be consistent with reasonably specific health impacts, such as in carbon monoxide poisoning.
637
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
or refutes plaintiffs’ claims that their diseases or injuries were caused by chemical exposures.9 In regulatory litigation, toxicological evidence is used to either
support or challenge government regulations concerning a chemical or a class of
chemicals. In regulatory litigation, toxicological evidence addresses the issue of
how exposure affects populations10 rather than addressing specific causation, and
agency determinations are usually subject to the court’s deference.11
Dose is a central concept in the field of toxicology, and an expert toxicologist will consider the extent of a plaintiff’s dose in making an opinion.12 But
dose has not been a central issue in many of the most important judicial decisions concerning the relation of toxicological evidence to toxic tort decisions.
These have mostly been general causation issues: For example, is a silicon breast
implant capable of causing rheumatoid arthritis, or is Bendectin capable of causing
deformed babies.13 However, in most specific causation issues involving exposure
to a chemical known to be able to cause the observed effect, the primary issue
will be whether there has been exposure to a sufficient dose to be a likely cause
of this effect.
9. See, e.g., Gen. Elec. Co. v. Joiner, 522 U.S. 136 (1997); Daubert v. Merrell Dow Pharms.,
Inc., 509 U.S. 579 (1993). Courts have held that toxicologists can testify as to disease causation related
to chemical exposures. See, e.g., Bonner v. ISP Techs, Inc., 259 F.3d 924, 928–31 (8th Cir. 2001);
Paoli R.R. v. Monsanto Co., 915 F.2d 829 (3d Cir. 1990); Loudermill v. Dow Chem. Co., 863 F.2d
566, 569–70 (8th Cir. 1988).
10. Again, there are exceptions. For example, certain regulatory approaches, such as the control
of hazardous air pollutants, are based on the potential impact to a putative maximally exposed individual rather than to the general population.
11. See, e.g., Int’l Union, United Mine Workers of Am. v. U.S. Dep’t of Labor, 358 F.3d 40,
43–44 (D.C. Cir. 2004) (determinations by Secretary of Labor are given deference by the court, but
must be supported by some evidence, and cannot be capricious or arbitrary); N.M. Mining Ass’n v.
N.M. Water Quality Control Comm., 150 P.3d 991, 995–96 (N.M. Ct. App. 2006) (action by a government agency is presumptively valid and will be given deference by the court. The court will only
overturn a regulatory decision if it is capricious and arbitrary, or not supported by substantial evidence).
12. Dose is a function of both concentration and duration. Haber’s rule is a century-old simplified expression of dose effects in which the effect of a concentration and duration of exposure is a
constant (e.g., exposure to an agent at 10 parts per million for 1 hour has the same impact as exposure
to 1 part per million for 10 hours). Exposure levels, which are concentrations, are often confused with
dose. This can be particularly problematic when attempting to understand the implications of exposure
to a level that exceeds a regulatory standard that is set for a different time frame. For example, assume
a drinking water contaminant is a known cause of cancer. To avoid a 1 in 100,000 lifetime risk caused
by this contaminant in drinking water, and assuming that the average person will drink approximately
2000 mL of water daily for a lifetime, the regulatory authority sets the allowable contaminant standard
in drinking water at 10 µg/L. Drinking one glass of water containing 20 µg/L of this contaminant,
although exceeding the standard, does not come close to achieving a “reasonably medically probable”
cause of an individual case of cancer.
13. See, e.g., In re Silicone Gel Breast Implants Prods. Liab. Litig., 318 F. Supp. 2d 879, 891
(C.D. Cal. 2004); Joseph Sanders, From Science to Evidence: The Testimony on Causation in the Bendectin
Cases, 46 Stan. L. Rev. 1, 19 (1993).
638
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
B. Purpose of the Reference Guide on Toxicology
This reference guide focuses on the scientific issues that arise most frequently in
toxic tort cases. Where it is appropriate, the guide explores the use of regulatory data and how the courts treat such data. It also provides an overview of the
basic principles and methodologies of toxicology and offers a scientific context
for proffered expert opinion based on toxicological data.14 The reference guide
describes research methods in toxicology and the relationship between toxicology
and epidemiology, and it provides model questions for evaluating the admissibility
and strength of an expert’s opinion. Following each question is an explanation
of the type of toxicological data or information that is offered in response to the
question, as well as a discussion of its significance.
C. Toxicological Study Design
Toxicological studies usually involve exposing laboratory animals (in vivo research)
or cells or tissues (in vitro research) to chemical or physical agents, monitoring the
outcomes (such as cellular abnormalities, tissue damage, organ toxicity, or tumor
formation), and comparing the outcomes with those for unexposed control groups.
As explained below,15 the extent to which animal and cell experiments accurately
predict human responses to chemical exposures is subject to debate.16 However,
because it is often unethical to experiment on humans by exposing them to known
doses of chemical agents, animal toxicological evidence often provides the best
scientific information about the risk of disease from a chemical exposure.17
In contrast to their exposure to drugs, only rarely are humans exposed to
environmental chemicals in a manner that permits a quantitative determination of
adverse outcomes.18 This area of toxicological study may consist of individual or
multiple case reports, or even experimental studies in which individuals or groups
of individuals have been exposed to a chemical under circumstances that permit
analysis of dose–response relationships, mechanisms of action, or other aspects of
14. The use of toxicological evidence in regulatory decisionmaking is discussed in Casarett and
Doull’s Toxicology: The Basic Science of Poisons, supra note 3, at 13–14; Barbara D. Beck et al., The
Use of Toxicology in the Regulatory Process, in Principles and Methods of Toxicology, supra note 2, at
45–102. For a more general discussion of issues that arise in considering expert testimony, see Margaret
A. Berger, The Admissibility of Expert Testimony, Section IV, in this manual.
15. See infra Section I.D.
16. The controversy over the use of toxicological evidence in tort cases is described in Bernard
D. Goldstein, Toxic Torts: The Devil Is in the Dose, 16 J.L. & Pol’y 551 (2008); Joseph V. Rodricks,
Evaluating Disease Causation in Humans Exposed to Toxic Substances, 14 J.L. & Pol’y 39 (2006); Silbergeld,
supra note 4, at 378.
17. See, e.g., Office of Tech. Assessment, U.S. Congress, Reproductive Health Hazards in the
Workplace 8 (1985).
18. However, it is from drug studies in which multiple animal species are compared directly
with humans that many of the principles of toxicology have been developed.
639
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
toxicology. For example, individuals occupationally or environmentally exposed
to polychlorinated biphenyls (PCBs) prior to prohibitions on their use have been
studied to determine the routes of absorption, distribution, metabolism, and excretion for this chemical. Human exposure occurs most frequently in occupational
settings where workers are exposed to industrial chemicals such as lead or asbestos;
however, even under these circumstances, it is usually difficult, if not impossible,
to quantify the amount of exposure. Moreover, human populations are exposed to
many other chemicals and risk factors, making it difficult to isolate the increased
risk of a disease that is the result of exposure to any one chemical.19
Toxicologists use a wide range of experimental techniques, depending in
part on their area of specialization. Toxicological research may focus on classes
of chemical compounds, such as solvents and metals; body system effects, such as
neurotoxicology, reproductive toxicology, and immunotoxicology; and effects on
physiological processes, including inhalation toxicology, dermatotoxicology, and
molecular toxicology (the study of how chemicals interact with cell molecules).
Each of these areas of research includes both in vivo and in vitro research.20
1. In vivo research
Animal research in toxicology generally falls under two headings: safety assessment
and classic laboratory science, with a continuum between them. As explained in
Section I.E, safety assessment is a relatively formal approach in which a chemical’s
potential for toxicity is tested in vivo or in vitro using standardized techniques
often prescribed by regulatory agencies, such as the Environmental Protection
Agency (EPA) and the Food and Drug Administration (FDA).21
The roots of toxicology in the science of pharmacology are reflected in an
emphasis on understanding the absorption, distribution, metabolism, and excretion
of chemicals. Basic toxicological laboratory research also focuses on the mechanisms of action of external chemical and physical agents. Such research is based
on the standard elements of scientific studies, including appropriate experimental
design using control groups and statistical evaluation. In general, toxicological
research attempts to hold all variables constant except for that of the chemical
exposure.22 Any change in the experimental group not found in the control group
is assumed to be perturbation caused by the chemical.
19. See, e.g., Office of Tech. Assessment, U.S. Congress, supra note 17, at 8.
20. See infra Sections I.C.1, I.C.2.
21.. W.J. White et al., The Use of Laboratory Animals in Toxicology Research, in Principles and
Methods of Toxicology 1055–1102 (A. Wallace Hayes ed., 5th ed. 2008); M.A. Dorato et al., The
Toxicologic Assessment of Pharmaceutical and Biotechnology Products, in Principles and Methods of Toxicology 325–68 (A. Wallace Hayes ed., 5th ed. 2008).
22. See generally Alan Poole & George B. Leslie, A Practical Approach to Toxicological Investigations (1989); Principles and Methods of Toxicology (A. Wallace Hayes ed., 2d ed. 1989); see also
discussion on acute, short-term, and long-term toxicity studies and acquisition of data in Frank C. Lu,
Basic Toxicology: Fundamentals, Target Organs, and Risk Assessment 77–92 (2d ed. 1991).
640
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
a. Dose–response relationships
An important component of toxicological research is dose–response relationships.
Thus, most toxicological studies generally test a range of doses of the chemical.
Animal experiments are conducted to determine the dose–response relationships
of a compound by measuring how response varies with dose, including diligently
searching for a dose that has no measurable physiological effect. This information
is useful in understanding the mechanisms of toxicity and extrapolating data from
animals to humans.23
b. Acute Toxicity Testing—Lethal Dose 50
To determine the dose–response relationship for a compound, a short-term lethal
dose 50% (LD50) may be derived experimentally. The LD50 is the dose at which
a compound kills 50% of laboratory animals within a period of days to weeks.
The use of this easily measured end point for acute toxicity to a large extent has
been replaced, in part because recent advances in toxicology have provided other
pertinent end points, and in part because of pressure from animal rights activists
to reduce or replace the use of animals in laboratory research.24
c. No observable effect level
A dose–response study also permits the determination of another important characteristic of the biological action of a chemical—the no observable effect level
(NOEL).25 The NOEL sometimes is called a threshold, because it is the level above
which observable effects in test animals are believed to occur and below which no
toxicity is observed.26 Of course, because the NOEL is dependent on the ability to
23. See infra Sections I.D, II.A.
24. Committee on Toxicity Testing and Assessment of Environmental Agents, National Research
Council, Toxicity Testing in the 21st Century: A Vision and a Strategy (2007).
25. For example, undiluted acid on the skin can cause a horrible burn. As the acid is diluted to
lower and lower concentrations, less and less of an effect occurs until there is a concentration sufficiently low (e.g., one drop in a bathtub of water, or a sample with less than the acidity of vinegar) that
no effect occurs. This no observable effect concentration differs from person to person. For example,
a baby’s skin is more sensitive than that of an adult, and skin that is irritated or broken responds to the
effects of an acid at a lower concentration. However, the key point is that there is some concentration
that is completely harmless to the skin.
26. The significance of the NOEL was relied on by the court in Graham v. Canadian National
Railway Co., 749 F. Supp. 1300 (D. Vt. 1990), in granting judgment for the defendants. The court
found the defendants’ expert, a medical toxicologist, persuasive. The expert testified that the plaintiffs’
injuries could not have been caused by herbicides, because their exposure was well below the reference
dose, which he calculated by taking the NOEL and decreasing it by a safety factor to ensure no human
effect. Id. at 1311–12 & n.11. But see Louderback v. Orkin Exterminating Co., 26 F. Supp. 2d 1298
(D. Kan. 1998) (failure to consider threshold levels of exposure does not necessarily render expert’s
opinion unreliable where temporal relationship, scientific literature establishing an association between
exposure and various symptoms, plaintiffs’ medical records and history of disease, and exposure to or
641
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
observe an effect, the level is sometimes lowered once more sophisticated methods
of detection are developed.
d. Benchmark dose
For regulatory toxicology, the NOEL is being replaced by a more statistically
robust approach known as the benchmark dose (BD). The BD is determined
based on dose–response modeling and is defined as the exposure associated with
a specified low incidence of risk, generally in the range of 1% to 10%, of a health
effect, or the dose associated with a specified measure or change of a biological
effect. To model the BD, sufficient data must exist, such as at least a statistically
or biologically significant dose-related trend in the selected end point.27
e. No-threshold model and determination of cancer risk
Certain genetic mutations, such as those leading to cancer and some inherited
disorders, are believed to occur without any threshold. In theory, the cancercausing mutation to the genetic material of the cell can be produced by any one
molecule of certain chemicals. The no-threshold model led to the development
of the one-hit theory of cancer risk, in which each molecule of a cancer-causing
chemical has some finite possibility of producing the mutation that leads to cancer.
(See Figure 1 for an idealized comparison of a no-threshold and threshold dose–
response.) This risk is very small, because it is unlikely that any one molecule of
a potentially cancer-causing agent will reach that one particular spot in a specific
cell and result in the change that then eludes the body’s defenses and leads to a
clinical case of cancer. However, the risk is not zero. The same model also can be
used to predict the risk of inheritable mutational events.28
the presence of other disease-causing factors were all considered). See also DiPirro v. Bondo Corp., 62
Cal. Rptr. 3d 722, 750 (Cal. Ct. App. 2007) (judgment for the maker of auto touchup paint based on
finding that there was substantial evidence in the record to show that the level of a particular toxin
[toluene] present in the paint fell 1000 times below the NOEL of that toxin and therefore no warning
label needed on paint can).
27. See S. Sand et al., The Current State of Knowledge on the Use of the Benchmark Dose Concept in
Risk Assessment, 28 J. Appl. Toxicol. 405–21 (2008); W. Slob et al., A Statistical Evaluation of Toxicity
Study Designs for the Estimation of the Benchmark Dose in Continuous Endpoints, 84 Toxicol. Sci. 167–85
(2005). Courts also recognize the benchmark dose. See, e.g., Am. Forest & Paper Ass’n Inc. v. EPA,
294 F.3d 113, 121 (D.C. Cir. 2002) (EPA’s use of benchmark dose takes into account comprehensive
dose–response information unlike NOEL and thus its use was not arbitrary in determining that methanol should remain on the list of hazardous air pollutants); California v. Tri-Union Seafoods, LLC, 2006
WL 1544384 (Cal. Super. Ct. May 11, 2006) (benchmark dose should not be equated with LOEL
(lowest observable effect level) and thus toxicologist’s testimony regarding methylmercury in tuna was
unreliable for purposes of California’s Proposition 65).
28. For further discussion of the no-threshold model of carcinogenesis, see James E. Klaunig &
Lisa M. Kamendulis, Chemical Carcinogens, in Casarett and Doull’s Toxicology: The Basic Science of
Poisons, supra note 3, at 329. But see V.P. Bond et al., Current Misinterpretations of the Linear No-Threshold
642
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
Figure 1. Idealized comparison of a no-threshold and threshold dose–response
relationship.
Hypothesis, 70 Health Physics 877 (1996); Marvin Goldman, Cancer Risk of Low-Level Exposure, 271
Science 1821 (1996).
Although the one-hit model explains the response to most carcinogens, there is accumulating
evidence that for certain cancers there is in fact a multistage process and that some cancer-causing
agents, so-called epigenetic or nongenotoxic agents, act through nonmutational processes, Committee on Risk Assessment Methodology, National Research Council, Issues in Risk Assessment 34–35,
187, 198–201 (1993). For example, the multistage cancer process may explain the carcinogenicity of
benzo[a]pyrene (produced by the combustion of hydrocarbons such as oil) and chlordane (a termite
pesticide). However, nonmutational responses to asbestos, dioxin, and estradiol cause their carcinogenic effects. The appropriate mathematical model to use to depict the dose–response relationship
for such carcinogens is still a matter of debate. Id. at 197–201. Proposals have been made to merge
cancer and noncancer risk assessment models. Committee on Improving Risk Analysis Approaches
Used by the U.S. EPA, National Research Council, Toward a Unified Approach to Dose–Response
Assessment 127–87 (2009).
Courts continue to grapple with the no-threshold model. See, e.g., In re W.R. Grace & Co. 355
B.R. 462, 476 (Bankr. D. Del. 2006) (the “no threshold model . . . flies in the face of the toxicological law of dose-response . . . doesn’t satisfy Daubert, and doesn’t stand up to scientific scrutiny”);
Cano v. Everest Minerals Corp., 362 F. Supp. 2d 814, 853–54 (W.D. Tex. 2005) (even accepting
the linear, no-threshold model for uranium mining and cancer, it is not enough to show exposure,
you must show causation as well). Where administrative rulemaking is the issue, the no-threshold
model has been accepted by some courts. See, e.g., Coalition for Reasonable Regulation of Naturally
643
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
f. Maximum tolerated dose and chronic toxicity tests
Another type of study uses different doses of a chemical agent to establish over a
90-day period what is known as the maximum tolerated dose (MTD) (the highest
dose that does not cause significant overt toxicity). The MTD is important because
it enables researchers to calculate the dose of a chemical to which an animal can be
exposed without reducing its lifespan, thus permitting the evaluation of the chronic
effects of exposure.29 These studies are designed to last the lifetime of the species.
Chronic toxicity tests evaluate carcinogenicity or other types of toxic effects.
Federal regulatory agencies frequently require carcinogenicity studies on both
sexes of two species, usually rats and mice. A pathological evaluation is done on
the tissues of animals that died during the study and those that are sacrificed at
the conclusion of the study.
The rationale for using the MTD in chronic toxicity tests, such as carcinogenicity bioassays, often is misunderstood. It is preferable to use realistic doses
of carcinogens in all animal studies. However, this leads to a loss of statistical
power, thereby limiting the ability of the test to detect carcinogens or other toxic
compounds. Consider the situation in which a realistic dose of a chemical causes
a tumor in 1 in 100 laboratory animals. If the lifetime background incidence of
tumors in animals without exposure to the chemical is 6 in 100, a toxicological test involving 100 control animals and 100 exposed animals who were fed
the realistic dose would be expected to reveal 6 control animals and 7 exposed
animals with the cancer. This difference is too small to be recognized as statistically significant. However, if the study started with 10 times the realistic dose,
the researcher would expect to get 10 additional cases for a total of 16 cases in
the exposed group and 6 cases in the control group, a significant difference that
is unlikely to be overlooked.
Unfortunately, even this example does not demonstrate the difficulties of
determining risk. Regulators are responding to public concern about cancer by
regulating risks often as low as 1 in 1,000,000—not 1 in 100, as in the example
given above. To test risks of 1 in 1,000,000, a researcher would have to either
increase the lifetime dose from 10 times to 100,000 times the realistic dose or
Occurring Substances v. Cal. Air Res. Bd., 19 Cal. Rptr. 3d 635, 641 (Cal. Ct. App. 2004) (use of
the no-threshold model to establish no safe level of asbestos exposure by regulatory agency upheld).
29. Even the determination of the MTD can be fraught with controversy. See, e.g., Simpson v.
Young, 854 F.2d 1429, 1431 (D.C. Cir. 1988) (petitioners unsuccessfully argued that FDA improperly
certified color additive Blue No. 2 dye as safe because researchers failed to administer the MTD to
research animals, as required by FDA protocols); Valentine v. PPG Indus., Inc., 821 N.E.2d 580, 607–08
(Ohio Ct. App. 2004) (summary judgment for defendant upheld based in part on expert’s observation
that “there is no reliable or reproducible epidemiological evidence that shows that chemicals capable
of causing brain tumors in animals at maximum tolerated doses over a lifetime can cause brain tumors
in humans. The biological plausibility of those chemicals causing brain tumors in humans is lacking.”).
See L.R. Rhomberg et al., Issues in the Design and Interpretation of Chronic Toxicity and Carcinogenicity
Studies in Rodents: Approaches to Dose Selection, 37 Crit. Rev. Toxicol. 729–837 (2007).
644
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
expand the numbers of animals under study into the millions. However, increases
of this magnitude are beyond the world’s animal testing capabilities and are also
prohibitively expensive. Inevitably, then, animal studies must trade statistical
power for extrapolation from higher doses to lower doses.
Accordingly, proffered toxicological expert opinion on potentially
cancer-causing chemicals almost always is based on a review of research studies
that extrapolate from animal experiments involving doses significantly higher than
that to which humans are exposed.30 Such extrapolation is accepted in the regulatory arena. However, in toxic tort cases, experts often use additional background
information31 to offer opinions about disease causation and risk.32
2. In vitro research
In vitro research concerns the effects of a chemical on human or animal cells, bacteria, yeast, isolated tissues, or embryos. Thousands of in vitro toxicological tests
have been described in the scientific literature. Many tests are for mutagenesis in
bacterial or mammalian systems. There are short-term in vitro tests for just about
every physiological response and every organ system, such as perfusion tests and
DNA studies. Relatively few of these tests have been validated by replication in
many different laboratories or by comparison with outcomes in animal studies to
determine if they are predictive of whole animal or human toxicity.33 However,
these tests, and their validation, are becoming increasingly important.
30. See, e.g., International Agency for Research on Cancer, World Health Organization, Preamble, in 63 IARC Monographs on the Evaluation of Carcinogenic Risks to Humans 9, 17 (1995);
James Huff, Chemicals and Cancer in Humans: First Evidence in Experimental Animals, 100 Envtl. Health
Persp. 201, 204 (1993); Joseph V. Rodricks, Evaluating Disease Causation in Humans Exposed to Toxic
Substances, 14 J.L. & Pol’y 39 (2006).
31. Central to offering an expert opinion on specific causation is a comparison of the estimated
risk with the likelihood of the adverse event if the individual had not suffered the alleged exposure.
This will differ depending on factors specific to that individual, including age, gender, medical history,
and competing exposures.
Researchers have developed numerous biomathematical formulas to provide statistical bases for
extrapolation from animal data to human exposure. See generally S.C. Gad, Statistics and Experimental
Design for Toxicologists (4th ed. 2005). See also infra Sections III, IV.
32. Policy arguments concerning extrapolation from high doses to low doses are explored in
Troyen A. Brennan & Robert F. Carter, Legal and Scientific Probability of Causation of Cancer and Other
Environmental Disease in Individuals, 10 J. Health Pol., Pol’y & L. 33 (1985). For a general discussion
of dose issues in toxic torts, see also Bernard D. Goldstein, Toxic Torts: The Devil Is in the Dose, 16
J.L. & Pol’y 551–85 (2008).
33. See R. Julian Preston & George R. Hoffman, Genetic Toxicology, in Casarett and Doull’s
Toxicology: The Basic Science of Poisons, supra note 3, at 381, 391–404. Use of in vitro data for
evaluating human mutagenicity and teratogenicity is described in John M. Rogers & Robert J.
Kavlock, Developmental Toxicology, in Casarett and Doull’s Toxicology: The Basic Science of Poisons,
supra note 3, at 415, 436–40. For a critique of expert testimony using in vitro data, see Wade-Greaux
v. Whitehall Laboratories, Inc., 874 F. Supp. 1441, 1480 (D.V.I. 1994), aff’d, 46 F.3d 1120 (3d Cir.
1994); In re Welding Fume Prods. Liab. Litig., 2006 WL 4507859, at *13 (N.D. Ohio Aug. 8, 2005)
645
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The criteria of reliability for an in vitro test include the following: (1) whether
the test has come through a published protocol in which many laboratories
used the same in vitro method on a series of unknown compounds prepared
by a reputable organization (such as the National Institutes of Health (NIH) or
the International Agency for Research on Cancer (IARC)) to determine if the
test consistently and accurately measures toxicity, (2) whether the test has been
adopted by a U.S. or international regulatory body, and (3) whether the test is
predictive of in vivo outcomes related to the same cell or target organ system.
D. Extrapolation from Animal and Cell Research to Humans
Two types of extrapolation must be considered: from animal data to humans and
from higher doses to lower doses.34 In qualitative extrapolation, one can usually
rely on the fact that a compound causing an effect in one mammalian species will
cause it in another species. This is a basic principle of toxicology and pharmacology.
If a heavy metal, such as mercury, causes kidney toxicity in laboratory animals,
it is highly likely to do so at some dose in humans. However, the dose at which
mercury causes this effect in laboratory animals is modified by many internal factors, and the exact dose–response curve may be different from that for humans.
Through the study of factors that modify the toxic effects of chemicals, including
absorption, distribution, metabolism, and excretion, researchers can improve the
ability to extrapolate from laboratory animals to humans and from higher to lower
doses.35 The mathematical depiction of the process by which an external dose
moves through various compartments in the body until it reaches the target organ
is often called physiologically based pharmacokinetics or toxicokinetics.36
Extrapolation from studies in nonmammalian species to humans is much more
difficult but can be done if there is sufficient information on similarities in absorp-
(Toxicologist qualified to testify on relationship between welding fumes and Parkinson’s disease including epidemiology and animal and in vitro toxicology studies).
34. See J.V. Rodricks et al., Quantitative Extrapolations in Toxicology, in Principles and Methods
of Toxicology 365 (A. Wallace Hayes ed., 5th ed. 2008).
35. For example, benzene undergoes a complex metabolic sequence that results in toxicity
to the bone marrow in all species, including humans. Robert Snyder, Xenobiotic Metabolism and the
Mechanism(s) of Benzene Toxicity, 36 Drug Metab. Rev. 531, 547 (2004).
The exact metabolites responsible for this bone marrow toxicity are the subject of much interest
but remain unknown. Mice are more susceptible to benzene than are rats. If researchers could determine the differences between mice and rats in their metabolism of benzene, they would have a useful
clue about which portion of the metabolic scheme is responsible for benzene toxicity to the bone
marrow. See, e.g., Lois D. Lehman-McKeeman, Absorption, Distribution, and Excretion of Toxicants, in
Casarett and Doull’s Toxicology: The Basic Science of Poisons, supra note 3, at 131; Andrew Parkinson
& Brian W. Ogilvie, Biotransformation of Xenobiotics, in Casarett and Doull’s Toxicology: The Basic
Science of Poisons, supra note 3, at 161.
36. For an analysis of methods used to extrapolate from animal toxicity data to human health
effects, see references cited in notes 21 and 22, supra.
646
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
tion, distribution, metabolism, and excretion. Advances in computational toxicology have increased the ability of toxicologists to make such extrapolations.37
Quantitative determinations of human toxicity based on in vitro studies usually are
not considered appropriate. As discussed in Section I.F, in vitro or animal data for
elucidating the mechanisms of toxicity are more persuasive when positive human
epidemiological data or toxicological information also exists.38
E. Safety and Risk Assessment
Toxicological expert opinion also relies on formal safety and risk assessments.
Safety assessment is the area of toxicology relating to the testing of chemicals and
drugs for toxicity. It is a relatively formal approach in which the potential for
toxicity of a chemical is tested in vivo or in vitro using standardized techniques.
The protocols for such studies usually are developed through scientific consensus
and are subject to oversight by governmental regulators or other watchdog groups.
After a number of bad experiences, including outright fraud, government
agencies have imposed codes on laboratories involved in safety assessment, including industrial, contract, and in-house laboratories.39 Known as good laboratory
practices (GLPs), these codes govern many aspects of laboratory standards, including such details as the number of animals per cage, dose and chemical verification,
and the handling of tissue specimens. GLPs are remarkably similar across agencies,
but the tests called for differ depending on the mission. For example, there are
major differences between FDA’s and EPA’s required procedures for testing drugs
37. See R.J. Kavlock et al., Computational Toxicology: A State of the Science Mini Review, 103
Toxicological Sci. 14–27 (2008). See also D. Malacarne et al., Relationship Between Molecular Connectivity and Carcinogenic Activity: A Confirmation with a New Software Program Based on Graph Theory,
101 Envtl. Health Persp. 331–42 (1993), for validation of the use of a computational structure-based
approach to carcinogenicity originally proposed by H.S. Rosenkranz & G. Klopman, Structural Basis
of Carcinogenicity in Rodents of Genotoxicants and Non-genotoxicants, 228 Mutat. Res. 105–24 (1990).
Structure–activity relationships have also been used to extend the threshold concept in toxicology to
look at low-dose exposures to agents present in foods or cosmetics. See R. Kroes et al., Structure-Based
Thresholds of Toxicological Concern (TTC): Guidance for Application to Substances Present at Low Levels in
the Diet, 42 Food Chem. Toxicol. 65–83 (2004).
38. An example of toxicological information in humans that is pertinent to extrapolation is the
finding in human urine of a carcinogenic metabolite found in studies of the same compound in laboratory animals. See, e.g., Goewey v. United States, 886 F. Supp. 1268, 1280–81 (D.S.C. 1995) (extrapolation of neurotoxic effects from chickens to humans unwarranted without human confirmation).
39. A dramatic case of fraud involving a toxicology laboratory that performed tests to assess the
safety of consumer products is described in United States v. Keplinger, 776 F.2d 678 (7th Cir. 1985).
Keplinger and the other defendants in this case were toxicologists who were convicted of falsifying
data on product safety by underreporting animal morbidity and mortality and omitting negative data
and conclusions from their reports. For further discussion of reviewing animal studies in light of the
FDA’s Good Laboratory Practice guidelines, see Eli Lilly & Co. v. Zenith Goldline Pharm., Inc. 364
F. Supp. 2d 820, 860 (S.D. Ind. 2005).
647
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
and environmental chemicals.40 FDA requires and specifies both efficacy and safety
testing of drugs in humans and animals. Carefully controlled clinical trials using
doses within the expected therapeutic range are required for premarket testing of
drugs because exposures to prescription drugs are carefully controlled and should
not exceed specified ranges or uses. However, for environmental chemicals and
agents, no premarket testing in humans is required by EPA. New European
Union Regulation on Registration, Evaluation, Authorisation and Restriction of
Chemicals (REACH) requires extensive testing of new chemicals and chemicals
in commerce.41 Moreover, because exposures are less predictable, doses usually are
given in a wider range in animal tests for nonpharmaceutical agents.42
Because exposures to environmental chemicals may continue over a lifetime
and affect both young and old, test designs called lifetime bioassays have been
developed in which relatively high doses are given to experimental animals. The
interpretation of results requires extrapolation from animals to humans, from high
to low doses, and from short exposures to multiyear estimates. It must be emphasized that less than 1% of the 60,000 to 75,000 chemicals in commerce have been
subjected to a full safety assessment, and there are significant toxicological data on
40. See, e.g., 40 C.F.R. Parts 160, 792 (1993); Lu, supra note 22, at 89. There is a major difference between the information needed to establish a regulatory standard or tolerance, and that needed
to establish causation for clinical or tort purposes.
41. For comparison of Toxic Substances Control Act (TSCA), 15 U.S.C. §§ 2601 et seq. (1978)
and REACH, see E. Donald Elliott, Trying to Fix TSCA § 6: Lessons from REACH, Proposition 65,
and the Clean Air Act, available at http://www.ucis.pitt.edu/euce/events/policyconf/07/PDFs/Elliott.
pdf. For issues related to the intentional testing of environmental chemicals in humans, see Committee on the Use of Third Party Toxicity Research with Human Research Participation, National
Research Council, Intentional Human Dosing Studies for EPA Regulatory Purposes: Scientific and
Ethical Issues (2004).
42. It must be appreciated that the development of a new drug inherently requires searching
for an agent that at useful doses has a biological effect (e.g., decreasing blood pressure), whereas those
developing a new chemical for consumer use (e.g., a house paint) hope that at usual doses no biological
effects will occur. There are other compounds, such as pesticides and antibacterial agents, for which
a biological effect is desired, but it is intended that at usual doses humans will not be affected. These
different expectations are part of the rationale for the differences in testing information available for
assessing toxicological effects. Under FDA rules, approval of a new drug usually will require extensive
animal and human testing, including a randomized double-blind clinical trial for efficacy and toxicity. In contrast, under TSCA, the only requirement before a new chemical can be marketed is that a
premanufacturing notice be filed with EPA, including any toxicity data in the company’s possession.
EPA reviews this information, along with structure–activity relationship modeling, in order to determine whether any restrictions on release should be imposed. For existing chemicals, EPA may require
companies to undertake animal and in vitro tests if the chemical may present an unreasonable risk to
health. The lack of toxicity data for most chemicals in commerce has led EPA to propose methods of
evaluation using in vitro toxicity pathway testing, followed by whole-animal testing where warranted.
See Committee on Toxicity Testing and Assessment of Environmental Agents, National Research
Council, Toxicity Testing in the 21st Century: A Vision and a Strategy (2007); U.S. Environmental
Protection Agency, Strategic Plan for Evaluating the Toxicity of Chemicals (March 2009), available at
http://www.epa.gov/spc/toxicitytesting.
648
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
only 10% to 20% of them. Under the current U.S. and international approaches
to testing chemicals with high production volume, and with the advent of the
REACH legislation, the extent of toxicological information is expanding rapidly.43
Risk assessment is an approach increasingly used by regulatory agencies to
estimate and compare the risks of hazardous chemicals and to assign priority for
avoiding their adverse effects.44 The National Academy of Sciences defines four
components of risk assessment: hazard identification, dose–response estimation,
exposure assessment, and risk characterization.45
Risk assessment is not an exact science. It should be viewed as a useful framework to organize and synthesize information and to provide estimates on which
policymaking can be based. In recent years, codification of the methodology used
to assess risk has increased confidence that the process can be reasonably free of
bias; however, significant controversy remains, particularly when actual data are
limited and generally conservative default assumptions are used.46
Although risk assessment information about a chemical can be somewhat useful in a toxic tort case, at least in terms of setting reasonable boundaries regarding
the likelihood of causation, the impetus for the development of risk assessment
has been the regulatory process, which has different goals.47 Because of their
43. See John S. Applegate, The Perils of Unreasonable Risk: Information, Regulatory Policy, and Toxic
Substances Control, 261 Colum. L. Rev. 264–66 (1991) for a discussion of REACH and its potential
impact on the availability of toxicological and risk information. See Sven O. Hanssen & Christina
Ruden, Priority Setting in the REACH System, 90 Toxicological Sci. 304–08 (2005), for a discussion of
the toxicological needs for REACH and its reliance on exposure.
44. The use of risk assessment by regulatory agencies was spurred by the Supreme Court’s
decision in Industrial Union Dep’t, AFL-CIO v. American Petroleum Institute, 448 U.S. 607 (1980). A
plurality of the court overturned the Occupational Safety and Health Administration’s (OSHA) attempt
to regulate benzene based on the intrinsic hazard of benzene being a human carcinogen. Instead, by
requiring a risk assessment, the inclusion of exposure assessment and dose–response evaluation became
a customary part of regulatory assessment. See John S. Applegate, supra note 43.
45. See generally National Research Council, Risk Assessment in the Federal Government:
Managing the Process (1983); Bernard D. Goldstein, Risk Assessment and the Interface Between Science
and Law, 14 Colum. J. Envtl. L. 343 (1989). Recently, a National Academy of Sciences panel has
discussed potential approaches to updating the risk paradigm. See Committee on Improving Risk
Analysis Approaches Used by the U.S. EPA, supra note 28.
46. An example of conservative default assumptions can be found in Superfund risk assessment. EPA has determined that Superfund sites should be cleaned up to reduce cancer risk from 1 in
10,000 to 1 in 1,000,000. A number of assumptions can go into this calculation, including conservative assumptions about intake, exposure frequency and duration, and cancer-potency factors for the
chemicals at the site. See, e.g., Robert H. Harris & David E. Burmaster, Restoring Science to Superfund
Risk Assessment, 6 Toxics L. Rep. 1318 (1992).
47. See Committee on Improving Risk Analysis Approaches Used by the U.S. EPA, supra note
28. See also Rhodes v. E.I. du Pont de Nemours & Co., 253 F.R.D. 365, 377–78 (S.D. W. Va. 2008)
(putative class-action plaintiffs alleging that contamination of their drinking water with industrial
perfluorooctanoic acid entitled them to medical monitoring could not rely upon regulatory risk assessment that does not provide the requisite reasonable certainty required to show a medical monitoring
injury). Risk assessment also has come under heavy criticism from those who prefer the precautionary
649
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
use of appropriately prudent assumptions in areas of uncertainty and their use of
default assumptions when there are limited data, risk assessments often intentionally encompass the upper range of possible risks.48 An additional issue, particularly
related to cancer risk, is that standards based on risk assessment often are set to
avoid the risk caused by lifetime exposure at this level. Exposure to levels exceeding this standard for a small fraction of a lifetime does not mean that the overall
lifetime risk of regulatory concern has been exceeded.49
1. The use of toxicological information in risk assessment
Risk assessment as practiced by government agencies involved in regulating
exposure to environmental chemicals is highly dependent upon the science of
toxicology and on the information derived from toxicological studies. EPA, FDA,
OSHA, the Consumer Product Safety Commission, and other international (e.g.,
the World Trade Organization), national, and state agencies use risk assessment
as a means to protect workers or the public from adverse effects.50 Acceptable
risk levels, for example, 1 in 1000 to 1 in 1,000,000, are usually well below what
principle as an alternative. For advocacy of the precautionary principle, see Joel A. Tickner, Precautionary Principle Encourages Policies That Protect Human Health and the Environment in the Face of Uncertain
Risks, 117 Pub. Health Rep. 493–97 (2002). Although variously defined, the precautionary principle
in many ways is a hazard-based approach.
48. It is also claimed that standard risk assessment will underestimate true risks, particularly for
sensitive populations exposed to multiple stressors, an issue of particular pertinence to discussions of
environmental justice. Committee on Environmental Justice, Institute of Medicine, Toward Environmental Justice: Research, Education, and Health Policy Needs (1999). The EPA has been developing
formal guidance for cumulative risk assessment, which has been defined as “the combined threats
from exposure via all relevant routes to multiple stressors including biological, chemical, physical, and
psychosocial entities.” Michael A. Callahan & Ken Sexton, If Cumulative Risk Assessment Is the Answer,
What Is the Question? Envtl. Health Persp. 799–806 (2007). See also International Life Sciences Institute,
A Framework for Cumulative Risk Assessment Workshop Report (1999). A related issue is aggregate
risk assessment, which focuses on exposure to a single agent through multiple routes. For example,
swimming in water containing a volatile organic contaminant is likely to lead to exposure through the
skin, through inhalation of the contaminant off-gassing just above the water surface, and through
swallowing water. For a discussion of aggregate risk assessment, see International Life Science Institute,
Aggregate Exposure Assessment Workshop Report (1998). For a study of a child’s indoor exposure
through different routes to a pesticide, see V.G. Zartarian et al., A Modeling Framework for Estimating
Children’s Residential Exposure and Dose to Chlorpyrifos Via Dermal Residue Contact and Nondietary Ingestion, 108 Envtl. Health Persp. 505–14 (2000).
49. A public health standard to protect against the lifetime risk of inhaling a known carcinogen
will usually be based on lifetime exposure calculations of 24 hours a day, everyday for 70 years. This is
more than 25,000 days and 600,000 hours. Exceeding this standard for a few hours would presumably
have little impact on cancer risk. In contrast, for a short-term standard set to avoid a threshold-based
risk, exceeding the standard for this short time may make a major difference, for example, an asthma
attack caused by being outdoors on a day that the ozone standard is exceeded.
50. Pharmaceuticals intended for human use are an exception in that a tradeoff between desired
and adverse effects may be acceptable, and human data are available prior to, and as a result of, the
marketing of the agent.
650
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
can be measured through epidemiological study. Inevitably, this means that risk
assessment is based solely on toxicological data—or, if epidemiological findings
of an adverse effect are observed, then toxicological reasoning must be used to
extrapolate to the appropriate lower dose standard aimed at protecting the public.
The four-part risk paradigm is heavily based on toxicological precepts. Hazard
identification reflects the toxicological “law” of specificity of effects, and dose–
response assessment is based upon “the dose makes the poison.” The hazard
identification process often uses “weight of evidence” approaches in which the
toxicological, mechanistic, and epidemiological data are rigorously assessed to form a
judgment regarding the likelihood that the agent produces a specific effect.51 Establishing the appropriate dose–response curve, threshold, or “one-hit” is an exercise
in toxicological reasoning. Even for those chemicals known to be carcinogens, a
threshold model is appropriate if the toxicological mechanism of action can be demonstrated to depend upon a threshold. Exposure assessment requires knowledge of
specific toxicological dynamics; for example, the impact on the lung of an air pollutant varies by factors such as inhalation rate per unit body mass, which is affected by
exercise and by age; by the size of a particle or the solubility of a gas, both of which
will affect the depth of penetrance into the more sensitive parts of the airways; by
the competence of the usual airway defense mechanisms, such as mucus flow and
macrophage function; and by the ability of the lung to metabolize the agent.52
F. Toxicological Processes and Target Organ Toxicity
The biological, chemical, and physical phenomena that are the basis of life are
astounding in their complexity. As a result, human subcellular, cellular, and organ
function are both delicately balanced and highly robust. Small changes caused by
external chemical and physical agents can have major effects; yet, through the
millennia, evolutionary pressures have led to the emergence of safety mechanisms
that defend against adverse environmental stresses.
The specialization that is a hallmark of organ development in vertebrates
inherently leads to diversity in the underlying processes that are the basis of organ
function. Certain chemicals poison virtually all cells by affecting a basic biological
process essential to life. For example, cyanide interferes with the conversion of
oxygen to energy in a subcellular component known as mitochondria.53 Other
51. See Section I.F for further discussion of weight-of-evidence approaches to potential human
carcinogens.
52. Some toxic agents pass through the lung without producing any direct effects on this organ.
For example, inhaled carbon monoxide produces its toxicity in essence by being treated by the body
as if it is oxygen. Carbon monoxide readily combines with the oxygen combining site of hemoglobin,
the molecule in red blood cells that is responsible for transporting oxygen from the lung to the tissues.
By doing so, the effective transport and tissue utilization of oxygen is blocked.
53. Note that the diffuse toxicity of cyanide also reflects its ability to spread widely in the body.
Certain mitochondrial poisons primarily affect the brain and active muscles, including the heart, which
651
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
chemical agents interfere selectively with an organ-specific process. For example,
organophosphate pesticides, often known as nerve gases, specifically interact
within the specialized intercellular nerve cell transmission of impulses—a process
that is pertinent primarily to the nervous system. Table 1 provides arbitrarily
selected examples of toxicological end points and agents of concern, which are
not meant to be inclusive or exhaustive.
Despite this specialization, there are pathological processes common to diseases affecting many different organs. For example, chronic inflammation of the
skin leads to fiber formation that is recognized as scarring. Similarly, cirrhosis of
the liver can result from fibrogenic processes caused by repetitive inflammation
of the liver, such as from the overuse of ethanol, and fibrosis of the lung is an
important pathological process resulting from asbestos, silica, and other agents.54
The potential for endocrine disruption by chemicals, particularly those that persist
within the body, has become an increasing concern. Many of these persistent agents
belong to families of chemically similar compounds, such as dioxins or PCBs,
that may differ in their effect. Particularly challenging to standard toxicological
approaches are agents that react with different receptors present on the surface or
internal components of the cell. These receptors often belong to complex families of related cellular components that are continually interacting with the broad
range of hormones produced by our bodies.55 The intricate dynamic processes of
normal endocrine activity include feedback loops that allow cyclic variation, such
as in the menstrual cycle or in the variation of hormone and receptor levels that
are linked to normal functions such as sleeping and sexual activity. These complex
normal “up and down” variations produce conceptual difficulties when attempting
to extrapolate the results from model systems to the functioning human.56
are particularly oxygen dependent. Others, unable to penetrate the blood-brain barrier, will primarily
affect peripheral muscle including the heart.
54. Lung fibrosis is a key pathological finding in a group of diseases known as pneumoconiosis
that includes coal miners’ black lung disease, silicosis, asbestosis, and other conditions usually caused
by occupational exposures.
55. As a simplification, agent–receptor interactions often are described as a key in a lock, with
the key needing to be able to both fit into the lock and turn the mechanism. An example from the
nervous system is the use in treating a heroin overdose of another opiate that has a much higher affinity for the receptor site but produces little effect once bound. When given to a normal person, this
second opiate would have a mild depressant effect, but it can reverse a near fatal overdose of heroin by
displacing the heroin from the receptor site. Thus the directionality of opiate effect depends upon the
interaction of the components of the mixture. This interaction is even more complex when dealing
with estrogenic agents that are naturally occurring as well as made within the body at different levels
in response to different external and internal stimuli and at different time intervals.
56. The complexity of the interaction of a mixture of dioxins with receptors governing the
endocrine system can be contrasted with that of the reaction of carbon monoxide with the hemoglobin oxygen receptor discussed in note 52. The latter is unidirectional in that any additional carbon
monoxide will interfere with oxygen delivery, of which there cannot be too much under normal
physiological conditions.
652
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
Table 1. Sample of Selected Toxicological End Points and
Examples of Agents of Concern in Humansa
Organ System
Examples of End Points
Examples of Agents of Concern
Skin
allergic contact dermatitis
chloracne
cancer
nickel, poison ivy, cutting oils
dioxins
polycyclic aromatic hydrocarbons
Respiratory
tract
nonspecific irritation
(reactive airway disease)
asthma
chronic obstructive
pulmonary disease
fibrosis, pneumoconiosis
cancer
formaldehyde, acrolein, ozone
Blood and the
immune system
anemia
secondary polycythemia
methemoglobinemia
pancytopenia
secondary lupus
erythematosus
leukemia
hepatic damage (hepatitis)
Urinary tract
kidney toxicity
cancer
arsine, lead, methyldopa
cobalt
nitrites, aniline dyes, dapsone
benzene, radiation,
chemotherapeutic agents
hydralazine
acetaminophen, ethanol, carbon
tetrachloride, vitamin A
aflatoxin, vinyl chloride
ethylene and diethylene glycols,
lead, melamine, aminoglycoside
antibiotics
aromatic amines
bladder cancer
Reproductive
and
developmental
toxicity
silica, mineral dusts, cotton dust
cigarette smoke, arsenic, asbestos,
nickel
benzene, radiation,
chemotherapeutic agents
Liver and
gastrointestinal
tract
Nervous system
toluene diisocyanate
cigarette smoke
nervous system toxicity
Parkinson’s disease
cholinesterase inhibitors, mercury,
lead, n-hexane, bacterial toxins
(botulinum, tetanus)
manganese
fetal malformations
thalidomide, ethanol
continued
653
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Table 1. Continued
Organ System
Examples of End Points
Examples of Agents of Concern
Endocrine
system
thyroid toxicity
radioactive iodine, perchlorate
Cardiovascular
system
heart toxicity
high blood pressure
arrhythmias
anthracyclines, cobalt
lead
plant glycosides (e.g. digitalis)
aThis
table presents only examples of toxicological end points and examples of agents of concern in
humans and is provided to help illustrate the variety of toxic agents and end points. It is not an exhaustive or inclusive list of organs, end points, or agents. Absence from this list does not indicate a relative
lack of evidence for a causal relation as to any agent of concern.
The processes that result in the causation of cancer are also of particular
interest to the public, to litigators, and to regulators. A common denominator for
the various diseases that fall under the heading of cancer is uncontrolled cellular
growth, usually reflecting the failure of the normal progression of precursor cells
to maturation and cell death. Central to the mechanism of cancer causation is the
production of a genetic change that leads a precursor cell to no longer conform to
usual processes that control cell growth. In virtually all cancers, the overgrowth of
cells can be traced to a single mutation, such that cancer cells are a clone of the one
mutated precursor cell.57 The understanding of the relationship between mutation and cancer led to some of the first toxicological tests to determine whether
an external agent could cause cancer. Such tests have grown in sophistication
because of the advances in molecular biology and computational toxicology that
have occurred concomitantly with an increased understanding of the variety of
potential pathways that lead to mutagenesis.58
Toxicological testing for chemical carcinogens ranges from relatively simple
studies to determine whether the substance is capable of producing bacterial mutations to observation of cancer incidence as a result of long-term administration of
the substance to laboratory animals. Between these two extremes are a multiplicity
of tests that build upon the understanding of the mechanism of cancer causation.
In vitro or in vivo tests may focus on the evidence of effects in DNA, such as
the presence of adducts of the chemical or its metabolites bound to the DNA
molecule or the cross-linking of the DNA molecule to protein. Researchers may
look for changes in the nucleus of the cell suggestive of DNA damage that could
57. There may, in fact, be multiple mutations as the initial clone of cells undergoes further
transformation before or after the cancer becomes clinically manifest.
58. Committee on Toxicity Testing and Assessment of Environmental Agents, National Research
Council, Toxicity Testing in the 21st Century: A Vision and a Strategy (2007).
654
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
result in mutagenesis and carcinogenesis, for example, the micronucleus test or
the comet assay. Certain mutagens cause an increase in the normal exchange of
nuclear material among DNA components during normal cell division, which
gives rise to a test known as the “sister chromatid exchange.”59 The direct observation of chromosomes to look for specific abnormalities, known as cytogenetic
analysis, is providing more information about the pathways of carcinogenesis.
For cancers such as acute myelogenous leukemia, it has long been recognized
that those individuals who present with recognizable chromosomal abnormalities
are more likely to have been exposed to a known human chemical leukemogen
such as benzene.60 But at this time there is no chromosomal abnormality that is
unequivocally linked to a specific chemical or physical carcinogen.61 These and
other tests provide information that can be used in evaluating whether a chemical
is a potential human carcinogen.
The many tests that are pertinent to estimating whether a chemical or physical agent produces human cancer require careful evaluation. The World Health
Organization’s (WHO’s) IARC and the U.S. National Toxicology Program
(NTP) have formal processes to evaluate the weight of evidence that a chemical
causes cancer.62 Each classifies chemicals on the basis of epidemiological evidence,
toxicological findings in laboratory animals, and mechanistic considerations, and
then assigns a specific category of carcinogenic potential to the individual chemical
or exposure situation (e.g., employment as a painter).63 Only a small percentage of
59. All of these tests require validation regarding their relevance to predicting human carcinogenesis, as well as to their technical reproducibility. See Raffaella Corvi et al., ECVAM Retrospective
Validation of In Vitro Micronucleus Test, 23 Mutagenesis 271–83 (2008), for an example of an approach
to validating a short-term assay for carcinogenesis.
60. F. Mitelman et al., Chromosome Pattern, Occupation, and Clinical Features in Patients with Acute
Nonlymphocytic Leukemia, 4 Cancer Genet. & Cytogenet. 197, 214 (1981).
61. See Luoping Zhang et al., The Nature of Chromosomal Aberrations Detected in Humans Exposed
to Benzene, 32 Crit. Rev. Toxicol. 1–42 (2002).
62. The U.S. National Toxicology Program issues a congressionally mandated Report on Carcinogens. The 12th report is available at http://ntp.niehs.nih.gov/ntp/roc/twelfth/roc12.pdf. IARC
produces its reports through a monograph series that provides detailed description of the agents or
processes under consideration as well as the findings of the IARC expert working group. See the
IARC Web site for a list of these monographs (http://monographs.iarc.fr/).
63. IARC uses the following classifications:
Group
Group
Group
Group
Group
1, The agent (mixture) is carcinogenic to humans;
2A, The agent (mixture) is probably carcinogenic to humans,
2B, The agent (mixture) is possibly carcinogenic to humans;
3, The agent (mixture) is not classifiable as to its carcinogenicity to humans; and
4, The agent (mixture) is probably not carcinogenic to humans.
Inherent in putting chemicals into distinct categories when there is a continuum for the strength
of the evidence is that some chemicals will be very close to the dividing line between the discrete
categories. Inevitably, small differences in the interpretation of the evidence for such chemicals will
lead to disagreement regarding categorization.
655
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the total chemicals in commerce are considered to be known human carcinogens.
In the past, assignment to the highest category was dependent almost totally on
epidemiological evidence, although animal data and mechanistic information were
also considered. In recent years, with improved understanding of the mechanism
of action of chemical carcinogens, there has been increased use of mechanistic
data.64 For example, higher credence is given to the likelihood that a chemical is
a human carcinogen if the metabolite found to be responsible for carcinogenesis
in a laboratory animal is also found in the blood or urine of humans exposed to
this chemical, or if there is evidence of the same type of DNA damage in humans
as there is in laboratory animals in which the agent does cause cancer.65
G. Toxicology and Exposure Assessment
In recent decades, exposure assessment has developed into a scientific field with
the usual trappings of journals, learned societies, and research funding processes.
64. See Vincent James Cogliano et al., Use of Mechanistic Data in IARC Evaluations, 49 Envt. &
Molecular Mutagenesis 100–09 (2008) for a discussion and for specific examples of the use of mechanistic data in evaluating carcinogens. The evolution in the approach to determining cancer causality is
evident from reviewing the guidelines used to assemble the weight of evidence for causality by IARC
and NTP, two of the organizations that have the lengthiest track record of responsibility for the hazard
identification of carcinogens. Both have increased the weight given to mechanistic evidence in characterizing the overall strength of the total evidence used to classify the potential for a chemical or an
exposure to be causal. IARC now permits classification in Group 1 when there is less than sufficient
evidence in humans but sufficient evidence in animals and “strong evidence in exposed humans that
the agent acts through a relevant mechanism of carcinogenicity” Id. at 103. The criteria used by NTP
for listing a chemical as a known human carcinogen in its biannual Report on Carcinogens is “There
is sufficient evidence of carcinogenicity from studies in humans,* which indicates a causal relationship
between exposure to the agent, substance, or mixture, and human cancer.” The asterisk is particularly
notable in that it specifies that the evidence need not be solely epidemiological: “*This evidence can
include traditional cancer epidemiology studies, data from clinical studies, and/or data derived from
the study of tissues or cells from humans exposed to the substance in question that can be useful for
evaluating whether a relevant cancer mechanism is operating in people.” See National Toxicology
Program, U.S. Dep’t of Health and Human Servs., Report of Carcinogens (12th ed. 2011), at 4,
available at http://ntp.niehs.nih.gov/ntp/roc/twelfth/roc12.pdf.
EPA also considers mechanism of action in its regulatory approaches and distinguishes further
between mechanism of action and mode of action. See Katherine Z. Guyton et al., Improving Prediction
of Chemical Carcinogenicity by Considering Multiple Mechanisms and Applying Toxicogenomic Approaches,
681 Mutation Res. 230, 240 (2009); Katherine Z. Guyton et al., Mode of Action Frameworks: A Critical
Analysis, 11 J. Toxicol. & Envtl. Health Part B 16, 31 (2008).
65. A recent example is the IARC evaluation of formaldehyde that upgraded the categorization
from 2A to 1 based upon epidemiological data that were strongly supported by the finding of nasal
cancer in laboratory animals and by the presence of DNA-protein cross-links in the nasal tissue of the
laboratory animals and of humans inhaling formaldehyde. However, epidemiological evidence associating formaldehyde with human acute myelogenous leukemia was questioned on the basis of the lack
of mechanistic evidence, including questions about how such a highly reactive agent could reach the
bone marrow following inhalation. See Formaldehyde, 2-Butoxyethanol and 1-tert-Butoxypropan-2-ol, in
88 IARC Monographs on the Evaluation of Carcinogenic Risks to Humans (2006).
656
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
Exposure assessment methodologies include mathematical models predicting exposure resulting from an emission source, which might be a long distance upwind;
chemical or physical measurements of media such as air, food, and water; and
biological monitoring within humans, including measurements of blood and urine
specimens. An exposure assessment should also look for competing exposures. In
this continuum of exposure metrics, the closer to the human body, the greater the
overlap with toxicology.66
Exposure assessment is central to epidemiology as well. Many of the causal
associations between chemicals and human disease have been developed from
epidemiological studies relating a workplace chemical to an increased risk of the
specific disease in cohorts of workers, often with only a qualitative assessment of
exposure. An improved quantitative understanding of such exposures enhances
the likelihood of observing causal relations.67 It also can provide the information
needed by the expert toxicologist to opine on the likelihood that a specific exposure was responsible for an adverse outcome.
H. Toxicology and Epidemiology
Epidemiology is the study of the incidence and distribution of disease in human
populations. Clearly, both epidemiology and toxicology have much to offer in
elucidating the causal relationship between chemical exposure and disease.68 These
66. Toxicologists also have indirect means of approaching exposure through symptoms. For
many agents, there is a known threshold for smell and a reasonable range of levels that might cause
symptoms. For example, the use of toxicological expertise is appropriate in a situation in which chronic
exposure to a volatile hydrocarbon is alleged to have occurred at levels at which acute exposure would
be expected to render the individual unconscious. Toxicologists may also contribute knowledge of
the extent of individual exposure based upon appropriate assumptions concerning inhalation rate or
water use; for example, children inhale more per body mass than do adults, and outdoor workers in
hot climates will drink more fluids.
67. In terms of general causation, accurate exposure assessment is important because a true effect
can be missed because of the confounding caused by cohorts that often include workers with little
exposure to the putative offending agent, thereby diluting the actual effect. See Peter F. Infante, Benzene
Exposure and Multiple Myeloma: A Detailed Meta-analysis of Benzene Cohort Studies, 1976 Ann. N.Y. Acad.
Sci. 90–109 (2006), for a discussion of this issue in relation to a meta-analysis of the potential causative
role of benzene in multiple myeloma. On the other hand, an association between exposure and effect
occurring solely by chance is more likely if the effect does not meet the expected standard of being more
pronounced in those receiving the highest dose. See Bernard D. Goldstein, Toxic Torts: The Devil Is in
the Dose, 16 J.L. & Pol’y 551–85 (2008). Setting regulatory standards based upon the observed effect in
a cohort often requires a risk assessment, which in turn is dependent on understanding the extent of the
exposure. This has led to extensive retrospective reconstruction of exposure in key cohorts.
68. See Michael D. Green et al., Reference Guide on Epidemiology, Section V, in this manual.
For example, in Norris v. Baxter Healthcare, 397 F.3d 878, 882 (10th Cir. 2005), testimony was excluded
as unreliable in which the expert ignored epidemiological studies that conflicted with the expert’s
opinion. However, epidemiological studies are not always necessary. Glastetter v. Novartis Pharms.
Corp., 252 F.3d 986, 999 (8th Cir. 2001).
657
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
sciences often go hand in hand with assessments of the risks of chemical exposure,
without artificial distinctions being drawn between them. However, although
courts generally rule epidemiological expert opinion admissible, the admissibility
of toxicological expert opinion has been more controversial because of uncertainties regarding extrapolation from animal and in vitro data to humans. This
particularly has been true in cases in which relevant epidemiological research data
exist. However, the methodological weaknesses of some epidemiological studies,
including their inability to accurately measure exposure and their small numbers
of subjects, render these studies difficult to interpret.69 In contrast, because animal
and cell studies permit researchers to isolate the effects of exposure to a single
chemical or to known mixtures, toxicological findings offer unique information concerning dose–response relationships, mechanisms of action, specificity of
response, and other information relevant to the assessment of causation.70
The gold standard in clinical epidemiology and in the testing of pharmaceutical agents is the randomized double-blind cohort study in which the control
and intervention groups are perfectly matched. Although appropriate and very
informative for the testing of pharmaceutical agents, it is generally unethical for
chemicals used for other purposes. The randomized control design in essence is
what is used in a classic toxicological study in laboratory animals, although matching is more readily achieved because the animals are genetically similar and have
identical environmental histories.
Dose issues are at the interface between toxicology and epidemiology. Many
epidemiological studies of the potential risk of chemicals do not have direct
information about dose, although qualitative differences among subgroups or
in comparison with other studies can be inferred. The epidemiology database
includes many studies that are probing for the potential for an association between
a cause and an effect. Thus a study asking all those suffering from a specific disease
a multiplicity of questions related to potential exposures is bound to find some
statistical association between the disease and one or more exposure conditions.
Such studies generate hypotheses that can then be evaluated more thoroughly by
subsequent studies that more narrowly focus on the potential cause-and-effect
69. Id. See also Michael D. Green et al., Reference Guide on Epidemiology, in this manual.
70. Both commonalities and differences between animal responses and human responses to
chemical exposures were recognized by the court in International Union, United Automobile, Aerospace
and Agricultural Implement Workers of America, UAW v. Pendergrass, 878 F.2d 389 (D.C. Cir. 1989). In
reviewing the results of both epidemiological and animal studies on formaldehyde, the court stated:
“Humans are not rats, and it is far from clear how readily one may generalize from one mammalian
species to another. But in light of the epidemiological evidence [of carcinogenicity] that was not the
main problem. Rather it was the absence of data at low levels.” Id. at 394. The court remanded
the matter to OSHA to reconsider its findings that formaldehyde presented no specific carcinogenic
risk to workers at exposure levels of 1 part per million or less. See also Hopkins v. Dow Corning Corp.,
33 F.3d 1116 (9th Cir. 1994); In re Accutane Prod. Liab., 511 F. Supp. 2d 1288, 1292 (M.D. Fla.
2007); United States v. Philip Morris USA, Inc., 449 F. Supp. 2d 1, 182 (D.D.C. 2006); Ambrosini
v. Labarraque, 101 F.3d 129, 141 (D.C. Cir. 1996).
658
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
relation. One way to evaluate the strength of the association is to assess whether
those epidemiological studies evaluating cohorts with relatively high exposure
observe the association.71
The requirement in certain jurisdictions for epidemiological evidence of a
relative risk greater than two (RR > 2) for general causation also has limited
the utilization of toxicological evidence.72 A firm requirement for such evidence means that if the epidemiological database showed statistically significant
evidence that cohorts exposed to 10 parts per million of an agent for 20 years
produced an 80% increase in risk, the court could not hear the case of a plaintiff alleging that exposure to 50 parts per million for 20 years of the same agent
caused the adverse outcome. Yet to a toxicologist there would be little question
that exposure to the fivefold higher dose would lead to more than a doubling of
the risk, all other facets of the case being similar.
Even though there is little toxicological data on many of the 75,000 compounds in general commerce, there is far more information from toxicological
studies than from epidemiological studies.73 It is much easier, and more economical, to expose an animal to a chemical or to perform in vitro studies than it is to
perform epidemiological studies. This difference in data availability is evident even
for cancer causation, for which toxicological study is particularly expensive and
time-consuming. Of the perhaps two dozen chemicals that reputable international
authorities agree are known human carcinogens based on positive epidemiological
studies, arsenic is the only one not known to be an animal carcinogen. Yet there
are more than 100 known animal carcinogens for which there is no valid epidemiological database, and others for which the epidemiological database has been
71. For common chemicals, it is not unusual that a literature search reveals an association
with virtually any disease. As an example of considering dose issues across epidemiological studies,
see Luoping Zhang et al., Formaldehyde Exposure and Leukemia: A New Meta-Analysis and Potential
Mechanisms, 681 Mutat. Res. 150–68 (2008). The subject of the strength of an epidemiological association and its relation to causality is considered in Michael D. Green et al., Reference Guide on
Epidemiology, in this manual.
72. The basis for the use of RR > 2 is the translation of the preponderance of evidence, or
“more likely than not,” as a basis for tort law into at least a doubling of risk. An example is the
Havner rule in Texas, which for general causation requires that there be at least two epidemiological
studies with a statistically significant RR > 2 associating a putative cause with an effect (Merrell Dow
Pharms. v. Havner, 953 S.W.2d 706, 716 (Tex. 1997)). For a discussion of the use by jurisdictions of
relative risk > 2 for general and specific causation, see Russellyn S. Carruth & Bernard D. Goldstein,
Relative Risk GreaterThan Two in Proof of Causation in Toxic Tort Litigation, 41 Jurimetrics 195 (2001);
for the toxicological issues, see Bernard D. Goldstein, Toxic Torts: The Devil Is in the Dose, 16 J.L. &
Pol’y 551–85 (2008).
73. See generally Committee on Toxicity Testing and Assessment of Environmental Agents, supra
note 24. See also National Research Council, Toxicity Testing: Strategies to Determine Needs and
Priorities (1984); Myra Karstadt & Renee Bobal, Availability of Epidemiologic Data on Humans Exposed
to Animal Carcinogens, 2 Teratogenesis, Carcinogenesis & Mutagenesis 151 (1982); Lorenzo Tomatis et
al., Evaluation of the Carcinogenicity of Chemicals: A Review of the Monograph Program of the International
Agency for Research on Cancer, 38 Cancer Res. 877, 881 (1978).
659
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
equivocal.74 To clarify any findings, regulators can require a repeat of an equivocal 2-year animal toxicological study or the performance of additional laboratory
studies in which animals deliberately are exposed to the chemical. Such deliberate
exposure is not possible in humans. As a general rule, unequivocally positive epidemiological studies reflect prior workplace practices that led to relatively high levels
of chemical exposure for a limited number of individuals and that, fortunately, in
most cases no longer occur now. Thus an additional prospective epidemiological
study often is not possible, and even the ability to do retrospective studies is constrained by the passage of time.
In essence, epidemiological findings of an adverse effect in humans represent
a failure of toxicology as a preventive science or of regulatory authorities or other
responsible parties in controlling exposure to a hazardous chemical or physical agent. A corollary of the tenet that, depending upon dose, all chemical and
physical agents are harmful, is that society depends upon toxicological science to
discover these harmful effects and on regulators and responsible parties to prevent
human exposure to a harmful level or to ensure that the agent is not produced.
Epidemiology is a valuable backup approach that functions to detect failures of
primary prevention. The two disciplines complement each other, particularly
when the approaches are iterative.
II. Demonstrating an Association Between
Exposure and Risk of Disease75
Once the expert has been qualified, he or she is expected to offer an opinion on
whether the plaintiff’s disease was caused by exposure to a chemical. To do so,
the expert relies on the principles of toxicology to provide a scientifically valid
74. The absence of epidemiological data is due, in part, to the difficulties in conducting cancer
epidemiology studies, including the lack of suitably large groups of individuals exposed for a sufficient period of time, long latency periods between exposure and manifestation of disease, the high
variability in the background incidence of many cancers in the general population, and the inability
to measure actual exposure levels. These same concerns have led some researchers to conclude that
“many negative epidemiological studies must be considered inconclusive” for exposures to low doses
or weak carcinogens. Henry C. Pitot III & Yvonne P. Dragan, Chemical Carcinogenesis, in Casarett and
Doull’s Toxicology: The Basic Science of Poisons 201, 240–41 (Curtis D. Klaassen ed., 5th ed. 1996).
75. Determinations about cause-and-effect relations by regulatory agencies often depend upon
expert judgment exercised by assessing the weight of evidence. For a discussion of this process as used
by the International Agency for Research on Cancer of the World Health Organization and the role
of information about mechanisms of toxicity, see Vincent J. Cogliano et al., Use of Mechanistic Data
in IARC Evaluations, 49 Envt’l & Molecular Mutagens 100 (2008). For the use of expert judgment
in EPA’s response to submission of information for premanufacture notification required under the
Toxic Substances Control Act, 15 U.S.C. §§ 2604, 2605(e), 40 C.F.R. §§ 720 et seq., see Chemical
Manufacturers Ass’n v. EPA, 859 F.2d 977 (D.C. Cir. 1988).
660
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
methodology for establishing causation and then applies the methodology to the
facts of the case.
An opinion on causation should be premised on three preliminary assessments.
First, the expert should analyze whether the disease can be related to chemical
exposure by a biologically plausible theory. Second, the expert should examine
whether the plaintiff was exposed to the chemical in a manner that can lead to
absorption into the body. Third, the expert should offer an opinion about whether
the dose to which the plaintiff was exposed is sufficient to cause the disease.
The following questions help evaluate the strengths and weaknesses of toxicological evidence.
A. On What Species of Animals Was the Compound Tested?
What Is Known About the Biological Similarities and
Differences Between the Test Animals and Humans?
How Do These Similarities and Differences Affect
the Extrapolation from Animal Data in Assessing
the Risk to Humans?
All living organisms share a common biology that leads to marked similarities in the
responsiveness of subcellular structures to toxic agents. Among mammals, more than
sufficient common organ structure and function readily permit the extrapolation
from one species to another in most instances. Comparative information concerning
factors that modify the toxic effects of chemicals, including absorption, distribution,
metabolism, and excretion, in the laboratory test animals and humans enhances the
expert’s ability to extrapolate from laboratory animals to humans.76
The expert should review similarities and differences between the animal
species in which the compound has been tested and humans. This analysis should
form the basis of the expert’s opinion regarding whether extrapolation from animals to humans is warranted.77
76. See generally supra notes 35–36 and accompanying text.
77. The failure to review similarities and differences in metabolism in performing cross-species
extrapolation has led to the exclusion of opinions based on animal data. See In re Silicone Gel Breast
Implants Prods. Liab. Litig., 318 F. Supp. 2d 879, 891 (C.D. Cal. 2004); Fabrizi v. Rexall Sundown,
Inc., 2004 WL 1202984, at *8 (W.D. Pa. June 4, 2004). Hall v. Baxter Healthcare Corp., 947 F. Supp.
1387, 1410 (D. Or. 1996); Nelson v. Am. Sterilizer Co., 566 N.W.2d 671 (Mich. Ct. App. 1997).
But see In re Paoli R.R. Yard PCB Litig., 35 F.3d 717, 779–80 (3d Cir. 1994) (noting that humans
and monkeys are likely to show similar sensitivity to PCBs), cert. denied sub nom. Gen. Elec. Co. v.
Ingram, 513 U.S. 1190 (1995). As the Supreme Court noted in General Electric Co. v. Joiner, 522 U.S.
136, 144 (1997), the issue regarding admissibility is not whether animal studies are ever admissible to
establish causation, but whether the particular studies relied upon by plaintiff’s experts were sufficiently
supported. See Carl F. Cranor et al., Judicial Boundary Drawing and the Need for Context-Sensitive Science
in Toxic Torts After Daubert v. Merrell Dow Pharmaceuticals, Inc., 16 Va. Envtl. L.J. 1, 38 (1996).
661
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In general, an overwhelming similarity is apparent in the biology of all living
things, and there is a particularly strong similarity among mammals. Of course,
laboratory animals differ from humans in many ways. For example, rats do not
have gallbladders. Thus, rat data would not be pertinent to the possibility that
a compound produces human gallbladder toxicity.78 Note that many subjective
symptoms are poorly modeled in animal studies. Thus, complaints that a chemical has caused nonspecific symptoms, such as nausea, headache, and weakness,
for which there are no objective manifestations in humans, are difficult to test in
laboratory animals.
B. Does Research Show That the Compound Affects a Specific
Target Organ? Will Humans Be Affected Similarly?
Some toxic agents affect only specific organs and not others. This organ specificity may be due to particular patterns of absorption, distribution, metabolism, and
excretion; the presence of specific receptors; or organ function. For example,
organ specificity may reflect the presence in the organ of relatively high levels of
an enzyme capable of metabolizing or changing a compound to a toxic form of the
compound,79 or it may reflect the relatively low level of an enzyme capable of
detoxifying a compound. An example of the former is liver toxicity caused by
inhaled carbon tetrachloride, which affects the liver but not the lungs because of
extensive metabolism to a toxic metabolite within the liver but relatively little
such metabolism in the lung.80
Some chemicals, however, may cause nonspecific effects or even multiple
effects. Lead is an example of a toxic agent that affects many organ systems,
including the blood, the central and peripheral nervous systems, the reproductive
system, and the kidneys.
The basis of specificity often reflects the function of individual organs. For
example, the thyroid is particularly susceptible to radioactive iodine in atomic fallout because thyroid hormone is unique within the body in that it requires iodine.
Through evolution, a very efficient and specific mechanism has developed that
78. See, e.g., Edward J. Calabrese, Multiple Chemical Interactions 583–89 tbl.14-1 (1991).
Species differences that produce a qualitative difference in response to xenobiotics are well known.
Sometimes understanding the mechanism underlying the species difference can allow one to predict
whether the effect will occur in humans. Thus, carbaryl, an insecticide commonly used for gypsy moth
control, among other things, produces fetal abnormalities in dogs but not in hamsters, mice, rats, and
monkeys. Dogs lack the specific enzyme involved in metabolizing carbaryl; the other species tested all
have this enzyme, as do humans. Therefore, it has been assumed that humans are not at risk for fetal
malformations produced by carbaryl.
79. Certain chemicals act directly to produce toxicity, whereas others require the formation of
a toxic metabolite.
80. Brian Jay Day et al., Potentiation of Carbon Tetrachloride-Induced Hepatotoxicity and Pneumotoxicity
by Pyridine, 8 J. Biochem. Toxicol. 11 (1993).
662
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
concentrates any absorbed iodine preferentially within the thyroid, rendering the
thyroid particularly at risk from radioactive iodine. In a test tube, the radiation
from radioactive iodine can affect the genetic material obtained from any cell in
the body, but in the intact laboratory animal or human, only the thyroid is at risk.
The unfolding of the human genome already is beginning to provide information pertinent to understanding the wide variation in human risk from environmental chemicals. The impact of this understanding on toxic tort causation
issues remains to be explored.81
C. What Is Known About the Chemical Structure of the
Compound and Its Relationship to Toxicity?
Understanding the structural aspects of chemical toxicology has led to the use of
structure–activity relationships (SAR) as a formal method of predicting the potential toxicity of new chemicals. This technique compares the chemical structure
of compounds with known toxicity and the chemical structure of compounds
with unknown toxicity. Toxicity then is estimated based on the molecular similarities between the two compounds. Although SAR is used extensively by EPA
in evaluating many new chemicals required to be tested under the registration
requirements of TSCA, its reliability has a number of limitations.82
81. Committee on Applications of Toxicogenomic Technologies to Predictive Toxicology
and Risk Assessment, National Research Council, Applications of Toxicogenomic Technologies to
Predictive Toxicology and Risk Assessment (2007); Gary E. Marchant, Toxicogenomics and Toxic Torts,
20 Trends Biotech. 329 (2002). Genomics can also be misinterpreted. A recent example is the use of
white blood cell gene expression to determine whether benzene was a cause of acute myelogenous
leukemia (AML) in individual workers. M.T. Smith, 14 Misuse of Genomics in Assigning Causation in
Relation to Benzene Exposure, Int’l J. Occup. Envtl. Health 144–46 (2008) describes why the failure to
match a pattern of DNA expression in workers with AML who were previously exposed to benzene
is not scientifically defensible as a means to establish the lack of causation, as said to have been done
in workers’ compensation cases in California. The wide range in the rate of metabolism of chemicals is at least partly under genetic control. A study of Chinese workers exposed to benzene found
approximately a doubling of risk in people with high levels of either an enzyme that increased the
rate of formation of a toxic metabolite or an enzyme that decreased the rate of detoxification of this
metabolite. There was a sevenfold increase in risk for those who had both genetically determined
variants. N. Rothman et al., Benzene Poisoning, A Risk Factor for Hematological Malignancy, Is Associated
with the NQO1 609C→T Mutation and Rapid Fractional Excretion of Chlorzoxazone, 57 Cancer Res.
239–42 (1997). See also Frederica P. Perera, Molecular Epidemiology: Insights into Cancer Susceptibility,
Risk Assessment, and Prevention, 88 J. Nat’l Cancer Inst. 496 (1996).
82. For example, benzene and the alkyl benzenes (which include toluene, xylene, and ethyl
benzene) share a similar chemical structure. SAR works exceptionally well in predicting the acute
central nervous system anesthetic-like effects of both benzene and the alkyl benzenes. Although there
are slight differences in dose–response relationships, they are readily explained by the interrelated
factors of chemical structure, vapor pressure, and lipid solubility (the brain is highly lipid). National
Research Council, The Alkyl Benzenes (1981). However, only benzene produces damage to the
bone marrow and leukemia; the alkyl benzenes do not have this effect. This difference is the result
663
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
D. Has the Compound Been the Subject of In Vitro Research,
and if So, Can the Findings Be Related to What Occurs
In Vivo?
Cellular and tissue culture research can be particularly helpful in identifying
mechanisms of toxic action and potential target-organ toxicity. The major barrier to the use of in vitro results is the frequent inability to relate doses that cause
cellular toxicity to doses that cause whole-animal toxicity. In many critical areas,
knowledge that permits such quantitative extrapolation is lacking.83 Nevertheless,
the ability to quickly test new products through in vitro tests, using human cells,
provides invaluable “early warning systems” for toxicity.84
E. Is the Association Between Exposure and Disease Biologically
Plausible?
No matter how strong the temporal relationship between exposure and the
development of disease, or the supporting epidemiological evidence, it is difficult to accept an association between a compound and a health effect when no
of specific toxic metabolic products of benzene in comparison with the alkyl benzenes. Thus SAR
is predictive of neurotoxic effects but not bone marrow effects. See Preston & Hoffman, supra note
33, at 277. Advances in computational approaches show promise in improving SAR. See Committee
on Toxicity Testing and Assessment of Environmental Agents, National Research Council, Toxicity
Testing in the 21st Century: A Vision and a Strategy, ch. 4 (2007).
In Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579 (1993), the Court rejected a per
se exclusion of SAR, animal data, and reanalysis of previously published epidemiological data where
there were negative epidemiological data. However, as the court recognized in Sorensen v. Shaklee
Corp., 31 F.3d 638, 646 n.12 (8th Cir. 1994), the problem with SAR is that “‘[m]olecules with minor
structural differences can produce very different biological effects.’” (quoting Joseph Sanders, From
Science to Evidence: The Testimony on Causation in the Bendectin Cases, 46 Stan. L. Rev. 1, 19 (1993)).
See also Glastetter v. Novartis Pharms. Corp., 252 F.3d 986, 990 (8th Cir. 2001); Polski v. Quigley
Corp., 2007 WL 2580550, at *6 (D. Minn. Sept. 5, 2007).
83. In Vitro Toxicity Testing: Applications to Safety Evaluation 8 (John M. Frazier ed., 1992).
Despite its limitations, in vitro research can strengthen inferences drawn from whole-animal bioassays
and can support opinions regarding whether the association between exposure and disease is biologically plausible. See Preston & Hoffman, supra note 33, at 278–93; Rogers & Kavlock, supra note 33,
at 319–23.
84. Graham v. Playtex Prods., Inc., 993 F. Supp. 127, 131–32 (N.D.N.Y. 1998) (opinion based
on in vitro experiments showing that rayon tampons were associated with higher risk of toxic shock
syndrome was admissible in the absence of epidemiological evidence). See also Allgood v. General
Motors Corp., 2006 WL 2669337, at *7 (S.D. Ind. Sept. 18, 2006); In re Ephedra Prods. Liab. Litig.,
393 F. Supp. 2d 181, 194 (S.D.N.Y. 2005) (in vitro studies may be subject of proper inferences
“although the gaps between such data and definitive evidence of causality are real and subject to challenge before the jury, they are not so great as to require the opinion to be excluded from evidence.
Inconclusive science is not the same as junk science”).
664
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
mechanism can be identified by which the chemical or physical exposure leads
to the putative effect.85
III. Specific Causal Association Between an
Individual’s Exposure and the Onset of
Disease
An expert who opines that exposure to a compound caused a person’s disease
engages in deductive clinical reasoning.86 In most instances, cancers and other
diseases do not wear labels documenting their causation. The opinion is based on
an assessment of the individual’s exposure, including the amount, the temporal
relationship between the exposure and disease, and other disease-causing factors. This information is then compared with scientific data on the relationship
between exposure and disease. The certainty of the expert’s opinion depends on
the strength of the research data demonstrating a relationship between exposure
and the disease at the dose in question and the presence or absence of other
disease-causing factors (also known as confounding factors).87
Particularly problematic are generalizations made in personal injury litigation
from regulatory positions. Regulatory standards are set for purposes far different
than determining the preponderance of evidence in a toxic tort case. For example,
if regulatory standards are discussed in toxic tort cases to provide a reference point
for assessing exposure levels, it must be recognized that there is a great deal of
variability in the extent of evidence required to support different regulations.88
The extent of evidence required to support regulations depends on
85. However, theories of bioplausibility, without additional data, have been found to be insufficient to support a finding of causation. See, e.g., Golod v. Hoffman La Roche, 964 F. Supp. 841,
860–61 (S.D.N.Y. 1997); Hall v. Baxter Healthcare Corp., 947 F. Supp. 1387, 1414 (D. Or. 1996).
But see Best v. Lowe’s Home Centers, Inc., 2008 WL 2359986, at *8 (E.D. Tenn. June 5, 2008)
(expert relied on temporal proximity in concluding that plaintiff lost his sense of smell due to chemical exposure).
86. For an example of deductive clinical reasoning based on known facts about the toxic effects
of a chemical and the individual’s pattern of exposure, see Bernard D. Goldstein, Is Exposure to Benzene
a Cause of Human Multiple Myeloma? 609 Annals N.Y. Acad. Sci. 225 (1990).
87. Causation issues are discussed in Michael D. Green et al., Reference Guide on Epidemiology, Section V, and Wong et al., Reference Guide on Medical Testimony, Section IV, in this manual.
See also David L. Bazelon, Science and Uncertainty: A Jurist’s View, 5 Harv. Envtl. L. Rev. 209 (1981);
Troyen A. Brennan, Causal Chains and Statistical Links: The Role of Scientific Uncertainty in HazardousSubstance Litigation, 73 Cornell L. Rev. 469 (1988); Joseph Sanders, Scientific Validity, Admissibility and
Mass Torts After Daubert, 78 Minn. L. Rev. 1387 (1994); Orrin E. Tilevitz, Judicial Attitudes Towards
Legal and Scientific Proof of Cancer Causation, 3 Colum. J. Envtl. L. 344, 381 (1977).
88. See, e.g., In re Paoli R.R. Yard PCB Litig., 35 F.3d 717, 781 (3d Cir. 1994) (district court
abused its discretion in excluding animal studies relied upon by EPA), cert. denied sub nom. General
665
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
1. The law (e.g., the Clean Air Act National Ambient Air Quality Standard
provisions have language focusing regulatory activity for primary pollutants on adverse health consequences to sensitive populations with an
adequate margin of safety and with no consideration of economic consequences, while regulatory activity under TSCA clearly asks for some
balance between the societal benefits and risks of new chemicals89);
2. The specific end point of concern (e.g., consider the concern caused by
cancer and adverse reproductive outcomes versus almost anything else);
and
3. The societal impact (e.g., the public’s support for control of an industry
that causes air pollution versus the public’s relative lack of desire to alter
personal automobile use patterns).
These three concerns, as well as others, including costs, politics, and the virtual
certainty of litigation challenging the regulation, have an impact on the level of
scientific proof required by the regulatory decisionmaker.90
In addition, regulatory standards traditionally include protective factors to
reasonably ensure that susceptible individuals are not put at risk. Furthermore,
standards often are based on the risk that results from lifetime exposure. Accordingly, the mere fact that an individual has been exposed to a level above a standard
does not necessarily mean that an adverse effect has occurred.
A. Was the Plaintiff Exposed to the Substance, and if So,
Did the Exposure Occur in a Manner That Can Result in
Absorption into the Body?
Evidence of exposure is essential in determining the effects of harmful substances.
Basically, potential human exposure is measured in one of three ways. First, when
direct measurements cannot be made, exposure can be measured by mathematical
modeling, in which one uses a variety of physical factors to estimate the transport of the pollutant from the source to the receptor. For example, mathematical
models take into account such factors as wind variations to allow calculation of
Elec. Co. v. Ingram, 513 U.S. 1190 (1995); Molden v. Georgia Gulf Corp., 465 F. Supp. 2d 606, 613
(M.D. La. 2006) (Plaintiff failed to establish prima facie case due to failure to establish exposure at a
level considered dangerous by regulatory agency); In re W.R. Grace & Co. 355 B.R. 462, 490 (Bankr.
D. Del. 2006) (OSHA standards of exposure relevant to causation but not determinative for exposure
occurring due to home attic insulation). See also John Endicott, Interaction Between Regulatory Law and
Tort Law in Controlling Toxic Chemical Exposure, 47 SMU L. Rev. 501 (1994).
89. See, e.g., Clean Air Act Amendments of 1990, 42 U.S.C. § 7412(f) (1994); Toxic Substances
Control, Act, 15 U.S.C. § 2605 (1994).
90. These concerns are discussed in Stephen Breyer, Breaking the Vicious Circle: Toward Effective Risk Regulation (1993).
666
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
the transport of radioactive iodine from a federal atomic research facility to nearby
residential areas. Second, exposure can be directly measured in the medium in
question—air, water, food, or soil. When the medium of exposure is water, soil, or
air, hydrologists or meteorologists may be called upon to contribute their expertise
to measuring exposure. The third approach directly measures human receptors
through some form of biological monitoring, such as blood tests to determine
blood lead levels or urinalyses to check for a urinary metabolite indicative of pollutant exposure. Ideally, both environmental testing and biological monitoring
are performed; however, this is not always possible, particularly in instances of
past exposure.91
The toxicologist must go beyond understanding exposure to determine if
the individual was exposed to the compound in a manner that can result in
absorption into the body. The absorption of the compound is a function of its
physiochemical properties, its concentration, and the presence of other agents
or conditions that assist or interfere with its uptake. For example, inhaled lead is
absorbed almost totally, whereas ingested lead is taken up only partially into the
body. Iron deficiency and low nutritional calcium intake, both common conditions of inner-city children, increase the amount of ingested lead that is absorbed
in the gastrointestinal tract and passes into the bloodstream.92
B. Were Other Factors Present That Can Affect the
Distribution of the Compound Within the Body?
Once a compound is absorbed into the body through the skin, lungs, or gastrointestinal tract, it is distributed throughout the body through the bloodstream.
Thus the rate of distribution depends on the rate of blood flow to various organs
91. See, e.g. Mitchell v. Gencorp Inc., 165 F.3d 778, 781 (10th Cir. 1999) (“[g]uesses, even if
educated, are insufficient to prove the level of exposure in a toxic tort case”); Wright v. Willamette
Indus., Inc., 91 F.3d 1105, 1107 (8th Cir. 1996); Ingram v. Solkatronic Chemical, Inc., WL 3544244,
at *11–*18 (N.D. Okla. 2005) (no information on dose so causation cannot be evaluated); In re Three
Mile Island Litig. Consol. Proceedings, 927 F. Supp. 834, 870 (M.D. Pa. 1996) (plaintiffs failed to
present direct or indirect evidence of exposure to cancer-inducing levels of radiation); Valentine v.
Pioneer Chlor Alkali Co., 921 F. Supp. 666, 678 (D. Nev. 1996). But see CSX Transp., Inc. v. Moody,
2007 WL 2011626, at *7 (Ky. Ct. App. July 13, 2007) (specific dose of solvent exposure not necessary
as long as evidence of exposure that could cause plaintiff’s toxic encephalopathy is presented including
how often solvents were used, duration of exposure, and documentation of physical symptoms while
plaintiff worked with solvents).
92. The term “bioavailability” is used to describe the extent to which a compound, such as lead,
is taken up into the body. In essence, bioavailability is at the interface between exposure and absorption into the organism. For an example of the impact of bioavailability on a governmental decision,
see Thomas H. Umbreit et al., Bioavailability of Dioxin in Soil from a 2,4,5-T Manufacturing Site, 232
Science 497–99 (1986), who found that the bioavailability of dioxins in the soil of Newark, New
Jersey, was negligible compared with that of Times Beach, Missouri—the latter community having
previously been evacuated because of dioxin soil contamination.
667
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
and tissues. Distribution and resulting toxicity also are influenced by other factors,
including the dose, the route of entry, tissue solubility, lymphatic supplies to the
organ, metabolism, and the presence of specific receptors or uptake mechanisms
within body tissues.
C. What Is Known About How Metabolism in the Human
Body Alters the Toxic Effects of the Compound?
Metabolism is the alteration of a chemical by bodily processes. It does not necessarily result in less toxic compounds being formed. In fact, many of the organic
chemicals that are known human cancer-causing agents require metabolic transformation before they can cause cancer. A distinction often is made between
direct-acting agents, which cause toxicity without any metabolic conversion, and
indirect-acting agents, which require metabolic activation before they can produce
adverse effects. Metabolism is complex, because a variety of pathways compete
for the same agent; some produce harmless metabolites, and others produce toxic
agents.93
D. What Excretory Route Does the Compound Take, and
How Does This Affect Its Toxicity?
Excretory routes are urine, feces, sweat, saliva, expired air, and lactation. Many
inhaled volatile agents are eliminated primarily by exhalation. Small water-soluble
compounds are usually excreted through urine. Higher-molecular-weight compounds are often excreted through the biliary tract into the feces. Certain fatsoluble, poorly metabolized compounds, such as PCBs, may persist in the body
for decades, although they can be excreted in the milk fat of lactating women.
E. Does the Temporal Relationship Between Exposure and the
Onset of Disease Support or Contradict Causation?
In acute toxicity, there is usually a short time period between cause and effect.
However, in some situations, the length of basic biological processes necessitates a
longer period of time between initial exposure and the onset of observable disease.
For example, in acute myelogenous leukemia, the adult form of acute leukemia,
at least 1 to 2 years must elapse from initial exposure to radiation, benzene, or
93. Courts have explored the relationship between metabolic transformation and carcinogenesis.
See, e.g., In re Methyl Tertiary Butyl Ether (MTBE) Prods. Liab. Litig., 2008 WL 2607852, at *2
(S.D.N.Y. July 1, 2008); Stites v. Sundstrand Heat Transfer, Inc., 660 F. Supp. 1516, 1519 (W.D.
Mich. 1987).
668
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
cancer chemotherapy before the manifestation of a clinically recognizable case of
leukemia, and the period of significantly higher risk from the last exposure usually persists for no more than about 15 years. A toxic tort claim alleging a shorter
or longer time period between cause and effect is scientifically highly debatable.
Much longer latency periods are necessary for the manifestation of solid tumors
caused by agents such as asbestos and arsenic.94
F. If Exposure to the Substance Is Associated with the Disease,
Is There a No Observable Effect, or Threshold, Level,
and if So, Was the Individual Exposed Above the No
Observable Effect Level?
For agents that produce effects other than through mutations, it is assumed that
there is some level that is incapable of causing harm. If the level of exposure
was below this no observable effect, or threshold, level, a relationship between
the exposure and disease cannot be established.95 When only laboratory animal
94. The temporal relationship between exposure and causation is discussed in Rolen v. Hansen
Beverage Co., 193 F. App’x 468, 473 (6th Cir. 2006) (“Expert opinions based upon nothing more
than the logical fallacy of post hoc ergo propter hoc typically do not pass muster under Daubert.”).
See also Young v. Burton, 2008 WL 2810237, at *17 (D.D.C. July 22, 2008); Dellinger v. Pfizer,
Inc., 2006 WL 2057654, at *10 (W.D.N.C. July 16, 2006) (temporal relationship between exposure
and illness alone not sufficient for causation when exposure was over an 18-month period); Cavallo
v. Star Enterprise, 892 F. Supp. 756, 769–74 (E.D. Va. 1995) (expert testimony based primarily on
temporal connection between exposure to jet fuel and onset of symptoms, without other evidence of
causation, ruled inadmissible). But see In re Stand ‘N Seal, Prods. Liab. Litig., 623 F. Supp. 2d 1355,
1371–72 (N.D. Ga. 2009) (toxicologist’s causation opinion that exposure to grout sealer caused chemical pneumonitis not subject to Daubert challenge based on a strong temporal relationship between
exposure and acute onset of respiratory symptoms despite lack of dose response data); In re Ephedra
Prods. Liab. Litig., 2007 WL 2947451, at *2 (S.D.N.Y. Oct. 9, 2007) (when exposure is known to
produce quick biological effects, a temporal relationship between exposure and effect can be used
to infer causation); Nat’l. Bank of Commerce v. Dow Chem. Co., 965 F. Supp. 1490, 1525 (E.D.
Ark. 1996) (“[T]here may be instances where the temporal connection between exposure to a given
chemical and subsequent injury is so compelling as to dispense with the need for reliance on standard
methods of toxicology.”). The issue of latency periods and the statute of limitations is considered in
Carl F. Cranor, Toxic Torts: Science, Law and the Possibility of Justice 173 (2006).
95. See, e.g., Allen v. Pennsylvania Eng’g Corp., 102 F.3d 194, 199 (5th Cir. 1996) (“Scientific
knowledge of the harmful level of exposure to a chemical, plus knowledge that the plaintiff was
exposed to such quantities, are minimal facts necessary to sustain the plaintiff’s burden in a toxic tort
case.”); Redland Soccer Club, Inc. v. Dep’t of the Army, 55 F.3d 827, 847 (3d Cir. 1995) (summary
judgment for defendant precluded where exposure above cancer threshold level could be calculated
from soil samples); Molden v. Georgia Gulf Corp., 465 F. Supp. 2d 606, 613 (M.D. La. 2006) (levels
of phenol released into the air were not considered harmful by regulatory agencies); Adams v. Cooper
Indus., Inc., 2007 WL 2219212, at *8 (E.D. Ky. July 30, 2007) (because plaintiffs’ experts have not
attempted to quantify or measure the amount or dosage of a substance to which a plaintiff was exposed,
their opinions are unreliable as to specific causation). But see Byers v. Lincoln Elec. Co, 607 F. Supp.
669
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
data are available, the expert extrapolates the NOEL from animals to humans
by calculating the animal NOEL based on experimental data and decreasing this
level by one or more safety factors to ensure no human effect.96 The NOEL can
also be calculated from human toxicity data if they exist. This analysis, however,
is not applied to substances that exert toxicity by causing mutations leading
to cancer. Theoretically, any exposure at all to mutagens may increase the risk of
cancer, although the risk may be very slight and not achieve medical probability.97
IV. Medical History
A. Is the Medical History of the Individual Consistent with the
Toxicologist’s Expert Opinion Concerning the Injury?
One of the basic and most useful tools in diagnosis and treatment of disease is
the patient’s medical history.98 A thorough, standardized patient information ques-
2d 863, FN101 (N.D. Ohio 2009) (no welder could ever provide evidence of actual exposure levels
after the fact “which is why the law does not require mathematical precision to show toxic exposure”
to support claims that inhaled manganese in welding fumes caused neurological injury); Tamraz v.
BOC Group, Inc., 2008 U.S. Dist. LEXIS 54932, at *9–*10 (N.D. Ohio July 18, 2008) (plaintiffs
were able to provide substantial evidential to support estimates of actual workplace conditions and
exposure for welder exposed to manganese).
96. See, e.g., supra note 26 & accompanying text; Robert G. Tardiff & Joseph V. Rodricks, Toxic
Substances and Human Risk: Principles of Data Interpretation 391 (1988); Joseph V. Rodricks, Calculated Risks 230–39 (2006); Lu, supra note 22, at 84. For regulatory toxicology, NOEL is being replaced
by a more statistically robust approach known as the benchmark dose. See supra note 27 & accompanying
text. For example, EPA’s use of the benchmark dose takes into account comprehensive dose–response
information, unlike NOEL.
97. See sources cited supra note 28. See also Henricksen v. ConocoPhillips Co., 605 F. Supp. 2d
1142, 1164–65 (E.D. Wa. 2009) (toxicologists’ opinion that exposure to gasoline containing benzene
caused truck driver’s acute mylogenous leukemia found unreliable where dose calculation was unreliable,
and “no-threshold model” lacked scientific support). U.S. regulatory approaches aimed at protecting the
general population tend to avoid setting a standard for a known human carcinogen, because any allowable
level below the standard is at least theoretically capable of causing cancer. However, exposure to many
chemical carcinogens, including benzene and arsenic, cannot be eliminated. Thus, agencies and Congress
have developed a number of ingenious means to regulate carcinogens while not seeming to acquiesce in
exposure of the general population to a carcinogen. These include FDA’s approach to de minimis risk and
EPA’s setting of a zero maximum contaminant level goal for carcinogens in drinking water while setting a
maximum contaminant level above zero that is “set as closely as possible to the MCLG, taking technology
and cost data into account,” http://safewater.custhelp.com/cgi-bin/safewater.cfg/php/enduser/std_adp.
php?p_faqid=1319. In contrast, occupational standards, which also take into account feasibility, permit
exposure to known human carcinogens. A generally outmoded approach for environmental or indoor
air guidelines has been to divide the permissible OSHA standard by a factor accounting for the presumed
lifetime exposure to the environmental chemical compared with 45 years at a 40-hour workweek.
98. For a thorough discussion of the methods of clinical diagnosis, see John B. Wong et al.,
Reference Guide on Medical Testimony, in this manual. See also Jerome P. Kassirer & Richard I.
670
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
tionnaire would be particularly useful for identifying the etiology, or causation,
of illnesses related to toxic exposures; however, there is currently no validated or
widely used questionnaire that gathers all pertinent information.99 Nevertheless,
it is widely recognized that a thorough medical history involves the questioning
and examination of the patient as well as appropriate medical testing. The patient’s
written medical records also should be examined.
The following information is relevant to a patient’s medical history: past and
present occupational and environmental history and exposure to toxic agents; lifestyle characteristics (e.g., use of nicotine and alcohol); family medical history (i.e.,
medical conditions and diseases of relatives); and personal medical history (i.e., present symptoms and results of medical tests as well as past injuries, medical conditions,
diseases, surgical procedures, and medical test results).
In some instances, the reporting of symptoms can be in itself diagnostic
of exposure to a specific substance, particularly in evaluating acute effects.100
For example, individuals acutely exposed to organophosphate pesticides report
headaches, nausea, and dizziness accompanied by anxiety and restlessness. Other
reported symptoms are muscle twitching, weakness, and hypersecretion with
sweating, salivation, and tearing.101
B. Are the Complaints Specific or Nonspecific?
Acute exposure to many toxic agents produces a constellation of nonspecific
symptoms, such as headaches, nausea, lightheadedness, and fatigue. These types of
symptoms are part of human experience and can be triggered by a host of medical
and psychological conditions. They are almost impossible to quantify or document
beyond the patient’s report. Thus, these symptoms can be attributed mistakenly
to an exposure to a toxic agent or discounted as unimportant when in fact they
reflect a significant exposure.102
Kopelman, Learning Clinical Reasoning (1991). A number of cases have considered the admissibility
of the treating physician’s opinion based, in part, on medical history, symptomatology, and laboratory
and pathology studies.
99. Office of Tech. Assessment, U.S. Congress, supra note 17, at 365–89.
100. But see Moore v. Ashland Chem., Inc., 126 F.3d 679, 693 (5th Cir. 1997) (discussion of
relevance of symptoms within 45 minutes of exposure); Armstrong v. Durango Georgia Paper Co.
2005 WL 2373443, at *5 (S.D. Ga. Sept. 27, 2005) (plaintiffs exhibited temporary symptoms widely
recognized by the medical community as those associated with exposure to chlorine gas).
101. Environmental Protection Agency, Recognition and Management of Pesticide Poisonings
(4th ed. 1989).
102. The issue of whether the development of nonspecific symptoms may be related to pesticide
exposure was considered in Kannankeril v. Terminix Int’l, Inc., 128 F.3d 802 (3d Cir. 1997). The court
ruled that the trial court abused its discretion in excluding expert opinion that considered, and rejected,
a negative laboratory test. Id. at 808–09. See also Kerner v. Terminix Int’l, Co., 2008 WL 341363, at
*7 (S.D. Ohio Feb. 6, 2008) (expert testimony about causation admissible based on plaintiff’s nonspecific symptoms because scientific literature has linked exposure to pyrethrins and pyrethroids to
671
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In taking a careful medical history, the expert focuses on the time pattern
of symptoms and disease manifestations in relation to any exposure and on the
constellation of symptoms to determine causation. It is easier to establish causation when a symptom is unusual and rarely is caused by anything other than the
suspect chemical (e.g., such rare cancers as hemangiosarcoma, associated with
vinyl chloride exposure, and mesothelioma, associated with asbestos exposure).
However, many cancers and other conditions are associated with several causative
factors, complicating proof of causation.103
C. Do Laboratory Tests Indicate Exposure to the Compound?
Two types of laboratory tests can be considered: tests that are routinely used in
medicine to detect changes in normal body status and specialized tests that are
used to detect the presence of the chemical or physical agent.104 For the most part,
tests used to demonstrate the presence of a toxic agent are frequently unavailable
from clinical laboratories. Even when available from a hospital or a clinical laboratory, a test such as that for carbon monoxide combined to hemoglobin is done so
rarely that it may raise concerns regarding its accuracy. Other tests, such as the test
for blood lead levels, are required for routine surveillance of potentially exposed
workers. However, if a laboratory is certified for the testing of blood lead in
workers, for which the OSHA action level is 40 micrograms per deciliter (µg/dl),
it does not necessarily mean that it will give reliable data on blood lead levels at the
much lower Centers for Disease Control and Prevention action level of 10 µg/dl.
D. What Other Causes Could Lead to the Given Complaint?
With few exceptions, acute and chronic diseases, including cancer, can be caused
by either a single toxic agent or a combination of agents or conditions. In taking
a careful medical history, the expert examines the possibility of competing causes,
or confounding factors, for any disease, which leads to a differential diagnosis.
In addition, ascribing causality to a specific source of a chemical requires that a
history be taken concerning other sources of the same chemical. The failure of
a physician to elicit such a history or of a toxicologist to pay attention to such a
numbness, tingling, burning sensations, and paresthesia); Wicker v. Consol. Rail Corp., 371 F. Supp.
2d 702, 732 (W.D. Pa. 2005).
103. Failure to rule out other potential causes of symptoms may lead to a ruling that the expert’s
report is inadmissible. See, e.g., Perry v. Novartis Pharms. Corp., 564 F. Supp. 2d 452, 469 (E.D. Pa.
2008); Farris v. Intel Corp., 493 F. Supp. 2d 1174, 1185 (D.N.M. 2007); Hall v. Baxter Healthcare
Corp., 947 F. Supp. 1387, 1413 (D. Or. 1996); Rutigliano v. Valley Bus. Forms, 929 F. Supp. 779,
786 (D.N.J. 1996).
104. See, e.g., Kannankeril v. Terminix Int’l, Inc., 128 F.3d 802, 807 (3d Cir. 1997).
672
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
history raises questions about competence and leaves open the possibility of competing causes of the disease.105
E. Is There Evidence of Interaction with Other Chemicals?
An individual’s simultaneous exposure to more than one chemical may result in
a response that differs from that which would be expected from exposure to only
one of the chemicals.106 When the effect of multiple agents is that which would
be predicted by the sum of the effects of individual agents, it is called an additive effect; when it is greater than this sum, it is known as a synergistic effect;
when one agent causes a decrease in the effect produced by another, the result is
termed antagonism; and when an agent that by itself produces no effect leads to an
enhancement of the effect of another agent, the response is termed potentiation.107
Three types of toxicological approaches are pertinent to understanding the
effects of mixtures of agents. One is based on the standard toxicological evaluation of common commercial mixtures, such as gasoline. The second approach
is from studies in which the known toxicological effect of one agent is used to
explore the mechanism of action of another agent, such as using a known specific
inhibitor of a metabolic pathway to determine whether the toxicity of a second
agent depends on this pathway. The third approach is based on an understanding
of the basic mechanism of action of the individual components of the mixture,
thereby allowing prediction of the combined effect, which can then be tested in
an animal model.108
105. See, e.g., Perry v. Novartis Pharms. Corp., 564 F. Supp. 2d 452, 471 (E.D. Pa. 2008) (plaintiff’s experts failed to adequately account for the possibility that plaintiff’s T-LBL was idiopathic, and
thus their conclusion that exposure to Elidel was a substantial cause of plaintiff’s cancer is unreliable
and inadmissible); Bell v. Swift Adhesives, Inc., 804 F. Supp. 1577, 1580 (S.D. Ga. 1992) (expert’s
opinion that workplace exposure to methylene chloride caused plaintiff’s liver cancer, without ruling
out plaintiff’s infection with hepatitis B virus, a known liver carcinogen, was insufficient to withstand
motion for summary judgment for defendant).
106. See generally Edward J. Calabrese, Multiple Chemical Interactions 97–115, 220–221 (1991).
107. Courts have been called on to consider the issue of synergy. In International Union, United
Automobile, Aerospace & Agricultural Implement Workers of America v. Pendergrass, 878 F.2d 389, 391 (D.C.
Cir. 1989), the court found that OSHA failed to sufficiently explain its findings that formaldehyde
presented no significant carcinogenic risk to workers at exposure levels of 1 part per million or less.
The court particularly criticized OSHA’s use of a linear low-dose risk curve rather than a risk-adverse
model after the agency had described evidence of synergy between formaldehyde and other substances
that workers would be exposed to, especially wood dust. Id. at 395.
108. See generally Calabrese, supra note 106. EPA has been addressing the issue of multiple
exposures to different agents within a community under the heading of cumulative risk assessment.
This approach is particularly of importance in dealing with environmental justice concerns. See, e.g.,
Institute of Medicine, Toward Environmental Justice: Research, Education, and Health Policy Needs
(1999); Michael A. Callahan & Ken Sexton, If Cumulative Risk Assessment Is the Answer, What Is the
Question? 115 Envtl. Health Persp. 799–806 (2006).
673
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
F. Do Humans Differ in the Extent of Susceptibility to the
Particular Compound in Question? Are These Differences
Relevant in This Case?
Individuals who exercise inhale more than sedentary individuals and therefore are
exposed to higher doses of airborne environmental toxins. Similarly, differences in
metabolism, which are inherited or caused by external factors, such as the levels
of carbohydrates in a person’s diet, may result in differences in the delivery of a
toxic product to the target organ.109
Moreover, for any given level of a toxic agent that reaches a target organ,
damage may be greater because of a greater response of that organ. In addition,
for any given level of target-organ damage, there may be a greater impact on particular individuals. For example, an elderly individual or someone with preexisting
lung disease is less likely to tolerate a small decline in lung function caused by an
air pollutant than is a healthy individual with normal lung function.
A person’s level of physical activity, age, sex, and genetic makeup, as well as
exposure to therapeutic agents (such as prescription or over-the-counter drugs),
affect the metabolism of the compound and hence its toxicity.110 Advances in
human genetics research are providing information about susceptibility to environmental agents that may be relevant to determining the likelihood that a given
exposure has a specific effect on an individual.
G. Has the Expert Considered Data That Contradict His or
Her Opinion?
Multiple avenues of deductive reasoning based on scientific data lead to acceptance
of causation in any field, particularly in toxicology. However, the basis for this
deductive reasoning is also one of the most difficult aspects of causation to describe
quantitatively. If animal studies, pharmacological research on mechanisms of toxicity, in vitro tissue studies, and epidemiological research all document toxic effects
of exposure to a compound, an expert’s opinion about causation in a particular
case is much more likely to be true.111
109. See generally Calabrese, supra note 106.
110. The problem of differences in chemical sensitivity was addressed by the court in Gulf South
Insulation v. United States Consumer Product Safety Commission, 701 F.2d 1137 (5th Cir. 1983). The court
overturned the commission’s ban on urea-formaldehyde foam insulation because the commission failed
to document in sufficient detail the level at which segments of the population were affected and whether
their responses were slight or severe: “Predicting how likely an injury is to occur, at least in general terms,
is essential to a determination of whether the risk of that injury is unreasonable.” Id. at 1148.
111. Consistency of research results was considered by the court in Marsee v. United States Tobacco
Co., 639 F. Supp. 466, 469–70 (W.D. Okla. 1986). The defendant, the manufacturer of snuff alleged
to cause oral cancer, moved to exclude epidemiological studies conducted in Asia that demonstrate
674
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
The more difficult problem is how to evaluate conflicting research results.
When different research studies reach different conclusions regarding toxicity, the
expert must be asked to explain how those results have been taken into account
in the formulation of the expert’s opinion.
V. Expert Qualifications
The basis of the toxicologist’s expert opinion in a specific case is a thorough
review of the research literature and treatises concerning effects of exposure to
the chemical at issue. To arrive at an opinion, the expert assesses the strengths and
weaknesses of the research studies. The expert also bases an opinion on fundamental concepts of toxicology relevant to understanding the actions of chemicals
in biological systems.
As the following series of questions indicates, no single academic degree,
research specialty, or career path qualifies an individual as an expert in toxicology.
Toxicology is a heterogeneous field. A number of indicia of expertise can be
explored, however, that are relevant to both the admissibility and weight of the
proffered expert opinion.
A. Does the Proposed Expert Have an Advanced Degree in
Toxicology, Pharmacology, or a Related Field? If the Expert
Is a Physician, Is He or She Board Certified in a Field
Such as Occupational Medicine?
A graduate degree in toxicology demonstrates that the proposed expert has a substantial background in the basic issues and tenets of toxicology. Many universities
have established graduate programs in toxicology. These programs are administered by the faculties of medicine, pharmacology, pharmacy, or public health.
Although most recent toxicology Ph.D. graduates have no other credentials, many highly qualified toxicologists are physicians or hold doctoral degrees
a link between smokeless tobacco and oral cancer. The defendant also moved to exclude evidence
demonstrating that the nitrosamines and polonium-210 contained in the snuff are cancer-causing
agents in some 40 different species of laboratory animals. The court denied both motions, finding:
There was no dispute that both nitrosamines and polonium-210 are present in defendant’s snuff products. Further, defendant conceded that animal studies have accurately and consistently demonstrated that
these substances cause cancer in test animals. Finally, the Court found evidence based on experiments
with animals particularly valuable and important in this litigation since such experiments with humans
are impossible. Under all these circumstances, the Court found this evidence probative on the issue
of causation.
Id. See also sources cited supra note 14.
675
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
in related disciplines (e.g., veterinary medicine, pharmacology, biochemistry,
environmental health, or industrial hygiene). For a person with this type of background, a single course in toxicology is unlikely to provide sufficient background
for developing expertise in the field.
A proposed expert should be able to demonstrate an understanding of the discipline of toxicology, including statistics, toxicological research methods, and disease
processes. A physician without particular training or experience in toxicology is
unlikely to have sufficient background to evaluate the strengths and weaknesses of
toxicological research. Most practicing physicians have little knowledge of environmental and occupational medicine.112 Generally, physicians are quite knowledgeable
about the identification of effects and their treatment. The cause of these effects,
particularly if they are unrelated to the treatment of the disease, is generally of little
concern to the practicing physician. Subspecialty physicians may have particular
knowledge of a cause-and-effect relationship (e.g., pulmonary physicians have
knowledge of the relationship between asbestos exposure and asbestosis),113 but most
physicians have little training in chemical toxicology and lack an understanding of
exposure assessment and dose–response relationships. An exception is a physician
who is certified in medical toxicology as a subspeciality under the American Board
of Medical Specialties’ requirements, based on substantial training in toxicology and
successful completion of rigorous examinations, including recertification exams.114
112. For recent documentation of how rarely an occupational history is obtained, see B.J. Politi
et al., Occupational Medical History Taking: How Are Today’s Physicians Doing? A Cross-Sectional Investigation of the Frequency of Occupational History Taking by Physicians in a Major US Teaching Center. 46 J.
Occup. Envtl. Med. 550–55 (2004).
113. See, e.g., Moore v. Ashland Chem., Inc., 126 F.3d 679, 701 (5th Cir. 1997) (treating physician’s opinion admissible regarding causation of reactive airway disease); McCullock v. H.B. Fuller Co.,
61 F.3d 1038, 1044 (2d Cir. 1995) (treating physician’s opinion admissible regarding the effect of fumes
from hot-melt glue on the throat, where physician was board certified in otolaryngology and based his
opinion on medical history and treatment, pathological studies, differential etiology, and scientific literature); Benedi v. McNeil-P.P.C., Inc., 66 F.3d 1378, 1384 (4th Cir. 1995) (treating physician’s opinion
admissible regarding the causation of liver failure by mixture of alcohol and acetaminophen, based on
medical history, physical examination, laboratory and pathology data, and scientific literature—the same
methodologies used daily in the diagnosis of patients); In re Ephedra Prods. Liab. Litig., 478 F. Supp. 2d
624, 633 (S.D.N.Y. 2007) (opinion of treating physician will assist the trier of fact because a reasonable
juror would want to know what inferences a treating physician would make); Morin v. United States,
534 F. Supp. 2d 1179, 1185 (D. Nev. 2005) (treating physician does not have sufficient expertise to
offer opinion about whether exposure to jet fuel caused cancer in his patient).
Treating physicians also become involved in considering cause-and-effect relationships when
they are asked whether a patient can return to a situation in which an exposure has occurred. The
answer is obvious if the cause-and-effect relationship is clearly known. However, this relationship
is often uncertain, and the physician must consider the appropriate advice. In such situations, the
physician will tend to give advice as though the causality was established, both because it is appropriate
caution and because of fears concerning medicolegal issues.
114. Before 1990, the American Board of Medical Toxicology certified physicians, but beginning in 1990, medical toxicology became a subspecialty board under the American Board of Emer-
676
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
Some physicians who are occupational health specialists also have training
in toxicology. Knowledge of toxicology is particularly strong among those who
work in the chemical, petrochemical, and pharmaceutical industries, in which
the surveillance of workers exposed to chemicals is a major responsibility. Of the
occupational physicians practicing today, only about 1000 have successfully completed the board examination in occupational medicine, which contains some
questions about chemical toxicology.115
B. Has the Proposed Expert Been Certified by the American
Board of Toxicology, Inc., or Does He or She Belong
to a Professional Organization, Such as the Academy of
Toxicological Sciences or the Society of Toxicology?
As of January 2008, more than 2000 individuals had received board certification
from the American Board of Toxicology. To sit for the examination, the candidate must be involved full time in the practice of toxicology, including designing
and managing toxicological experiments or interpreting results and translating
them to identify and solve human and animal health problems. Diplomats must
be recertified every 5 years. The Academy of Toxicological Sciences (ATS) was
formed to provide credentials in toxicology through peer review only. It does
not administer examinations for certification. Approximately 200 individuals are
certified as Fellows of ATS.
gency Medicine, the American Board of Pediatrics, and the American Board of Preventive Medicine,
as recognized by the American Board of Medical Specialties.
115. Clinical ecologists, another group of physicians, have offered opinions regarding multiple
chemical hypersensitivity and immune system responses to chemical exposures. These physicians
generally have a background in the field of allergy, not toxicology, and their theoretical approach is
derived in part from classic concepts of allergic responses and immunology. This theoretical approach
has often led clinical ecologists to find cause-and-effect relationships or low-dose effects that are not
generally accepted by toxicologists. Clinical ecologists often belong to the American Academy of
Environmental Medicine.
In 1991, the Council on Scientific Affairs of the American Medical Association concluded that
until “accurate, reproducible, and well-controlled studies are available . . . multiple chemical sensitivity
should not be considered a recognized clinical syndrome.” Council on Scientific Affairs, American
Med. Ass’n, Council Report on Clinical Ecology 6 (1991). In Bradley v. Brown, 42 F.3d 434, 438
(7th Cir. 1994), the court considered the admissibility of an expert opinion based on clinical ecology
theories. The court ruled the opinion inadmissible, finding that it was “hypothetical” and based on
anecdotal evidence as opposed to scientific research. See also Kropp v. Maine School Adm. Union No.
44, 471 F. Supp. 2d 175, 181–82 (D. Me. 2007) (expert physician does not rely upon scientifically
valid methodologies or data in reaching the conclusion that plaintiff is hypersensitive to phenol vapors
in indoor air); Coffin v. Orkin Exterminating Co., 20 F. Supp. 2d 107, 110 (D. Me. 1998); Frank v.
New York, 972 F. Supp. 130, 132 n.2 (N.D.N.Y. 1997). But see Elam v. Alcolac, Inc., 765 S.W.2d
42, 86 (Mo. Ct. App. 1988) (expert opinion based on clinical ecology theories admissible).
677
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The Society of Toxicology (SOT), the major professional organization for
the field of toxicology, was founded in 1961 and has grown dramatically in recent
years. It now has 6300 members.116 Criteria for membership is based either on
peer-reviewed publications or on the active practice of toxicology. Physician toxicologists can join the American College of Medical Toxicology and the American
Academy of Clinical Toxicologists. There are also societies of forensic toxicology,
such as the International Academy of Forensic Toxicology. Other organizations
in the field are the American College of Toxicology, for which experience in the
active practice of toxicology is the major membership criterion; the International
Society of Regulatory Toxicology and Pharmacology; and the Society of Occupational and Environmental Health. For membership, the last two organizations
require only the payment of dues.
C. What Other Criteria Does the Proposed Expert Meet?
The success of academic scientists in toxicology, as in other biomedical sciences,
usually is measured by the following types of criteria: the quality and number of
peer-reviewed publications, the ability to compete for research grants, service on
scientific advisory panels, and university appointments.
Publication of articles in peer-reviewed journals indicates an expertise in
toxicology. The number of articles, their topics, and whether the individual is
the principal or senior author are important factors in determining the expertise
of a toxicologist.117
Most research grants from government agencies and private foundations are
highly competitive. Successful competition for funding and publication of the
research findings indicate competence in an area.
Selection for local, national, and international regulatory advisory panels
usually implies recognition in the field. Examples of such panels are the NIH
Toxicology Study Section and panels convened by EPA, FDA, WHO, and IARC.
Recognized industrial organizations, including the American Petroleum Institute
and the Electric Power Research Institute, and public interest groups, such as
the Environmental Defense Fund and the Natural Resources Defense Council,
116. There are currently 21 specialty sections of SOT that represent the different specialty areas
involved in understanding the wide range of toxic effects associated with exposure to chemical and
physical agents. These sections include mechanisms, molecular biology, inhalation toxicology, metals,
neurotoxicology, carcinogenesis, risk assessment, and immunotoxicology.
117. Examples of reputable, peer-reviewed journals are the Journal of Toxicology and Environmental
Health; Toxicological Sciences; Toxicology and Applied Pharmacology; Science; British Journal of Industrial
Medicine; Clinical Toxicology; Archives of Environmental Health; Journal of Occupational and Environmental
Medicine; Annual Review of Pharmacology and Toxicology; Teratogenesis, Carcinogenesis and Mutagenesis;
Fundamental and Applied Toxicology; Inhalation Toxicology; Biochemical Pharmacology; Toxicology Letters;
Environmental Research; Environmental Health Perspectives; International Journal of Toxicology; Human and
Experimental Toxicology; and American Journal of Industrial Medicine.
678
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
employ toxicologists directly and as consultants and enlist academic toxicologists
to serve on advisory panels. Because of a growing interest in environmental issues,
the demand for scientific advice has outgrown the supply of available toxicologists.
It is thus common for reputable toxicologists to serve on advisory panels.
Finally, a university appointment in toxicology, risk assessment, or a related
field signifies an expertise in that area, particularly if the university has a graduate
education program in that area.
VI. Acknowledgments
The authors greatly appreciate the excellent research assistance provided by Eric
Topor and Cody S. Lonning.
679
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Glossary of Terms
The following terms and definitions were adapted from a variety of sources,
including Office of Technology Assessment, U.S. Congress, Reproductive Health
Hazards in the Workplace (1985); Casarett and Doull’s Toxicology: The Basic
Science of Poisons (Curtis D. Klaassen ed., 7th ed. 2007); National Research
Council, Biologic Markers in Reproductive Toxicology (1989); Committee on
Risk Assessment Methodology, National Research Council, Issues in Risk Assessment (1993); M. Alice Ottoboni, The Dose Makes the Poison: A Plain-Language
Guide to Toxicology (2d ed. 1991); and Environmental and Occupational Health
Sciences Institute, Glossary of Environment Health Terms (1989) [update].
absorption. The taking up of a chemical into the body orally, through inhalation,
or through skin exposure.
acute toxicity. An immediate toxic response following a single or short-term
exposure to an agent or dosing.
additive effect. When exposure to more than one toxic agent results in the
same effect as would be predicted by the sum of the effects of exposure to
the individual agents.
antagonism. When exposure to one toxic agent causes a decrease in the effect
produced by another toxic agent.
benchmark dose. The benchmark dose is determined on the basis of dose–
response modeling and is defined as the exposure associated with a specified
low incidence of risk, generally in the range of 1% to 10%, of a health effect,
or the dose associated with a specified measure or change of a biological
effect.
bioassay. A test for measuring the toxicity of an agent by exposing laboratory
animals to the agent and observing the effects.
biological monitoring. Measurement of toxic agents or the results of their
metabolism in biological materials, such as blood, urine, expired air, or
biopsied tissue, to test for exposure to the toxic agents, or the detection of
physiological changes that are due to exposure to toxic agents.
biologically plausible theory. A biological explanation for the relationship
between exposure to an agent and adverse health outcomes.
carcinogen. A chemical substance or other agent that causes cancer.
carcinogenicity bioassay. Limited or long-term tests using laboratory animals
to evaluate the potential carcinogenicity of an agent.
chronic toxicity. A toxic response to long-term exposure or dosing with an
agent.
clinical ecologists. Physicians who believe that exposure to certain chemical agents can result in damage to the immune system, causing multiple680
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
chemical hypersensitivity and a variety of other disorders. Clinical ecologists
often have a background in the field of allergy, not toxicology, and their
theoretical approach is derived in part from classic concepts of allergic
responses and immunology. There has been much resistance in the medical
community to accepting their claims.
clinical toxicology. The study and treatment of humans exposed to chemicals
and the quantification of resulting adverse health effects. Clinical toxicology
includes the application of pharmacological principles to the treatment of
chemically exposed individuals and research on measures to enhance elimination of toxic agents.
compound. In chemistry, the combination of two or more different elements in
definite proportions, which when combined acquire properties different from
those of the original elements.
confounding factors. Variables that are related to both exposure to a toxic
agent and the outcome of the exposure. A confounding factor can obscure
the relationship between the toxic agent and the adverse health outcome
associated with that agent.
differential diagnosis. A physician’s consideration of alternative diagnoses that
may explain a patient’s condition.
direct-acting agents. Agents that cause toxic effects without metabolic activation or conversion.
distribution. Movement of a toxic agent throughout the organ systems of the
body (e.g., the liver, kidney, bone, fat, and central nervous system). The rate
of distribution is usually determined by the blood flow through the organ
and the ability of the chemical to pass through the cell membranes of the
various tissues.
dose, dosage. A product of both the concentration of a chemical or physical
agent and the duration or frequency of exposure.
dose–response curve. A graphic representation of the relationship between the
dose of a chemical administered and the effect produced.
dose–response relationships. The extent to which a living organism responds
to specific doses of a toxic substance. The more time spent in contact with a
toxic substance, or the higher the dose, the greater the organism’s response.
For example, a small dose of carbon monoxide will cause drowsiness; a large
dose can be fatal.
epidemiology. The study of the occurrence and distribution of disease among
people. Epidemiologists study groups of people to discover the cause of a
disease, or where, when, and why disease occurs.
epigenetic. Pertaining to nongenetic mechanisms by which certain agents cause
diseases, such as cancer.
681
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
etiology. A branch of medical science concerned with the causation of diseases.
excretion. The process by which toxicants are eliminated from the body, including through the kidney and urinary tract, the liver and biliary system, the fecal
excretor, the lungs, sweat, saliva, and lactation.
exposure. The intake into the body of a hazardous material. The main routes of
exposure to substances are through the skin, mouth, and lungs.
extrapolation. The process of estimating unknown values from known values.
good laboratory practice (GLP). Codes developed by the federal government
in consultation with the laboratory testing industry that govern many aspects
of laboratory standards.
hazard identification. In risk assessment, the qualitative analysis of all available
experimental animal and human data to determine whether and at what dose
an agent is likely to cause toxic effects.
hydrogeologists, hydrologists. Scientists who specialize in the movement of
ground and surface waters and the distribution and movement of contaminants in those waters.
immunotoxicology. A branch of toxicology concerned with the effects of toxic
agents on the immune system.
indirect-acting agents. Agents that require metabolic activation or conversion
before they produce toxic effects in living organisms.
inhalation toxicology. The study of the effect of toxic agents that are absorbed
into the body through inhalation, including their effects on the respiratory
system.
in vitro. A research or testing methodology that uses living cells in an artificial or
test tube system, or that is otherwise performed outside of a living organism.
in vivo. A research or testing methodology that uses living organisms.
lethal dose 50 (LD50). The dose at which 50% of laboratory animals die within
days to weeks.
lifetime bioassay. A bioassay in which doses of an agent are given to experimental animals throughout their lifetime. See bioassay.
maximum tolerated dose (MTD). The highest dose of an agent to which an
organism can be exposed without it causing death or significant overt toxicity.
metabolism. The sum total of the biochemical reactions that a chemical produces
in an organism.
molecular toxicology. The study of how toxic agents interact with cellular
molecules, including DNA.
multiple-chemical hypersensitivity. A physical condition whereby individuals
react to many different chemicals at extremely low exposure levels.
682
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
multistage events. A model for understanding certain diseases, including some
cancers, based on the postulate that more than one event is necessary for the
onset of disease.
mutagen. A substance that causes physical changes in chromosomes or biochemical changes in genes.
mutagenesis. The process by which agents cause changes in chromosomes and
genes.
neurotoxicology. A branch of toxicology concerned with the effects of exposure
to toxic agents on the central nervous system.
no observable effect level (NOEL). The highest level of exposure to an agent
at which no effect is observed. It is the experimental equivalent of a threshold.
no-threshold model. A model for understanding disease causation that postulates
that any exposure to a harmful chemical (such as a mutagen) may increase
the risk of disease.
one-hit theory. A theory of cancer risk in which each molecule of a chemical
mutagen has a possibility, no matter how tiny, of mutating a gene in a manner
that may lead to tumor formation or cancer.
pharmacokinetics. A mathematical model that expresses the movement of a
toxic agent through the organ systems of the body, including to the target
organ and to its ultimate fate.
potentiation. The process by which the addition of one agent, which by itself
has no toxic effect, increases the toxicity of another agent when exposure to
both agents occurs simultaneously.
reproductive toxicology. The study of the effect of toxic agents on male and
female reproductive systems, including sperm, ova, and offspring.
risk assessment. The use of scientific evidence to estimate the likelihood of
adverse effects on the health of individuals or populations from exposure to
hazardous materials and conditions.
risk characterization. The final step of risk assessment, which summarizes information about an agent and evaluates it in order to estimate the risks it poses.
safety assessment. Toxicological research that tests the toxic potential of a chemical in vivo or in vitro using standardized techniques required by governmental
regulatory agencies or other organizations.
structure–activity relationships (SAR). A method used by toxicologists to
predict the toxicity of new chemicals by comparing their chemical structures
with those of compounds with known toxic effects.
synergistic effect. When two toxic agents acting together have an effect greater
than that predicted by adding together their individual effects.
target organ. The organ system that is affected by a particular toxic agent.
683
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
target-organ dose. The dose to the organ that is affected by a particular toxic
agent.
teratogen. An agent that changes eggs, sperm, or embryos, thereby increasing
the risk of birth defects.
teratogenic. The ability to produce birth defects. (Teratogenic effects do not pass
to future generations.) See teratogen.
threshold. The level above which effects will occur and below which no effects
occur. See no observable effect level.
toxic. Of, relating to, or caused by a poison—or a poison itself.
toxic agent or toxicant. An agent or substance that causes disease or injury.
toxicology. The science of the nature and effects of poisons, their detection, and
the treatment of their effects.
684
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Toxicology
References on Toxicology
A Textbook of Modern Toxicology (Ernest Hodgson ed., 4th ed. 2010).
Casarett and Doull’s Toxicology: The Basic Science of Poisons (Curtis D. Klaassen ed., 7th ed. 2007).
Committee on Toxicity Testing and Assessment of Environmental Agents,
National Research Council, Toxicity Testing in the 21st Century: A Vision
and a Strategy (2007).
Environmental Toxicants (Morton Lippmann ed., 3d ed. 2009).
Patricia Frank & M. Alice Ottoboni, The Dose Makes the Poison: A PlainLanguage Guide to Toxicology (3d ed. 2011).
Genetic Toxicology of Complex Mixtures (Michael D. Waters et al. eds., 1990).
Human Risk Assessment: The Role of Animal Selection and Extrapolation (M.
Val Roloff ed., 1987).
In Vitro Toxicity Testing: Applications to Safety Evaluation (John M. Frazier ed.,
1992).
Michael A. Kamrin, Toxicology: A Primer on Toxicology Principles and Applications (1988).
Frank C. Lu, Basic Toxicology: Fundamentals, Target Organs, and Risk Assessment (4th ed. 2002).
National Research Council, Biologic Markers in Reproductive Toxicology
(1989).
Alan Poole & George B. Leslie, A Practical Approach to Toxicological Investigations (1989).
Principles and Methods of Toxicology (A. Wallace Hayes ed., 5th ed. 2008).
Joseph V. Rodricks, Calculated Risks (2d ed. 2006).
Short-Term Toxicity Tests for Nongenotoxic Effects (Philippe Bourdeau et al.
eds., 1990).
Toxic Interactions (Robin S. Goldstein et al. eds., 1990).
Toxic Substances and Human Risk: Principles of Data Interpretation (Robert G.
Tardiff & Joseph V. Rodricks eds., 1987).
Toxicology (Hans Marquardt et al. eds., 1999).
Toxicology and Risk Assessment: Principles, Methods, and Applications (Anna
M. Fan & Louis W. Chang eds., 1996).
685
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
Medical Testimony
J O H N B . W O N G , L AW R E N C E O . G O S T I N , A N D
OSCAR A. CABRERA
John B. Wong, M.D., is Chief of the Division of Clinical Decision Making, Informatics, and
Telemedicine at the Institute for Clinical Research and Health Policy Studies, Tufts Medical
Center, and Professor of Medicine at Tufts University School of Medicine.
Lawrence O. Gostin, J.D., is Linda D. and Timothy J. O’Neill Professor of Global Health
Law and Faculty Director of O’Neill Institute for National and Global Health Law, Georgetown
University Law Center.
Oscar A. Cabrera, Abogado, LL.M., is Deputy Director of the O’Neill Institute for National
and Global Health Law and Adjunct Professor of Law, Georgetown University Law Center.
CONTENTS
I. Introduction, 689
II. Medical Testimony Introduction, 689
A. Medical Versus Legal Terminology, 689
B. Applicability of Daubert v. Merrell Dow Pharmaceuticals, Inc., 692
C. Relationship of Medical Reasoning to Legal Reasoning, 693
III. Medical Care, 695
A. Medical Education and Training, 695
1. Medical school, 695
2. Postgraduate training, 697
3. Licensure and credentialing, 698
4. Continuing medical education, 700
B. Organization of Medical Care, 700
C. Patient Care, 702
1. Goals, 702
2. Patient-physician encounters, 703
IV. Medical Decisionmaking, 704
A. Diagnostic Reasoning, 704
1. Clinical reasoning process, 705
2. Probabilistic reasoning and Bayes’ rule, 707
3. Causal reasoning, 714
687
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. Testing, 717
1. Screening, 717
2. Diagnostic testing, 719
3. Prognostic testing, 721
C. Judgment and Uncertainty in Medicine, 721
1. Variation in medical care, 721
2. Evidence-based medicine, 722
3. Hierarchy of medical evidence, 723
4. Guidelines, 726
5. Vicissitudes of therapeutic decisionmaking, 728
D. Informed Consent, 734
1. Principles and standards, 734
2. Risk communication, 737
3. Shared decisionmaking, 739
V. Summary and Future Directions, 740
Glossary of Terms, 742
References on Medical Testimony, 745
688
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
I. Introduction
Physicians are a common sight in today’s courtroom. A survey of federal judges
published in 2002 indicated that medical and mental health experts constituted
more than 40% of the total number of testifying experts.1 Medical evidence is
a common element in product liability suits,2 workers’ compensation disputes,3
medical malpractice suits,4 and personal injury cases.5 Medical testimony may also
be critical in certain kinds of criminal cases.6 The goal of this reference guide is to
introduce the basic concepts of diagnostic reasoning and clinical decisionmaking,
as well as the types of evidence that physicians use to make judgments as treating physicians or as experts retained by one of the parties in a case. Following
this introduction (Section I), Section II identifies a few overarching theoretical
issues that courts face in translating the methods and techniques customary in the
medical profession in a manner that will serve the court’s inquiry. Sections III
and IV describe medical education and training, the organization of medical care,
the elements of patient care, and the processes of diagnostic reasoning and medical judgment. When relevant, each subsection includes examples from case law
illustrating how the topic relates to legal issues.
II. Medical Testimony Introduction
A. Medical Versus Legal Terminology
Because medical testimony is common in the courtroom generally and indispensable to certain kinds of cases, courts have employed some medical terms in ways
1. Joe S. Cecil, Ten Years of Judicial Gatekeeping Under Daubert, 95 Am. J. Pub. Health S74–S80
(2005).
2. See, e.g., In re Bextra & Celebrex Mktg. Sales Practices and Prod. Liab., 524 F. Supp. 2d 1166
(N.D. Cal. 2007) (thoroughly reviewing the proffered testimony of plaintiff’s expert cardiologist and
neurologist in a products liability suit alleging that defendant’s arthritis pain medication caused serious
cardiovascular injury).
3. See, e.g., AT&T Alascom v. Orchitt, 161 P.3d 1232 (Alaska 2007) (affirming the decision
of the state workers’ compensation board and rejecting appellant’s challenges to worker’s experts).
4. Schneider ex rel. Estate of Schneider v. Fried, 320 F.3d 396 (3d Cir. 2003) (allowing a
physician to testify in a malpractice case regarding whether administering a particular drug during
angioplasty was within the standard of care).
5. See, e.g., Epp v. Lauby, 715 N.W.2d 501 (Neb. 2006) (detailing the opinions of two physicians
regarding whether plaintiff’s fibromyalgia resulted from an automobile accident with two defendants).
6. Medical evidence will be at issue in numerous kinds of criminal cases. See State v. Price, 171
P.3d 293 (Mont. 2007) (an assault case in which a physician testified regarding the potential for a
stun gun to cause serious bodily harm); People v. Unger, 749 N.W.2d 272 (Mich. Ct. App. 2008) (a
second-degree murder case involving testimony of a forensic pathologist and neuropathologist); State
v. Greene, 951 So. 2d 1226 (La. Ct. App. 2007) (a child sexual battery and child rape case involving
the testimony of a board-certified pediatrician).
689
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
that differ from their use by the medical profession. Differential diagnosis, for
example, is an accepted method that a medical expert may employ to offer expert
testimony that satisfies Daubert.7 In the legal context, differential diagnosis refers
to a technique “in which physician first rules in all scientifically plausible causes
of plaintiff’s injury, then rules out least plausible causes of injury until the most
likely cause remains, thereby reaching conclusion as to whether defendant’s product caused injury. . . .”8 In the medical context, by contrast, differential diagnosis
7. See, e.g., Feliciano-Hill v. Principi, 439 F.3d 18, 25 (1st Cir. 2006) (“[W]hen an examining
physician calls upon training and experience to offer a differential diagnosis . . . most courts have
found no Daubert problem.”); Clausen v. M/V New Carissa, 339 F.3d 1049, 1058–59 (9th Cir. 2003)
(recognizing differential diagnosis as a valid methodology); Mattis v. Carlon Elec. Prods., 295 F.3d
856, 861 (8th Cir. 2002) (“A medical opinion based upon a proper differential diagnosis is sufficiently
reliable to satisfy [Daubert.]”); Westberry v. Gislaved Gummi AB, 178 F.3d 257, 262 (4th Cir. 1999)
(recognizing differential diagnosis as a reliable technique).
8. Wilson v. Taser Int’l, Inc. 2008 WL 5215991, at *5 (11th Cir. Dec. 16, 2008) (“[N]onetheless,
Dr. Meier did not perform a differential diagnosis or any tests on Wilson to rule out osteoporosis
and these corresponding alternative mechanisms of injury. Although a medical expert need not rule
out every possible alternative in order to form an opinion on causation, expert opinion testimony is
properly excluded as unreliable if the doctor ‘engaged in very few standard diagnostic techniques by
which doctors normally rule out alternative causes and the doctor offered no good explanation as to
why his or her conclusion remained reliable’ or if ‘the defendants pointed to some likely cause of the
plaintiff’s illness other than the defendants’ action and [the doctor] offered no reasonable explanation
as to why he or she still believed that the defendants’ actions were a substantial factor in bringing
about that illness.’”); Williams v. Allen, 542 F.3d 1326, 1333 (11th Cir. 2008) (“Williams also offered
testimony from Dr. Eliot Gelwan, a psychiatrist specializing in psychopathology and differential
diagnosis. Dr. Gelwan conducted a thorough investigation into Williams’ background, relying on a
wide range of data sources. He conducted extensive interviews with Williams and with fourteen other
individuals who knew Williams at various points in his life.”) (involving a capital murder defendant
petitioning for habeus corpus offering supporting expert witness); Bland v. Verizon Wireless, L.L.C.,
538 F.3d 893, 897 (8th Cir. 2008) (“Bland asserts Dr. Sprince conducted a differential diagnosis which
supports Dr. Sprince’s causation opinion. We have held, ‘a medical opinion about causation, based
upon a proper differential diagnosis is sufficiently reliable to satisfy Daubert.’ A ‘differential diagnosis
[is] a technique that identifies the cause of a medical condition by eliminating the likely causes until
the most probable cause is isolated.’”) (stating expert’s incomplete execution of differential diagnosis
procedure rendered expert testimony unsatisfactory for Daubert standard) (citations omitted); Lash v.
Hollis 525 F.3d 636, 640 (8th Cir. 2008) (“Further, even if the treating physician had specifically opined
that the Taser discharges caused rhabdomyolysis in Lash Sr., the physician offered no explanation of
a differential diagnosis or other scientific methodology tending to show that the Taser shocks were
a more likely cause than the myriad other possible causes suggested by the evidence.”) (finding lack
of expert testimony with differential diagnosis enough to render evidence insufficient for jury to find
causation in personal injury suit); Feit v. Great West Life & Annuity Ins. Co., 271 Fed. App’x. 246,
254 (3d Cir. 2008) (“However, although this Court generally recognizes differential diagnosis as a
reliable methodology the differential diagnosis must be properly performed in order to be reliable. To
properly perform a differential diagnosis, an expert must perform two steps: (1) ‘Rule in’ all possible
causes of Dr. Feit’s death and (2) ‘Rule out’ causes through a process of elimination whereby the last
remaining potential cause is deemed the most likely cause of death.”) (ruling that district court not
in error for excluding expert medical testimony that relied on an improperly performed differential
diagnosis) (citations omitted); Glastetter v. Novartis Pharms. Corp., 252 F.3d 986 (8th Cir. 2001).
690
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
refers to a set of diseases that physicians consider as possible causes for symptoms
the patient is suffering or signs that the patient exhibits.9 By identifying the likely
potential causes of the patient’s disease or condition and weighing the risks and
benefits of additional testing or treatment, physicians then try to determine the
most appropriate approach—testing, medication, or surgery, for example.10
Less commonly, courts often have used the term “differential etiology”
interchangeably with differential diagnosis.11 In medicine, etiology refers to the
study of causation in disease,12 but differential etiology is a legal invention not
used by physicians. In general, both differential etiology and differential diagnosis
are concerned with establishing or refuting causation between an external cause
and a plaintiff’s condition. Depending on the type of case and the legal standard,
a medical expert may testify in regard to specific causation, general causation, or
both. General causation refers to whether the plaintiff’s injury could have been
caused by the defendant, or a product produced by the defendant, while specific
causation is established only when the defendant’s action or product actually
caused the harm.13 An opinion by a testifying physician may be offered in support
of both kinds of causation.14
Courts also refer to medical certainty or probability in ways that differ from
their use in medicine. The standards “reasonable medical certainty” and “reasonable medical probability” are also terms of art in the law that have no analog for a
practicing physician.15 As is detailed in Section IV, diagnostic reasoning and medi9. Steadman’s Medical Dictionary 531 (28th ed. 2006) (defining differential diagnosis as “the
determination of which of two or more diseases with similar symptoms is the one from which the patient
is suffering, by a systematic comparison and contrasting of the clinical findings.”).
10. The Concise Dictionary of Medical-Legal Terms 36 (1998) (definition of differential diagnosis).
11. See Proctor v. Fluor Enters., Inc. 494 F.3d 1337 (11th Cir. 2007) (testifying medical expert
employed differential etiology to reach a conclusion regarding the cause of plaintiff’s stroke). But see
McClain v. Metabolife Int’l, Inc., 401 F.3d 1233, 1252 (11th Cir. 2005) (distinguishing differential
diagnosis from differential etiology, with the former closer to the medical definition and the latter
employed as a technique to determine external causation).
12. Steadman’s Medical Dictionary 675 (28th ed. 2006) (defining etiology as “the science
and study of the causes of disease and their mode of operation. . . .”). For a discussion of the term
“etiology” in epidemiology studies, see Michael D. Green et al., Reference Guide on Epidemiology,
Section I, in this manual.
13. See Amorgianos v. Nat’l R.R. Passenger Corp., 303 F.3d 256, 268 (2d Cir. 2002).
14. See, e.g., Ruggiero v. Warner-Lambert Co. 424 F.3d 249 (2d Cir. 2005) (excluding testifying
expert’s differential diagnosis in support of a theory of general causation because it was not supported
by sufficient evidence).
15. See, e.g., Dallas v. Burlington N., Inc., 689 P.2d 273, 277 (Mont. 1984) (“‘[R]easonable
medical certainty’ standard; the term is not well understood by the medical profession. Little, if
anything, is ‘certain’ in science. The term was adopted in law to assure that testimony received by
the fact finder was not merely conjectural but rather was sufficiently probative to be reliable”). This
reference guide will not probe substantive legal standards in any detail, but there are substantive
differences in admissibility standards for medical evidence between federal and state courts. See Robin
Dundis Craig, When Daubert Gets Erie: Medical Certainty and Medical Expert Testimony in Federal Court,
77 Denv. U. L. Rev. 69 (1999).
691
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
cal evidence are aimed at recommending the best therapeutic option for a patient.
Although most courts have interpreted “reasonable medical certainty” to mean a
preponderance of the evidence,16 physicians often work with multiple hypotheses
while diagnosing and treating a patient without any “standard of proof ” to satisfy.
Statutes and administrative regulations may also contain terms that are borrowed, often imperfectly, from the medical profession. In these cases, the court
may need to examine the intent of the legislature and the term’s usage in the
medical profession.17 If no intent is apparent, the court may need to determine
whether the medical definition is the most appropriate one to apply to the statutory language. Whether the language is a term of art or a question of law will
often dictate the admissibility and weight of evidence.18
B. Applicability of Daubert v. Merrell Dow Pharmaceuticals,
Inc.
The Supreme Court’s decision in Daubert v. Merrell Dow Pharmaceuticals, Inc.,19
changed the way that judges screen expert testimony. A 2002 study by the RAND
Corporation indicated that after Daubert, judges began scrutinizing expert testimony much more closely and began more aggressively excluding evidence that
does not meet its standards.20 Despite the Court’s subsequent decisions in General
Electric Co. v. Joiner21 and Kumho Tire Co. v. Carmichael22 further defining the
16. See, e.g., Sharpe v. United States, 230 F.R.D. 452, 460 (E.D. Va. 2005) (“It is not enough
for the plaintiff’s expert to testify that the defendant’s negligence might or may have caused the injury
on which the plaintiff bases her claim. The expert must establish that the defendant’s negligence was
‘more likely’ or ‘more probably’ the cause of the plaintiff’s injury . . . ”).
17. See, e.g., Feltner v. Lamar Adver., Inc., 83 F. App’x 101 (6th Cir. 2003) (holding that the
statutory definition of “permanent total disability” under the Tennessee Workers Compensation Act
was not the same as the medical definition); Endorf v. Bohlender, 995 P.2d 896 (Kan. Ct. App. 2000)
(a medical malpractice case reversing a lower court’s interpretation of the statutory phrase “clinical
practice” because it did not comport with the legislature’s intent that the statutory meaning reflect
the medical definition).
18. See, e.g., Coleman v. Workers’ Comp. Appeal Bd. (Ind. Hosp.), 842 A.2d 349 (Pa. 2004)
(holding that since the legislature did not define the medical term “physical examination,” the
common usage of the term is more appropriate than the strict medical definition).
19. 509 U.S. 579 (1993).
20. Lloyd Dixon & Brian Gill, Changes in the Standards for Admitting Expert Evidence in
Federal Civil Cases Since the Daubert Decision (2002).
21. 522 U.S. 136 (1997) (holding that the trial court had properly excluded expert testimony
extrapolated from animal studies and epidemiological studies).
22. 526 U.S. 137 (1999). In Kumho, the Court made clear that Daubert applies to all expert
testimony and not just “scientific” testimony. Although the case involved a defect in tires, courts
before Kumho were divided on whether expert medical opinion based on experience or clinical
medical testimony were subject to Daubert. See also Joe S. Cecil, Ten Years of Judicial Gatekeeping Under
Daubert, 95 Am. J. Pub. Health S74–S80 (2005). See also Lawrence O. Gostin, Public Health Law:
Power, Duty, Restraint (2d ed. 2008).
692
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
Daubert standard, federal and state courts have sometimes employed conflicting
interpretations of what Daubert requires from testifying physicians.
The standard of review is an important factor in understanding how Daubert
has engendered seemingly inconsistent results. The Supreme Court adopted an
abuse of discretion standard in Joiner23 and affirmed it in Kumho.24 Although in
most product liability cases the courts reached the same conclusion, inconsistent
determinations regarding the admissibility of similar evidence may not constitute
an abuse of discretion under the federal standard of review or in states with a
similar standard.25
C. Relationship of Medical Reasoning to Legal Reasoning
As Section II.A suggested, the goal that guides the physician—recommending
the best therapeutic options for the patient—means that diagnostic reasoning and
the process of ongoing patient care and treatment involve probabilistic judgments
concerning several working hypotheses, often simultaneously. When a court
requires a testifying physician to offer evidence “to a reasonable medical certainty”
or “reasonable medical probability,” it is supplying the expert with a legal rule to
which his or her testimony must conform.26 In other words, a lawyer often will
23. 522 U.S. at 143.
24. 526 U.S. at 142.
25. Hollander v. Sandoz Pharm. Corp., 289 F.3d 1193, 1207 (10th Cir. 2002); see also Brasher
v. Sandoz Pharm. Corp., 160 F. Supp. 2d 1291, 1298 n.17 (N.D. Ala. 2001); Reichert v. Phipps, 84
P.3d 353, 358 (Wyo. 2004).
26. Courts have occasionally noted the tension between the medical reasoning and legal
reasoning when applying the reasonable medical certainty or reasonable medical probability standards.
See Clark v. Arizona, 548 U.S. 735, 777 (2006) (“When . . . ‘ultimate issue’ questions are formulated
by the law and put to the expert witness who must then say ‘yea’ or ‘nay,’ then the expert witness is
required to make a leap in logic. He no longer addresses himself to medical concepts but instead must
infer or intuit what is in fact unspeakable, namely, the probable relationship between medical concepts
and legal or moral constructs such as free will. These impermissible leaps in logic made by expert
witnesses confuse the jury. . . .”); Rios v. City of San Jose, 2008 U.S. Dist. LEXIS 84923, at *4
(N.D. Cal. Oct. 9, 2008) (“In their fifth motion, plaintiffs seek to exclude the testimony of Dr. Brian
Peterson who defendants designated to testify, among other subjects, about the ‘proximate cause’ of
Rios’ death. As the use of terms that also carry legal significance could confuse the jury, the motion is
granted in part, and defendants are instructed to distinguish between medical and legal terms such as
proximate cause to the extent possible. Where such terms must be used by the witness consistent with
the language employed in his field of expertise, the parties shall craft a limiting instruction to advise
the jury of the distinction between those terms and the issues they will be called upon to determine.”);
Norland v. Wash. Gen. Hosp., 461 F.2d 694, 697 (8th Cir. 1972) (“The use of the terms ‘probable’
and ‘possible’ as a basis for test of qualification or lack of qualification in respect to a medical opinion
has frequently converted this aspect of a trial into a mere semantic ritual or hassle. The courts have
come to recognize that the competency of a physician’s testimony cannot soundly be permitted to
turn on a mechanical rule of law as to which of the two terms he has employed. Regardless of which
term he may have used, if his testimony is such in nature and basis of hypothesis as to judicially impress
that the opinion expressed represents his professional judgment as to the most likely one among the
693
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
need to explain the legal standard to the physician, who will then shape the form
and content of his or her testimony in a manner that serves the legal inquiry.27
Legal standards will shape how physicians testify in a number of other ways.
Although treating physicians generally are concerned less about discovering the
actual causes of the disease than treating the patient, the testifying medical expert
will need to tailor his or her opinions in a way that conforms to the legal standard of causation. As Section IV will demonstrate, when analyzing the patient’s
symptoms and making a judgment based on the available medical evidence, a
physician will not expressly identify a “proximate cause” or “substantial factor.”
For example, in order to recommend treatment, a physician does not necessarily
need to determine whether a patient’s lung ailment was more likely the result of
a long history of tobacco use or prolonged exposure to asbestos if the optimal
treatment is the same. In contrast, when testifying as an expert in a case in which
an employee with a long history of tobacco use is suing his employer for possible
injuries as a result of asbestos exposure in the workplace, physicians may need
to make judgments regarding the likelihood that either tobacco or asbestos—or
both—could have contributed to the injury.28
Physicians often will be asked to testify about patients from whom they have
never taken a medical history or examined and make estimates about proximate
cause, increased risk of injury, or likely future injuries.29 The doctor may even
need to make medical judgments about a deceased litigant.30 Testifying in all
such cases requires making judgments that physicians do not ordinarily make in
their profession, making these judgments outside of physicians’ customary patient
encounters, and adapting the opinion in a way that fits the legal standard. The
purpose of this guide is not to describe or recommend competing legal standards,
whether it be the standard of proof, causation, admissibility, or the applicable standard of care in medical malpractice cases. Instead, it aims to introduce the practice
of medicine to federal and state judges, emphasizing the tools and methods that
possible causes of the physical condition involved, the court is entitled to admit the opinion and leave
its weight to the jury.”).
27. There are several cases that demonstrate the difficulty that physicians sometimes have in
adapting their testimony to the legal standard. See Schrantz v. Luancing, 527 A.2d 967 (N.J. Super.
Ct. Law Div. 1986) (malpractice case in which the medical expert’s opinion was inadequate because
of her understanding of “reasonable medical certainty”).
28. Physicians will testify as experts in cases in which the plaintiff’s condition may be the result
of multiple causes. In these cases, the divergence between medical reasoning and legal reasoning are
very apparent. See, e.g., Tompkin v. Philip Morris USA, Inc., 362 F.3d 882 (6th Cir. 2004) (affirming
district court’s conclusion that testimony offered by the defendant’s expert regarding the decedent’s
work-related asbestos exposure was not prejudicial in a suit against a tobacco company on behalf
of plaintiff’s deceased husband); Mobil Oil Corp. v. Bailey, 187 S.W.3d 265 (Tex. Ct. App. 2006)
(involving claims from a worker who had a long history of tobacco use that exposure to asbestos
increased his risk of cancer).
29. See, e.g., Tompkin, 362 F.3d 882.
30. See, e.g., id.
694
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
doctors use to make decisions and highlighting the challenges in adapting them
when testifying as medical experts.
Sections III and IV of this guide explain in great detail the practice of
medicine, including medical education, the structure of health care, and, most
importantly, the methods that physicians use to diagnose and treat their patients.
Special attention is given to the physician–patient relationship and to the types of
evidence that physicians use to make medical judgments. In an effort to make each
issue more salient, examples from case law are offered when they are illustrative.
III. Medical Care
A. Medical Education and Training
1. Medical school
The Association of American Medical Colleges (AAMC) consists of 133 accredited U.S. medical schools and 17 Canadian medical schools.31 The Liaison Committee on Medical Education performs the accreditation for AAMC and assesses
the quality of postsecondary education by determining whether each institution or
program meets established standards for function, structure, and performance. The
goal of medical school is to prepare students in the art and science of medicine for
graduate medical education.32 Of the 4 years of medical school, the first 2 years are
typically spent studying preclinical basic sciences involving the study of the normal
structure and function of human systems (e.g., through anatomy, biochemistry,
physiology, behavioral science, and neuroscience), followed by the study of
abnormalities and therapeutic principles (e.g., through microbiology, immunology, pharmacology, and pathology). The final 2 years involve clinical experience,
including rotations in patient care settings such as clinics or hospitals with required
“core” clerkships in internal medicine, pediatrics, psychiatry, surgery, obstetrics/
gynecology, and family medicine. All physicians who wish to be licensed must pass
the United States Medical Licensing Examination Steps 1, 2, and 3.33
31. Association of American Medical Colleges, Membership, available at https://www.aamc.org/
about/membership/ (last visited Feb. 12, 2011).
32. See Davis v. Houston Cnty., Ala. Bd. of Educ., 2008 WL 410619 (M.D. Ala. Feb. 13,
2008) (finding that an individual with no medical training was not qualified to give expert testimony).
33. Planned Parenthood Cincinnati Region v. Taft 444 F.3d 502, 515 (6th Cir. 2006), (“The
State has not appealed the district court’s order refusing to recognize Dr. Crockett as an expert in
the critical review of medical literature. Although that order has not been placed before us, the only
reason the district court gave for her ruling was that Dr. Crockett did not have any specific training
in the critical review of medical literature beyond the training incorporated in her general medical
school and residency training. This ruling ignored Dr. Crockett’s testimony that her residency program
at Georgetown University put particular emphasis on training residents in the critical review of
medical literature, that she had taught classes on the subject, that she had done extensive reading and
695
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In the United States, besides the more than 941,000 physicians, there are
more than 61,000 doctors of osteopathy. The Commission on Osteopathic College Accreditation accredits 25 colleges of osteopathic medicine. Training is
similar to that for medical physicians but with additional “special attention on the
musculoskeletal system which reflects and influences the condition of all other
body systems.”34 About 25% of current U.S. physicians are foreign medical graduates that include both U.S. citizens and foreign nationals.35 Because educational
standards and curricula outside the United States and Canada vary, the Education
Commission for Foreign Medical Graduates has developed a certification exam
to assess whether these graduates may enter Accreditation Council for Graduate
Medical Education (ACGME) accredited residency and fellowship programs.36
self-education on the subject, and that she had critically reviewed medical literature for the FDA. If
these qualifications are not sufficient to demonstrate expertise, this court is hard-pressed to imagine
what qualifications would suffice.”); Davis v. Houston Cnty., Ala. Bd. of Educ., 2008 WL 410619,
at *4 (M.D. Ala. Feb. 13, 2008) (“The Board has moved to exclude all evidence of Freet’s opinions
and conclusions related to the cause of Joshua Davis’s behavior at the football game contained in his
deposition as well as Freet’s letter to Malcolm Newman. The Board argues that Freet is not qualified
to give expert testimony, and that Plaintiff failed to comply with Fed. R. Civ. P. 26(a)(2)(B) by not
providing a report of Freet’s testimony that includes all of the information required by Rule 26(a)
(2)(B). . . . In order to consider Freet’s expert opinions, this Court must find that Freet meets the
requirements of Fed. R. Evid. 702. Rule 702 requires an expert to be qualified by ‘knowledge, skill,
experience, training, or education.’ Freet is not a medical doctor and never attended medical school.
The only evidence of Freet’s qualifications are: approximately five years working for the Department
of Veterans Affairs in the vocational rehabilitation program, followed by approximately seven years
working in private practice as a ‘licensed professional counselor.’ There is no evidence in the record
of Freet’s educational background, or any details of the exact nature of Freet’s work experience.”);
Therrien v. Town of Jay, 489 F. Supp. 2d 116, 117 (D. Me. 2007) (“Citing Daubert v. Merrell Dow
Pharmaceuticals, Inc., 509 U.S. 579, 113 S. Ct. 2786, 125 L. Ed. 2d 469 (1993) and Rule 702 of
the Federal Rules of Evidence, Officer Gould’s first objection is that Dr. Harding does not possess
sufficient expertise to express expert opinions about ‘the mechanism and timing of Plaintiff’s injuries.’
This objection is not well taken. Dr. Harding was graduated from Dartmouth College and Georgetown
Medical School; he completed a residency in internal medicine, is board certified in internal medicine,
and has been licensed to practice medicine in the state of Maine since 1978.”). United States Medical
Licensing Examination, Examinations, available at http://www.usmle.org/Examinations/index.html
(last visited Aug. 9, 2011).
34. Association of American Medical Colleges, What is a DO? available at http://www.
osteopathic.org/osteopathic-health/about-dos/what-is-a-do/Pages/default.aspx (last visited Feb. 12,
2011); Association of American Medical Colleges, About Osteopathic Medicine, available at http://
www.osteopathic.org/osteopathic-health/about-dos/about-osteopathic-medicine/Pages/default.aspx
(last visited Feb. 12, 2011).
35. American Medical Association, Physician Characteristics and Distribution in the U.S. (2009).
36. Commission for Foreign Medical Graduates, About ECFMG, available at http://www.ecfmg.
org/about.html (last visited Feb. 12, 2011).
696
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
2. Postgraduate training
After graduating from medical school, most physicians undergo additional training
in a residency program in a chosen specialty.37 Residencies typically range from 3 to
7 years at teaching hospitals and academic medical centers where residents care for
patients while being supervised by physician faculty and participating in educational
and research activities.38 After graduating from an accredited residency program,
physicians become eligible to take their board certification examinations.39 Physician
licensure in many states requires the completion of a residency program accredited
by the ACGME, the organization which is responsible for accrediting the more
than 8700 residency programs in 26 specialties and 130 subspecialties.40 Following residency, some physicians opt for additional subspecialty fellowship training.
ACGME divides fellowship training41 into (1) Dependent Subspecialty Programs
in which the program functions in conjunction with an accredited specialty/core
program and (2) Independent Subspecialty Programs in which the program does
not depend on the accreditation status of a specialty program.42 For osteopathic
physicians, the American Osteopathic Association approves osteopathic postdoctoral
37. See Brown v. Harmot Med. Ctr., 2008 WL 55999 (W.D. Pa. Jan. 3, 2008). American
Medical Association, Requirements for Becoming a Physician, available at http://www.ama-assn.org/
ama/pub/education-careers/becoming-physician.page? (last visited Aug. 9, 2011).
38. See Planned Parenthood Cincinnati Region v. Taft, 444 F.3d 502, 515 (6th Cir. 2006).
American Medical Association, Requirements for Becoming a Physician, available at http://www.amaassn.org/ama/pub/education-careers/becoming-physician.page? (last visited Aug. 9, 2011).
39. See Therrien v. Town of Jay, 489 F. Supp. 2d 116, 117 (D. Me. 2007) (finding that a physician
who completed a residency in internal medicine was qualified to give his opinion on trauma related to a
§ 1983 claim against a police department). American Medical Association, Requirements for Becoming
a Physician, available at http://www.ama-assn.org/ama/pub/education-careers/becoming-physician.page?
(last visited Aug. 9, 2011).
40. Accreditation Council for Graduate Medical Education, The ACMGE at a Glance, available
at http://www.acgme.org/acWebsite/newsRoom/newsRm_acGlance.asp (last visited Feb. 12, 2011).
41. Accreditation Council for Graduate Medical Education, Specialty Programs with Dependent
and Independent Subspecialties, available at http://www.acgme.org/acWebsite/RRC_sharedDocs/
sh_progs_depIndSubs.asp (last visited Feb. 12, 2011).
42. John Doe 21 v. Sec’y of Health and Human Servs., 84 Fed. Cl. 19, 35–36 (Fed. Cl. 2008)
(“The Government’s expert, Dr. Wiznitzer, is a board-certified neurologist by the American Board of
Psychiatry and Neurology, with a special qualification in Child Neurology. In addition, Dr. Wiznitzer
is certified by the American Board of Pediatrics. Since 1986, Dr. Wiznitzer has been an Associate
Pediatrician and an Associate Neurologist at University Hospital of Cleveland, Ohio. And, since 1992,
Dr. Wiznitzer has been Director of the Autism Center at Rainbow Babies and Children’s Hospital
in Cleveland, Ohio. During the past 24 years, Dr. Wiznitzer also has been an Associate Professor of
Pediatrics and Associate Professor of Neurology at Case Western Reserve University. Dr. Wiznitzer
completed his residency in Pediatrics from Children’s Hospital Medical Center in Cincinnati and
served as a Fellow in Developmental Disorders, Pediatric Neurology, and Higher Cortical Functions.
Dr. Wiznitzer also has received numerous awards and honors in the neurology field and his work has
been widely published.”) (citations omitted); Brown v. Hamot Med. Ctr., 2008 WL 55999, at *8–9
697
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
training programs.43 The American Osteopathic Association established the Osteopathic Postdoctoral Training Institutions (OPTI), wherein each OPTI partners a
community-based training consortium with one or more colleges of osteopathic
medicine and one or more hospitals and possibly ambulatory care facilities.44
3. Licensure and credentialing
Medical Practice Acts defining the practice of medicine and delegating enforcement to state medical boards exist for each of the 50 states, the District of
Columbia, and the U.S. territories. Besides awarding medical licenses, state medical boards also investigate complaints, discipline physicians who violate the law,
and evaluate and rehabilitate physicians. The Federation of State Medical Boards
represents the 70 medical boards of the United States and its territories, and its
mission is “promoting excellence in medical practice, licensure, and regulation as
the national resource and voice on behalf of state medical boards in their protection of the public.”45
Credentialing typically involves verifying medical education, postgraduate
training, board certification, professional experience, state licensure, prior credentialing outcomes, medical board actions, malpractice, and adverse clinical events.
Credentialing or recredentialing by hospitals involves an assessment of a physician’s
professional or technical competence and performance by evaluating and monitoring the quality of patient care. This credentialing process defines physicians’ scope
of practice and hospital privileges, that is, the clinical services they may provide.
The American Board of Medical Specialties (ABMS) provides certification
in 24 medical specialties (e.g., emergency medicine, internal medicine, obstetrics
and gynecology, family medicine, pediatrics, surgery, and others) to provide46
“assurance of a physician’s expertise in a particular specialty and/or subspecialty
(W.D. Pa. Jan. 3, 2008) (“As the United States Court of Appeals for the Fifth Circuit has explained
in another context, a medical residency is primarily an academic enterprise:
[a] residency program is distinct from other types of employment in that the resident’s “work” is what
is academically supervised and evaluated. [T]he primary purpose of a residency program is not employment or a stipend, but the academic training and the academic certification for successful completion
of the program. The certificate . . . tells the world that the resident has successfully completed a course
of training and is qualified to pursue further specialized training or to practice in specified areas. . . .
Successful completion of the residency program depends upon subjective evaluations by trained faculty
members into areas of expertise that courts are poorly equipped to undertake in the first instance or
to review. . . .”).
43. American Osteopathic Association, Postdoctoral Training, available at http://www.osteopathic.
org/inside-aoa/Education/postdoctoral-training/Pages/default.aspx (last visited Feb. 12, 2011).
44. Id.
45. Federation of State Medical Boards, FSMB Mission and Goals, available at http://www.fsmb.
org/mission.html (last visited Feb. 12, 2011).
46. American Board of Medical Specialties, Who We Are and What We Do, available at http://
www.abms.org/About_ABMS/who_we_are.aspx (last visited Feb. 12, 2011).
698
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
of medical practice.”47 Although the criteria vary depending on the field, board
eligibility requires the completion of an appropriate residency, an institutional
or valid license to practice medicine, and evaluation with written and—in some
cases—oral examinations. Many boards also require an evaluation of practice performance for initial certification. Board certification documents the fulfillment of
all criteria including passing the examinations. Originally, board certificates had no
expiration, but a program of periodic recertification (every 6 to 10 years) was subsequently initiated to ensure that physicians remained current in their specialty. In
2006, the ABMS recertification process became the Maintenance of Certification
to emphasize continuous professional development through a four-part process:
1.
2.
3.
4.
Licensure and professional standing;
Lifelong learning;
Cognitive expertise; and
Practice performance assessment in six core competencies
a. patient care,
b. medical knowledge,
c. practice-based learning,
d. interpersonal and communications skills,
e. professionalism, and
f. systems-based practice.48
In some cases, specialty organizations have opted to develop their own certification
process outside of the ABMS (e.g., the American Board of Bariatric Medicine).49
The American Osteopathic Association (AOA) certifies osteopathic physicians
in 18 osteopathic specialty boards (e.g., emergency medicine, internal medicine,
obstetrics and gynecology, family medicine, pediatrics, surgery, and others).50 The
osteopathic continuous certification process involves (1) unrestricted licensure,
(2) lifelong learning/continuing medical education, (3) cognitive assessment,
(4) practice performance assessment and improvement, and (5) continuous AOA
membership.51
47. Although specialization is a hallmark of modern medical practice, courts have not always
required that medical testimony come from a specialist. See Gaydar v. Sociedad Instituto GinecoQuirurgico y Planificacion Familiar, 245 F.3d 15, 24–25 (1st Cir. 2003) (“The proffered expert
physician need not be a specialist in a particular medical discipline to render expert testimony relating
to that discipline.”).
48. American Board of Medical Specialties, ABMS Maintenance of Certification, available at
http://www.abms.org/Maintenance_of_Certification/ABMS_MOC.aspx (last visited Feb. 12, 2011).
49. American Board of Bariatric Medicine, Certification, available at http://www.abbmcertification.
org/ (last visited Feb. 12, 2011).
50. American Osteopathic Association, AOA Specialty Certifying Boards, available at http://
www.osteopathic.org/inside-aoa/development/aoa-board-certification/Pages/aoa-specialty-boards.aspx
(last visited Feb. 12, 2011).
51. Id.
699
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
4. Continuing medical education
For relicensure, state medical boards require continuing medical education so that
physicians can acquire new knowledge and maintain clinical competence. The
Accreditation Council for Continuing Medical Education (ACCME) identifies,
develops, and promotes quality standards for continuing medical education for
physicians. ACCME requires certain elements of structure, method, and organization in the development of continuing medical education materials to ensure
uniformity across states and to help assure physicians, state medical boards, medical
societies, state legislatures, continuing medical education providers, and the public
that the education meets certain quality standards. For osteopathic physicians, the
AOA Board of Trustees also oversees accreditation for osteopathic CME sponsors
through the Council on Continuing Medical Education (CCME).52 The AOA’s
Healthcare Facilities Accreditation Program (HFAP) reviews services delivered by
medical facilities.53
B. Organization of Medical Care
The delivery of health care in the United States is highly decentralized and
fragmented,54 and is provided through clinics, hospitals, managed care organizations, medical groups, multispecialty clinics, integrated delivery systems, specialty
standalone hospitals, imaging facilities, skilled nursing facilities, rehabilitation
hospitals, emergency departments, and pharmacy-based and other walk-in clinics.
When surveyed in 1996, patients viewed the health care system as a “nightmare
to navigate.”55 Transitioning care from outpatient to inpatient hospitalization to
recovery often involves multiple handoffs among different physicians and care
providers with the need for accurate, timely, and complete transfer of information about the patient’s acute and chronic medical conditions, medications, and
treatments. Although hospitals increasingly belong to a network or system, most
community physicians belong to practices involving 10 or fewer physicians.56
Concerns about the safety of the organization of medical care first arose from
the Harvard Medical Practice Study which found that adverse events occurred in
52. American Osteopathic Association, Continuing Medical Education, available at http://
www.osteopathic.org/inside-aoa/development/continuing-medical-education/Pages/default.aspx (last
visited Feb. 12, 2011).
53. Healthcare Facilities Accreditation Program, About HFAP, available at http://www.hfap.org/
about/overview.aspx (last visited Feb. 12, 2011).
54. Committee on Quality of Health Care in America, Institute of Medicine, Crossing the
Quality Chasm: A New Health System for the 21st Century (2001) (hereinafter “2001 CQHCA
Report”).
55. Id. at 28.
56. Id. at 28.
700
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
3.7% of hospitalizations.57 Following some highly publicized errors (fatal medication overdoses and amputation of the limb on the wrong side), the Institute of
Medicine estimated that errors resulted in as many as 98,000 deaths in patients
hospitalized during 1997.58 The report highlights “The decentralized and fragmented nature of the health care delivery system (some would say ‘nonsystem’)
also contributes to unsafe conditions for patients, and serves as an impediment to
efforts to improve safety.” While recognizing that “not all errors result in harm,”
the report defines safety as “freedom from accidental injury” and specifies two
types of error: “the failure of a planned action to be completed as intended or the
use of a wrong plan to achieve an aim.”59
Subsequently, the Institute of Medicine recommended development of a
learning health care delivery system “a system that both prevents errors and learns
from them when they occur. The development of such a system requires, first,
a commitment by all stakeholders to a culture of safety and, second, improved
information systems.”60 Government and nongovernment institutions such as the
Agency for Healthcare Research and Quality (designated as the federal lead for
patient safety by the Healthcare Research and Quality Act of 1999 to “(1) identify the causes of preventable health care errors and patient injury in health care
delivery; (2) develop, demonstrate, and evaluate strategies for reducing errors and
improving patient safety; and (3) disseminate such effective strategies throughout
the health care industry.”),61 the National Quality Forum (a nonprofit organization
with multiple stakeholders developing and measuring performance standards), the
Joint Commission (independent not-for-profit organization accrediting and certifying care quality and safety), Institute of Healthcare Improvement (independent notfor-profit organization fostering innovation that improves care), and the Leapfrog
Group (a coalition of large employers rewarding performance) all have adopted
as parts of their mission the assessment and promotion of safety at the healthcare
system level. To deliver safe, effective, and efficient care, medical delivery systems
having increasingly incorporated allied health professions, including nurses, nurse
practitioners, physicians’ assistants, pharmacists, and therapists into care delivery.
57. Troyen A. Brennan et al., Incidence of Adverse Events and Negligence in Hospitalized Patients:
Results of the Harvard Medical Practice Study I, 324 New Eng. J. Med. 370–76 (1991); Lucian L. Leape
et al., The Nature of Adverse Events in Hospitalized Patients: Results of the Harvard Medical Practice Study
II, 324 New Eng. J. Med. 377–84 (1991).
58. Committee on Quality of Health Care in America, Institute of Medicine, To Err Is Human:
Building a Safer Health System 26 (2000) (hereinafter “2000 CQHCA Report”).
59. Id at 4, 54, 58.
60. Committee on Data Standards for Patient Safety, Institute of Medicine, Patient Safety:
Achieving a New Standard for Care 1 (2005).
61. Agency for Healthcare Research and Quality, Advancing Patient Safety: A Decade of
Evidence, Design and Implementation at 1, available at http://www.ahrq.gov/qual/advptsafety.htm
(last visited Feb. 12, 2011.)
701
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
C. Patient Care
1. Goals
The Institute of Medicine (IOM) describes quality health care delivery as “[t]he
degree to which health services for individuals and populations increase the likelihood of desired health outcomes and are consistent with current professional
knowledge.” The six specific aims for improving health care include
1. “Safe: avoiding injuries to patients from the care that is intended to help
them;”
2. “Effective: providing services based on scientific knowledge to all who
could benefit, and refraining from providing services to those not likely
to benefit;”
3. “Patient-centered: providing care that is respectful of and responsive to
individual patient preferences, needs, and values, and ensuring that patient
values guide all clinical decisions;”
4. “Timely: reducing waits and sometimes harmful delays for both those who
receive and those who give care;”
5. “Efficient: avoiding waste, including waste of equipment, supplies, ideas,
and energy;” and
6. “Equitable: providing care that does not vary in quality because of personal
characteristics such as gender, ethnicity, geographic location, and socioeconomic status.”62
Health outcome goals include (1) improving longevity or life expectancy,
(2) relieving symptoms (improving quality of life or reducing morbidity), and
(3) preventing disease. These goals, however, may conflict with one another. For
example, some patients may be willing to accept the chance of a reduced length
of life to try to obtain a higher quality of life (e.g., if normal volunteers had a
vocal cord cancer, about 20% of them would prefer radiation therapy instead of
surgery to preserve their voice despite a reduction in survival63), whereas others
may accept reduced quality of life to try to extend life (e.g., cancer chemotherapy). Some may accept a risk of dying from a procedure to prolong life or
relieve symptoms (e.g., coronary revascularization), whereas others may prefer to
avoid the near-term risk of the procedure or surgery despite future benefit (risk
aversion). In Crossing the Quality Chasm, the IOM emphasized care delivery that
should accommodate individual patient choices and preferences and be customized
on the basis of patients needs and values.64
62.
63.
Laryngeal
64.
2001 CQHCA Report, supra note 54, at 44, 5-6.
Barbara J. McNeil et al., Speech and Survival: Tradeoffs Between Quality and Quantity of Life in
Cancer, 305 New Eng. J. Med. 982–87 (1981) (hereinafter “McNeil”).
2001 CQHCA Report, supra note 54, at 49.
702
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
The Charter on Medical Professionalism avers three fundamental principles:
(1) patient welfare or serving the interest of the patient, (2) patient autonomy or
empowering patients to make informed decisions, and (3) social justice or fair distribution of health care resources.65 At times, the primacy of patient welfare places
the physician in conflict with social justice—for example, a patient with an acute
heart attack is in the emergency room with no coronary care unit (CCU) beds
available, and the most stable patient in the CCU has a 2-day-old heart attack.
Transferring the patient out of the CCU places him or her at a small risk for a
complication, but the CCU bed is a limited societal resource that other patients
should be able to access.66 Similarly, patients may insist on an unneeded and
costly test or treatment, and the first two principles would encourage physicians
to acquiesce, yet these unnecessary tests or treatments expose patients to harm and
expense and also diminish resources that would otherwise be available to others.67
2. Patient-physician encounters
A patient-physician encounter typically consists of four components: (1) patient
history, (2) physical examination, (3) medical decisionmaking, and (4) counseling.68 In many cases, patients seek medical attention because of a change in health
that led to symptoms. During the patient history, physicians identify the chief
complaint as the particular symptom that led the patient to seek medical evaluation. The history of the present illness includes the onset and progression of
symptoms over time and may include eliciting pertinent symptoms that the patient
does not exhibit. These “pertinent negatives” reduce the likelihood of certain
competing diagnoses. A comprehensive encounter includes past medical history
of prior illnesses, hospitalizations, surgeries, current medications, drug allergies,
and lifestyle habits including smoking, alcohol use, illicit drug use, dietary habits,
and exercise habits. Family history considers illnesses that have been diagnosed in
related family members to identify potential genetic predispositions for disease.
Social history usually includes education, employment, and social relationships
and provides a socioeconomic context for developing or coping with illness and
an employment context for exposure to environmental or toxin risks. Finally, the
review of systems is a comprehensive checklist of symptoms that might or might
not arise from the various organ systems and is an ancillary means to capture symp-
65. Medical Professionalism Project: ABIM Foundation, Medical Professionalism in the New
Millennium: A Physician Charter, 136 Annals Internal Med. 243, 244 (2002).
66. Harold C. Sox et al., Medical Decision Making (2007).
67. Harold C. Sox, Medical Professionalism and the Parable of the Craft Guilds, 147 Annals Internal
Med. 809–10 (2007).
68. See generally Davoll v. Webb, 194 F.3d 1116, 1138 (10th Cir. 1999) (“A treating physician is
not considered an expert witness if he or she testifies about observations based on personal knowledge,
including treatment of the party.”).
703
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
toms that the patient may have unintentionally neglected to mention, but which
may lead physicians to consider additional diagnostic possibilities.
Patients, particularly the elderly, also may seek care to monitor multiple
chronic conditions. This places an emphasis on collaborative and continuous
care that involves patients (and their families) and providers, long-term care goals
and plans, and self-management training and support.69 The organizational needs
for condition management, however, differ substantially from those necessary to
deliver health services for acute episodic complaints. Taking a patient history in
this case involves determining the status of the multiple conditions and whether
symptoms from those conditions have progressed, improved, or stabilized and of
the ability of patients to manage their condition.
The physical examination may be directed or complete. Physical findings
are referred to as signs (distinct from symptoms noted by the patient). Directed
physical examination refers to the examination of the relevant organ systems that
may cause the symptoms or that may have positive or negative findings related
to suspected diseases. When the disease is a chronic condition, the examination
may be used to monitor disease progression or resolution. The complete physical
examination of all organ systems may be performed as part of any annual examination, for difficult diagnoses, or for diseases that affect multiple organ systems.
The medical decisionmaking step of the encounter involves performing an
assessment and plan. After the history and physical examination—based on the
diagnostic possibilities, their likelihood, and the risks and benefits of treatment for
each—the physician decides whether to recommend diagnostic testing, empiric
treatment or referral to specialty or subspecialty care for further diagnostic evaluation, or a therapeutic intervention. Particularly challenging diagnoses are those
that present with atypical symptoms, occur rarely, mimic other diseases, or involve
multiple organ systems. For example, symptoms may arise from different organ
systems: Wheezing, which is consistent with asthma, could be caused by acid
going up from the stomach into the esophagus and then into the lungs (gastroesophageal reflux), congestive heart failure, or vocal cord dysfunction, among
other diagnostic possibilities. The final step in the encounter is counseling the
patient regarding diagnoses, tests, and treatments including dietary and lifestyle
changes, medications, medical devices, and procedural interventions.
IV. Medical Decisionmaking
A. Diagnostic Reasoning
Uncertainty in defining a disease makes diagnosis difficult: (1) the difference
between normal and abnormal is not always well demarcated; (2) many diseases
69. 2001 CQHCA Report, supra note 54, at 27.
704
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
do not progress with certainty (e.g., progression of ductal carcinoma in situ of
the breast to invasive breast cancer occurs less than 50% of the time) but rather
increase the risk of a poor outcome (e.g., hypertension raises the risk of developing
heart disease or stroke); and (3) symptoms, signs, and findings for one disease overlap with others.70 Variation also exists in the ability of physicians to elicit particular
symptoms (e.g., in a group of patients interviewed by many physicians, 23% to
40% of the physicians reported cough as being present), observe signs (e.g., only
53% of physicians detected cyanosis—a blue or purple discoloration of the skin
resulting from lack of oxygen—when present), or interpret tests (e.g., only 51% of
pathologists agreed with each other when examining PAP smear slides with cells
taken from a woman’s cervix to look for signs of cervical cancer).71 Moreover,
prognosis (response to disease or treatment) with alternative therapies is in many
cases uncertain. In a report by the Royal College of Physicians:
The practice of medicine is distinguished by the need for judgement in the
face of uncertainty. Doctors take responsibility for these judgements and their
consequences. A doctor’s up-to-date knowledge and skill provide the explicit
scientific and often tacit experiential basis for such judgements. But because so
much of medicine’s unpredictability calls for wisdom as well as technical ability,
doctors are vulnerable to the charge that their decisions are neither transparent
nor accountable.72
1. Clinical reasoning process
Studies of clinical problem solving suggest that physicians employ combinations
of two diagnostic approaches ranging from hypothetico-deductive (deliberative
and analytical) to pattern recognition (quick and intuitive).73 In the hypotheticodeductive approach, based on partial information, such as patient age, gender, and
chief complaint, physicians74 begin to generate a limited list of potential diagnostic
hypotheses (hypothesis generation). Over the past 50 years, cognitive scientists
70. David M. Eddy, Variations in Physician Practice: The Role of Uncertainty, 3 Health Affairs 74,
75–76 (1984).
71. Id. at 77–78.
72. Royal College of Physicians, RCP Bookshop. Doctors in Society. Medical Professionalism in
a Changing World technical supplement full text at 11, available at http://bookshop.rcplondon.ac.uk/
contents/pub75-411c044b-3eee-462d-936d-1dad7313e4a0.pdf (last visited Feb. 12, 2011).
73. Jerome P. Kassirer et al., Learning Clinical Reasoning (2d ed. 2009) (hereinafter “Kassirer
et al.”); Arthur S. Elstein & Alan Schwartz, Clinical Problem Solving and Diagnostic Decision Making:
Selective Review of the Cognitive Literature, 324 BMJ 729–32 (2002) (hereinafter “Elstein”); Jerome P.
Kassirer & G. Anthony Gorry, Clinical Problem Solving: A Behavioral Analysis, 89 Annals Internal Med.
245 (1978); Geoffrey Norman, Research in Clinical Reasoning: Past History and Current Trends, 39 Med
Educ. 418–27 (2005).
74. Steven N. Goodman, Toward Evidence-Based Medical Statistics, 1: The p Value Fallacy, 130
Annals Internal Med. 995–1004 (1999) (hereinafter “Goodman”).
705
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
have demonstrated that human short-term memory capacity is limited,75 and so
this initial list of possible diagnoses is a cognitive necessity and provides an initial
context that physicians use to evaluate subsequent data. Based on their knowledge
of the diagnoses on that list, physicians have expectations about what symptoms,
risk factors, disease course, signs, or test results would be consistent with each
diagnosis (deductive inference).
As physicians gather additional information, they evaluate those data for
their consistency with the possibilities on their initial list and whether those data
would increase or decrease the likelihood of each possibility (hypothesis refinement). If the data are inconsistent, additional diagnostic possibilities are considered
(hypothesis modification). The information gathering continues as an iterative
process at the same visit or over time during multiple visits with the same or
other physicians. The final cognitive step (diagnostic verification) involves testing the validity of the diagnosis for its coherency (consistency with predisposing
risk factors, physiological mechanisms, and resulting manifestations), its adequacy
(the ability to account for all normal and abnormal findings and the disease time
course), and its parsimony (the simplest single explanation as opposed to requiring
the simultaneous occurrence of two or more diseases to explain the findings).76
At the other end of clinical reasoning are heuristics, quick automatic “rules
of thumb” or cognitive shortcuts. In such cases, pattern recognition leads to rapid
recognition and a quick diagnosis, improving cognitive efficiency.77 For example,
a black woman with large shadows of lymph nodes in her chest x ray would trigger a diagnosis of a disease known as sarcoidosis for many physicians. The simplifying assumptions involved in heuristics, however, are subject to cognitive biases.
For example, episodic headache, sweating, and a rapid heartbeat form the classic
triad seen in patients with a rare adrenal tumor known as a pheochromocytoma
that also can cause hypertension. Physicians finding those three symptoms in a
patient with hypertension may overestimate the patient’s likelihood of having
pheochromocytoma based on representativeness bias, overestimating the likelihood of a less common disease just because case findings resemble those found
in that disease.78 Other cognitive errors include availability (overestimating the
75. Elstein, supra note 73; George A. Miller, The Magical Number Seven Plus or Minus Two: Some
Limits on Our Capacity for Processing Information, 63 Psychol. Rev. 81–97 (1956).
76. Kassirer et al., supra note 73, at 5-6.
77. Stephen G. Pauker & John B. Wong, How (Should) Physicians Think? A Journey from Behavioral
Economics to the Bedside, 304 JAMA 1233–35 (2010).
78. For additional discussion and definition of terms, see Section IV.A.2. Applying Bayes’ rule,
about 100 in 100,000 patients with hypertension have pheochromocytoma; this symptom triad occurs
in 91% of patients with pheochromocytoma (sensitivity) and does not occur in 94% of those without
pheochromocytoma (specificity), and so 6% of those without pheochromocytoma would have this
symptom triad. On the basis of Bayes’ rule, 91 of the 100 individuals with pheochromocytoma (91% times
100) would have this triad, and 5994 without a pheochromocytoma (6% times 99,900) will have the triad.
Thus, among the 100,000 hypertensive patients, 6085 will have the classic triad, suggesting the possibility
of pheochromocytoma, but only 91 out of the 6085 or 1.5%, will indeed have pheochromcytoma.
706
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
likelihood of memorable diseases because of severity or media attention and
underestimating common or routine diseases) and anchoring (insufficient adjustment of the initial likelihood of disease).79
Clinical intuition refers to rapid, unconscious processes that select the pertinent findings out of the multitude of available data.80 Such expertise results from
practice, is context sensitive, and cannot always be reduced to cause and effect.81
Cognitive research into the development of expertise suggests two competing
hypotheses. In instance- or exemplar-based memory, physicians store scripts or
“stories” of prior recalled case examples, for example, visual information such
as that in pathology, dermatology, or radiology, and match new cases to those
stories. The alternative prototype memory hypothesis is based on a mental model
of disease wherein experts store structured “facts” about the disease to create
abstractions. These “prototypes” enable experts to link findings to one another,
to connect findings to the possible diagnoses, and to predict additional findings
necessary to confirm the diagnosis, even in the absence of prior experience with
exactly such a case.82
Physicians typically apply hypothetico-deductive approaches when seeing
patients with problems outside of their expertise or difficult problems with atypical issues within their expertise and apply intuitive pattern recognition for cases
within their expertise or less challenging cases. However, diagnostic accuracy
appears to depend more on mastery of domain knowledge than on the particular
problem-solving method.83
2. Probabilistic reasoning and Bayes’ rule
There is no correlation between physicians’ ability to collect data thoroughly and
their ability to interpret the data accurately.84 Making quantitative predictions or
interpretation of test results constitutes probabilistic reasoning and avoids the use
of ambiguous qualitative terms such as “low” or “always” that may contribute to
different management decisions.85
Over 200 years ago, the Reverend Bayes first wrote a paper published posthumously which now forms a critical concept in modern medicine. Ignored for
79. Kassirer et al., supra note 73; Elstein, supra note 73.
80. Trisha Greenhalgh, Intuition and Evidence—Uneasy Bedfellows? 52 Brit. J. Gen. Practice
395–400 (2002).
81. Id. at 396.
82. Kassirer et al., supra note 73; Elstein, supra note 73.
83. Elstein, supra note 73.
84. Arthur S. Elstein & Alan Schwartz, Clinical Reasoning in Medicine, in Clinical Reasoning in
the Health Professions 223–34 (Joy Higgs et al. eds., 3d ed. 2008).
85. When physicians were asked to quantify “low probability,” the estimates had a mean of
~37% with a range from 0% to ~80% and when asked to quantify “always,” physicians had a mean
of ~88% with a range from 70% to 100%. Geoffrey D. Bryant & Geoffrey R. Norman, Expressions of
Probability: Words and Numbers, 302 New Eng. J. Med. 411 (1980).
707
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
nearly two centuries, his paper showed how to estimate the likelihood of disease
following a test result using the likelihood of disease prior to testing and the specific test result obtained. Thus, Bayesian analysis refers to a method of combining
existing evidence or a prior belief with additional evidence, for example, from test
results. The additional evidence may be the presence or absence of a symptom,
sign, test, or research study results.
The pretest suspicion of disease or, equivalently, the likelihood or prior probability of disease may be objective, that is, related to incidence (new cases over
a specified period of time) or prevalence (existing cases at a particular point in
time); based on clinical prediction rules (e.g., mathematical predictive models to
estimate the likelihood of developing heart disease over the next 10 years using
data from the Framingham Study); or subjective, that is, based on a clinician’s
estimated likelihood of disease prior to any testing.86 Bayes’ rule then combines
that pretest suspicion with the observed test result. Those who have disease and a
positive test are said to have true-positive test results. Those without disease who
have a negative test are said to have true-negative test results. Tests, however,
are almost always not perfectly accurate. That is, not everyone with disease has a
positive test; these are called false-negative test results. Similarly, some individuals
who are healthy may mistakenly have positive tests; these are called false-positive
test results.
For example, consider screening mammography which is positive in 90% of
women with breast cancer, and so the true-positive rate (or “sensitivity”) of 90%
is the likelihood of a positive test among those with disease. Mammography is
negative in 93% of women without breast cancer, and so the true-negative rate
(or “specificity”) of 93% is the likelihood of a negative test among those who do
not have disease (see Table 1).87 Note that if the test is not negative, it must be
positive, or vice versa, so that the sum of the columns in Table 1 must equal 100%.
Because a positive mammogram can occur among individuals with or without
breast cancer, the interpretation of the likelihood of breast cancer with a positive mammogram can be problematic. Given that the prevalence of breast cancer
among asymptomatic 40- to 50-year-old women is 8 in 1000, or 0.8%, Bayes’
rule calculates the likelihood of breast cancer following a test result, for example,
a positive mammogram (see Figures 1 and 2, Table 2).88 This analysis helps explain
in part why mammogram screening is controversial in women under age 50.
86. See Gonzalez v. Metro. Transp. Auth., 174 F.3d 1016, 1023 (9th Cir. 1999) (describing the
implications of Bayes’ rule for drug testing and noting that a test with the same false-positive rate will
generate a higher proportion of false positives to true positives in a population with fewer drug users);
see generally Michael O. Finkelstein & William B. Fairley, A Bayesian Approach to Identification Evidence,
83 Harv. L. Rev. 489 (1970). For a discussion of Baysian statistics, see David H. Kaye & David A.
Freedman, Reference Guide on Statistics, Section IV.D, in this manual.
87. Gerd Gigerenzer, Calculated Risks: How to Know When Numbers Deceive You (2002)
at 41 (hereinafter “Gigerenzer”).
88. Id. at 45-48.
708
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
Table 1. 2 × 2 Test Characteristics of Screening Mammogram for Use in
Bayes’ Rule
Breast Cancer
No Breast Cancer
Positive mammogram
90
true positives
7
false positives
Negative mammogram
10
false negatives
93
true negatives
Figure 1. Screening 1000 women for breast cancer.
1000 Women
Prevalence = 0.8%
992 without Breast Cancer
8 with Breast Cancer
Sensitivity
= 90%
7 with
Positive
Test
Specificity
= 93%
923 with
Negative
Test
69 with
Positive
Test
1 with
Negative
Test
Probability of Breast Cancer
with +Mammogram
(Predictive Value Positive)
7
7 + 69
= 9%
Figure 2. Likelihood of breast cancer after a positive or a negative mammogram.
Post-Test Probability of
Breast Cancer
Positive Test
Negative Test
100%
80%
60%
40%
20%
0%
0%
20%
40%
60%
80%
100%
Pretest Probability of Breast Cancer
709
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Table 2. Tabular and Formula Forms of Bayes’ Rule
Tabular Form of Bayes’ Rule
Conditional
Probability
of Positive
Test for the
Condition
(%)
Product of
the Pretest
and the
Conditional
Probabilities
(%)
Condition
Pretest
or Prior
Probability
(%)
Breast cancer
0.8
90
sensitivity
0.72
No breast cancer
99.2
7
1 − specificity
6.9
Posttest or
Posterior
Probability
(%)
9
= 0.72 ÷ 7.6
Sum = 7.6
Formula Form of Bayes’ Rule
pD+* pT+|D+
_______________________________
(pD+*pT+|D+) + ((1–pD+)*(1–pT–|D–))
pD+ = prior probability of disease = 0.8%
pT+|D+ = Sensitivity = True Positive Rate = 90%
pT–|D– = Specificity = True Negative Rate = 93%
0.008 * 0.90
____________________________
(0.008*0.90) + ((1–0.008)*(1–0.93))
= 9%
Despite a test that has a 90% or higher rate on both sensitivity and specificity, a
calculation using Bayes’ theorem shows that having a low probability of breast
cancer before testing means that even with a positive result on a screening mammogram, the likelihood that an average woman under age 50 has breast cancer
is less than 10%.
The probability of breast cancer among those with a positive mammogram
is termed the “predictive value positive.” Similarly, if the test were negative,
the likelihood of breast cancer in those with a negative mammogram (“false
reassurance rate”) would be 1 divided by 924 (1 woman with breast cancer and
a negative test and 923 women without breast cancer who have negative tests
in Figure 1), or about 0.1%. Interpreting a medical test result then depends on
the pretest likelihood of disease and the test’s sensitivity and specificity. Figure 2
710
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
illustrates the likelihood of breast cancer for differing pretest or prior probabilities
of breast cancer.
The discriminating ability of a test can be succinctly summarized as a likelihood ratio. The likelihood ratio positive expresses how much more likely disease
is to be present following a positive test result. It is the ratio of the true-positive
rate to the false-positive rate (sensitivity divided by 1 minus the specificity), e.g.,
12.5 (0.90 divided by 1 − 0.93) in the case of mammography. The likelihood ratio
negative expresses how much less likely disease is to be present following a negative test result. It is 1 minus the ratio of the false-negative rate to the true-negative
rate (1 minus the sensitivity divided by the specificity) or 0.11 (1 − 0.90 divided
by 0.93) in the case of mammography. Likelihood ratios exceeding 10 or falling
below 0.1 are believed to be strong discriminators causing “large” changes in the
likelihood of disease; those between 5 and 10 or 0.1 and 0.2 cause “moderate”
changes; and those between 2 and 5 or 0.2 and 0.5 cause “small” changes.89 Note
that even for a strongly discriminating test such as mammography, a positive or a
negative test result does not change the likelihood of disease substantially for very
low or very high probabilities of disease (see Figure 2), thereby highlighting the
importance of the pretest likelihood of disease in interpreting test results.
Terms such as “sensitivity,” “specificity,” and “predictive value negative or
positive” are called conditional probabilities because they express the likelihood
of a particular result based on a particular condition (e.g., a positive test result
among those with disease) or the likelihood of a particular condition among
those with a particular result (e.g., disease among those with a positive test).90
These kinds of expression, however, remove the base case probability (the pretest
probability of disease, sometimes referred to as the prior probability of disease) as
part of “normalization,” so that Bayes’ rule is required to interpret a test result.
Moreover, confusion between sensitivity and predictive value positive may lead to
errors in the interpretation of test results; for example, a 90% likelihood of having a positive mammogram in patients with breast cancer—the sensitivity—may
be misinterpreted as the predictive value positive, implying that a woman with a
positive mammogram has a 90% chance of having cancer. This misinterpretation
ignores the role for pretest suspicion or likelihood of disease (or assumes that all
89. David A Grimes & Kenneth F Schulz, Refining Clinical Diagnosis with Likelihood Ratios, 365
Lancet 1500–05 (2005).
90. This terminology may be confusing. The predictive value negative (negative predictive
value) is defined as the probability of no disease among those with a negative test. It also equals 1
minus the false reassurance rate. The false-alarm rate is defined as the probability of no disease among
those with a positive test. It is also 1 minus the predictive value positive. The false reassurance rate may
be confused with the false negative rate (among those with disease, the likelihood of a negative test)
because both involve those with negative tests and those with disease but in one case the denominator
is individuals with negative tests (false reassurance rate) and in the other case individuals with disease
(false negative rate). Similarly, the false alarm rate may be confused with the false positive rate (among
those with no disease, the likelihood of a positive test).
711
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
women undergoing the test have the disease). This confusion can be avoided by
translating Bayes’ rule into natural frequency expressions.91 The natural frequency
expression incorporates both the pretest likelihood and the conditional probabilities of the test results to yield the following statements (see Figure 1): Of 1000
women between 40 and 50 years old, 8 have breast cancer, and 7 of these will
test positive. Of the remaining 992 who do not have breast cancer, about 69 will
also test positive. When presented as a natural frequency (including the likelihood of disease), the likelihood of breast cancer becomes more transparent; thus
76 women will test positive, and 7 of the 76 will have breast cancer. When 48
physicians with an average of 14 years of professional experience were presented
with the natural frequency version or the conditional probability version, 16 of
24 estimated the likelihood of breast cancer to exceed 50% with the conditional
probability (sensitivity, specificity) version but only 5 of 24 did so with the natural
frequency information.92
Just as mammography test results may be misinterpreted if Bayes’ rule is not
applied, the prosecutor’s fallacy involves the misinterpretation of probabilistic
information. For example, in People v. Collins, the prosecutor argued that 1 in 3
girls have blonde hair, 1 in 10 girls have a pony tail, 1 in 10 automobiles are partly
yellow, 1 in 4 men have a mustache, 1 in 10 black men have a beard, and 1 in
1000 cars have an interracial couple in the car.93 Multiplying these six probabilities
together yields a 1 in 12 million joint probability of having all conditions present.
Aside from being simply estimates and from assuming that the probabilities were
independent of one another, the prosecutor made the statement that “The probability of the defendant matching on these six characteristics is 1 in 12 million,”
thereby assuming that someone other than the defendant being guilty is the same
1 in 12 million. However, if translated into natural frequency terms, 1 out of
every 12 million couples would have these six characteristics, and so assuming that
there are 24 million couples, there would be a 1 in 2 chance that the Collinses
are innocent. The error results from confusing the probability of a positive test
(having all six characteristics) among those with the disease (being guilty) and the
probability of the disease (being guilty) among those with a positive test (having
all six characteristics), that is, confusing the conditional probabilities—sensitivity
and positive predictive value.
Bayes’ rule becomes even more relevant in the genomic medicine era.94 Suppose a genetic test has a sensitivity and specificity of 99.9%, and suppose the probability of disease is 1 in 1000 if a positive family history is present and 1 in 100,000
if no family history is present. Screening 1000 individuals with a positive family
91..
92..
93.
94.
(2006).
Gigerenzer, supra note 87, at 42.
Id. at 43.
Id. at 152.
Isaac S. Kohane et al., The Incidentalome. A Threat to Genomic Medicine, 296 JAMA 212–15
712
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
history for the gene results in 2 positive tests: 1 individual truly has disease, and
in the other the test is a false positive. Screening 10 million individuals without a
family history results in 10,100 positive tests in which 100 individuals have disease
and 10,000 do not. Even with a specificity of 99.99%, if a test screens for 10,000
genes simultaneously, then 63% of individuals will have at least one false-positive
test result. Based simply on the genetic test results alone, neither individuals nor
physicians would be able to distinguish those with true-positive results from those
with false-positive results, thereby potentially leading to inappropriate monitoring
or treatment for all with positive test results.
Although a test is commonly thought of as a sample from a bodily fluid, tissue,
or image, a test also could be the presence or absence of a symptom or physical
sign. For example, both inhalation anthrax and influenza can cause symptoms
of muscle aches, fever, and malaise. However, a critical symptom that helps distinguish one from the other is runny nose, which occurs in 14% of those with
inhalation anthrax but in 78% to 89% of those with influenza or influenza-like
illness. Thus, when faced with distinguishing between these diagnoses, patients
with a runny nose given this symptom alone are about six times more likely to
have influenza or a flu-like illness than to have anthrax.95
Sensitivity and specificity rely on setting a positivity criterion, the threshold
level for determining normal above which tests are positive and below which the
test is negative. If the criterion is made stricter (e.g., what is considered to be abnormal requires a higher test result), then sensitivity falls and specificity increases, and if
the criterion is made laxer, then sensitivity rises and specificity falls. Depending on
the context of the testing, it may be more appropriate to choose a laxer criterion
(e.g., screening donated blood for HIV infection where the benefit is reducing
transfusion-associated HIV transmission, and the risk is discarding some uninfected
units of donated blood) or a stricter one (e.g., screening a low-prevalence population for HIV infection where the benefit is reducing false-positive diagnoses and
the risk is missing some truly HIV-infected individuals).96 Thus the benefits of finding and treating a person with disease versus the risk of treating a person without
disease should help establish what is considered normal or abnormal.
The terms “sensitivity” and “specificity” apply to the simple situation in
which disease is present or absent and a test can be positive or negative, but terminology and interpretation become more complicated when multiple diseases
are under consideration and when multiple test results may occur.97 For example,
consider blood in the urine (hematuria), which could be caused by a urinary tract
infection, a kidney stone, or a bladder cancer, among many other diseases. The
95. Nathaniel Hupert et al., Accuracy of Screening for Inhalational Anthrax After a Bioterrorist Attack,
139 Annals Internal Med. 337–45 (2003).
96. Klemens M. Meyer & Stephen G. Pauker, Screening for HIV: Can We Afford the False Positive
Rate? 317 New Eng. J. Med. 238–41 (1987).
97. Kassirer et al., supra note 73, at 21–22.
713
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
terms “sensitivity” and “specificity” are no longer appropriate because disease is
not simply present or absent. Instead, they are replaced by the term conditional
probabilities, that is, sensitivity is replaced by the likelihood of blood in the urine
with a urinary tract infection, or with a kidney stone, or with a bladder cancer.
Similarly, a very positive test has a different interpretation than a weakly positive test, and Bayes’ rule can quantify the difference. Results from multiple tests
can be combined with Bayes’ rule by applying Bayes’ rule to the first test result
and then reapplying Bayes’ rule to subsequent test results. This approach assumes
that the result of the first test does not affect the test characteristics (sensitivity
or specificity) of the second test (i.e., that there is conditional independence of
each test). When two tests are available, screening will usually occur first with the
high-sensitivity test to detect a high proportion of those with disease (true positives), or “ruling in” disease. Those with a positive first test will then undergo a
high-specificity test to reduce the number of individuals who do not have disease
but a positive first test (false positive), or “ruling out” disease.
3. Causal reasoning
To select the most appropriate therapy, physicians seek to identify the cause of a
patient’s complaints and findings. While considering the presence or absence of risk
factors (e.g., the presence of male gender, advanced age, high cholesterol, high blood
pressure, diabetes mellitus, and smoking for the medical condition coronary heart
disease), physicians will often use any type of evidence98 that might support causation, for example, biological plausibility,99 physiological drug effects, case reports, or
temporal proximity100 to an exposure.101 Although physicians use epidemiological
studies in their decisionmaking, “they are accustomed to using any reliable data to
assess causality, no matter what their source” because they must make care decisions
even in the face of uncertainty.102 This is in contrast to the courts which require a
higher standard than clinicians or regulators, and wherein causation cannot just be
“possible” but where “a ‘preponderance of evidence’ establishes that an injury was
caused by an alleged exposure.”103 For physicians, causal reasoning typically involves
98. Jerome P. Kassirer & Joe S. Cecil, Inconsistency in Evidentiary Standards for Medical Testimony:
Disorder in the Courts, 288 JAMA 1382–87 (2002) (hereinafter “Kassirer & Cecil”); see also Section
IV.C.2, for levels of evidence.
99. See Kennan v. Sec’y of Health & Human Servs., 2007 WL 1231592 (Ct. Fed. Cl. Apr. 5,
2007).
100. But see Wilson v. Taser Int’l, Inc., 303 F. App’x 708, 714 (11th Cir. 2008) (“[A]lthough
a doctor usually may primarily base his opinion as to the cause of a plaintiff’s injuries on this history
where the patient ‘has sustained a common injury in a way that it commonly occurs,’ . . . Dr. Meier
could not rely upon the temporal connection between the two events to support his causation opinion
in this case.”).
101. Kassirer & Cecil, supra note 98, at 1384.
102. Id. at 1394.
103. Id. at 1384.
714
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
understanding how abnormalities in physiology, anatomy, genetics, or biochemistry
lead to the clinical manifestations of disease. Through such reasoning, physicians
develop a “causal cascade” or “chain or web of causation” linking a sequence
of plausible cause-and-effect mechanisms to arrive at the pathogenesis or pathophysiology of a disease. For example, kidney failure leads to poor drug excretion,
resulting in symptoms or signs of drug toxicity.104 Although probabilistic reasoning
typically dominates initial hypothesis generation by physicians based on prevalence
or incidence, pattern recognition of concomitant symptoms and signs could trigger a
diagnosis. For example, cough, lung lesions, and enlarged breasts (gynecomastia) in a
37-year-old man could trigger the diagnosis of metastatic germ cell cancer.105 More
typically, physicians use causal reasoning in diagnostic refinement and verification
to examine a diagnosis for its coherency, namely, asking whether its physiological
mechanism would be expected to lead to the observed manifestations and whether
it is adequate to account for all normal and abnormal findings and the disease time
course. Once treatment has been implemented, physicians must make causal judgments in determining whether an alteration in patient status is the result of progression of disease or an adverse consequence of treatment, or whether the absence of
improvement results from therapeutic ineffectiveness that should prompt a change
in therapy or even reconsideration of the diagnosis.
Pathophysiological reasoning, however, also can lead to incorrect conclusions. In patients with heart failure with a weakened heart, a class of medications
called beta blockers had been thought to be contraindicated because beta blockers
would decrease the strength of the heart muscle contraction. Subsequent studies
found that beta blockers in patients with heart failure usually had no ill effect and
actually increased survival. Similarly, physicians once thought that atherosclerotic
blockages in heart arteries slowly progressed to cause a heart attack, so that
revascularizing those plaques through heart bypass surgery would prevent heart
attacks.106 Over the past 15 years, however, scientific evidence has emerged that
small vulnerable atherosclerotic plaques (not amenable to revascularization because
of their small size) can suddenly rupture and cause heart attacks. Not surprisingly,
revascularization trials involving either bypass surgery or percutaneous interventions such as stenting or angioplasty do not diminish the risk of having a heart
attack or improve survival for most patients.107
Although treating physicians108 may testify with regard to both general and
specific causation, as with use of evidence for causation, their standards for evi104. Kassirer et al., supra note 73, at 63–66.
105. Id. at 29.
106. David S. Jones, Visions of a Cure: Visualization, Clinical Trials, and Controversies in Cardiac
Therapeutics, 1968–1998, 91 Isis 504–41 (2000).
107. Thomas A. Trikalinos et al., Percutaneous Coronary Interventions for Non-acute Coronary Artery
Disease: A Quantitative 20-Year Synopsis and a Network Meta-analysis, 373 Lancet 911–18 (2009).
108. See generally Bland v. Verizon Wireless, LLC, 538 F.3d 893 (8th Cir. 2008) (upholding the
district court’s decision to reject a treating physician’s evidence of causation under Daubert).
715
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
dence vary.109 For example, some physicians may stop using a drug after the first
reports of adverse effects, and others may continue to use a drug despite evidence
of harm from randomized controlled trials. Determining whether an effect is a
class effect or drug specific can be difficult. When considering beta blockers for
patients with a weakened heart (heart failure), many studies have consistently
demonstrated the benefit of beta blockers in reducing mortality in those with
heart attacks often resulting in weakened heart function. However, in a randomized trial limited to patients with documented weakened heart, one particular beta
blocker was found to not confer a survival benefit, and as a result the heart failure
guidelines limited their beta blocker recommendation to just those three drugs
with documented mortality benefit in trials.110
Although treating physicians may be aware of patient-specific risk factors
such as smoking or family history, they may not routinely review specialized
aspects of such data, for example, toxicology, industrial hygiene, environment,
and some aspects of epidemiology. Additional experts may assist in distinguishing
general from specific causation by using their specialized knowledge to weigh the
relative contribution of each putative causative factor to determine “reasonable
medical certainty” or “reasonable medical probability.” The determination of
general causation involves medical and scientific literature review and the evaluation of epidemiological data, toxicological data, and dose–response relationships.
Consider for example, hormone replacement therapy for postmenopausal women.
Multiple observational studies using methods such as case-control, cross-sectional,
and cohort designs111 suggested an association between hormone therapy and
reduction in heart attack, but such designs are subject to confounding and bias
and are particularly weak for causation because in case-control and cross-sectional
studies, the sequence of the exposure and outcome is unknown. To resolve the
question, the Women’s Health Initiative (WHI) study randomized women to hormone replacement therapy or placebo and found a statistically significant increase
in clot-related disorders—heart attack, stroke, and heart-related mortality over
5 years but most notable in the first year after initiation of hormone therapy.112
Heart attacks are caused by blood clots and plaque rupture, and so the results
were consistent with the known biological mechanism of estrogens in the clotting cascade. However, patients in the WHI were, on average, 63 years old and
therefore not perimenopausal as analyzed in the observational studies. In a novel
109. Kassirer & Cecil, supra note 98, at 1384.
110. Mariell Jessup et al., 2009 Focused Update: ACCF/AHA Guidelines for the Diagnosis and
Management of Heart Failure in Adults: A Report of the American College of Cardiology Foundation/American
Heart Association Task Force on Practice Guidelines, 119 Circulation 1977–2016 (2009).
111. See Michael D. Green et al., Reference Guide on Epidemiology, in this manual.
112. Jacques E. Rossouw et al., Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal
Women: Principal Results from the Women’s Health Initiative Randomized Controlled Trial, 288 JAMA
321–33 (2002); JoAnn E. Manson et al., Estrogen Plus Progestin and the Risk of Coronary Heart Disease,
349 New Eng. J. Med. 523–34 (2003).
716
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
approach, the observational Nurses’ Health Study attempted to emulate the design
and intention-to-treat (ITT) analysis aspect of the WHI randomized trial, and
saw that the hormone replacement treatment effects were similar to those from
the randomized trial, suggesting that “the discrepancies between the WHI and
the Nurses’ Health Study ITT estimates could be largely explained by differences
in the distribution of time since menopause and length of followup.”113
B. Testing
1. Screening
Screening on a population basis requires that (1) the condition be present in the
population and affect quality and length of life; (2) the incidence or prevalence be
sufficiently high to justify any risks associated with the test; (3) preventive or early
treatment should be available; (4) an asymptomatic period for early detection must
exist; (5) the screening test should be accurate, acceptable, and affordable; and
(6) screening benefits should exceed harms. Screening for disease in asymptomatic,
otherwise healthy patients has become widely accepted and promulgated.114
Screening differs from diagnostic testing used to elucidate the cause of symptoms
or loss of function because screening involves apparently healthy individuals.115
Although screening may prevent the development of disease-related morbidity
and mortality, positive test results (both false positive and true positive) may lead
to interventions that could be unnecessary or even risky because of overdiagnosis
and overtreatment.116
Normal ranges for biochemical tests are often based on the 95% confidence
intervals in a normal healthy population—that is, although everyone is healthy,
by convention, values outside the 2.5% lower and upper extremes are considered
to be abnormal. Consequently, ordering six blood tests in a normal healthy individual yields only a 74% chance that all six tests will be normal; that is, there is
a 26% chance that one or more may be abnormal. Similarly, when ordering 12
tests in a normal person, there is a 54% chance that all 12 will be normal and a
46% chance that 1 or more will be abnormal. So simply ordering tests in healthy
individuals or in the absence of clinical suspicion of a disease may result in many
113. Miguel A. Hernán et al., Observational Studies Analyzed Like Randomized Experiments: An
Application to Postmenopausal Hormone Therapy and Coronary Heart Disease, 19 Epidemiology 766–79
(2008).
114. Lisa M. Schwartz et al., Enthusiasm for Cancer Screening in the United States, 291 JAMA
71–78 (2004).
115. David A. Grimes & Kenneth F. Schulz, Uses and Abuses of Screening Tests, 359 Lancet 881–84
(2002) (hereinafter Grimes and Schulz); William C. Black, Overdiagnosis: An Under Recognized Cause of
Confusion and Harm in Cancer Screening, 92 J. Nat’l Cancer Inst. 1280–82 (2000) (hereinafter “Black”).
116. Grimes & Schulz, supra note 115, at 884; Black, supra note 115, at 1280.
717
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
false-positive test results that can lead to false alarms, anxiety, additional testing,
and possible morbidity or mortality from subsequent testing or interventions.117
Even a valueless screening test may appear to be beneficial because of “leadtime bias.” If screened or unscreened patients have the same prognosis from
the time of onset of symptoms to death, then screened patients only appear to live
longer because the time elapsed from diagnosis by screening to death exceeds that
from diagnosis made at the time of symptom onset to death. A second bias, “length
bias,” also leads to overestimation of the benefit from screening.118 Suppose that a
randomized trial of screening or no screening is conducted over a limited length of
time from study initiation to termination. The screening test detects patients with
both aggressive and indolent forms of the disease. Among the unscreened patients,
however, disease only becomes evident through the development of symptoms,
which would be more likely in patients who have the aggressive form of the disease and a poorer prognosis. Thus screened patients with disease appear to have
a better prognosis than unscreened patients with disease because a higher proportion of the screened patients have more indolent disease. Extending the concept
of length bias further, screening can result in “pseudodisease” or “overdiagnosis,”
such as the identification of slow-growing cancers that even if untreated would
never cause symptoms or reduce survival.119 Although lung cancer is commonly
thought to be one of the more aggressive cancers, an autopsy study found that
one-third of lung cancers were unsuspected prior to autopsy, and nearly all of these
patients with unsuspected lung cancer prior to autopsy died from other causes.120
Lung cancer screening in these individuals would have resulted in pseudodisease
or overdiagnosis because screening would have diagnosed their cancer but they
would have died of something else (or from a severe adverse effect of the cancer
treatment) before the cancer became evident.
To further illustrate bias in screening studies, the Mayo Lung Project was a
randomized trial comparing screening for lung cancer with periodic chest X rays
and sputum samples versus usual care. It found that screening did improve the
likelihood of survival 5 years after diagnosis in those with lung cancer but surprisingly did not affect lung cancer deaths. Further analysis of the randomized trial
found that the survival advantage of screening was attributable to the 46 extra
117. A radiologist described his own experience to illustrate the clinical aphorism that “the
only ‘normal’ patient is one who has not yet undergone a complete work-up.” He had a negative
CT scan of the colon examination, but the CT scan also provided images outside the liver with
radiologists identifying lesions in the kidneys, liver, and lungs. This resulted in additional CT scans, a
liver biopsy, PET scan, video-aided thoracoscopy (a flexible scope inserted into the chest), and three
wedge resections of the lung leading to multiple tubes, medications, and “excruciating pain” that
required 5 weeks for recovery. William J. Casarella, A Patient’s Viewpoint on a Current Controversy, 224
Radiology 927 (2002).
118. Grimes & Schulz, supra note 115, at 884.
119. Black, supra note 115, at 1280.
120. Charles K. Chan et al., More Lung Cancer but Better Survival: Implications of Secular Trends in
“Necropsy Surprise” Rates, 96 Chest 291–96 (1989).
718
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
lung cancer cases detected by screening. These 46 cases had indolent (or, at worst,
very slowly progressive) lung cancer; that is, these patients would have a normal
life expectancy, and so, including their prognosis in those with screen-detected
lung cancer inflates the apparent 5-year survival with screening because of length
bias and overdiagnosis.121 More recently, CT scan screening found lung cancer
to be present in the same proportion of nonsmokers as smokers,122 suggesting
that many of the cancers detected in the nonsmokers were ones that would have
never progressed. This overdiagnosis can lead to morbidity and mortality: CT scan
screening for lung cancer results in a threefold increase in diagnosis and threefold
increase in surgery with an average surgical mortality of 5% and serious complication rate exceeding 20%,123 as well as potential risk from radiation exposure. A
similar phenomenon occurs with breast cancer where screening increases surgeries
by about one-third from overdiagnosis and with prostate cancer where the lifetime
risk of dying from prostate cancer is about 3%, yet 60% of men in their sixties
have prostate cancer, and so, screening and detecting all men with prostate cancer
in their sixties would lead to treatment of many men who would not have died
from prostate cancer.124 In patients found to have cancer by screening, it is not
possible to distinguish those whose cancers would have progressed from those in
whom the cancer-appearing cells would not have progressed or spread.
2. Diagnostic testing
Based on the history and physical examination, physicians will establish diagnostic
possibilities. They may then request additional tests to reduce uncertainty and
to confirm the diagnosis, as part of diagnostic verification. Although, theoretically, all tests could be ordered, tests should be chosen on the basis of a clinical
suspicion because of possible morbidity or even mortality from inappropriate
testing. Normative prescriptive decision models for reasoning in the presence of
uncertainty suggest that whether and which tests get ordered should depend on
the sensitivity and specificity of the test as discussed in Section IV.A.2, supra, but
also the risk of mortality or morbidity from the test, and the benefit and risk of
treatment.125 In general, for sufficiently low probabilities of disease, no tests should
be ordered and no treatment given. For sufficiently high probabilities of disease,
121. Black, supra note 115.
122. William C. Black & John A. Baron, CT Screening for Lung Cancer: Spiraling into Confusion?
297 JAMA 995–97 (2007).
123. Id. at 996.
124. Karsten J. Jørgensen & Peter C. Gøtzsche, Overdiagnosis in Publicly Organised Mammography
Screening Programmes: Systematic Review of Incidence Trends, 339 BMJ b2587 (2009); Michael J. Barry,
Prostate-Specific–Antigen Testing for Early Diagnosis of Prostate Cancer, 344 New Eng. J. Med. 1373–77
(2001).
125. Stephen G. Pauker & Jerome P. Kassirer, The Threshold Approach to Clinical Decision Making,
302 New Eng. J. Med. 1109–17 (1980).
719
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
testing is unnecessary and treatment should be administered. For intermediate
probabilities of disease, testing should be performed. When testing carries risks,
the probabilities of disease for which testing should be done become narrower, and
so physicians should be more likely to treat empirically or neither test nor treat.
As sensitivity and specificity increase, the range of probabilities in which testing
should be done expands.
Although an abnormal test result may be found, that abnormality may not
be causing symptoms. For example, herniated lumbar discs are found in approximately 25% of healthy individuals without back pain; thus finding a herniated
disc in patients with back pain may be an incidental finding. If signs such as a
foot drop develop, additional muscle and nerve conduction studies might confirm
evidence of nerve compromise from the herniated disc, but such tests are painful.
Over time, sequential images show that the herniated disc has partial or complete
resolution after 6 months without surgery. Therefore, a herniated disc may be
seen with CT or MRI scanning in patients with or without symptoms, and so
just having symptoms and evidence of a herniated disc would be an insufficient
indication for back surgery.126 In the absence of severe or progressive neurological deficits, elective disc surgery could be considered for patients with probable
herniated discs who have persistent symptoms and findings consistent with sciatica
(not just low back pain) for 4 to 6 weeks, but such “patients should be involved
in decision making” (see Section IV.D.3, infra).127
Just as some therapies may eventually be found to be harmful or not beneficial, tests initially felt to be useful may be found to be less valuable.128 Among
other potential biases,129 this may occur because of the choice of study population
used to determine the test’s sensitivity and specificity. For example, an FDAapproved rapid test for HIV infection has a reported specificity of 100%, implying
that any positive tests must indicate truly infected individuals, yet one of the populations in which testing is recommended is women who have had prior children
and are in labor but have not yet had an HIV test during the pregnancy.130 In 15
multiparous women, this rapid HIV test resulted in one false-positive test result
in the 15 women tested, yielding a specificity of 93%,131 and so not all pregnant
women with positive tests can be assumed to be truly infected.
126. Richard A. Deyo & James N. Weinstein, Low Back Pain, 344 New Eng. J. Med. 363–70
(2001); Richard A. Deyo et al., Trends, Major Medical Complications, and Charges Associated with Surgery
for Lumbar Spinal Stenosis in Older Adults, 303 JAMA 1259–65 (2010).
127. Deyo & Weinstein, supra note 126, at 368.
128. David F. Ransohoff & Alvan R. Feinstein, Problems of Spectrum and Bias in Evaluating the
Efficacy of Diagnostic Tests, 299 New Eng. J. Med. 926–30 (1978).
129. Penny Whiting et al., Sources of Variation and Bias in Studies of Diagnostic Accuracy: A
Systematic Review, 140 Annals Internal Med. 189–202 (2004).
130. Food and Drug Administration, OraQuick® Rapid HIV-1 Antibody Test, available at
http://www.fda.gov/downloads/BiologicsBloodVaccines/BloodBloodProducts/ApprovedProducts/
PremarketApprovalsPMAs/ucm092001.pdf (last visited Mar. 2, 2011).
131. Id.
720
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
3. Prognostic testing
Once a diagnosis has been established, additional prognostic testing may be performed to establish the extent of disease (e.g., staging of a cancer) or to monitor
response to therapy. Molecular profiling of disease may not only characterize
prognosis but also treatment response. In women with breast cancer, for example,
finding a genetic marker called the human epidermal growth factor receptor type
2 (HER2, also called HER2/neu) gene identified patients who responded poorly
to any of the standard chemotherapeutic agents and hence had a poor prognosis.
Illustrative of the emerging era of pharmacogenomics, adjuvant chemotherapy
combined with a monoclonal antibody in HER2-positive breast cancer patients
has been found to delay progression and prolong survival.132
C. Judgment and Uncertainty in Medicine
1. Variation in medical care
Studies over the past several decades show substantial geographic variation in the
utilization rates for medical care within small areas or local regions (e.g., a threeto fourfold variation in the use of surgical procedures such as tonsillectomy when
comparing children living in adjacent areas of similar demographics)133 and between
large areas or widespread regions (e.g., a 10-fold variation in the performance of
other discretionary surgical procedures such as lower extremity revascularization,
carotid endarterectomy, back surgery, and radical prostatectomy).134 Even when
limiting the analysis to 77 U.S. hospitals with reputations for high-quality care in
managing chronic illness, the care that patients received in their last 6 months of
life varied extensively, ranging from hospital stays of 9 to 27 days (threefold variation), intensive care unit stays of 2 to 10 days (fivefold variation); and physician
visits of 18 to 76 (fourfold variation), depending on the hospital at which patients
received their care.135
Four categories of variation are recognized: (1) underuse of effective care,
(2) issues of patient safety, (3) concern for preference-sensitive care, and (4) notions
of supply-sensitive services.136 Effective care refers to treatments that are known
to be beneficial and that nearly all patients should receive with little influence
132. Dennis J. Slamon et al., Use of Chemotherapy Plus a Monoclonal Antibody Against HER2 for
Metastatic Breast Cancer That Overexpresses HER2, 344 New Eng. J. Med. 783–92 (2001).
133. John Wennberg & Alan Gittelsohn, Small Area Variations in Health Care Delivery, 182
Science 1102–08 (1973) (hereinafter “Wennberg & Gittelsohn”).
134. John D. Birkmeyer et al., Variation Profiles of Common Surgical Procedures, 124 Surgery
917–23 (1998).
135. John E. Wennberg et al., Use of Hospitals, Physician Visits, and Hospice Care During Last Six
Months of Life Among Cohorts Loyal to Highly Respected Hospitals in the United States, 328 BMJ 607 (2004).
136. John E. Wennberg, Unwarranted Variations in Healthcare Delivery: Implications for Academic
Medical Centres, 325 BMJ 961–64 (2002) (hereinafter “Wennberg”).
721
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
of patient preferences, for example, use of beta blockers following myocardial
infarction. The underuse of effective care was illustrated by one prominent study that
identified 439 high-quality process measures for 30 conditions and preventive
care. In assessing the use of measures that were clearly recommended (i.e., clearly
beneficial), they found that only about 50% of patients received these highly recommended care processes.137 Issues of patient safety refer to the execution of care
and the occurrence of iatrogenic complications (i.e., complications resulting from
health care interventions). The IOM estimates that hospitalized patients risk one
medication error for every day they are hospitalized, resulting in an estimated
7000 deaths annually (more than from workplace injuries) at an annual cost of
$3.5 billion in 2006 dollars.138 Concern for preference-sensitive care refers to treatment
choices that should depend on patient health goals or preferences. Prostate surgery
helps relieve symptoms of an enlarged prostate (such as frequent urination, waking
up at night to urinate) but carries a risk of losing sexual function. Separate from
the probability of losing sexual function, in preference-sensitive care, the decision
to have prostate surgery depends on how much the enlarged prostate symptoms
bother the patient and on how important sexual function is to them, that is, their
preferences and values.139 Finally, supply-sensitive services refer to care that depends
not on evidence of effectiveness or patient preferences, but rather on the availability
of services. Specifically, patients living in areas with more doctors or more hospitals
experience more office visits, tests, and hospitalizations.140
2. Evidence-based medicine
The exceptional variation in the delivery of medical care was a major factor that
led to a careful reexamination of physician diagnostic strategies, therapeutic decision making, and the use of medical evidence, but it was not the only one. Other
circumstances that set the stage for an intense focus on medical evidence included
(1) the development of medical research, including randomized controlled trials
and other observational study designs; (2) the growth of diagnostic and therapeutic
interventions;141 (3) interest in understanding medical decisionmaking and how
physicians reason;142 and (4) the acceptance of meta-analysis as a method to com-
137. Elizabeth A. McGlynn et al., The Quality of Health Care Delivered to Adults in the United
States, 348 New Eng. J. Med. 2635–45 (2003).
138. Committee on Identifying and Preventing Medication Errors, Institute of Medicine,
Preventing Medication Errors (2006); 2000 CQHCA Report, supra note 58.
139. Michael J. Barry et al., Patient Reactions to a Program Designed to Facilitate Patient Participation
in Treatment Decisions for Benign Prostatic Hyperplasia, 1995 Med. Care 771–82 (1995).
140. Wennberg, supra note 136, at 142.
141. Cynthia D. Mulrow & K.N. Lohr, Proof and Policy from Medical Research Evidence, 26 J.
Health Pol., Pol’y & L. 249–66 (2001) (hereinafter “Mulrow & Lohr”).
142. Robert S. Ledley & Lee B. Lusted, Reasoning Foundations of Medical Diagnosis; Symbolic Logic,
Probability, and Value Theory Aid Our Understanding of How Physicians Reason, 130 Science 9–21 (1959).
722
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
bine data from multiple randomized trials.143 In response to the above conditions,
“evidence-based medicine” gained prominence in 1992.144 It is aptly defined as
“the conscientious, explicit and judicious use of current best evidence in making
decisions about the care of the individual patient. It means integrating individual
clinical expertise with the best available external clinical evidence from systematic
research.”145
Evidence-based medicine contrasts with the traditional informal method of
practicing based on anecdotes, applying the most recently read articles, doing
what a group of eminent experts recommend, or minimizing costs.146 Rather,
it is “the use of mathematical estimates of the risks of benefit and harm, derived
from high-quality research on population samples, to inform clinical decision
making in the diagnosis, investigation or management of individual patients.”147
In a paper from a joint workshop held by IOM and the Agency for Healthcare
Research and Quality148 that addressed what physicians consider to be sufficient
evidence to justify their clinical practice and treatment decisions, Mulrow and
Lohr wrote “evidence-based medicine stresses a structured critical examination of
medical research literature: relatively speaking, it deemphasizes average practice
as an adequate standard and personal heuristics.”149
3. Hierarchy of medical evidence
With the explosion of available medical evidence, increased emphasis has been
placed on assembling, evaluating, and interpreting medical research evidence.
A fundamental principle of evidence-based medicine (see also Section IV.C.5,
infra) is that the strength of medical evidence supporting a therapy or strategy
is hierarchical. When ordered from strongest to weakest, systematic review of
randomized trials (meta-analysis) is at the top, followed by single randomized
trials, systematic reviews of observational studies, single observational studies,
143. See Michael D. Green et al., Reference Guide on Epidemiology, Section VI, in this manual;
Video Software Dealers Ass’n v. Schwarzenegger, 556 F.3d 950, 963 (9th Cir. 2009) (analyzing a metaanalysis of studies on video games and adolescent behavior); Kennecott Greens Creek Min. Co. v.
Mine Safety & Health Admin., 476 F.3d 946, 953 (D.C. Cir. 2007) (reviewing the Mine Safety and
Health Administration’s reliance on epidemiological studies and two meta-analyses).
144. Evidence-Based Medicine Working Group, Evidence-Based Medicine. A New Approach to
Teaching the Practice of Medicine, 268 JAMA 2420–25 (1992).
145. David L. Sackett et al., Evidence Based Medicine: What It Is and What It Isn’t, 312 BMJ
71–72, 71 (1996).
146. Trisha Greenhalgh, How to Read a Paper: The Basics of Evidence-Based Medicine (3d
ed. 2006).
147. Id. at 1.
148. Clark C. Havighurst et al., Evidence: Its Meanings in Health Care and in Law, 26 J. Health
Pol., Pol’y & L. 195–215 (2001).
149. Mulrow & Lohr, supra note 141, at 253.
723
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
physiological studies, and unsystematic clinical observations.150 An analysis of the
frequency with which various study designs are cited by others provides empirical evidence supporting the influence of meta-analysis followed by randomized
controlled trials in the medical evidence hierarchy.151 Although they are at the
bottom of the evidence hierarchy, unsystematic clinical observations or case
reports may be the first signals of adverse events or associations that are later
confirmed with larger or controlled epidemiological studies (e.g., aplastic anemia
caused by chloramphenicol,152 or lung cancer caused by asbestos153). Nonetheless,
subsequent studies may not confirm initial reports (e.g., the putative association
between coffee consumption and pancreatic cancer).154
Just as in laboratory experiments, evidence about the benefits and risks of
medical interventions arises through repetitive observations. A single randomized controlled trial relies on hypothesis testing, specifically assuming the null
hypothesis that a new drug is equivalent to the comparator (e.g., placebo). As
conceived nearly 100 years ago, interpreting the trial involved calculating the
likelihood of the alpha error (p-value) wherein the study suggests that the drug or
device is beneficial but the “truth” is that it is not, that is, a false-positive study
result. Similarly, a beta error (1 minus power) is the likelihood of a study finding
that the drug or device is not beneficial when the “truth” is that it is, that is, a
false-negative study result (Table 3).
Table 3. Analogy Between Interpreting a Diagnostic Test and a Drug Study
Truth
Drug +
Drug −
Study +
Power (true positive)
α Type I error (false positive)
Study −
β Type II error (false negative)
True negative
The choice of which specific error rates to use (e.g., false positive or p-value
or alpha of 0.05) was suppose to depend on a judgment of the relative consequences of the two errors, missing an effective drug (Type II beta error) or
150. Gordon H. Guyatt et al., Users’ Guides to the Medical Literature: A Manual for EvidenceBased Clinical Practice (2d ed. 2008) (hereinafter “Guyatt”); see also Michael D. Green et al.,
Reference Guide on Epidemiology, in this manual.
151. Nikolaos A. Patsopoulos et al., Relative Citation Impact of Various Study Designs in the Health
Sciences, 293 JAMA 2362–66 (2005).
152. W.T.W. Clarke, Fatal Aplastic Anemia and Chloramphenicol, 97 Can. Med. Ass’n J. 815 (1967)
(hereinafter “Clarke”).
153. Michael Gochfeld, Asbestos Exposure in Buildings, Envtl. Med. 438, 440 (1995).
154.. Brian MacMahon et al., Coffee and Cancer of the Pancreas, 304 New Eng. J. Med. 630–33
(1981) (hereinafter “MacMahon”).
724
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
considering an ineffective drug to be effective (Type I alpha error).155 The null
hypothesis, however, assumes equivalence, and so it does not provide any measure
of evidence outside of the particular study (e.g., prior studies or biological mechanism or plausibility). Thus, the null hypothesis assumption necessitates abandoning
the ability to measure evidence or determine “truth” from a single experiment,
so that hypothesis testing is thereby “equivalent to a system of justice that is not
concerned with which individual defendant is found guilty or innocent (that is,
‘whether each separate hypothesis is true or false’) but tries instead to control the
overall number of incorrect verdicts.”156 From a Bayesian perspective, the interpretation of a new study depends on whether prior studies showed benefit or harm
and on the existence of a biological mechanism or plausibility (e.g., the association
between coffee consumption and pancreatic cancer was a “false-positive” result
because in further testing the initial finding was not validated and there was no
known plausible biological mechanism).157
Cumulative meta-analysis of treatments enables the accumulation of randomized trial evidence to examine trends in efficacy or risks, overcoming issues of
underpowered trials that have insufficient numbers of patients enrolled to reliably
detect a benefit. For example, between 1959 and 1988, 33 randomized trials with
streptokinase for acute myocardial infarction involving over 35,000 patients had
been published. By combining the results of each trial as they occurred, a cumulative meta-analysis found “a consistent, statistically significant reduction in total
mortality” with streptokinase use by 1973.158 In contrast, for many years, physicians used a drug called lidocaine to prevent life-threatening heart rhythm disturbances, yet none of the randomized trials of lidocaine demonstrated any benefit,
and finally cumulative meta-analysis found a trend toward harm. When the results
of meta-analysis were compared with comments in textbooks and review articles,
discrepancies were detected between the meta-analytic patterns of effectiveness
in the randomized trials and the recommendations of reviewers [the review
article author]. Review articles often failed to mention important advances or
exhibited delays in recommending effective preventive measures. In some cases,
treatments that have no effect on mortality or are potentially harmful continued
to be recommended by several clinical experts.159
155. Goodman, supra note 74, at 998.
156. Id. at 998.
157. MacMahon, supra note 154, at 630.
158. Joseph Lau et al., Cumulative Meta-Analysis of Therapeutic Trials for Myocardial Infarction, 327
New Eng. J. Med. 248–54 (1992).
159. Elliott M. Antman et al., A Comparison of Results of Meta-Analyses of Randomized Control
Trials and Recommendations of Clinical Experts: Treatments for Myocardial Infarction, 268 JAMA 240, 240
(1992).
725
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
4. Guidelines
Clinical practice guidelines are “systematically developed statements to assist
practitioner and patient decisions about appropriate health care for specific clinical circumstances.”160 Such guidelines have been widely developed and issued
by medical specialty associations, professional societies, government agencies, or
health care organizations.161 To avoid biases inherent in review articles (particularly single-authored ones) and to encourage transparency and acceptance, a standard method to develop clinical practice guidelines has emerged. It involves systematically searching for and reviewing the evidence (summarizing the evidence),
grading the quality of evidence for each outcome (the certainty of the recommendation), and assessing the balance of benefits versus risks (the size of the treatment effect or the strength of the recommendation).162 Additional considerations
include values and preferences (patient health goals) and costs (resource allocation)
where increasing variability or uncertainty in preferences or the presence of higher
costs reduces the likelihood of making a strong recommendation.163 The number,
length, and diversity of guidelines developed by various professional organizations
challenge practicing physicians. An attempt to quantify guideline development
found exponential growth, with 8 guidelines published in 1990, 138 in 1996, and
855 by mid-1997, including 160 that were more than 10 pages long.164
With this proliferation, different professional organizations may issue guidelines on the same topic, but with competing recommendations. The composition of the panel and the processes for developing guideline recommendations
may differ. For example, the U.S. Preventive Services Task Force (USPSTF) is
“an independent panel of non-Federal experts in prevention and evidence-based
medicine and is composed of primary care providers (such as internists, pediatricians, family physicians, gynecologists/obstetricians, nurses, and health behavior
specialists).”165 In their evaluation of mammography, the USPSTF “recommends
against routine screening mammography in women aged 40 to 49 years” (see
160. Committee to Advise the Public Health Service on Clinical Practice Guidelines, Institute of
Medicine, Clinical Practice Guidelines: Directions for a New Program 8 (Marilyn J. Field & Kathleen
N. Lohr, eds. 1994).
161. See generally Sofamor Danek Group v. Gaus, 61 F.3d 929 (D.C. Cir. 1995) (reviewing
guidelines issued by the Agency for Health Care Policy and Research in light of the Federal Advisory
Committee Act); Levine v. Rosen, 616 A.2d 623 (Pa. 1992) (finding that differing guidance from
two groups was evidence that reasonable physicians could follow either school of thought); Michelle
M. Mello, Of Swords and Shields: The Role of Clinical Practice Guidelines in Medical Malpractice Litigation,
149 U. Pa. L. Rev. 645 (2001).
162. David Atkins et al., Grading Quality of Evidence and Strength of Recommendations, 328 BMJ
1490 (2004).
163. Gordon H. Guyatt et al., Going from Evidence to Recommendations, 336 BMJ 1049–51 (2008).
164. Arthur Hibble et al., Guidelines in General Practice: The New Tower of Babel? 317 BMJ
862–63 (1998).
165. U.S. Preventive Services Task Force (USPSTF), Agency for Healthcare Research and
Quality, available at http://www.ahrq.gov/clinic/uspstfix.htm (last visited Mar. 2, 2011).
726
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
also Section IV.D.2).166 In contrast, based on a writing group composed of its
members who are “directly responsible for performing these screening tests,” the
Society of Breast Imaging and the American College of Radiology recommend
“annual screening from age 40” with mammography for “women at average risk
for breast cancer.”167 Similarly, for prostate cancer screening, the USPSTF update
“concludes that the current evidence is insufficient to assess the balance of benefits
and harms of prostate cancer screening in men younger than age 75 years.”168
In the American Urological Association update, a statement panel composed of
urologists, oncologists, and other physicians made two recommendations: “The
decision to use PSA for the early detection of prostate cancer should be individualized. Patients should be informed of the known risks and the potential benefits”
and “Early detection and risk assessment of prostate cancer should be offered to
asymptomatic men 40 years of age or older who wish to be screened with an
estimated life expectancy of more than 10 years.”169
Practice guidelines provide recommendations on how to evaluate and treat
patients, but because they apply to the general case, their recommendations may
not apply to a particular individual patient, or some extrapolation may be required,
particularly when multiple diseases exist, as they frequently do in the elderly,170 or
when treatment entails competing risks. For example, anticoagulation is generally
recommended for patients with atrial fibrillation (an abnormal heart rhythm disturbance) to prevent blood clots that could cause a stroke, yet anticoagulation can
also lead to life-threatening bleeding; therefore, for individual patients, physicians
must weigh the risk of developing clots versus the risk of bleeding. Consequently,
guidelines typically include statements such as “clinical or policy decisions involve
more considerations than this body of evidence alone. Clinicians and policymakers
should understand the evidence but individualize decision making to the specific
patient or situation.”171 Some physicians who rely on personal style, review
articles, and colleagues to influence their clinical practice have been concerned
with how guidelines affect clinical autonomy and health care costs.172
166. U.S. Preventive Services Task Force, Screening for Breast Cancer: U.S. Preventive Services Task
Force Recommendation Statement, 151 Annals Internal Med. 716–26 (2009).
167. Carol H. Lee et al., Breast Cancer Screening with Imaging: Recommendations from the Society
of Breast Imaging and the ACR on the Use of Mammography, Breast MRI, Breast Ultrasound, and Other
Technologies for the Detection of Clinically Occult Breast Cancer, 7 J. Am. C. Radiology 18–27 (2010).
168. U.S. Preventive Services Task Force, Screening for Prostate Cancer: U.S. Preventive Services
Task Force Recommendation Statement, 149 Annals Internal Med. 185–91 (2008).
169. American Urological Association, Prostate-Specific Antigen Best Practice Statement (rev.
2009), available at http://www.auanet.org/content/media/psa09.pdf (last visited Mar. 2, 2011).
170. Cynthia M. Boyd et al., Clinical Practice Guidelines and Quality of Care for Older Patients with
Multiple Comorbid Diseases: Implications for Pay for Performance, 294 JAMA 716–24 (2005).
171. U.S. Preventive Services Task Force, Screening for Carotid Artery Stenosis: U.S. Preventive
Services Task Force Recommendation Statement, 147 Annals Internal Med. 854–59 (2007).
172. Sean R. Tunis et al., Internists’ Attitudes About Clinical Practice Guidelines, 120 Annals Internal
Med. 956–63 (1994).
727
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
However, just as clinicians have been reluctant to apply guidelines in practice, courts have generally been slow to apply them in deciding cases.173 There
are political and legal issues that can arise with the development of guidelines.174
Political sensitivities, conflicts of interest, and potential lawsuits often silence
otherwise innovative and potentially useful guidelines. In 2006, the Connecticut
Attorney General launched an antitrust suit against the Infectious Disease Society
of America (IDSA) after IDSA promulgated guidelines recommending against the
use of long-term antibiotics for the treatment of “chronic Lyme disease (CLD).”175
Although the Centers for Disease Control and Prevention and the Food and Drug
Administration (FDA) findings seemed to concur with IDSA’s guidelines, a strong
lobby representing patients afflicted with CLD and the physicians who treated
them colored the Attorney General’s decision to file suit.176 Organizations can
violate antitrust laws if their guideline-setting process is an unreasonable attempt
to advance their members’ economic interests by suppressing competition. IDSA
settled without admitting guilt, but it is clear that organizations must be careful to
maintain transparency in the guideline development process.177
Besides clinical practice guidelines, IOM defines other types of statements:
(1) medical review criteria are systematically developed statements that can be used to
assess the appropriateness of specific health care decisions, services, and outcomes;
(2) standards of quality are authoritative statements of minimum levels of acceptable
performance or results, excellent levels of performance or results, or the range
of acceptable performance or results; and (3) performance measures are methods or
instruments to estimate or monitor the extent to which the actions of a health care
practitioner or provider conform to practice guidelines, medical review criteria,
or standards of quality.
5. Vicissitudes of therapeutic decisionmaking
Medical decisionmaking often involves complexity, uncertainty, and tradeoffs178
because of unique genetic factors, lifestyle habits, known conditions, medication
histories, and ambiguity about possible diagnoses, test results, treatment benefits,
173. Arnold J. Rosoff, Evidence-Based Medicine and the Law: The Courts Confront Clinical Practice
Guidelines, 26 J. Health Pol., Pol’y & L. 327–68 (2001).
174. One element in the near demise of the Agency for Health Care Policy and Research was
a political audience receptive to complaints from an association of back surgeons who disagreed with
the AHCPR practice guideline conclusions regarding low back pain. B.H. Gray et al., AHCPR and
the Changing Politics of Health Services Research, Health Affairs, Suppl. Web Exclusives W3-283-307
(June 2003).
175. John D. Kraemer & Lawrence O. Gostin, Science, Politics, and Values: The Politicization of
Professional Practice Guidelines, 301 JAMA 665–67 (2009).
176. Id at 666.
177. Id at 666.
178. John P.A. Ioannidis & Joseph Lau, Systematic Review of Medical Evidence, 12 J.L. & Pol’y
509–35 (2004).
728
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
and therapeutic harms. Given inherent diagnostic and therapeutic uncertainty,
physicians often make treatment decisions in the face of uncertainty.
Donald Schön argued that regardless of the professional field, “An artful
practice of the unique case appears anomalous when professional competence is
modeled in terms of application of established techniques to recurrent events”
and that specialization “fosters selective inattention to practical competence and
professional artistry.”179 In the case of a patient with peanut allergies and heart
disease, allergy guidelines recommend avoiding beta blockers, but heart disease
guidelines recommend beta blockers because they have been shown to prolong
life in patients with heart disease. An allergist would recommend against taking a
beta blocker, yet a cardiologist would recommend taking it.180
Well-performed randomized trials provide the least biased estimates of treatment benefit and harm by creating groups with equivalent prognoses. Sticking
strictly to the scientific evidence, some physicians may limit their use of medications to the specific drug at the specific doses found to be beneficial in such trials.
Others may assume class effects until proven otherwise. Still others may consider
additional factors such as out-of-pocket costs for patients or patient preferences.
When physicians evaluate patients who might benefit from a treatment but who
would have been excluded from the study in which the benefit was demonstrated,
they must weigh the risks and benefits in the absence of definitive evidence of
benefit or of harm. Indeed, because few medical recommendations are based on
randomized trials (the least biased level of evidence) physicians frequently and
necessarily face uncertainty in making testing and treatment decisions and tradeoffs: Very few treatments come without some risk, and in many disciplines, clear
evidence of efficacy and risks of treatment are lacking. In cardiology (one of the
better studied areas of medical care), nearly one-half of guideline recommendations are based on expert opinion, case studies, or standards of care.181
Applying well-designed studies to populations of patients represents
another problem. The Randomized Aldactone Evaluation Study demonstrated
that spironolactone reduced mortality and hospitalizations for heart failure and
improved quality of life with minimal risk of seriously high levels of potassium
(hyperkalemia).182 Published in a prominent medical journal, prescriptions for
spironolactone rose quickly because of familiarity with the medication and the
poor prognosis of patients with heart failure. As opposed to the study population,
however, community individuals were older, more frequently women, often
179. Donald A. Schön, The Reflective Practitioner: How Professionals Think in Action, at
vii (1983).
180. John A. TenBrook et al., Should Beta-Blockers Be Given to Patients with Heart Disease and
Peanut-Induced Anaphylaxis? A Decision Analysis, 113 J. Allergy & Clin. Immunol. 977–82 (2004).
181. Pierluigi Tricoci et al., Scientific Evidence Underlying the ACC/AHA Clinical Practice
Guidelines, 301 JAMA 831–41 (2009).
182. Bertram Pitt et al., The Effect of Spironolactone on Morbidity and Mortality in Patients with Severe
Heart Failure. Randomized Aldactone Evaluation Study Investigators, 341 New Eng. J. Med. 709–17 (1999).
729
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
had absolute or relative contraindications to treatment, and had not had tests
of their heart function to establish the indication to treat or of their potassium
level and kidney function to determine their risk for high potassium levels from
treatment.183 These factors increased the risk that spironolactone therapy in these
patients might lead to high potassium levels that could be life-threatening. Indeed,
hospitalizations per 1000 patients for high potassium rose from 2.4 in 1994 to
11.0 in 2001, resulting in an estimated 560 additional hospitalizations for high
potassium and 73 additional hospital deaths in older patients with heart failure
in Ontario.184 Criteria for entry into randomized trials of drugs typically exclude
individuals with concomitant medication use, medical comorbidities, and female
gender, and they may limit participation by socioeconomic status or race and ethnicity, thereby limiting the ability to generalize the results of a trial to the clinical
population being treated.185 Physicians refer to randomized controlled studies as
assessments of drug “efficacy” in restricted patient populations, whereas treatment
in general clinical populations are often referred to as “effectiveness” studies.
To be sufficiently powered to demonstrate statistical significance,186 randomized controlled trials usually require high event rates, prolonged followup, or
large numbers of patients. Because of impracticality, expense, and the time period
needed to obtain long-term outcomes, these trials may often choose a surrogate
marker that is associated with a clinically important event or with survival. For
example, statins were approved on the basis of their safety and efficacy in lowering cholesterol but were only demonstrated to improve survival in patients with
known coronary heart disease years later.187 Fast-track approval of new drugs for
HIV infection was based on safety and efficacy in reducing viral levels (as a surrogate or substitute outcome measure felt to be related to survival) as opposed to
demonstration of improved survival.
On the other hand, in the late 1970s, patients with frequent extra heartbeats
(ventricular premature contractions) following a heart attack had an increased
risk for sudden death. On that basis, those in the then-emerging field of cardiac
electrophysiology believed that reducing ventricular premature beats (as a surrogate outcome measure) would decrease subsequent sudden cardiac death. In
early randomized controlled trials, oral antiarrhythmic drugs such as encainide
and flecainide were approved by FDA on the basis of their ability to suppress
these extra heartbeats in patients who had had a myocardial infarction. Years after
183.. Dennis T. Ko et al., Appropriateness of Spironolactone Prescribing in Heart Failure Patients: A
Population-Based Study, 12 J. Cardiac Failure 205–10 (2006).
184. David N. Juurlink et al., Rates of Hyperkalemia After Publication of the Randomized Aldactone
Evaluation Study. 351 New Eng. J. Med. 543–51 (2004).
185. Harriette G.C. Van Spall et al., Eligibility Criteria of Randomized Controlled Trials Published
in High-Impact General Medical Journals: A Systematic Sampling Review, 297 JAMA 1233–40 (2007).
186. See Michael D. Green et al., Reference Guide on Epidemiology, in this manual.
187. Randomised Trial of Cholesterol Lowering in 4444 Patients with Coronary Heart Disease: The
Scandinavian Simvastatin Survival Study (4S), 344 Lancet 1383–89 (1994).
730
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
approval of these drugs, however, a randomized controlled trial designed to demonstrate a survival benefit of these drugs was discontinued after only 10 months
because of a statistically significant higher rate of mortality in patients receiving the
drugs. Although these drugs effectively suppressed the extra heartbeats, the study
found that they also increased the likelihood of fatal heart rhythm disturbances.188
Prior to approval by FDA, drugs and devices must undergo Phase 1, 2, and 3
clinical trials to demonstrate safety and efficacy. Following preliminary chemical
discovery, toxicology, and animal studies, Phase 1 studies examine the safety of
new drugs in healthy individuals. Phase 2 studies involve varying drug doses in
individuals with the disease to explore efficacy and responses and adverse effects.
Based on the dose or doses identified in Phase 2, a Phase 3 study examines drug
response in a larger number of patients to again determine safety and efficacy
in the hope of getting a new drug approved for sale by regulatory authorities.
However, because fewer than 10,000 individuals have usually received the drug
during all of these trials, uncommon adverse outcomes may not become apparent until usage is broadened and extended. For example, depending on dosage, between 1 in 24,200 and 1 in 40,500 patients who received the antibiotic
chloramphenicol189 developed fatal aplastic anemia (in which the bone marrow
no longer produces any blood cells). This adverse effect was discovered only in
the 1960s after chloramphenicol was initially considered safe and had been widely
used during the 1950s.190
For all approved drug and therapeutic biological products, FDA has managed
postmarketing safety surveillance since 1969 through the Adverse Event Reporting System. Health care professionals, including physicians, pharmacists, nurses,
and others, and consumers, including patients, family members, lawyers, and
others, are expected to report adverse events and medication errors. It is a voluntary system with the following limitations: (1) uncertainty that the drug caused
the reported event, (2) no requirement for proof of a causal relationship between
product and event, (3) insufficient detail to evaluate events, (4) incomplete reporting of all adverse events, and (5) inability to determine the incidence of an adverse
events because the actual number of patients receiving a product and the duration
of use of those products are unknown.
In 1999, rofecoxib (Vioxx), a Cox-2 selective nonsteroidal anti-inflammatory
drug, was approved for pain relief in part on the basis of studies that suggested that
it induced less gastrointestinal bleeding than other nonsteroidal anti-inflammatory
drugs. In 2004, the manufacturer announced a voluntary worldwide withdrawal
188. Preliminary Report: Effect of Encainide and Flecainide on Mortality in a Randomized Trial of
Arrhythmia Suppression After Myocardial Infarction: The Cardiac Arrhythmia Suppression Trial (CAST)
Investigators, 321 New Eng. J. Med. 406–12 (1989).
189. Two pre-Daubert cases from the Fifth Circuit dealt with product liability suits against the
manufacturer: Christophersen v. Allied-Signal Corp., 939 F.2d 1106 (5th Cir. 1991); Osburn v. Anchor
Labs., 825 F.2d 908 (5th Cir. 1987). Clarke, supra note 152, at 515.
190. Clarke, supra note 152, at 815.
731
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
of rofecoxib when a prospective study confirmed that the drug increased the risk
of myocardial infarctions (heart attacks) and stroke with chronic use.191
This section demonstrates some of the issues that physicians grapple with in
treatment decisions. Some generally avoid using new drugs until sufficient experience with the medication provides an opportunity for unknown adverse effects
to emerge following drug approval. Others may be quick to adopt new drugs,
especially drugs perceived to have improved safety or efficacy such as through a
novel mechanism of action. By withholding use of new drugs, more conservative
physicians may avoid the occurrence of unforeseen adverse consequences, but they
may also delay the use of new drugs that may benefit their patients. The converse
may occur, of course, with physicians who are early adopters of new drugs, tests,
or technologies.
Even in a randomized trial in which a drug is found to be beneficial, some
patients who received the drug may have been harmed, emphasizing the need
to individualize the balancing of risks and benefits and explaining in part why
some physicians may not adhere to guideline recommendations. The fundamental
dilemma articulated by Bernard in 1865 still haunts the clinician: The response of
the “average” patient to therapy is not necessarily the response of the patient being
treated.192 Indeed, the average results of clinical trials do not apply to all patients
in the trial. Even with well-defined inclusion and exclusion criteria, variation in
outcome risk and, therefore, treatment benefit exists so that even “typical” patients
included in the trial may not be likely to get the average benefits.
The Global Utilization of Streptokinase and tPA for Occluded Coronary Arteries Trial is a case in point. The trial suggested that accelerated tissue
plasminogen accelerator (tPA) reduced mortality from acute myocardial infarction,
with the tradeoff being an increased risk of bleeding from tPA.193 In a reanalysis of
this study, most (85%) of the survival benefit of tPA accrued to half of the patients
(those at highest risk of dying from their heart attack). Some patients with very
low risk of dying from their heart attack who received tPA likely were harmed
because their risk of intracranial hemorrhage exceeded the benefit.194 In practice
then, even in a randomized controlled trial demonstrating survival benefit, on
average, those benefits may not accrue to every patient in that trial that received
treatment. Therefore, to optimize treatment decisions, physicians attempt to individualize treatment decisions based on their assessment of the patient’s risk versus
benefit. Even then, physicians may be reluctant to administer a medication such
191. See generally In re Vioxx Prods. Liab. Litig., 360 F. Supp. 2d 1352 (J.P.M.L. 2005).
192. Salim Yusuf et al., Analysis and Interpretation of Treatment Effects in Subgroups of Patients in
Randomized Clinical Trials, 266 JAMA 93–98 (1991) (hereinafter “Yusuf”).
193 An International Randomized Trial Comparing Four Thrombolytic Strategies for Acute Myocardial
Infarction. The GUSTO Investigators, 329 New Eng. J. Med. 673–82 (1993).
194. David M Kent et al., An Independently Derived and Validated Predictive Model for Selecting
Patients with Myocardial Infarction Who Are Likely to Benefit from Tissue Plasminogen Activator Compared
with Streptokinase, 113 Am. J. Med. 104–11 (2002).
732
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
as tPA that can cause severe harm such as an intracranial hemorrhage. A single
clinical experience with a patient who bled when given tPA might well color their
judgment about the benefits of the treatment.
A fundamental principle of evidence-based medicine is that “Evidence alone
is never sufficient to make a clinical decision.”195 Nearly all medical decisions
involve some tradeoff between a benefit and a risk. Besides the options and the
likelihood of the outcomes, patient preferences about the resulting outcomes
should affect care choices, especially when there are tradeoffs such as a risk of
complications or dying from a procedure or treatment versus some benefit such as
living longer (provided the patient survives the short-term risk of the procedure)
or improving their quality of life (relieving symptoms). Besides individualizing
risk and benefit assessments, physicians may also deviate from guideline recommendations (“warranted variation”) because of a particular patient’s higher risk
of adverse events or lower likelihood of benefit or because of patient preferences
for the alternative outcomes, such as when risks occur at different times. For
example, given a hypothetical choice between living 25 years for certain or a
50:50 chance of living 50 years or dying immediately, most individuals choose
the 25 years for certain. Although both options yield, on average, 25 years, most
individuals are risk averse and prefer to avoid the near-term risk of dying. When
interviewed, some patients with “operable” lung cancer were quite averse to
possible immediate death from surgery, and so, based on their preferences, these
patients probably would opt for radiation therapy despite its poorer long-term
survival.196
Besides risk aversion, some treatments may improve quality of life but place
patients at risk for shortened life expectancy, and some patients may be willing to
trade off quality of life for length of life. When presented with laryngeal cancer
scenarios, some volunteer research subjects chose radiation therapy over surgery to
preserve their voices despite a reduced likelihood of future survival. “These results
suggest that treatment choices should be made on the basis of patients’ attitudes
toward the quality as well as the quantity of survival.”197
To illustrate this principle, a National Institutes of Health Consensus Conference recommended breast-conserving surgery when possible for women with
Stage I and II breast cancer198 because well-designed studies with long-term
followup on thousands of women demonstrated equivalence of lumpectomy
and radiation therapy or mastectomy for survival and disease-free survival (being
alive without breast cancer recurrence). In one study, lumpectomy and radiation
appeared to have a lower risk of breast cancer recurrence with 5 women reported
to have had breast cancer recurrences following lumpectomy and radiation versus
195.
196.
197..
198.
Guyatt, supra note 150, at 8; see also supra Section IV.C.3.
McNeil, supra note 63, at 986.
Id. at 982.
NIH Consensus Conference: Treatment of Early-Stage Breast Cancer, 265 JAMA 391–95 (1991).
733
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
10 women after mastectomy.199 However, breast cancer that recurred in the breast
that had been operated on was censored (i.e., deliberately not considered in the
statistical analysis).200 When including these censored cancer recurrences, 20 breast
cancer recurrences occurred after lumpectomy versus 10 after mastectomy, and
so lumpectomy actually had a higher overall risk of recurrence.201 As expressed
by one woman, “The decision about treatment for breast cancer remains an
intensely personal one. The mastectomy I chose . . . felt a lot less invasive than
the prospect of six weeks of daily radiation, not to mention the 14% risk of local
recurrence.”202 In such a case, patient preferences203 regarding tradeoffs involving breast preservation and increased risk of breast cancer recurrence or the need
for radiation therapy associated with lumpectomy may play an important role in
determining the optimal decision for any particular patient.204
D. Informed Consent
1. Principles and standards
Medical informed consent is an ethical, moral, and legal responsibility of physicians.205 It is guided by four ethical principles: autonomy, beneficence, nonmalfeasance, and justice.206 Autonomy refers to informed, rational decisionmaking
after unbiased and thoughtful deliberation. Beneficence represents the moral
obligation of physicians to act for the benefit of patients.207 These two principles
place physicians in conflict because they wish to provide the care they believe
is best for the patient, but because that care usually involves some risk or cost,
physicians also recognize that patient preferences may affect their recommendation. In a study examining the incidence of erectile dysfunction with use of a beta
blocker medication known to be beneficial, heart disease patients were (1) blinded
199. Joan A. Jacobson et al., Ten-Year Results of a Comparison of Conservation with Mastectomy in the
Treatment of Stage I and II Breast Cancer, 332 New Eng. J. Med. 907–11 (1995) (hereinafter “Jacobson”).
200. Bernard Fisher et al., Eight-Year Results of a Randomized Clinical Trial Comparing Total
Mastectomy and Lumpectomy With or Without Irradiation in the Treatment of Breast Cancer, 320 New Eng.
J. Med. 822–28 (1989); Jacobson, supra note 199, at 998.
201. Jacobson, supra note 199, at 999.
202. Karen Sepucha et al., Policy Support for Patient-Centered Care: The Need For Measurable
Improvements In Decision Quality, Health Affairs Supp. Web Exclusives VAR 54, VAR 62 (2004).
203. Proctor & Gamble Pharm., Inc. v. Hoffman-LaRoche, Inc., 2006 WL 2588002, at *10
(S.D.N.Y. 2006) (detailing the testimony of a physician stating that, in addition to efficacy, he considers
patient preferences when determining treatment for osteoporosis).
204. Jerome P. Kassirer, Adding Insult to Injury. Usurping Patients’ Prerogatives, 308 New Eng. J.
Med. 898–901 (1983) (hereinafter “1983 Kassirer”).
205. Timothy J. Paterick et al., Medical Informed Consent: General Considerations for Physicians, 83
Mayo Clinic Proc. 313–19 (2008) (hereinafter “Paterick”).
206. Jaime S. King & Benjamin W. Moulton, Rethinking Informed Consent: The Case for Shared
Decision Making, 32 Am. J.L. & Med. 429–501 (2006) (hereinafter “King & Moulton”).
207.. Id. at 435.
734
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
to the drug, (2) informed of the drug name only, or (3) informed about its erectile dysfunction adverse effect. Among those blinded, 3.1% developed erectile
dysfunction compared with 15.6% of those given the drug name and 31.2% of
those informed about adverse effects, showing that being informed increased the
risk for adverse effects and might deprive patients of benefit from a drug because
they stop taking it.208 Physicians must balance the desire to provide beneficial care
with the obligation to promote autonomous decisions by informing patients of
potential adverse effects or tradeoffs.
State jurisdictions differ in their standards for disclosure, with half adopting
the physician or professional standard (the information that other local physicians
with similar skill levels would provide) and the other half adopting the patient
or materiality standard (the information that a reasonable patient would deem
important in decisionmaking).209 The informed consent process involves the disclosure of alternative treatment options including no treatment and the risks and
benefits associated with each alternative. Discussion should include severe risks
and frequent risks, but the courts have not provided explicit guidance about what
constitutes sufficient severity or frequency. Patients should be considered by the
court to be competent and should have the capacity to make decisions (understanding choices, risks, and benefits). The decision should be voluntary—of free
mind and free will, without coercion or manipulation. The language used should
be understandable to the patient, and treatment should not proceed unless the
physician believes the patient understands the options and their risks and benefits.
Patients may withdraw consent or refuse treatment. Such an action should
engender additional discussion, and documentation may include the completion of
a withdrawal-of-consent form. In certain situations, exceptions to medical consent
may arise in emergencies, when the treatment is recognized by prudent physicians
to involve no material risk to patients and when the procedure is unanticipated
and not known to be necessary at the time of consent.210
The Merenstein case described an unpublished trial in which, during his
residency, Dr. Merenstein examined a highly educated man. The examination
included a discussion of the relevant risks and benefits regarding prostate cancer
screening using the prostate-specific antigen (PSA) test based on recommendations from the U.S. Preventive Services Task Force, the American College of
Physicians–American Society of Internal Medicine, the American Medical Association, the American Urological Association, the American Cancer Association,
and the American Academy of Family Physicians. Dr. Merenstein testified that
the patient declined the test because of the high false-positive rate, the risk of
treatment-related adverse effects, and the low risk of dying from prostate cancer.
208.. Antonello Silvestri et al., Report of Erectile Dysfunction After Therapy with Beta-Blockers Is
Related to Patient Knowledge of Side Effects and Is Reversed by Placebo, 24 Eur. Heart J. 1928, 1928 (2003).
209. King & Moulton, supra note 206, at 430.
210. Paterick, supra note 205, at 315.
735
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Another physician seeing the same patient subsequently ordered a PSA without
any patient discussion. The PSA was high and the patient was diagnosed with
incurable advanced prostate cancer. The plaintiff’s attorney argued that despite
the guidelines above, the standard of care in Virginia was to order the blood test
without discussion, based on four physician witnesses. The jury ruled in favor of
the plaintiff.211
To illustrate the importance of patient preferences, a woman with breast
cancer described her experience: “But as the surgeon diagramed incision points on
my chest with a felt-tip pen, my husband asked a question: Is it really necessary to
transfer this back muscle? The doctor’s answer shocked us. No, he said, he could
simply operate on my chest. That would cut surgery and recovery time in half.
He had planned the more complicated procedure because he thought it would
have the best cosmetic result. ‘I assumed that’s what you wanted.’”212 Instead the
woman preferred the less invasive approach that shortened her recovery time.
In the research setting, a randomized trial with and without informed consent
demonstrated that the process of getting informed consent altered the effect of a
placebo when given to patients with insomnia. The first patient of each pair was
randomized to no informed consent and the second to informed consent. Out
of 56 patients randomized to informed consent, 26 declined to participate in the
study (the patients without informed consent had no choice and were unaware of
their participation in a study). The informed consent process created a “biased”
group because the age and gender for those who declined participation differed
significantly from those who did agree to be included in the study. The hypnotic
activity of placebo was significantly higher without informed consent, and adverse
events were found more commonly in the group receiving informed consent. The
study suggests that the process of getting informed consent introduced biases in
the patient population and affected the efficacy and adverse effects observed in this
clinical trial, thereby potentially affecting the general applicability of any findings
involving informed consent.213
Besides physicians, patients may get health information from the Internet,
family, friends, and the media (newspapers, magazines, television). Among Internet
users, 80% had searched for information on at least 1 of 15 major health topics
but use varied from 62% to 89% by age, gender, education, or race/ethnicity.214
Conducted between November 2006 and May 2007, a cross-sectional national
survey of U.S. adults who had made a medical decision found that Internet use
211. King & Moulton, supra note 206, at 432–34; Daniel Merenstein, A Piece of My Mind:
Winners and Losers, 291 JAMA 15–16 (2004).
212. Julie Halpert, Health: What Do Patients Want? Newsweek, Apr. 28, 2003, at 63–64.
213. R. Dahan et al., Does Informed Consent Influence Therapeutic Outcome? A Clinical Trial of the
Hypnotic Activity of Placebo in Patients Admitted to Hospital, 293 Brit. Med. J. Clin. Res. Ed. 363–64
(1986).
214. Pew Internet, Health Topics, http://pewinternet.org/Reports/2011/HealthTopics.aspx
(last visited Feb. 12, 2011).
736
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
averaged 28% but varied from 17% for breast cancer screening to 48% for hip/
knee replacement among those 40 years of age and older.215 However, even
among Internet users, health care providers were felt to be the most influential
source of information for medical decisions, followed by the Internet, family and
friends, and then media.
2. Risk communication
Multiple health outcomes may result from alternative treatment choices, and
how patients feel about the relative importance of those outcomes varies.216
When patients with recently diagnosed curable prostate cancer were presented
with 93 possible questions that might be important to patients like themselves, 91
of the questions were cited as relevant to at least one patient.217 Communication
skills should include patient problem assessment (appropriate questioning techniques, seeking patient’s beliefs, checking patient’s understanding of the problem);
patient education and counseling (eliciting patient’s perspective, providing clear
instructions and explanations, assessing understanding); negotiation and shared
decisionmaking (surveying problems and delineating options, arriving at mutually
acceptable solutions); relationship development and maintenance (encouraging
patient expression, communicating a supportive attitude, explaining any jargon,
and using nonverbal behavior to enhance communication).218
Certain forms of risk communication, however, may be confusing and should
be avoided: “single event probabilities, conditional probabilities (such as sensitivity
and specificity), and relative risks.”219 An example of a single-event probability
would be the statement that a particular medication results in a 30% to 50% chance
of developing erectile dysfunction.220 Although physicians are referring to patients,
patients may misinterpret this as referring to their own sexual encounters and
having an erectile dysfunction problem in 30% to 50% of their sexual encounters.
The preferred natural frequency statement would be “out of 100 people like
you taking this medication, 30 to 50 of them experience erectile dysfunction.”
The natural frequency statement specifies a reference class, thereby reducing
misunderstanding.221
215. Mick P. Couper et al., Use of the Internet and Ratings of Information Sources for Medical
Decisions: Results from the DECISIONS Survey, 30 Med. Decision Making 106S–14S (2010).
216.. 1983 Kassirer, supra note 203, at 889.
217. Deb Feldman-Stewart et al., What Questions Do Patients with Curable Prostate Cancer Want
Answered? 20 Med. Decision Making 7–19 (2000).
218.. Michael J. Yedidia et al., Effect of Communications Training on Medical Student Performance,
290 JAMA 1157–65 (2003).
219. Gerd Gigerenzer & Adrian Edwards, Simple Tools for Understanding Risks: From Innumeracy
to Insight, 327 BMJ 741–44 (2003).
220. Gigerenzer, supra note 87, at 4.
221. Id. at 4; see also Section IV.A.2.
737
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Regarding relative risk, consider a statement that taking a cholesterol-lowering
medication reduces the risk of dying by 22%.222 This may be misinterpreted as saying that out of 1000 patients with high cholesterol, 220 of them can avoid dying
by taking cholesterol-lowering medications. The actual data show that 32 deaths
occur among 1000 patients taking the medication, and 41 deaths occur among
1000 patients taking the placebo. The relative risk reduction equals 9 divided by
41. A preferred way to express the benefit would be the absolute risk reduction
(the difference between 41 and 32 deaths in 1000 patients), or to say that in 1000
people like you with high cholesterol, taking a cholesterol medication for 5 years
helps 9 of them avoid dying.223 Calculating an odds ratio, the cholesterol-lowering
medication reduces the odds of dying by 23%; notice that neither the relative risk
nor the odds ratio characterizes the number of events without treatment and that
the odds ratio always magnifies the risk or benefit when compared with the relative risk. To illustrate further, a relative risk reduction of 20% has very different
absolute risk reductions depending on the number of events without treatment. If
20 of 100 patients without treatment would die, then the absolute risk reduction
is 4 of 100 or 4% (20% times 20), but if 20 of 100,000 patients without treatment
would die, then the absolute reduction is 4 of 100,000 or 0.004%. The number
needed to treat is an additional form of risk communication popularized as part of
evidence-based medicine to account for the risk without treatment. It is the reciprocal of the absolute risk difference or 1 divided by the quantity 9 lives saved per
1000 (1 ÷ (9/1000)) treated with cholesterol medications in the above example.
Therefore 111 patients need to be treated with a cholesterol medication for 5 years
to save one of them, or in the illustrative example, with a relative risk reduction of
20%, either 25 or 25,000 would need to be treated to save 1 patient.
In the analysis of mammography for the U.S. Preventive Services Task Force,
the number needed to be invited (NNI) for screening to avoid one breast cancer
death was 1904 for 39- to 49-year-olds, 1339 for 50- to 59-year-olds, and 377
for 60- to 69-year-olds.224 To account for possible harm, there is a corresponding determination of the number needed to harm (NNH) that is calculated in
the same manner. Considering breast biopsy as a morbidity, 5 women need to
undergo breast biopsy for every one woman diagnosed with breast cancer for
39- to 49-year-olds, and the corresponding numbers are 3 for women ages 50 to
59 and 2 for women ages 60 to 69 years old.225 Estimates of overdiagnosis ranged
mostly from 1% to 10%, and so, out of 100 women diagnosed with breast cancer
from screening, 1 to 10 of them undergo treatment for a cancer that would never
have caused any mortality.226 Clearly no one can tell if any particular woman has
222. Gigerenzer, supra note 87, at 34.
223. Id. at 34–35.
224. Heidi D. Nelson et al., Screening for Breast Cancer: An Update for the U.S. Preventive Services
Task Force, 151 Annals Internal Med. 727–37 (2009).
225. Id. at 732.
226. Id. at 731–732
738
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
been overdiagnosed because this is unobservable.227 The estimated extent of overdiagnosis requires estimating mortality reductions in a screened population compared with an unscreened population over a long period. The difference between
the two groups provides an estimate of the extent of overdiagnosis.
To summarize the evidence, “Mammography does save lives, more effectively
among older women, but does cause some harm. Do the benefits justify the risks?
The misplaced propaganda battle seems to now rest on the ratio of the risks of saving a life compared with the risk of overdiagnosis, two very low percentages that
are imprecisely estimated and depend on age and length of followup.”228 In the
USPSTF recommendations for mammography in 40- to 49-year-olds, the focus
has been on the first part of their statement “The USPSTF recommends against
routine screening mammography in women aged 40 to 49 years.” Although
screening has demonstrated benefits, in their view, the benefits of screening do not
sufficiently and clearly outweigh the potential harms to make a recommendation
that all women 40 to 49 years old have routine screening mammography from
a public health or population perspective. Oft neglected, the USPSTF in their
immediately subsequent sentence recognizes that individual preferences should
affect the care that patients receive: “The decision to start regular, biennial screening mammography before the age of 50 years should be an individual one and
take patient context into account, including the patient’s values regarding specific
benefits and harms.”229 The recommendation recognizes that depending on their
experiences, values, and preferences, some women may seek the benefit in reducing breast cancer deaths and others may prefer to avoid possible morbidity (breast
biopsy and worry) and potential overdiagnosis and overtreatment.
3. Shared Decisionmaking
The “professional values of competence, expertise, empathy, honesty, and commitment are all relevant to communicating risk: Getting the facts right and conveying
them in an understandable way are not enough.”230 Shared and informed decisionmaking has emerged as one part of patient care. It distinguishes “problem solving”
that identifies one “right” course that leaves little room for patient involvement
from “decisionmaking” in which several courses of action may be reasonable and
in which patient involvement should determine the optimal choice. In such cases,
health care choices depend not only on the likelihood of alternative outcomes
resulting from each strategy but also on the patient preferences for possible outcomes and their attitudes about risk taking to improve future survival or quality
227. Klim McPherson, Screening for Breast Cancer—Balancing the Debate, 341 BMJ 234–35 (2010).
228. Id. at 234.
229. U.S. Preventive Services Task Force, Screening for Breast Cancer: U.S. Preventive Services Task
Force Recommendation Statement, 151 Annals Internal Med. 716, 716 (2009).
230. Adrian Edwards, Communicating Risks, 327 BMJ 691–92 (2003).
739
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
of life and the timing of that risk whether the risk occurs now or in the future.231
Informed decisionmaking occurs
when an individual understands the nature of the disease or condition being
addressed; understands the clinical service and its likely consequences, including
risks, limitations, benefits, alternatives, and uncertainties; has considered his or
her preferences as appropriate; has participated in decision making at a personally
desirable level; and either makes a decision consistent with his or her preferences
and values or elects to defer a decision to a later time.232
Shared decisionmaking occurs “when a patient and his or her healthcare
provider(s), in the clinical setting, both express preferences and participate in
making treatment decisions.”233
To assist with shared decisionmaking, health decision aids have been developed to help patients and their physicians choose among reasonable clinical
options together by describing the “benefits, harms, probabilities, and scientific
uncertainties.”234 In 2007, the legislature in the state of Washington became the
first to establish and recognize in law a role for shared decisionmaking in informed
consent.235 The bill goes on to encourage the development, certification, use, and
evaluation of decision aids. The consent form provides written documentation
that the consent process occurred, but the crux of the medical consent process
is the discussion that occurs between a physician and a patient. The physician
shares his or her medical knowledge and expertise and the patient shares his or
her values (health goals) and preferences. It is an opportunity to strengthen the
patient–physician relationship through shared decisionmaking, respect, and trust.
V. Summary and Future Directions
Having sequenced the human genome, medical research is poised for exponential growth as the code for human biology (genomics) is translated into proteins
(proteomics) and chemicals (metabolomics) to identify molecular pathways that
lead to disease or that promote health. With advances in medical technologies in
diagnosis and preventive and symptomatic treatment, the practice of medicine will
be profoundly altered and redefined. For example, consider lymphoma, a blood
cancer that used to be classified simply by appearance under the microscope as
231. Michael J. Barry, Health Decision Aids to Facilitate Shared Decision Making in Office Practice,
136 Annals Internal Med. 127–35 (2002).
232. Peter Briss et al., Promoting Informed Decisions About Cancer Screening in Communities and
Healthcare Systems, 26 Am. J. Preventive Med. 67, 68 (2004).
233. Id. at 68.
234. Annette M. O’Connor et al., Risk Communication in Practice: The Contribution of Decision
Aids, 327 BMJ 736, 736 (2003).
235. Bridget M. Kuehn, States Explore Shared Decision Making, 301 JAMA 2539–41 (2009).
740
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
either Hodgkin’s or non-Hodgkin’s lymphoma. As science has evolved, it is now
further classified by cellular markers that identify the underlying cancer cells as
one of two cells that help with immunity (protecting the body from infection and
cancer): T cells or B cells. Current research is attempting to characterize those cells
further by identifying underlying genetic and cellular markers and pathways that
may distinguish these lymphomas and provide potential therapeutic targets. The
growth in the research enterprise, both basic science and clinical translational (the
translation of bench research to the bedside or basic science research into novel
treatments or diagnostics), has greatly expanded research capacity to generate
scientific research of all types.
With greatly expanded knowledge, research and specialization, judgments
about admissibility and about what constitutes expertise become increasingly
difficult and complex. The sifting of this research into sufficiently substantiated,
competent, and reliable evidence, however, relies on the traditional scientific
foundation: first, biological plausibility and prior evidence and, second, consistent repeated findings. The practice of medicine at its core will continue to be a
physician and patient interaction with professional judgment and communication
central elements of the relationship. Judgment is essential because of uncertainties
in the underlying professional knowledge or because even if the evidence is credible and substantiated, there may be tradeoffs in risks and benefits for testing and
for treatment. Communication is critical because most decisions involve tradeoffs,
in which case individual patient preferences for the outcomes that may be unique
to patients and that may affect decisionmaking should be considered.
In summary, medical terms shared in common by the legal and medical professions have differing meanings, for example, differential diagnosis, differential
etiology, and general and specific causation. The basic concepts of diagnostic
reasoning and clinical decisionmaking and the types of evidence used to make
judgments as treating physicians or experts involve the same overarching theoretical issues: (1) alternative reasoning processes; (2) weighing risks, benefits, and
evidence; and (3) communicating those risks.
741
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Glossary of Terms
adequacy. In diagnostic verification, testing a particular diagnosis for its adequacy
involves determining its ability to account for all normal and abnormal findings and the observed time course of the disease.
attending physician. The physician responsible for the patient’s care at the hospital in which the patient is being treated.
Bayes’ theorem (rule). A mathematical approach to integrating suspicion (pretest probability) with additional information such as from a test result (posttest
probability) by using test characteristics (sensitivity and specificity) to demonstrate how well the test performs in individuals with and without the disease.
causal reasoning. For physicians, causal reasoning typically involves understanding how abnormalities in physiology, anatomy, genetics, or biochemistry lead
to the clinical manifestations of disease. Through such reasoning, physicians
develop a “causal cascade” or “chain or web of causation” linking a sequence
of plausible cause-and-effect mechanisms to arrive at the pathogenesis or
pathophysiology of a disease.
chief complaint. The primary or main symptom that caused the patient to seek
medical attention.
coherency. In diagnostic verification, testing a particular diagnosis for its coherency involves determining the consistency of that particular diagnosis with predisposing risk factors, physiological mechanisms, and resulting manifestations.
conditional probability. The probability or likelihood of something given that
something else has occurred or is present, for example, the likelihood of disease if a test is positive (posterior probability) or the likelihood of a positive
test if disease is present (sensitivity). See Bayes’ theorem or rule.
consulting physician. A physician, usually a specialist, asked by the patient’s
attending physician to provide an opinion regarding diagnosis, testing, or
treatment or to perform a procedure or intervention, for example, surgery.
diagnostic test. A test ordered to confirm or exclude possible causes of a patient’s
symptoms or signs (distinct from screening test).
diagnostic verification. The last stage of narrowing the differential diagnosis
to a final diagnosis by testing the validity of the diagnosis for its coherency,
adequacy, and parsimony.
differential diagnosis. A set of diseases that physicians consider as possible
causes for patients presenting with a chief complaint (hypothesis generation).
As additional symptoms with further patient history, signs found on physical
examination, test results, or specialty physician consultations become available, the likelihood of various diagnoses may change (hypothesis refinement)
or new ones may be considered (hypothesis modification) until the diagnosis
is nearly final (diagnostic verification).
742
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
differential etiology. Term used by the court or witnesses to establish or refute
external causation for a plaintiff’s condition. For physicians, etiology refers
to cause.
external causation. External causation is established by demonstrating that the
cause of harm or disease originates from outside the plaintiff’s body, for
example, defendant’s action or product.
general causation. General causation is established by demonstrating, usually
through scientific evidence, that a defendant’s action or product causes (or is
capable of causing) disease.
heuristics. Quick automatic “rules of thumb” or cognitive shortcuts often
involving pattern recognition that facilitate rapid diagnostic and treatment
decisionmaking. Although characteristic of experts, it may predispose to
known cognitive errors. See Hypothetico-deductive.
hypothesis generation. A limited list of potential diagnostic hypotheses in
response to symptoms, signs, and lab test results. See differential diagnosis.
hypothesis modification. A change in the list of diagnostic hypotheses (differential diagnosis) in response to additional information, e.g., symptoms, signs,
and lab test results. See differential diagnosis.
hypothesis refinement. A change in the likelihood of the potential diagnostic
hypotheses (differential diagnosis) in response to additional information, e.g.,
symptoms, signs, and lab test results. As additional information emerges,
physicians evaluate those data for their consistency with the possibilities on
the list and whether those data would increase or decrease the likelihood of
each possibility. See differential diagnosis.
hypothetico-deductive. Deliberative and analytical reasoning involving hypothesis
generation, hypothesis modification, hypothesis refinement, and diagnostic
verification. Typically applied for problems outside an individual’s expertise or
difficult problems with atypical issues, it may avoid known cognitive errors.
See Heuristics.
individual causation. See specific causation.
inductive reasoning. The process of arriving at a diagnosis based on symptoms,
signs, and lab tests. See differential diagnosis.
inferential reasoning. See inductive reasoning.
overdiagnosis. Screening can lead to “pseudodisease” or “overdiagnosis,” e.g.,
the identification of slow-growing cancers that even if untreated would never
cause symptoms or reduce survival because the screening test cannot distinguish the abnormal-appearing cells that would become cancerous from those
that would never do so. See overtreatment.
overtreatment. The treatment of patients with pseudodisease whose disease
would never cause symptoms or reduce survival. The treatment may place
743
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
patients at risk for treatment-related morbidity and possibly mortality. See
overdiagnosis.
parsimony. In diagnostic verification, testing a particular diagnosis for its parsimony involves choosing the simplest single explanation as opposed to requiring the simultaneous occurrence of two diseases to explain the findings.
pathogenesis. See causal reasoning.
pathology test. Microscopic examination of body tissue typically obtained by a
biopsy or during surgery to determine if the tissue appears to be abnormal (different than would be expected for the source of the tissue). The visual components of the abnormality are typically described (e.g., types of cells, appearance
of cells, scarring, effect of stains or molecular markers that help facilitate identification of the components) and, on the basis of visual pattern, the abnormality
may be classified, e.g., malignancy (cancer) or dysplasia (precancerous).
posttest probability. See predictive value.
predictive value or posttest probability. The suspicion or probability of a
disease after additional information (such as from a test) has been obtained.
The predictive value positive or positive predictive value is the probability
of disease in those known to have a positive test result. The predictive value
negative or negative predictive value is the probability of disease in those
known to have a negative test result.
pretest probability. The suspicion or probability of a disease before additional
information (such as from a test) is obtained.
prior probability. See pretest probability.
screening test. A test performed in the absence of symptoms or signs to detect
disease earlier, e.g., cancer screening (distinct from diagnostic test).
sensitivity. Likelihood of a positive finding (usually referring to a test result but
could also be a symptom or a sign) among individuals known to have a disease
(distinct from specificity).
sign. An abnormal physical finding identified at the time of physical examination
(distinct from symptoms).
specific causation or individual causation. Established by demonstrating that
a defendant’s action or product is the cause of a particular plaintiff’s disease.
specificity. Likelihood of a negative finding (usually referring to a test result but
could also be a symptom or a sign) among individuals who do not have a
particular disease (distinct from sensitivity).
syndrome. A group of symptoms, signs, and/or test results that together characterize a specific disease.
symptom. The patient’s description of a change in function, sensation, or appearance (distinct from sign).
744
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Medical Testimony
References on Medical Testimony
Lynn Bickley et al., Bates’ Guide to Physical Examination and History Taking
(10th ed. 2008).
Gerd Gigerenzer. Calculated Risks. How to Know When Numbers Deceive You
(2002).
Trisha Greenhalgh. How to Read a Paper: The Basics of Evidence-Based Medicine (4th ed. 2010).
Gordon Guyatt et al., Users’ Guides to the Medical Literature: Essentials of
Evidence-Based Clinical Practice (2d ed. 2009).
Jerome P. Kassirer et al., Learning Clinical Reasoning (2d ed. 2009).
Harold C. Sox et al., Medical Decision Making (2006).
Sharon E. Straus et al., Evidence-Based Medicine (4th ed. 2010).
745
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on
Neuroscience
H E N RY T . G R E E L Y A N D A N T H O N Y D . W A G N E R
Henry T. Greely, J.D., is Deane F. and Kate Edelman Johnson Professor of Law, Professor,
by courtesy, of Genetics, and the Director of Center for Law and the Biosciences, Stanford
University, Stanford, California.
Anthony D. Wagner, Ph.D., is Associate Professor of Psychology and Neuroscience, Stanford
University, Stanford, California.
CONTENTS
I. Introduction, 749
II. The Human Brain, 749
A. Cells, 750
B. Brain Structure, 754
C. Some Aspects of How the Brain Works, 759
III. Some Common Neuroscience Techniques, 761
A. Neuroimaging, 761
1. CAT scans, 762
2. PET scans and SPECT scans, 763
3. MRI—structural and functional, 766
B. EEG and MEG, 772
C. Other Techniques, 773
1. Lesion studies, 773
2. Transcranial magnetic stimulation (TMS), 774
3. Deep brain stimulation (DBS), 775
4. Implanted microelectrode arrays, 775
IV. Issues in Interpreting Study Results, 776
A. Replication, 777
B. Problems in Experimental Design, 777
C. The Number and Diversity of Subjects, 779
D. Applying Group Averages to Individuals, 780
E. Technical Accuracy and Robustness of Imaging Results, 781
F. Statistical Issues, 782
G. Possible Countermeasures, 783
747
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
V. Questions About the Admissibility and the Creation of Neuroscience
Evidence, 784
A. Evidentiary Rules, 785
1. Relevance, 785
2. Rule 702 and the admissibility of scientific evidence, 785
3. Rule 403, 788
4. Other potentially relevant evidentiary issues, 789
B. Constitutional and Other Substantive Rules, 790
1. Possible rights against neuroscience evidence, 790
2. Possible rights to the creation or use of neuroscience
evidence, 795
3. The Fourth Amendment, 796
VI. Examples of the Possible Uses of Neuroscience in the Courts, 796
A. Criminal Responsibility, 799
B. Lie Detection, 802
1. Issues involved in the use of fMRI-based lie detection in
litigation, 803
2. Two cases involving fMRI-based lie detection, 805
3. fMRI-based lie detection outside the courtroom, 807
C. Detection of Pain, 807
VII. Conclusion, 811
References on Neuroscience, 812
748
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
I. Introduction
Science’s understanding of the human brain is increasing exponentially. We know
almost infinitely more than we did 30 years ago; however, we know almost
nothing compared with what we are likely to know 30 years from now. The results
of advances in understanding human brains—and of the minds they generate—are
already beginning to appear in courtrooms. If, as neuroscience indicates, our mental
states are produced by physical states of our brain, our increased ability to discern
those physical states will have huge implications for the law. Lawyers already are
introducing neuroimaging evidence as relevant to questions of individual responsibility, such as claims of insanity or diminished responsibility, either on issues of
liability or of sentencing. In May 2010, parties in two cases sought to introduce
neuroimaging in court as evidence of honesty; we are also beginning to see efforts
to use it to prove that a person is in pain. These and other uses of neuroscience
are almost certain to increase with our growing knowledge of the human brain
as well as continued technological advances in accurately and precisely measuring
the brain. This chapter strives to give judges some background knowledge about
neuroscience and the strengths and weaknesses of its possible applications in litigation in order to help them become better prepared for these cases.1
The chapter begins with a brief overview of the structure and function of the
human brain. It then describes some of the tools neuroscientists use to understand
the brain—tools likely to produce findings that parties will seek to introduce in
court. Next, it discusses a number of fundamental issues that must be considered
when interpreting neuroscientific findings. Finally, after discussing, in general, the
issues raised by neuroscience-based evidence, the chapter concludes by analyzing
a few illustrative situations in which neuroscientific evidence is likely to appear
in court in the future.
II. The Human Brain
This abbreviated and simplified discussion of the human brain describes the cellular basis of the nervous system, the structure of the brain, and finally our current
understanding of how the brain works. More detailed, but still accessible, informa-
1. The Law and Neuroscience Project, funded by the John D. and Catherine T. MacArthur
Foundation, is preparing a book about law and neuroscience for judges, which should be available
by 2011. A Primer on Neuroscience (Stephen Morse & Adina Roskies eds., forthcoming 2011). The
Project has already published a pamphlet written by neuroscientists for judges, with brief discussions
of issues relevant to law and neuroscience. A Judge’s Guide to Neuroscience: A Concise Introduction
(M.S. Gazzaniga & J.S. Rakoff eds., 2010). One early book on a broad range of issues in law and
neuroscience also deserves mention: Neuroscience and the Law: Brain, Mind, and the Scales of Justice
(Brent Garland ed., 2004).
749
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
tion about the human brain can be found in academic textbooks and in popular
books for general audiences.2
A. Cells
Like most of the human body the nervous system is made up of cells. Adult
humans contain somewhere between 50 trillion and 100 trillion human cells.
Each of those cells is both individually alive and part of a larger living organism.
Each cell in the body (with rare exceptions) contains each person’s entire
complement of human genes—his or her genome. The genes, found on very long
molecules of deoxyribonucleic acid (DNA) that make up a human’s 46 chromosomes, work by leading the cells to make other molecules, notably proteins and
ribonucleic acid (RNA). We now believe that there are about 23,000 human
genes. Cells are different from each other not because they contain different genes
but because they turn on and off different sets of genes. All human cells seem to
use the same group of several thousand “housekeeping” genes that run the cell’s
basic machinery, but skin cells, kidney cells, and brain cells differ in which other
genes they use. Scientists count different numbers of “types” of human cells, with
estimates ranging from a few hundred to a few thousand (depending largely on
how narrowly or broadly one defines a cell type).
The most important cells in the nervous system are called neurons. Neurons
pass messages from one neuron to another in a complex way that appears to be
responsible for brain function, conscious or otherwise.
Neurons (Figure 1) come in many sizes, shapes, and subtypes (with their
own names), but they generally have three features: a cell body (or “soma”),
short extensions called dendrites, and a longer extension called an axon. The cell
body contains the nucleus of the cell, which in turn contains the 46 chromosomes
with the cell’s DNA. The dendrites and axons both reach out to make connections with other neurons. The dendrites generally receive information from other
neurons; the axons send information.
Communication between neurons occurs at areas called synapses (Figure 2),
where two neurons almost meet. At a synapse, the two neurons will come within
2. The Society for Neuroscience, the very large scholarly society that covers a wide range of brain
science, has published a brief and useful primer about the human brain called Brain Facts. The most
recent edition, published in 2008, is available free at www.sfn.org/index.aspx?pagename=brainfacts.
Some particularly interesting books about various aspects of the brain written for a popular
audience include Oliver W. Sacks, The Man Who Mistook His Wife for a Hat and Other Clinical
Tales (1990); Antonio R. Damasio, Descartes’ Error: Emotion, Reason, and the Human Brain (1994);
Daniel L. Schacter, Searching for Memory: The Brain, the Mind, and the Past (1996); Joseph E.
LeDoux, The Emotional Brain: The Mysterious Underpinnings of Emotional Life (1996); Christopher
D. Frith, Making Up the Mind: How the Brain Creates Our Mental World (2007); and Sandra Aamodt
& Sam Wang, Welcome to Your Brain: Why You Lose Your Car Keys But Never Forget How to
Drive and Other Puzzles of Everyday Life (2008).
750
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
Figure 1. Schematic of the typical structure of a neuron.
Source: Quasar Jarosz at en.wikipedia.
less than a micrometer (a millionth of a meter) of each other, with the presynaptic
side, on the axon, separated from the postsynaptic side, on the dendrite, by a gap
called the synaptic cleft. At synapses, when the axon (on the presynaptic side)
“fires” (becomes active) it releases molecules, known as neurotransmitters, into
the synaptic cleft. Some of those molecules are picked up by special receptors on
the dendrite that is on the postsynaptic side of the cleft. More than 100 different
neurotransmitters have been identified; among the best known are dopamine,
serotonin, glutamate, and acetylcholine. Some of the neurotransmitters released
into the synaptic cleft are picked up by special receptors on the postsynaptic side
of the cleft by the dendrite.
At the postsynaptic side of the cleft, neurotransmitters binding to the receptors can have a wide range of effects. Sometimes they cause the receiving (postsynaptic) neuron to “fire,” sometimes they suppress (inhibit) the postsynaptic
neuron from firing, and sometimes they seem to do neither. The response of the
receiving neuron is a complicated summation of the various messages it receives
from multiple neurons that converge, through synapses, on its dendrites.
A neuron that does fire does so by generating an electrical current that
flows down (away from the cell body) the length of its axon. We normally
think of electrical current as flowing in things like copper wiring. In that case,
free electrons move down the wire. The electrical currents of neurons are more
complicated. Molecules with a positive or negative electrical charge (ions) move
through the neuron’s membrane and create differences in the electrical charge
between the inside and outside of the neuron, with the current traveling along
the axon, rather like a fire brigade passing buckets of water in only one direction
751
Copyright © National Academy of Sciences. All rights reserved.
752
Copyright © National Academy of Sciences. All rights reserved.
Source: From Carlson. Carlson, Neil R. Foundations of Physiological Psychology (with Neuroscience Animations and Student Study Guide CD-ROM), 6th.
© 2005. Printed and electronically reproduced by permission of Pearson Education, Inc., Upper Saddle River, New Jersey.
Reference Manual on Scientific Evidence: Third Edition
Figure 2. Synapse. Communication between neurons occurs at the synapse, where the sending (presynaptic) and receiving (postsynaptic) neurons meet. When the presynaptic neuron fires, it releases neurotransmitters into the synaptic cleft, which
bind to receptors on the postsynaptic neuron.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
down the line. Firing occurs in milliseconds. This process of moving ions in and
out of the cell membrane requires that the cell use large amounts of energy. When
the current reaches the end of the axon, it may or may not cause the axon to
release neurotransmitters into the synaptic cleft. This complicated part-electrical,
part-chemical system is how information passes from one neuron to another.
The axons of human neurons are all microscopically narrow, but they vary
enormously in length. Some are micrometers long; others, such as neurons running from the base of the spinal cord to the toes, are several feet long. Longer
axons tend to be coated with a fatty substance called myelin. Myelin helps insulate
the axon and thus increases the strength and efficiency of the electrical signal,
much like the insulation wrapped around a copper wire. (The destruction of this
myelin sheathing is the cause of multiple sclerosis.) Axons coated with myelin
appear white; thus areas of the nervous system that have many myelin-coated
axons are referred to as “white matter.” Cell bodies, by contrast, look gray, and so
areas with many cell bodies and relatively few axons make up our “gray matter.”
White matter can roughly be thought of as the wiring that connects gray matter
to the rest of the body or to other areas of gray matter.
What we call nerves are really bundles of neurons. For example, we all have
nerves that run down our arms to our fingers. Some of those nerves consist of
neurons that pass messages from the fingers, up the arm, to other neurons in the
spinal cord that then pass the messages on to the brain, where they are analyzed
and experienced. This is how we feel things with our fingers. Other nerves are
bundles of neurons that pass messages from the brain through the spinal cord to
nerves that run down the arms to the fingers, telling them when and how to move.
Neurons can connect with other neurons or with other kinds of cells.
Neurons that control body movements ultimately connect to muscle cells—these
are called motor neurons. Neurons that feed information into the brain start
with specialized sensory cells (i.e., cells specialized for detecting different types
of stimuli—light, touch, heat, pain, and more) that fire in response to the appropriate stimulus. Their firings ultimately lead, directly or through other neurons,
into the brain. These are sensory neurons. These neurons send information only
in one direction—motor neurons ultimately from the brain, sensory neurons to
the brain. The paralysis caused by, for example, severe damage to the spinal cord
both prevents the legs from receiving messages to move that would come from
the brain through the motor neurons and keeps the brain from receiving messages from sensory neurons in the legs about what the legs are experiencing. The
break in the spinal column prevents the messages from getting through, just as a
break in a local telephone line will keep two parties from connecting. (There are,
unfortunately, not yet any human equivalents to wireless service.)
Estimates of the number of cells in a human brain vary widely, from a few
hundred billion to several trillion. These cells include those that make up blood
vessels and various connective tissues in the brain, but most of them are specialized
brain cells. About 80 billion to 100 billion of these brain cells are neurons; the
753
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
other cells (and the source of most of the uncertainty about the number of cells)
are principally another class of cells referred to generally as glial cells. Glial cells
play many important roles in the brain, including, for example, producing and
maintaining the myelin sheaths that insulate axons and serving as a special immune
system for the brain. The full importance of glial cells is still being discovered;
emerging data suggest that they may play a larger role in mental processes than
as “support staff.” At this point, however, we concentrate on neurons, the brain
structures they form, and how those structures work.
B. Brain Structure
Anatomists refer to the brain, the spinal cord, and a few other nerves directly
connecting to the brain as the central nervous system. All the other nerves are
part of the peripheral nervous system. This reference guide does not focus on the
peripheral nervous system, despite its importance in, for example, assessing some
aspects of personal injuries. We also, less fairly, ignore the central nervous system
other than the brain, even though the spinal cord, in particular, plays an important
role in modulating messages going into and coming out of the brain.
The average adult human brain (Figure 3) weighs about 3 pounds and fills
a volume of about 1300 cubic centimeters. If liquid, it would almost fill two
standard wine bottles with a little space left over. Living brains have a consistency
about like that of gelatin. Despite the softness of brains, they are made up of
regular shapes and structures that are generally consistent from person to person.
Just as every nondamaged or nondeformed human face has two eyes, two ears,
Figure 3. Lateral (left) and mid-sagittal (right) views of the human brain.
Source: Courtesy of Anthony Wagner.
754
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
one nose, and one mouth with standard numbers of various kinds of teeth, every
normal brain has the same set of identifiable structures, both large and small.
Neuroscientists have long worked to describe and define particular regions
of the brain. In some ways this is like describing parcels of land in property
documents, and, like property descriptions, several different methods are used.
At the largest scale, the brain is often divided into three parts: the brain stem, the
cerebellum, and the cerebrum.3
The brain stem is found near the bottom of the brain and is, in some ways,
effectively an extension of the spinal cord. Its various parts play crucial roles in
controlling the body’s autonomic functioning, such as heart rate and digestion. The
brain stem also contains important regions that regulate processing in the cerebrum.
For example, the substantia nigra and ventral tegmental area in the brain stem
consist of critical neurons that generate the neurotransmitter dopamine. While the
substantia nigra is crucial for motor control, the ventral tegmental area is important
for learning about rewards. The loss of neurons in the substantia nigra is at the core
of the movement problems of Parkinson’s disease.
The cerebellum, which is about the size and shape of a squashed tennis ball,
is tucked away in the back of the skull. It plays a major role in fine motor control
and seems to keep a library of learned motor skills, such as riding a bicycle. It was
long thought that damage to the cerebellum had little to no effect on a person’s
personality or cognitive abilities, but resulted primarily in unsteady gait, difficulty
in making precise movements, and problems in learning movements. More recent
studies of patients with cerebellar damage and functional brain imaging studies of
healthy individuals indicate that the cerebellum also plays a role in more cognitive
functions, including supporting aspects of working memory, attention, and language.
The cerebrum is the largest part of the human brain, making up about 85% of
its volume. The cerebrum is found at the front, top, and much of the back of the
human brain. The human brain differs from the brains of other mammals mainly
because it has a vastly enlarged cerebrum.
There are several different ways to identify parts of, or locations in, the cerebrum. First, the cerebrum is divided into two hemispheres—the famous left and
right brain. These two hemispheres are connected by tracts of white matter—of
axons—most notably the large connection called the corpus callosum. Oddly, the
right hemisphere of the brain generally receives messages from and controls the
movements of the left side of the body, while the left hemisphere receives messages from and controls the movements of the right side of the body.
Each hemisphere of the cerebrum is divided into four lobes (Figure 4): The
frontal lobe in the front of the cerebrum (behind the forehead), the parietal lobe at
3. The brain also is sometimes divided into the forebrain, midbrain, and hindbrain. This
classification is useful for some purposes, particularly in describing the history and development of the
vertebrate brain, but it does not entirely correspond to the categorization of cerebrum, brain stem,
and cerebellum, and it is not used in this reference guide.
755
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Figure 4. Lobes of a hemisphere. Each hemisphere of the brain consists of four
lobes––the frontal, parietal, temporal, and occipital lobes.
Source: http://commons.wikimedia.org/wiki/File:Gray728.svg. This image is in the public domain
because its copyright has expired. This applies worldwide.
the top and toward the back, the temporal lobe on the side (just behind and above
the ears), and the occipital lobe at the back. Thus, one could describe a particular
region as lying in the left frontal lobe—the frontal lobe of the left hemisphere.
The surface of the cerebrum consists of the cortex, which is a sheet of gray
matter a few millimeters thick. The cortex is not a smooth sheet in humans, but
rather is heavily folded with valleys, called sulci (“sulcus” in the singular), and
bulges, called gyri (“gyrus”). The sulci and gyri have their own names, and so a
location can be described as in the inferior frontal gyrus in the left frontal lobe.
These folds allow the surface area of the cortex, as well as the total volume of the
cortex, to be much greater than in other mammals, while still allowing it to fit
inside our skulls, similar to the way the many folds of a car’s radiator give it a very
large surface area (for radiating away heat) in a relatively small space.
The cerebral cortex is extraordinarily large in humans compared with other
species and is clearly centrally involved in much of what makes our brains special, but the cerebrum contains many other important subcortical structures that
we share with other vertebrates. Some of the more important areas include the
thalamus, the hypothalamus, the basal ganglia, and the amygdala. These areas all
connect widely, with the cortex, with each other, and with other parts of the brain
to form complex networks.
The functions of all these areas are many, complex, and not fully understood,
but some facts are known. The thalamus seems to act as a main relay that carries
756
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
information to and from the cerebral cortex, particularly for vision, hearing,
touch, and proprioception (one’s sense of the position of the parts of one’s body).
It also is, importantly, involved in sleep, wakefulness, and consciousness. The
hypothalamus has a wide range of functions, including the regulation of body
temperature, hunger, thirst, and fatigue. The basal ganglia are a group of regions
in the brain that are involved in motor control and learning, among other things.
They seem to be strongly involved in selecting movements, as well as in learning
through reinforcement (as a result of rewards). The amygdala appears to be important in emotional processing, including how we attach emotional significance to
particular stimuli.
In addition, many other parts of the brain, in the cortex or elsewhere, have
their own special names, usually with Latin or Greek roots that may or may not
seem descriptive today. The hippocampus, for example, is named for the Greek
word for seahorse. For most of us, these names will have no obvious rhyme or reason, but merely must be learned as particular structures in the brain—the superior
colliculus, the tegmentum, the globus pallidus, the substantia nigra, the cingulate
cortex, and more. All of these structures come in pairs, with one in the left hemisphere and one in the right hemisphere; only the pineal gland is unpaired. Brain
atlases include scores of names for particular structures or regions in the brain and
detailed information about the structures or regions.
Some of these smaller structures may have special importance to human
behavior. The nucleus accumbens, for example, is a small subcortical region in
each hemisphere of the cerebrum that appears important for reward processing
and motivation. In experiments with rats that received stimulation of this region
in return for pressing a lever, the rats would press the lever almost to the exclusion of any other behavior, including eating. The nucleus accumbens in humans
appears linked to appetitive motivation, responding in anticipation of primary
rewards (such as pleasure from food and sex) and secondary rewards (such as
money). Through interactions with the orbital frontal cortex and dopaminegenerating neurons in the midbrain (including the ventral tegmental area), the
nucleus accumbens is considered part of a “reward network.” With a hypothesized
role in addictive behavior and in reward computations, more broadly, this putative
reward network is a topic of considerable ongoing research.
All of these various locations, whether defined broadly by area or by the
names of specific structures, can be further subdivided using directions: front and
back, up and down, toward the middle, or toward the sides. Unfortunately, the
directions often are not expressed in a straightforward manner, and several different terminological conventions exist. Locations toward the front or back of the
brain can be referred to as either anterior or posterior or as rostral or caudal (literally, toward the nose, or beak, or the tail). Locations toward the bottom or top
of the brain are termed inferior or superior or, alternatively, as ventral or dorsal
(toward the stomach or toward the back). A location toward the middle of the
brain is called medial; one toward the side is called lateral. Thus, different loca757
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
tions could be described, for example, as in the left anterior cingulate cortex, in
the dorsal medial (or sometimes dorsomedial) prefrontal cortex, or in the posterior
hypothalamus.
Finally, one other method often is used, a method created by Korbinian
Brodmann in 1909. Brodmann, a neuroanatomist, divided the brain into about 50
different areas or regions (Figure 5). Each region was defined on the basis of the
Figure 5. Brodmann’s areas. Brodmann divided the cortex into different areas
based on the cell types and how they were organized.
Source: Prof. Mark Dubin, University of Colorado.
758
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
kinds of neurons found there and how those neurons are organized. A location
described by Brodmann area may or may not correspond closely with a structural
location. Other organizational schemes exist, but Brodmann’s remains the most
widely used to describe the approximate locations of findings in modern human
brain imaging studies.
C. Some Aspects of How the Brain Works
Most of neuroscience is dedicated to finding out how the brain works, but
although much has been learned, considerably more remains unknown. We could
use many different ways to describe what is known about how the brain works.
This section discusses a few important aspects of brain function and makes several
general points about the localization and distribution of functions, as well as brain
plasticity, before commenting on the effects of hormones and other chemical
influences on the brain.
Some brain functions are localized in, or especially dependent on, particular
regions of the brain. This has been known for many years as a result of studies of
people who, through traumatic injury, stroke, or cancer, have lost, or lost the use
of, particular regions of their brains. For example, in the 1860s, French anatomist
Paul Broca discovered through autopsies of patients that damage to a region in the
left inferior frontal lobe (now known as Broca’s area) caused an inability to speak.
It is now known that some functions cannot normally be performed when particular brain areas are damaged or missing. The visual cortex, located at the back of
the brain in the occipital lobes, is as necessary for vision as the eyes are; the hippocampus is necessary for the creation of many kinds of memory; and the motor
cortex is necessary for voluntary movements. The motor cortex and the parallel
somatosensory cortex, which is essential for processing sensory information such as
the sense of touch from the body, are further subdivided, with particular regions
necessary for causing motion or sensing feelings from the legs, arms, fingers, face,
and so on. Other brain regions also will be involved in these actions or sensations,
but these regions are necessary to them.
At the same time, the fact that a region is necessary to a particular class of
sensations, behaviors, or cognition does not mean either that it is not involved in
other brain functions or that other brain regions do not also contribute to these
particular abilities. The amygdala, for example, is involved in our feelings of fear,
but it is also involved broadly in emotional reactions, both positive and negative.
It also modulates learning, memory, and even sensory perception. Although some
functions are localized, others are widely distributed. For example, the visual cortex is essential to vision, but actual visual perception involves many parts of the
brain in addition to the occipital lobes. Memories appear to be stored over much
of the cortex. Networks of brain regions participate in many of these functions.
For example, if you touch something very hot with your left index finger,
your spinal cord, through a reflex loop, will cause you to pull your finger back
759
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
very quickly. Then the part of your right somatosensory cortex devoted to the
index finger will be involved in receiving and initially interpreting the sensation.
Other areas of your brain will recognize the stimulus as painful, your motor
regions will be involved in waving your hand back and forth or bringing your
finger to your mouth, widespread parts of your cortex may lead to your remembering other instances of burning yourself, and your hippocampus may play a role
in making a new long-term memory of this incident. There is no brain region
“for” burning your finger; many regions, both specific and general, contribute to
the brain’s response.
In addition, brains are at least somewhat “plastic” or changeable on both
small and large scales. Anyone who can see has a working visual cortex, and it is
always located in the back of the brain (in the occipital lobe), but its exact borders
will vary slightly from person to person. In other cases, the brain may adjust and
change in response to a person’s behavior or changes in that person’s anatomy. For
example, a right-handed violinist may develop an enlarged brain region for controlling the fingers of the left hand, used in fingering the violin. If a person loses
an arm to amputation, the parts of the motor and somatosensory cortices that had
dealt with that arm may be “taken over” by other body parts. In some cases, this
brain plasticity can be extreme. A young child who has lost an entire hemisphere
of his or her brain may grow up to have normal or nearly normal functionality as
the remaining hemisphere takes on the tasks of the missing hemisphere. Unfortunately, the possibilities of this kind of extreme plasticity do diminish with age,
but rehabilitation after stroke in adults sometimes does show changes in the brain
functions undertaken by particular brain regions.
The picture of the brain as a set of interconnected neurons that fire in networks or patterns in response to stimuli is useful but not complete. In addition
to neuron firings, other factors affect how the brain works, particularly chemical
factors.
Some of these are hormones, generated by the body either inside or outside
the brain. They can affect how the brain functions, as well as how it develops.
Sex hormones such as estrogen and testosterone can have both short-term and
long-term effects on the brain. So can other hormones, such as cortisol, associated with stress, and oxytocin, associated with, among other things, trust and
bonding. Endorphins, chemicals secreted by the pituitary gland in the brain,
are associated with pain relief and a sense of well-being. Still other chemicals,
brought in from outside the body, can have major effects on the brain, both in
the short term and the long term. Examples include alcohol, caffeine, nicotine,
morphine, and cocaine. These can trigger very specific brain reactions or can
have broad effects.
760
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
III. Some Common Neuroscience
Techniques
Neuroscientists use many techniques to study the brain. Some of them have been
used for centuries, such as autopsies and the observation of patients with brain
damage. Some, such as the intentional destruction of parts of the brain, can be
used ethically only in research on nonhuman animals. Of course, research with
nonhuman animals, although often helpful in understanding human brains, is of
less value when examining behaviors that are uniquely developed among humans.
The current revolution in neuroscience is largely the result of a revolution in the
tools available to neuroscientists, as new methods have been developed to image
and to intervene in living brains. These methods, particularly the imaging methods
that allow more precise measurements of human brain structure and function in
living people, are giving rise to increasing efforts to introduce neuroscientific
evidence in court.
This section of this chapter focuses on several kinds of neuroimaging—
computerized axial tomography (CAT) scans, positron emission tomography
(PET) scans, single photon emission computed tomography (SPECT) scans,
and magnetic resonance imaging (MRI), as well as an older method, electroencephalography (EEG), and its close relative, magnetoencephalography (MEG).
Some of these methods show the structure of the brain, others show the brain’s
functioning, and some do both. These are not the only important neuroscience
techniques; several others are discussed briefly at the end of this section. Genetic
analysis provides yet another technique for increasing our understanding of human
brains and behaviors, but this chapter does not deal with the possible applications
of human genetics to understanding behavior.
A. Neuroimaging
Traditional imaging technologies have not been very helpful in studying the
brain. X-ray images are the shadows cast by dense objects. Not only is the brain
surrounded by our very dense skulls, but there are no dense objects inside the
brain to cast these shadows. Although a few features of the brain or its blood
vessels could be seen through methods that involved the injection of air into
some of the spaces in the brain or of contrast media into the blood, these provided limited information. The opportunity to see inside a living brain itself only
goes back to about the 1970s, with the development of CAT scans. This ability
has since exploded with the development of several new techniques, three of
which, with CAT, are discussed on the following pages.
761
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
1. CAT scans
The CAT scan is a multidimensional, computer-assisted X-ray machine. Instead
of taking one X ray from a fixed location, in a CAT scan both the X-ray source
and (180 degrees opposite the source) the X-ray detectors rotate around the person being scanned. Rather than exposing negatives to make “pictures” of dense
objects, as in traditional X rays, the X-ray detectors produce data for computer
analysis. A complete modern CAT scan includes data sufficient to reconstruct the
scanned object in three dimensions. Computerized algorithms can then be used to
produce an image of any particular slice through the object. The multiple angles
and computer analysis make it possible to pick out the relatively small density differences within the brain that traditional X-ray technology could not distinguish
and to use them to produce images of the soft tissue (Figure 6).
Figure 6. CAT scan depicting axial sections of the human brain. The ventral
most (bottom) surface of the brain is at upper left and the dorsal most
(top) surface is at the lower right.
Source: http://en.wikipedia.org/wiki/File:CT_of_brain_of_Mikael_H%C3%A4ggstr%C3%B6m_
large.png. Image in the public domain.
The CAT scan provides a structural image of the brain. It is useful for showing some kinds of structural abnormalities, but it provides no direct information
762
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
about the brain’s functioning. A CAT scan brain image is not as precise as the
image produced from an MRI, but because the procedure is both quick and (relatively) inexpensive, CAT scanners are common in hospitals. Medically, brain CAT
scans are used mainly to look for bleeding or swelling inside the brain, although
they also will record sizeable tumors or other large structural abnormalities. For
neuroscience, the great advantage of the CAT scan was its ability, for the first time,
to reveal some details inside the skull, an ability that has been largely superseded
for research by MRI. CAT scans have been used in courts to argue that structural
changes in the brain, shown on the CAT scan, are evidence of insanity or other
mental impairments. Perhaps their most notable use was in 1982 in the trial of
John Hinckley for the attempted assassination of President Ronald Reagan. A
CAT scan of Hinckley’s brain that showed widened sulci (the “valleys” in the
surface of the brain) was introduced into evidence to show that Hinckley suffered
from organic brain damage in the form of shrinkage of his brain.4
2. PET scans and SPECT scans
Traditional X-ray machines and their more sophisticated descendant, the
CAT scan, project X rays through the skull and create images based on how much
of the X rays are blocked or absorbed. PET scans and SPECT scans operate very
differently. In these methods, a substance that emits radiation is introduced into
the body. That radiation then is detected from outside the body in a way that can
determine the location of the radiation source. These scans generally are not used
for determining the brain’s structure, but for understanding how it is functioning.
They are particularly good at measuring one aspect of brain structure—the density
of particular receptors, such as those for dopamine, at synapses in some areas of
the brain, such as the frontal lobes.
Radioactive decay of atoms can take several forms, producing alpha, beta, or
gamma radiation. PET scanners take advantage of isotopes of atoms that decay
by giving off positive beta radiation. Beta decay usually involves the emission
of an electron; positive beta decay involves the emission of a positron, the positively charged antimatter equivalent of an electron. When positrons (antimatter)
meet electrons (matter), the two particles are annihilated and converted into two
photons of gamma radiation with a known energy (511,000 electron volts) that
follow directly opposite paths from the site of the annihilation. Inside the body,
the collision between the positron and electron and the consequent production
of the gamma radiation photons takes place within a short distance (a millimeter
or two) of the site of the initial radioactive decay that produced the positron.
4. The effects of this evidence on the verdict are unclear. See Lincoln Caplan, The Insanity
Defense and the Trial of John W. Hinckley, Jr. (1984) for a discussion of the case and its consequences
for the law.
763
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
PET scans, therefore, start with the introduction into a person’s body of
a radioactive tracer that decays by giving off a positron. One common tracer
is fluorodeoxyglucose (FDG), a molecule that is almost identical to the simple
sugar, glucose, except that one of the oxygen atoms in glucose is replaced by an
atom of fluorine-18, an isotope of the element fluorine with nine protons and
nine neutrons. Fluorine normally found in nature is fluorine-19, with 9 protons
and 10 neutrons, and is stable. Fluorine-18 is very unstable and decays, through
positive beta decay, quickly losing about half of its mass every 110 minutes (its
half-life). The body treats FDG as though it were glucose, and so the FDG is
concentrated where the body needs the energy supplied by glucose. A major
clinical use of PET scans derives from the fact that tumor cells use energy, and
hence glucose, at much higher rates than normal cells.
After giving the FDG time to become concentrated in the body, which usually
takes about an hour, the person is put inside the scanner itself. There, the person
is entirely surrounded by a very sensitive radiation detector, tuned to respond to
gamma radiation of the energy produced by annihilated positrons. When two “hits”
are detected by two sensors at about the same time, the source is known to be
located on a line connecting the two. Very small differences in the timing of when
the radiation is detected can help determine where along that the line the annihilation took place. In this way, as more gamma radiation from the decaying FDG is
detected, the general location of the FDG within the body can be determined and,
as a result, tissue that is using a lot of glucose, such as a tumor, can be located.
In neuroscience research, PET scans also can be taken using different molecules that bind more specifically to particular tissues or cells. Some of these more
specific ligands use fluorine-18, but others use a different radioactive tracer that
also decays by emitting a positron—oxygen-15. This can be used to determine
what parts of the brain are using more or less oxygen. Oxygen-15, however, has
a much shorter half-life (2 minutes) and so is more difficult and expensive to use
than FDG. Similarly, carbon-11, with a half-life of 20 minutes, also can be used.
Carbon-11 atoms can be introduced into various molecules that bind to important receptors in the brain, such as receptors for dopamine, serotonin, or opioids.
This allows the study of the distribution and function of these receptors, both in
healthy people and in people with various mental illnesses or neurological diseases.
The result of a PET scan is a record of the locations of positron decay events
in the brain. Computer visualization tools can then create cross-sectional images of
the brain, showing higher and lower rates of decay, with differences in magnitude
typically depicted through the use of different colors (Figure 7).
PET scans are excellent for showing the location of various receptors in normal and abnormal brains. PET scans are also very good for showing areas of different glucose use and, hence, of different levels of metabolism. This can be very
useful, for example, in detecting some kinds of brain damage, such as the damage
that occurs with Alzheimer’s disease, where certain regions of the brain become
abnormally inactive, or in brain regions that have been damaged by a stroke.
764
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
Figure 7. PET scan depicting an axial section of the human brain.
Source: http://en.wikipedia.org/wiki/Positron_emission_tomography. Image in the public domain.
In addition, the comparison (subtraction) of two PET scan measurements,
one scan when a person is engaged in a task that is thought to require particular
brain functions and a second control (or baseline) scan that is not thought to
require these functions, allows researchers indirectly to measure brain function.
PET scans were initially used in this way in research to show what areas of the
brain were used when people experienced various stimuli or performed particular
tasks. PET has been substantially superseded for this purpose by functional MRI,
which is less expensive, does not involve radiation exposure, provides better spatial
resolution, and allows a longer period of testing.
SPECT scans are similar to PET scans. Each can produce a three-dimensional
model of the brain and display images of any cross section through the brain. Like
PET scans, they require the injection of a radioactive tracer material; unlike PET
scans, the radioactive tracer in SPECT directly emits gamma radiation rather than
emitting positrons. These kinds of tracers are more stable, more accessible, and
much cheaper than the positron-emitting tracers needed for PET scans. With
a PET scan, the gamma detector entirely surrounds the person; with a SPECT
scan, one to three gamma detectors are rotated around the body over about 15 to
765
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
20 minutes. As with PET scans, the SPECT tracers can be used to measure brain
metabolism or to attach to specific molecular receptors in the brain. The spatial
resolution of a SPECT scan, however, is poorer than with a PET scan, with an
uncertainty of about 1 cm.
Both PET and SPECT scans are most useful if coupled with good structural
images. Contemporary PET and SPECT scanners often include a simultaneous
CAT scan; there is some experimental work aimed at providing simultaneous PET
and MRI scans.
3. MRI—structural and functional
MRI was developed in the 1970s, first came into wide use in the 1980s, and is
currently the dominant neuroimaging technology for producing detailed images of
the brain’s structure and for measuring aspects of brain function. MRI operates on
completely different principles than either CAT scans or PET or SPECT scans; it
does not rely on X rays passing through the brain or on the decay of radioactive
tracer molecules inside the brain. Rather, MRI’s workings involve more complicated physics. This section discusses the general characteristics of MRI and then
focuses on structural MRI, diffusion tensor imaging, and finally, functional MRI.
The power of an MRI scanner is measured by the strength of its magnetic
field, measured in units called tesla (T). The magnetic field of a small bar magnet
is about 0.01 T. The strength of the Earth’s magnetic field is about 0.00005 T. The
MRI machines used for clinical purposes use magnetic fields of between 0.2 T
and 3.0 T, with 1.5 T or 3.0 T being the systems most commonly used today.
MRI machines for human research purposes have reached 9.4 T. In general, the
stronger the magnetic field, the better the image, although higher fields also can
create their own measurement difficulties, especially when imaging brain function.
MRI machines achieve these high magnetic fields through using superconducting
magnets, made by cooling the electromagnet with liquid helium at a temperature 4° (Celsius) above absolute zero. For this and other reasons, MRI systems
are complicated, with higher initial and continuing maintenance costs compared
with some other methods for functional imaging (e.g., electroencephalography;
see infra Section III.B).
In most MRI systems (Figure 8), the subject, on an examination table, slides
into a cylindrical opening in the machine so that the part of the body to be imaged
is in the middle of the magnet. Depending on the kind of imaging performed,
the examination or experiment can take from about 30 minutes to more than
2 hours; throughout the scanning process the subject needs to stay as motionless
as possible to avoid corrupting the images. The main sensations for the subject
are the loud thumping and buzzing noises made by the machine, as well as the
machine’s vibration.
MRI examinations appear to involve minimal risk. Unlike the other neuroimaging technologies discussed above, MRI does not involve any high-energy
766
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
Figure 8. MRI machine. Magnetic resonance imaging systems are used to
acquire both structural and functional images of the brain.
Source: Courtesy of Anthony Wagner.
radiation. The magnetic field seems to be harmless, at least as long as magnetizable
objects are kept away from it. MRI subjects need to remove most metal objects;
people with some kinds of implanted metallic devices, with tattoos with metal
in their ink, or with fragments of ferrous metal anywhere in their bodies cannot
be scanned because of the dangerous effects of the field on those bits of metal.
When the subject is positioned in the MRI scanner, the powerful field of the
magnet causes the nuclei of atoms (usually the hydrogen nuclei of the body’s water
molecules) to align with the direction of the main magnetic field of the magnet.
Using a brief electromagnetic pulse, these aligned atoms are then “flipped” out of
alignment from the main magnetic field, and, after the pulse stops, the nuclei then
767
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
rapidly realign with the strong main magnetic field. Because the nuclei spin (like
a top), they create an oscillating magnetic field that is measured by a receiver coil.
During structural imaging, the strength of the signal generated partially depends
on the relative density of hydrogen nuclei, which varies from point to point in the
body according to the density of water. In this manner, MRI scanners can generate
images of the body’s anatomy or of other scanned objects. Because an MRI scan
can effectively distinguish between similar soft tissues, MRI can provide very-highresolution images of the brain’s anatomy, which is, after all, made up of soft tissue.
Structural MRI scans produce very detailed images of the brain (Figure 9).
They can be used to spot abnormalities, large and small, as well as to see normal
variation in the size and shape of brain features. Structural MRI can be used, for
example, to see how brain features change as a person ages. Previously, getting that
kind of detailed information about a brain required an autopsy or, at a minimum,
extensive neurosurgery. This ability makes structural MRI both an important
clinical tool and a very useful technique for research that tries to correlate human
differences, normal and abnormal, with differences in brain structure, as well as
for research that seeks to understand brain development.
Another structural imaging application of brain MRI has become increasingly
prevalent over the past decade: diffusion tensor imaging (DTI). As noted above,
neuronal tissue in the brain can be divided roughly into gray matter (the bodies
of neurons) and white matter (neuronal axons that transmit signals over distance).
DTI uses MRI to see what direction water diffuses through brain tissue. Tracts
of white matter are made up of bundles of axons coated with fatty myelin. Water
will diffuse through that white matter along the direction of the axons and not,
generally, across them. This method can be used, therefore, to trace the location
of these bundles of white matter and hence the long-distance connections between
different parts of the brain. Abnormal patterns of these connections may be associated with various conditions, from Alzheimer’s disease to dyslexia, some of which
may have legal implications.
Functional MRI (fMRI) is perhaps the most exciting use of MRI in neuroscience for understanding brain function. This technique shows what regions
of the brain are more or less active in response to the performance of particular
tasks or the presentation of particular stimuli. It does not measure brain activity
(the firing of neurons) directly but, instead, looks at how blood flow changes in
response to brain activity and uses those changes, through the so-called BOLD
response (the blood-oxygen-level dependent response), to allow the researcher to
infer patterns of brain activity.
Structural MRI generally creates its images through detecting the density of
hydrogen atoms in the subject and flipping them with radio pulses. For fMRI, the
scanner detects changes in the ratio of oxygenated hemoglobin (oxyhemoglobin)
and deoxygenated hemoglobin (deoxyhemoglobin) in particular locations in the
brain. Hemoglobin is the protein in red blood cells that carries oxygen from
the lungs to the body. On the basis of metabolic demands, hemoglobin molecules
768
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
Figure 9. Brain MRI scan depicting an axial (upper), coronal (lower left), and
sagittal (lower right) image of the human brain.
Source: Courtesy of Anthony Wagner.
769
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
supply oxygen for the body’s needs. Accordingly, “fresher” blood will have a
higher ratio of oxyhemoglobin to deoxyhemoglobin than more “used” blood.
Importantly, because deoxyhemoglobin (which is found at a higher level in
“used” blood) causes the fMRI signal to decay, a higher ratio of oxyhemoglobin
to deoxyhemoglobin will produce a stronger fMRI signal.
Neural activity is energy intensive for neurons, and neurons do not contain
any significant reserves of oxygen or glucose. Therefore, the brain’s blood vessels
respond quickly to increases in activity in any one region of the brain by sending more fresh blood to that area. This is the basis of the BOLD response, which
measures changes in the ratio of oxyhemoglobin to deoxyhemoglobin in a brain
region several seconds after activity in that region. In particular, when a brain region
becomes more active, there is first, perhaps more intuitively, a decline in the ratio
of oxyhemoglobin to deoxyhemoglobin immediately after activity in the region,
apparently corresponding to the depletion of oxygen in the blood at the site of the
activity. This decline, however, is very small and very hard to detect with fMRI.
Immediately after this decrease, there is an infusion of fresh (oxyhemoglobin-rich)
blood, which can take several seconds to reach maximum; it is this infusion that
results in the increase in the oxy/deoxyhemoglobin ratio that is measured in BOLD
fMRI studies. Because even this subsequent increase is relatively small and variable, fMRI experiments typically involve many trials of the same task or class of
stimuli in order to be able to see the signal amidst the noise.
Thus, in a typical fMRI experiment the subject will be placed in the scanner
and the researchers will measure differences in the BOLD response throughout his
or her brain between different conditions. A subject might, for example, be told
to look at a video screen on which images of places alternate with images of faces.
For purposes of the experiment, the computer will impose a spatial map on the
subject’s brain, dividing it into thousands of little cubes, each a few cubic millimeters in size, referred to as “voxels.” Either while the data are being collected
(so-called “real-time fMRI”5) or after an entire dataset has been gathered, a computerized program will compare the BOLD signal for each voxel when the screen
was showing places to that when the screen contained faces. Regions that showed a
statistically significant increase in the BOLD response several seconds after the face
was on the video screen compared with the effects several seconds after a screen
showing a place appeared will be said to have been “activated” by seeing the face.
The researchers will infer that those regions were, in some way, involved in how
the brain processes images of faces. The results typically will be shown as a structural brain image on which areas of more or less activation, as shown by a statistical
test, will be shown by different colors (Figure 10).6
5. Use of this “real-time” fMRI has been increasing, but it is not yet clear whether the claims
for it will stand up.
6. This example is actually a simplified version of experiments performed by Professor Nancy
Kanwisher at MIT in the early 2000s that explored a region of the brain called the fusiform face area
770
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
Figure 10. fMRI image. Functional MRI data reveal regions associated with
cognition and behavior. Here, regions of the frontal and parietal
lobes that are more active when remembering past events relative to
detecting novel stimuli are depicted.
Source: Courtesy of Anthony Wagner.
Functional MRI was first proposed in 1990, and the first research results
using BOLD-contrast fMRI in humans were published in 1992. The past decade
has seen an explosive increase in the number of research articles based on fMRI,
with nearly 2500 articles published in 2008—compared with about 450 in 1998.7
which is particularly involved in processing visions of faces. See Kathleen M. O’Craven & Nancy
Kanwisher, Mental Imagery of Faces and Places Activates Corresponding Stimulus-Specific Brain Regions, 12
J. Cog. Neurosci. 1013 (2000).
7. See the census of fMRI articles from 1993 to 2008 in Carole A. Federico et al., Intersecting
Complexities in Neuroimaging and Neuroethics, in Oxford Handbook of Neuroethics (J. Illes & B.J.
Sahakian eds., 2011). This continued an earlier census from 1993 to 2001. Judy Illes et al., From
Neuroimaging to Neuroethics, 5 Nature Neurosci. 205 (2003).
771
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
MRI (functional and structural) is quite safe, and MRI machines are widespread
in developed countries, largely for clinical use but increasingly for research use
as well. Although fMRI research is subject to many questions and controversies
(discussed infra Section IV), this technique has been responsible for most of the
recent interest in applying neuroscience to law, from criminal responsibility to
lie detection.
B. EEG and MEG
EEG is the measurement of the brain’s electrical activity as exhibited on the scalp;
MEG is the measurement of the small magnetic fields generated by the brain’s
electrical activity. The roots of EEG go back into the nineteenth century, but its
use increased dramatically in the 1930s and 1940s.
The process uses electrodes attached to the subject’s head with an electrically
conductive substance (a paste or a gel) to record electrical currents on the surface
of the scalp. Multiple electrodes are used; for clinical purposes, 20 to 25 electrodes
are commonly used, although arrays of more than 200 electrodes can be used.
(In MEG, superconducting “squids”8 are positioned over the scalp to detect the
brain’s tiny magnetic signals.) The electrical currents are generated by the neurons
throughout the brain, although EEG is more sensitive to currents emerging from
neurons closer to the skull. It is therefore more challenging to use EEG to reveal
the functioning of structures deep in the brain.
Because EEG and MEG directly measure neural activity, in contrast to the
measures of blood flow in fMRI, the timing of the neural activity can be measured
with great precision (the temporal resolution), down to milliseconds. On the other
hand, in comparison to fMRI, EEG and MEG are poor at determining the location
of the sources of the currents (the spatial resolution). The EEG/MEG signal is a
summation of the activity of thousands to millions of neurons at any one time. Any
one pattern of EEG or MEG signal at the scalp has an infinite number of possible
source patterns, making the problem of determining the brain source of measured
EEG/MEG signal particularly challenging and the results less precise.
The results of clinical EEG and MEG tests can be very useful for detecting
some kinds of brain conditions, notably epilepsy, and are also part of the process
of diagnosing brain death. EEG and MEG are also used for research, particularly
in the form of event-related potentials, which correlate the size or pattern of the
EEG or MEG signal with the performance of particular tasks or the presentation
of particular stimuli. Thus, as with the hypothetical fMRI experiment described
above, one could look for any consistent changes in the EEG or MEG signal
when a subject sees faces rather than a blank screen. Apart from the determina8. SQUID stands for superconducting quantum interference device (and has nothing to do with
the marine animal). This device can measure extremely small magnetic fields, including those generated
by various processes in living organisms, and so is useful in biological studies.
772
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
tion of brain death, where EEG is already used, the most discussed possible legally
relevant uses of EEG have been lie detection and memory detection.
EEG is safe, cheap, quiet, and portable. MEG is safe and quiet, but the technology is considerably more expensive than EEG and is not easily portable. EEG
methods can tolerate much more head movement by the subject than PET or
MRI techniques, although movement is often a challenge for MEG. EEG and
MEG have good temporal resolution, distinguishing between milliseconds, which
makes them very attractive for research, but their spatial resolution is inadequate
for many research questions. As a result, some researchers use a combination of
methods, integrating MRI and EEG or MEG data (acquired simultaneously or at
different times) using sophisticated data analysis techniques.
C. Other Techniques
Functional neuroimaging (especially fMRI) and EEG seem to be the techniques
that are most likely to lead to efforts to introduce neuroscience-based evidence
in court, but several other neuroscience techniques also might have legal applications. This section briefly describes four other methods that may be discussed in
court: lesion studies, transcranial magnetic stimulation, deep brain stimulation, and
implanted microelectrode arrays.
1. Lesion studies
One powerful way to test whether particular brain regions are associated with particular mental processes is to study mental processes after those brain regions have
been destroyed or damaged. Observations of the consequences of such lesions,
created by accidents or disease, were, in fact, the main way in which localization
of brain function was originally understood.
For ethical reasons, the experimental destruction of brain tissue is limited to
nonhuman animals. Nonetheless, in addition to accidental damage, on occasion
human brains will need to be intentionally damaged for clinical purposes. Tumors
may have to be removed or, in some cases, epilepsy may have to be treated by
removing the region of the brain that is the focus for the seizures. Valuable knowledge may be gained from following these subjects.
Our understanding of the role of the hippocampus in creating memories,
as one example, was greatly aided by study of a patient known as H.M.9 When
he was 27 years old, H.M. was treated for intractable epilepsy, undergoing an
9. H.M.’s name, not publicly released until his death, was Henry Gustav Molaison. Details of
his life can be found in several obituaries, including Benedict Carey, H.M., An Unforgettable Amnesiac,
Dies at 82, N.Y. Times, Dec. 4, 2008, at A1, and H.M., A Man Without Memories, The Economist,
Dec. 20, 2008. The first scientific report of his case was W.B. Scoville & Brenda Milner, Loss of Recent
Memory After Bilateral Hippocampal Lesions, 20 J. Neurol., Neurosurg. Psychiatry 11 (1957).
773
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
experimental procedure that surgically removed his left and right medial temporal
lobes, including most of his two hippocampi. The surgery was successful, but from
that time until his death in 2008, H.M. could not form new long-term memories, either of events or of facts. His short-term memory, also known as working
memory, was intact, and he could learn new motor, perceptual, and (some) cognitive skills (his “procedural memory” still functioned). He also could remember
his life’s events from before his surgery, although his memories were weaker the
closer the events were to the surgery. Those brain regions were clearly involved
in making new long-term memories for facts or events, but not in storing old ones.
2. Transcranial magnetic stimulation (TMS)
TMS is a noninvasive method of creating a temporary, reversible functional brain
“lesion.” Using this technique, researchers disrupt the organized activity of the
brain’s neurons by applying an electrical current. The current is formed by a rapidly changing magnetic field that is generated by a coil held next to the subject’s
skull. The field penetrates the scalp and skull easily and causes a small current in a
roughly conical portion of the brain below the coil. This current induces a change
in the typical responses of the neurons, which can block the normal functioning
of that part of the brain.
TMS can be done in a number of ways. In some approaches, TMS happens
at the same time as the subject performs the task to be studied. These concurrent
approaches include single pulses or paired pulses as well as rapid (more than once
per second) repetitive TMS that is delivered during task performance. Another
method uses TMS for an extended period, often several minutes, before the task
is performed. This sequential TMS uses slow (less than once per second) repetitive TMS.
The effects of single-pulse/paired-pulse and concurrent repetitive TMS are
present while the coil is generating the magnetic field, and can extend for a few
tens of milliseconds after the stimulation is turned off. By contrast, the effects of
pretask repetitive TMS are thought to last for a few minutes (about half as long as
the actual stimulation). When TMS is repeated regularly in nonhumans, long-term
effects have been observed. Therefore, guidelines regarding how much stimulation can be applied in humans have been established.
The Food and Drug Administration (FDA) has approved TMS as a treatment for otherwise untreatable depression. The neuroscience research value of
TMS stems from its ability to alter brain function in a relatively small area (about
2 cm) in an otherwise healthy brain, thus allowing for targeted testing of the role
of a particular brain region for a particular class of cognitive abilities. By blocking
normal functioning of the affected neurons, this can be equivalent, in effect, to
a temporary lesion of that area of the brain. TMS appears to have minimal risks,
but its long-term effects are not known.
774
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
3. Deep brain stimulation (DBS)
DBS is an FDA-approved treatment for several neurological conditions affecting movement, notably Parkinson’s disease, essential tremor, and dystonia. The
device used in DBS includes a lead that is implanted into a specific brain region,
a pulse generator (generally implanted under the shoulder or in the abdomen),
and a wire connecting the two. The pulse generator sends an electric current to
the electrodes in the lead, which in turn affect the functioning of neurons in an
area around the electrodes.
The precise manner by which DBS affects brain function remains unclear.
Even for Parkinson’s disease, for which it is widely used, individual patients
sometimes benefit in unpredictable ways from placement of the lead in different
locations and from different frequency or power of the stimulation.
Researchers are continuing to experiment with DBS for other conditions,
such as depression, minimally conscious state, chronic pain, and overeating that
leads to morbid obesity. The results are sometimes surprising. In a Canadian trial
of DBS for appetite control, the obese patient did not ultimately lose weight but
did suddenly develop a remarkable memory. That research group is now starting
a trial of DBS for dementia.10 Other surprises have included some negative side
effects from DBS, such as compulsive gambling, hypersexuality, and hallucinations. These kinds of unexpected consequences from DBS make it of continuing
broader research interest.
4. Implanted microelectrode arrays
Ultimately, to understand the brain fully one would like to know what each of its
100 billion neurons is doing at any given time, analyzed in terms of their collective
patterns of activity.11 No current technology comes close to that kind of resolution. For example, although fMRI has a voxel size of a few cubic millimeters, it is
looking at the blood flow responding to thousands or millions of neurons at each
point in the brain. Conversely, while direct electrical recordings allow individual
neurons to be examined, and manipulated, it is not easy to record from many
neurons at once. While still on a relatively small scale, recent developments now
offer one method for recording from multiple neurons simultaneously by using
an implanted microelectrode array.
A chip containing many tiny electrodes can be implanted directly into brain
tissue. Some of those electrodes will make useable connections with neurons and
can then be used either to record the activity of that neuron (when it is firing or
10. See Clement Hamani et al., Memory Enhancement Induced by Hypothalamic/Fornix Deep Brain
Stimulation, 63 Annals Neurol. 119 (2008).
11. See the discussion in Emily R. Murphy & Henry T. Greely, What Will Be the Limits of
Neuroscience-Based Mindreading in the Law? in The Oxford Handbook of Neuroethics (J. Illes & B.J.
Sahakian eds., 2011).
775
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
not) or to stimulate the neuron to fire. These kinds of implants have been used in
research on motor function, both in monkeys and in occasional human patients.
The research has aimed at understanding better what neuronal activity leads to
motion and hence, in the long run, perhaps to a method of treating quadriplegia
or other motion disorders.
These arrays have several disadvantages as research tools. Arrays require
neurosurgery for their implantation, with all of its consequent risks of infection or
damage. They also have a limited lifespan, because the brain’s defenses eventually
prevent the electrical connection between the electrode and the neuron, usually
over the span of a few months. Finally, the arrays can only reach a tiny number of
the billions of neurons in the brain; current arrays have about 100 microelectrodes.
IV. Issues in Interpreting Study Results
Lawyers trying to introduce neuroscience evidence will almost always be arguing
that, when interpreted in the light of some preexisting research study, some kind
of neuroscience-based test of the brain of a person in the case—usually a party,
though sometimes a witness—is relevant to the case. It might be a claim that a
PET scan shows that a criminal defendant was likely to have been legally insane at
the time of the crime; it could be a claim that an fMRI of a witness demonstrates
that she is lying. The judge will have to determine whether the scientific evidence
is admissible at all under the Federal Rules of Evidence, and particularly under
Rule 702. If the evidence is admissible, the finder of fact will need to consider
the validity and strength of the underlying scientific finding, the accuracy of the
particular test performed on the party or witness, and the application of the former
to the latter.
Neuroscience-based evidence will commonly raise several scientific issues
relevant to both the initial admissibility decision and the eventual determination
of the weight to be given the evidence. This section of the reference guide examines seven of these issues: replication, experimental design, group averages, subject
selection and number, technical accuracy, statistical issues, and countermeasures.
The discussion focuses on fMRI-based evidence, because that seems likely to be
the method that will be used most frequently in the coming years, but most of the
seven issues apply more broadly.
One general point is absolutely crucial. The various techniques discussed in Section III, supra, are generally accepted scientific procedures, both for use in research
and, in most cases, in clinical care. Each one is a good scientific tool in general. The
crucial issue is not likely to be whether the techniques meet the requirements for
admissibility when used for some purposes, but whether the techniques—when used
for the purpose for which they are offered—meet those requirements. Sometimes proponents of fMRI-based lie detection, for example, have argued that the technique
should be accepted because fMRI is the subject of more than 12,000 peer-reviewed
776
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
publications. That is true, but irrelevant—the question is the application of fMRI to
lie detection, which is the subject of far fewer, and much less definitive, publications.
A. Replication
A good general rule of thumb in science is never to rely on any experimental
finding until it has been independently replicated. This may be particularly true
with fMRI experiments, not because of fraud or negligence on the part of the
experimenters, but because, for reasons discussed below, these experiments are
very complicated. Replication builds confidence that those complications have
not led to false results.
In many scientific fields, including much of fMRI research, replication is
sometimes not as common as it should be. A scientist often is not rewarded for
replicating (or failing to replicate) another’s work. Grants, tenure, and awards tend
to go to people doing original research. The rise of fMRI has meant that such
original experiments are easy to conceive and to attempt—anyone with experimental expertise, access to research subjects (often undergraduates), and access
to an MRI scanner (found at any major medical facility) can try his or her own
experiments and, if the study design and logic are sound and the results are statistically significant, may well end up with published results. Experiments replicating,
or failing to replicate, another’s work are neither as exciting nor as publishable.
For example, as discussed in more detail below, more than 15 different laboratories have collectively published 20 to 30 peer-reviewed articles finding some
statistically significant relationship between fMRI-measured brain activity and
deception. None of the studies is an independent replication of another laboratory’s work. Each laboratory used its own experimental design, its own scanner,
and its own method of analysis. Interestingly, the published results implicate
many different areas of the brain as being activated when a subject lies. A few
of the brain regions are found to be important in most of the studies, but many of
the other brain regions showing a correlation with deception differ from publication to publication. Only a few of the laboratories have published replications of
their own work; some of those laboratories have actually published findings with
different results from those in their earlier publications.
That a finding has been replicated does not mean it is correct; different
laboratories can make the same mistakes. Neither does failure of replication mean
that a result is wrong. Nonetheless, the existence of independent replication is
important support for a finding.
B. Problems in Experimental Design
The most important part of an fMRI experiment is not the MRI scanner, but
the design of the underlying experiment being examined in the scanner. A poorly
777
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
designed experiment may yield no useful information, and even a well-designed
experiment may lead to information of uncertain relevance.
A well-designed experiment must focus on the particular mental state or brain
process of interest while minimizing any systematic biases. This can be especially
difficult with fMRI studies. After all, these studies are measuring blood flow in
the brain associated with neuronal responses in particular regions. If, for example,
in an experiment trying to assess how the brain reacts to pain, the experimental
subjects are consistently distracted at one point in the experiment by thinking
about something else, the areas of brain activation will include the areas activated
by the distraction. One of the earliest published lie detection experiments was
designed so that the experimental subjects pushed a button for “yes” only when
saying (honestly) that they held the card displayed; they pushed the “no” button
both when they did not hold the card displayed and when they did hold it but
were following instructions to lie. They were to say “yes” only 24 times out of
432 trials.12 The resulting differences might have come from the differences in
thinking about telling the truth or telling a lie—but they also may have come
from the differences in thinking about pressing the “no” button (the most common action) and pressing the “yes” button (the less frequent response). The results
themselves cannot distinguish between the two explanations.
Designing good experiments is difficult, but in some respects the better the
experiment, the less relevant it may prove to a real situation. A laboratory experiment attempts to minimize distractions and differences among subjects, but such
factors will be common in real-world settings. Perhaps more important, for some
kinds of experiments it will be difficult, if not impossible, to reproduce in the
laboratory the conditions of interest in the real world. As an extreme example, if
one is interested in how a murderer’s brain functions during a murder, one cannot conduct an experiment that involves having the subject commit a murder in
the scanner. For ethical reasons, that condition of interest cannot be tested in the
experiment.
The problem of trying to detect deception provides a different example. All
published laboratory-based experiments involve people who know that they are
taking part in a research project. Most of them are students and are being paid
to participate in the project. They have received detailed information about the
experiment and have signed a consent form. Typically, they are instructed to “lie”
about a particular matter. Sometimes they are told what the lie should be (to deny
that they see a particular playing card, such as the seven of clubs, on a screen in
the scanner); sometimes they are told to make up a lie (about their most recent
12. Daniel D. Langleben et al., Telling Truth from Lie in Individual Subjects with Fast Event-Related
fMRI, 26 Human Brain Mapping 262 (2005). See discussion in Nancy Kanwisher, The Use of fMRI in
Lie Detection: What Has Been Shown and What Has Not, in Emilio Bizzi et al., Using Imaging to Identify
Deceit: Scientific and Ethical Questions (2009), at 10, and in Anthony Wagner, Can Neuroscience
Identify Lies? in A Judge’s Guide to Neuroscience, supra note 1, at 30.
778
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
vacation, for example). In either case, they are following instructions—doing what
they should be doing—when they tell the “lie.”
This situation is different from the realistic use of lie detection, when a guilty
person needs to tell a convincing story to avoid a high-stakes outcome such as
arrest or conviction—and even an innocent person will be genuinely nervous
about the possibility of an incorrect finding of deception. In an attempt to parallel
these real-world characteristics, some laboratory-based studies have tried to give
subjects some incentive to lie successfully; for example, the subjects may be told
(falsely) that they will be paid more if they “fool” the experimenters. Although
this may increase the perceived stakes, it seems unlikely that it creates a realistic
level of stress. These differences between the laboratory and the real world do
not mean that the experimental results of laboratory studies are unquestionably
different from the results that would exist in a real-world situation, but they do
raise serious questions about the extent to which the experimental data bear on
detecting lies in the real world.
Few judges will be expert in the difficult task of designing valid experiments.
Although judges may be able themselves to identify weaknesses in experimental
design, more often they will need experts to address these questions. Judges will
need to pay close attention to that expert testimony and the related argument, as
“details” of experimental design may turn out to be absolutely crucial to the value
of the experimental results.
C. The Number and Diversity of Subjects
Doing fMRI scans is expensive. The total cost of performing an hour-long
research scan of a subject ranges from about $300 to $1000. Much fMRI research,
particularly work without substantial medical implications, is not richly funded. As
a result, studies tend to use only a small number of subjects—many fMRI studies
use 10 to 20 subjects, and some use even fewer. In the lie detection literature, for
example, the number of subjects used ranges from 4 to about 30.
It is unclear how representative such a small group would be of the general
population. This is particularly true of the many studies that use university students
as research subjects. Students typically are from a restricted age range, are likely
to be of above-average intelligence and socioeconomic background, may not
accurately reflect the country’s ethnic diversity, and typically will underrepresent
people with serious mental conditions. To limit possible confounding variables,
it can make sense for a study design to select, for example, only healthy, righthanded, native-English-speaking male undergraduates who are not using drugs.
But the very process of selecting such a restricted group raises questions about
whether the findings will be relevant to other groups of people. They may be
directly relevant, or they may not be. At the early stages of any fMRI research,
it may not be clear what kinds of differences among subjects will or will not be
important.
779
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
D. Applying Group Averages to Individuals
Most fMRI-based research looks for statistically significant associations between
particular patterns of brain activation across a number of subjects. It is highly
unlikely than any fMRI pattern will be found always to occur under certain
circumstances in every person tested, or even that it will always occur under
those circumstances in any one person. Human brains and their responses are too
complicated for that. Research is highly unlikely to show that brain pattern “A”
follows stimulus “B” each and every time and in every single person, although it
may show that A follows B most of the time.
Consider an experiment with 10 subjects that examines how brain activation
varies with the sensation of pain. A typical approach to analyzing the data is to
take the average brain activation patterns of all 10 subjects combined, looking for
the regions that, across the group, have the greatest changes—the most statistically
significant changes—when the painful stimulus is applied compared with when
it is absent. Importantly, though, the most significant region showing increased
activation on average may not be the region with the greatest increase in activation in any particular one of the 10 subjects. It may not even be the area with
the greatest activation in any of the 10 subjects, but it may be the region that was
most consistently active across the brains of the 10 subjects, even if the response
was small in each person.
Although group averages are appropriate for many scientific questions, the
problem is that the law, for the most part, is not concerned with “average” people,
but with individuals. If these “averaged” brains show a particular pattern of brain
activation in fMRI studies and a defendant’s brain does not, what, if anything,
does that mean?
It may or may not mean anything—or, more accurately, the chances that it is
meaningful will vary. The findings will need to be converted into an assessment
of an individual’s likelihood of having a particular pattern of brain activation in
response to a stimulus, and that likelihood can be measured in various ways.
Consider the following simplified example. Assume that 1000 people have
been tested to see how their brains respond to a particular painful stimulus. Each
is scanned twice, once when touched by a painfully hot metal rod and once
when the rod is room temperature. Assume that all of them feel pain from the
heated rod and that no one feels pain from the room temperature rod. And,
finally, assume that 900 of the 1000 show a particular pattern of brain activation when touched with the hot rod, but only 50 of the 1000 show the same
pattern when touched with the room temperature rod.
For these 1000 people, using the fMRI activation pattern as a test for the
perception of this pain would have a sensitivity of 90% (90% of the 1000 who felt
the pain would be correctly identified and only 10% would be false negatives).
Using the activation as a test for the lack of the pattern would have a specificity
of 95% (95% of those who did not feel pain were correctly identified and only
780
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
5% were false positives). Now ask, of all those who showed a positive test result,
how many were actually positive? This percentage, the positive predictive value,
would be 94.7%—900 out of 950. Depending on the planned use of the test, one
might care more about one of these measures than another and there are often
tradeoffs between them. Making a test more sensitive (so that it misses fewer
people with the sought characteristic) often means making it less specific (so that
it picks up more people who do not have the characteristic in question). In any
event, when more people are tested, these estimates of sensitivity, specificity, and
positive predictive value become more accurate.
There are other ways of measuring the accuracy of a test of an individual, but
the important point is that some such conversion is essential. A research paper that
reveals that the average subject’s brain (more accurately, the “averaged subjects’
brain”) showed a particular reaction to a stimulus does not, in itself, say anything
useful about how likely any one person is to have the same reaction to that stimulus. Further analyses are required to provide that information. Researchers, who
are often more interested in identifying possible mechanisms of brain action than in
creating diagnostic tests, will not necessarily have analyzed their data in ways that
make them useful for application to individuals—or even have obtained enough
data for that to be possible. At least in the near future, this is likely to be a major
issue for applying fMRI studies to individuals, in the courtroom, or elsewhere.
E. Technical Accuracy and Robustness of Imaging Results
MRI machines are variable, complicated, and finicky. The machines come in
several different sizes, based on the strength of the magnet, with machines used
for clinical purposes ranging from 0.2 T to 3.0 T and research scanners going as
high as 9.4 T. Three companies dominate the market for MRI machines—General
Electric, Siemens, and Philips—although several other companies also make the
machines. Both the power and the manufacturer of an MRI system can make a
substantial difference in the resulting data (and images). These variations can be
more important with functional MRI (though they also apply to structural MRI)
so that a result seen on a 1.5-T Siemens scanner might not appear on a 3.0-T
General Electric machine. Similarly, results from one 3.0-T General Electric
machine may be different from those on an identical model.
Even the exact same MRI machine may behave differently from day to day
or month to month. The machines frequently need maintenance or adjustments
and sometimes can be inoperable for days or even weeks at a time. Comparing
results from even the same machine before and after maintenance—or a system
upgrade—can be difficult. This can make it hard to compare results across different studies or between the group average of one study and results from an
individual subject.
These issues concern not only the quality of the scans done in research, but,
even more importantly, the credibility of the individual scan sought to be intro781
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
duced at trial. If different machines were used, care must be taken to ensure that
the results are comparable. The individual scans also can have other problems. Any
one scan is subject not only to machine-derived artifacts and other problems noted
above, but also to human-generated artifacts, such as those caused by the subject’s
movements during the scan.
Finally, another technical problem of a different kind comes from the nature
of fMRI research itself. The scanner will record changes in the relative levels of
oxyhemoglobin to deoxyhemoglobin for thousands of voxels throughout the
brain. During data analysis, these signal changes will be tested to see if they show
any change in the response between the experimental condition and the baseline
or control condition. Importantly, with fMRI, there is no definitive way to quantify precisely how large a change there was in the neural response compared to
baseline; hence, the researcher must set a somewhat arbitrary statistical cutoff value
(a threshold) for saying that a voxel was activated or deactivated. A researcher who
wants only to look at strong effects will require a large change from baseline; a
researcher who wants to see a wide range of possible effects will allow smaller
changes from baseline to count.
Neither way is “right”—we do not know whether there is some minimum
change in the BOLD response that means an “important” amount of brain activation has taken place, and if such a true value exists, it is likely to differ across
brain regions, across tasks, and across experimental contexts. What this means is
that different choices of statistical cutoff values can produce enormous differences
in the apparent results. And, of course, the cutoff values used in the studies and
in the scan of the individual of interest must be consistent across repeated tests.
This important fact often may not be known.
F. Statistical Issues
Interpreting fMRI results requires the application of complicated statistical
methods.13 These methods are particularly difficult, and sometimes controversial, for fMRI studies, partly because of the thousands of voxels being examined.
Fundamentally, most fMRI experiments look at many thousands of voxels and try
to determine whether any of them are, on average, activated or deactivated as a
result of the task or stimulus being studied. A simple test for statistical significance
asks whether a particular result might have arisen by chance more than 1 time in
20 (or 5%): Is it significant at the .05 level? If a researcher is looking at the results
for thousands of different voxels, it is likely that a number of voxels will show
an effect above the threshold just by chance. There are statistical ways to control
the rate of these false positives, but they need to be applied carefully. At the same
time, rigid control of false positives through statistical correction (or the use of
13. For a broad discussion of statistics, see David H. Kaye & David A. Freedman, Reference
Guide on Statistics, in this manual.
782
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
a very conservative threshold) can create another problem—an increase in the
false-negative rate, which results in failing to detect true brain responses that are
present in the data but that fall below the statistical threshold. The community of
fMRI researchers recognizes that these issues of statistical significance are difficult
to resolve.
Over the past decade, other statistical techniques are increasingly being used
in neuroimaging research, including techniques that do not look at the statistical
significance of changes in the BOLD response in individual voxels, but that instead
examine changes in the distributed patterns of activation across many voxels in a
region of the brain or across the whole brain. These techniques include methods
known as principal component analysis, multivariate analysis, and related machine
learning algorithms. These methods, the details of which are not reviewed in this
chapter, are being used increasingly in neuroimaging research and are producing
some of the most interesting results in the field. The techniques are fairly complex,
and determining how to interpret the results of these tests can be controversial.
Thus, these methods alone may require substantial and potentially confusing
expert testimony in addition to all the other expert testimony about the underlying neuroscience evidence.
G. Possible Countermeasures
When neuroimaging is being used to compare the brain of one individual—a
defendant, plaintiff, or witness, for example—to others, the individual undergoing
neuroimaging might be able to use countermeasures to make the results unusable
or misleading. And at least some of those countermeasures may prove especially
hard to detect.
Subjects can disrupt almost any kind of scanning, whether done for structural
or functional purposes, by moving in the scanner. Unwilling subjects could ruin
scans by moving their bodies, heads, or, possibly, even by moving their tongues.
Blatant movements to disrupt the scan would be apparent, both from watching the
subject in the scanner and from seeing the results, leading to a possible negative
inference that the person was trying to interfere with the scan. Nonetheless, that
scan itself would be useless.
More interesting are possible countermeasures for functional scans. Polygraphy
may provide a useful comparison. Countermeasures have long been tried in
polygraphy with some evidence of efficacy. Polygraphy typically looks at the differences in physiological measurements of the subject when asked anxiety-provoking
questions or benign control questions. Subjects can use drugs or alcohol to try to
dampen their body reactions when asked anxiety-provoking questions. They can
try to use mental measures to control or affect their physiological reactions, calming themselves during anxiety-provoking questions and increasing their emotional
reaction to control questions. And, when asked control questions, they can try to
increase the physiological signs the polygraph measures through physical means.
783
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
For example, subjects might bite their tongues, step on tacks hidden in their shoes,
or tighten various muscles to try to increase their blood pressure, galvanic skin
response, and so on. The National Academy of Sciences report on polygraphs
concluded that
Basic science and polygraph research give reason for concern that polygraph test
accuracy may be degraded by countermeasures, particularly when used by major
security threats who have a strong incentive and sufficient resources to use them
effectively. If these measures are effective, they could seriously undermine any
value of polygraph security screening.14
Some of the countermeasures used by polygraph subjects can be detected by, for
example, drug or alcohol tests or by carefully watching the subject’s body. But
purely mental actions cannot be detected. These kinds of countermeasures may be
especially useful to subjects seeking to beat neuroscience-based lie detection. For
example, some argue that deception produces different activation patterns than
telling the truth because it is mentally harder to tell a lie—more of the brain needs
to work to decide whether to lie and what lie to tell. If so, two mental countermeasures immediately suggest themselves: make the lie easier to tell (through,
perhaps, memorization or practice) or make the brain work harder when telling
the truth (through, perhaps, counting backward from 100 by sevens).
Countermeasures are not, of course, potentially useful only in the context of
lie detection. A neuroimaging test to determine whether a person was having the
subjective feeling of pain might be fooled by the subject remembering, in great
detail, past experiences of pain. The possible uses of countermeasures in neuroimaging have yet to be extensively explored, but at this point they cast additional
doubt on the reliability of neuroimaging in investigations or in litigation.
V. Questions About the Admissibility and
the Creation of Neuroscience Evidence
The admissibility of neuroscience evidence will depend on many issues, some of
them arising from the rules of evidence, some from the U.S. Constitution, and
some from other legal provisions. Another often-overlooked reality is that judges
may have to decide whether to order this kind of evidence to be created. Certainly, judges may be called to pass upon the requests for criminal defendants (or
convicts seeking postconviction relief) to be able to use neuroimaging. They may
also have to decide motions in civil or criminal cases to compel neuroimaging.
One could even imagine requests for warrants to “search the brains” of possible
14. See National Research Council, The Polygraph and Lie Detection 5 (2003). This report
is an invaluable resource for discussions of not just the scientific evidence about the reliability of the
polygraph, but also for general background about the application of science to lie detection.
784
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
witnesses for evidence. This guide does not seek to resolve any of these questions,
but points out some of the problems that are likely to be raised about admitting
neuroscience evidence in court.
A. Evidentiary Rules
This discussion looks at the main evidentiary issues that are likely to be raised in
cases involving neuroscience evidence. Note, though, that judges will not always be
governed by the rules of evidence. In criminal sentencing or in probation hearings,
among other things, the Federal Rules of Evidence do not apply,15 and they apply
with limitations in other contexts.16 Nonetheless, even in those circumstances,
many of the principles behind the Rules, discussed below, will be important.
1. Relevance
The starting point for all evidentiary questions must be relevance. If evidence is
not relevant to the questions in hand, no other evidentiary concerns matter. This
basic reminder may be particularly useful with respect to neuroscience evidence.
Evidence admitted, for example, to demonstrate that a criminal defendant had
suffered brain damage sometime before the alleged crime is not, in itself, relevant. The proffered fact of the defendant’s brain damage must be relevant. It
may be relevant, for example, to whether the defendant could have formed the
necessary criminal intent, to whether the defendant should be found not guilty
by reason of insanity, to whether the defendant is currently competent to stand
trial, or to mitigation in sentencing. It must, however, be relevant to something
in order to be admissible at all, and specifying its relevance will help focus the
evidentiary inquiry. The question, for example, would not be whether PET scans
meet the evidentiary requirements to be admitted to demonstrate brain damage,
but whether they have “any tendency to make the existence of any fact that is of
consequence to the determination of the action more probable or less probable
than it would be without the evidence.”17 The brain damage may be relevant to
a fact, but that fact must be “of consequence to the determination of the action.”
2. Rule 702 and the admissibility of scientific evidence
Neuroscience evidence will almost always be “scientific . . . knowledge” governed
by Rule 702 of the Federal Rules of Evidence, as interpreted in Daubert v. Merrell
Dow Pharmaceuticals18 and its progeny, both before and after the amendments
to Rule 702 in 2000. Rule 702 allows the testimony of a qualified expert “if
15.
16.
17.
18.
Fed. R. Evid. 1101(d).
Fed. R. Evid. 1101(e).
Fed. R. Evid. 401.
Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993).
785
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
(1) the testimony is sufficiently based upon reliable facts or data, (2) the testimony is the product of reliable principles and methods, and (3) the witness has
applied the principles and methods reliably to the facts of the case.” In Daubert,
the Supreme Court listed several, nonexclusive guidelines for trial court judges
considering testimony under Rule 702. The Committee that proposed the 2000
Amendments to Rule 702 summarized these factors as follows:
The specific factors explicated by the Daubert Court are (1) whether the expert’s
technique or theory can be or has been tested—that is, whether the expert’s theory
can be challenged in some objective sense, or whether it is instead simply a subjective, conclusory approach that cannot reasonably be assessed for reliability;
(2) whether the technique or theory has been subject to peer review and publication; (3) the known or potential rate of error of the technique or theory
when applied; (4) the existence and maintenance of standards and controls; and
(5) whether the technique or theory has been generally accepted in the scientific
community.19
The tests laid out in Daubert and in the evidentiary rules governing expert
testimony have been the subjects of enormous discussion, both by commentators
and by courts. And, to the extent some neuroscience evidence has been admitted in federal courts (and the courts of states that follow Rule 702 or Daubert), it
has passed those tests. We do not have the knowledge needed to analyze them in
detail, but we will merely point out a few aspects that seem especially relevant to
neuroscience evidence.
Neuroscience evidence should often be subject to tests, as long as the point
of the neuroscience evidence is kept in mind. An fMRI scan might provide evidence that someone was having auditory hallucinations, but it could not prove
that someone was not guilty by reason of insanity. The latter is a legal conclusion, not a scientific finding. The evidence might be relevant to the question of
insanity, but one cannot plausibly conduct a scientific test of whether a particular
pattern of brain activation is always associated with legal insanity. One might offer
neuroimaging evidence about whether a person is likely to have unusual difficulty
controlling his or her impulses, but that is not, in itself, proof that the person acted
recklessly. The idea of testing helps separate the conclusions that neuroscience
might be able to reach from the legal conclusions that will be beyond it.
Daubert’s stress on the presence of peer review and publication corresponds
nicely to scientists’ perceptions. If something is not published in a peer-reviewed
journal, it scarcely counts. Scientists only begin to have confidence in findings
after peers, both those involved in the editorial process and, more important, those
who read the publication, have had a chance to dissect them and to search intensively for errors either in theory or in practice. It is crucial, however, to recognize
that publication and peer review are not in themselves enough. The publications
need to be compared carefully to the evidence that is proffered.
19. Fed. R. Evid. 702 advisory committee’s note.
786
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
First, the published, peer-reviewed articles must establish the specific scientific
fact being offered. An (accurate) assertion that fMRI has been the basis of more
than 12,000 peer-reviewed publications will help establish that fMRI can be used
in ways that the scientific community finds reliable. By themselves, however,
those publications do not establish any particular use of fMRI. If fMRI is being
offered as proof of deception, the 20 or 30 peer-reviewed articles concerning its
ability to detect deception are most important, not the 11,980 articles involving
fMRI for other purposes.
Second, the existence of several peer-reviewed publications on the same
general method does not support the accuracy of any one approach if those publications are mutually inconsistent. There are now about 20 to 30 peer-reviewed
publications that, using fMRI, find statistically significant differences in patterns
of brain activation depending on whether the subjects were telling the truth or
(typically) telling a lie when instructed to do so. Many of those publications find
patterns that are different from, and often inconsistent with, the patterns described
in the other publications. Multiple inconsistent publications do not add weight, and
may indeed subtract it, from a scientific method or theory.
Third, the peer-reviewed publication needs to describe in detail the method
about which the expert plans to testify. A commercial firm might, for example,
claim that its method is “based on” some peer-reviewed publications, but unless
the details of the firm’s methods were included in the publication, those details
were neither published nor peer reviewed. A proprietary algorithm used to generate a finding published in the peer-reviewed literature is not adequately supported
by that literature.
The error rate is also crucial to most neuroscience evidence, in two different
senses. One is the degree to which the machines used to produce the evidence
make errors. Although these kinds of errors may balance out in a large sample
used in published literature, any scan of any one individual may well be affected
by errors in the scanning process. Second, and more important, neuroscience
evidence will almost never give an absolute answer, but will give a probabilistic
one. For example, a certain brain structure or activation pattern will be found
in some percentage of people with a particular mental condition or state. These
group averages will have error rates when they are applied to individuals. Those
rates need to be known and presented.
The issue of standards and controls also is important in neuroscience. This
area is new and has not undergone the kind of standardization seen, for example,
in forensic DNA analysis. When trying to apply neuroscience findings to an
individual, evidence from the individual needs to have been acquired in the same
way, with the same standards and conditions, as the evidence from which the
scientific conclusions were drawn—or, at least, in ways that can be made readily
comparable. For example, there is no one standard in fMRI research for what
statistical threshold should be used for a change in the BOLD signal to “count”
as a meaningful activation or deactivation. An individual’s scan would need to
787
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
be analyzed under the same definition for activation as was used in the research
supporting the method, and the effects of the chosen threshold on finding a false
positive or false negative must be considered.
The final consideration, general acceptance in the scientific community, also
needs to be applied carefully. There is clearly general acceptance in the scientific
community that fMRI can provide scientifically and sometimes clinically useful
information about the workings of human brains, but that does not mean there
is general acceptance of any particular fMRI application. Similarly, there may be
general acceptance that fMRI can provide some general information about the
physical correlates of a particular mental state, but without general acceptance that
it can do so reliably in an individual case.
3. Rule 403
Rule 702 is not the only test that neuroscience evidence will need to pass to be
admitted in court. Even evidence admissible under that rule must still escape the
exclusion provided by Rule 403:
Although relevant, evidence may be excluded if its probative value is substantially
outweighed by the danger of unfair prejudice, confusion of the issues, or misleading the jury, or by considerations of undue delay, waste of time, or needless
presentation of cumulative evidence.
As discussed in detail in a recent article,20 Rule 403 may be particularly
important with some attempted applications of neuroscience evidence because of
the balance it requires between the value of evidence to the decisionmaker and
its costs.
The probative value of such evidence may often be questioned. Neuroscience
evidence will rarely, if ever, be definitive. It is likely to have a range of uncertainties, from the effectiveness of the method in general, to questions of its proper
application in this case, to whether any given individual’s reactions are the same
as those previously tested.
The other side of Rule 403, however, is even more troublesome. The time
necessary to introduce such evidence, and to educate the jury (and judge) about
it, will usually be extensive. The possibilities for confusion are likely to be great.
And there is at least some evidence that jurors (or, to be precise, “mock jurors”)
are particularly likely to overestimate the power of neuroscience evidence.21
20. Teneille Brown & Emily Murphy, Through a Scanner Darkly: Functional Neuroimaging as
Evidence of a Criminal Defendant’s Past Mental States, 62 Stan. L. Rev. 1119 (2010).
21. See Deena Skolnick Weisberg et al., The Seductive Allure of Neuroscience Explanations, 20 J. Cog.
Neurosci. 470 (2008); David P. McCabe & Alan D. Castel, Seeing Is Believing: The Effect of Brain Images
on Judgments of Scientific Reasoning, 107 Cognition 343 (2008). These articles are discussed in Brown &
Murphy, supra note 20, at 1199–1202. But see N.J. Schweitzer et al., Neuroimages as Evidence in a Mens
Rea Defense: No Impact, Psychol. Pub. Pol’y L. (in press) providing the experimental results that seem
to indicate that showing neuroimages to mock jurors does not affect their decisions.
788
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
A high-tech “picture” of a living brain, complete with brain regions shown in
bright orange and deep purple (colors not seen in an actual brain), may have an
unjustified appeal to a jury. In each case, judges will need to weigh possibilities of
confusion or prejudice, along with the near certainty of lengthy testimony, against
the claimed probative value of the evidence.
4. Other potentially relevant evidentiary issues
Neuroscience evidence will, of course, be subject in individual cases to all evidentiary rules, from the Federal Rules of Evidence or otherwise, and could be
affected by many of them. Four examples follow where the application of several
such rules to this kind of evidence may raise interesting issues; there are undoubtedly many others.
First, in June 2009 the U.S. Supreme Court decided Melendez-Diaz v.
Massachusetts,22 where the five-justice majority held that the Confrontation Clause
required the prosecution to present the testimony at trial of state laboratory analysts who had identified a substance as cocaine. This would seem to apply to any
use by the prosecution in criminal cases of neuroscience evidence about a scanned
defendant or witness, although it is not clear who would have to testify. Would
testimony be required from the person who observed the procedure, the person
who analyzed the results of the procedure, or both? If the results were analyzed by
a computerized algorithm, would the individual (or group) that wrote that algorithm have to testify? These questions, and others, are not unique to neuroscience
evidence, of course, but will have to be sorted out generally after Melendez-Diaz.
Second, the Federal Rules of Evidence put special limits on the admissibility
of evidence of character and, in some cases, of predisposition.23 In some cases,
neuroscience evidence offered for the purpose of establishing a regular behavior
of the person might be viewed as evidence of character24 or predisposition (or
22. 129 S. Ct. 2527 (2009).
23. Fed. R. Evid. 404, 405, 412–415, 608.
24. Evidence about lie detection has sometimes been viewed as “character evidence,” introduced
to bolster a witness’s credibility. The Canadian Supreme Court has held that polygraph evidence is
inadmissible in part because it violates the rule limiting character evidence.
“What is the consequence of this rule in relation to polygraph evidence? Where such evidence is sought
to be introduced, it is the operator who would be called as the witness, and it is clear, of course, that the
purpose of his evidence would be to bolster the credibility of the accused and, in effect, to show him
to be of good character by inviting the inference that he did not lie during the test. In other words, it
is evidence not of general reputation but of a specific incident, and its admission would be precluded
under the rule. It would follow, then, that the introduction of evidence of the polygraph test would
violate the character evidence rule.” R. v. Béland, 60 C.R. (3d) 1, ¶¶ 71–72 (1987).
The Canadian court also held that polygraph evidence violated another rule concerning character
evidence, the rule against “oath-helping.”
“From the foregoing comments, it will be seen that the rule against oath-helping, that is, adducing
evidence solely for the purpose of bolstering a witness’s credibility, is well grounded in authority. It
789
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
lack of predisposition). Whether such evidence could be admitted might hinge on
whether it was offered in a civil case or a criminal case, and, if in a criminal case,
by the prosecution or the defendant.
Third, Federal Rule of Evidence 406 allows the admission of evidence about
a habit or routine practice to prove that the relevant person’s actions conformed to
that habit or routine practice. It is conceivable that neuroscience evidence might
be used to describe “habits of mind” and thus be offered under this rule.
The fourth example applies to neuroscience-based lie detection. Although
New Mexico is the only U.S. jurisdiction that generally allows the introduction of
polygraph evidence,25 several jurisdictions allow polygraph evidence in two specific situations. First, polygraph evidence is sometimes allowed when both parties
have stipulated to its admission in advance of the performance of the test. (This
does lead one to wonder whether a court would allow evidence from a psychic or
from a fortune-telling toy, like the Magic Eight Ball, if both parties stipulated to
it.) Second, polygraph evidence is sometimes allowed to impeach or to corroborate a witness’s testimony.26 If a neuroscience-based lie detection technique were
found to be as reliable as the polygraph, presumably those jurisdictions would have
to consider whether to extend these exceptions to such neuroscience evidence.
B. Constitutional and Other Substantive Rules
In many contexts, courts will be asked to admit neuroscience evidence or to
order, allow, or punish its creation. Such actions may implicate a surprisingly large
number of constitutional rights, as well as other substantive legal provisions. Most
of these would be rights against the creation or use of neuroscience evidence,
although some would be possible rights to its use. And one constitutional provision, the Fourth Amendment, might cut both ways. Again, this section will not
seek to discuss all possible such claims or to resolve any of them, but only to raise
some of the most interesting issues.
1. Possible rights against neuroscience evidence
a. The Fifth Amendment privilege against self-incrimination
Could a person be forced to “give evidence” through a neuroscience technology, or would that violate his or her privilege against self-incrimination? This has
is apparent that, since the evidence of the polygraph examination has no other purpose, its admission
would offend the well-established rule.” R. v. Béland, 60 C.R. (3d) 1, ¶ 67(1) (The court also ruled
against polygraph evidence as violating the rule against prior consistent statements and, because the jury
needs no help in assessing credibility, the rule on the use of expert witnesses).
25. Lee v. Martinez, 96 P.3d 291 (N.M. 2004).
26. See, e.g., United States v. Piccinonna, 885 F.2d 1529 (11th Cir. 1989) (en banc). See also
United States v. Allard, 464 F.3d. 529 (5th Cir. 2006); Thornburg v. Mullin, 422 F.3d 1113 (10th
Cir. 2005).
790
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
already begun to be discussed by legal scholars in the context of lie detection.27
One issue is whether the neuroscience evidence is “testimonial evidence.” If it
were held to be “testimonial” it would be subject to the privilege, but if it were
nontestimonial, it would, under current law, not be. Examples of nontestimonial
evidence for purposes of the privilege against self-incrimination include incriminating information from a person’s private diaries, a blood alcohol test, or medical
X rays. An fMRI scan is nothing more than a computer record of radio waves
emitted by molecules in the brain. It does not seem like “testimony.” On the
other hand, fMRI-based lie detection currently involves asking the subject questions to which he or she gives answers, either orally, by pressing buttons, or by
some other form of communication. Perhaps those answers would make the
resulting evidence “testimonial.”
It is possible, however, that answers may not be necessary. Two EEG-based
systems claim to be able to determine whether a person either recognizes or has
“experiential knowledge” of an event (a memory derived from experience as
opposed to being told about it).28 Very substantial scientific questions exist about
each system, but, assuming they were to be admitted as reliable, they would raise
this question more starkly because they do not require the subject of the procedure to communicate. The subject is shown photographs of relevant locations or
read a description of the events while hooked up to an EEG. The brain waves,
27. The law review literature, by both faculty and students, discussing the Fifth Amendment and
neuroscience-based lie detection is already becoming voluminous. See, e.g., Nita Farahany, Incriminating
Thoughts, 64 Stan. L. Rev., Paper No. 11-17, available at SSRN: http://ssrn.com/abstract=1783101
(2011); Dov Fox, The Right to Silence as Protecting Mental Control, 42 Akron L. Rev. 763 (2009);
Matthew Baptiste Holloway, One Image, One Thousand Incriminating Words: Images of Brain Activity
and the Privilege Against Self-Incrimination, 27 Temp. J. Sci. Tech. & Envtl. L. 141 (2008); William
Federspiel, Neuroscience Evidence, Legal Culture, and Criminal Procedure, 16 Wm. & Mary Bill Rts. J. 865
(2008); Sarah E. Stoller & Paul Root Wolpe, Emerging Neurotechnologies for Lie Detection and the Fifth
Amendment, 33 Am. J.L. & Med. 359 (2007); Michael S. Pardo, Neuroscience Evidence, Legal Culture, and
Criminal Procedure, 33 Am. J. Crim. L. 301 (2006); and Erich Taylor, A New Wave of Police Interrogation?
“Brain Fingerprinting,” the Constitutional Privilege Against Self-Incrimination, and Hearsay Jurisprudence, U.
Ill. J.L. Tech. & Pol’y 287 (2006).
28. The first system is the so-called Brain Fingerprinting, developed by Dr. Larry Farwell.
This method was introduced successfully in evidence at the trial court level in a postconviction relief
case in Iowa; the use of the method in that case is discussed briefly in the Iowa Supreme Court’s
decision on appeal, Harrington v. Iowa, 659 N.W.2d 509, 516 n.6 (2003). (The Court expressed no
view on whether that evidence was properly admitted. See id. at 516.) The method is discussed on the
Web site of Farwell’s company, Brain Fingerprinting Laboratories, www.brainwavescience.com. It is
criticized from a scientific perspective in J. Peter Rosenfeld, “Brain Fingerprinting”: A Critical Analysis,
4 Sci. Rev. Mental Health Practice 20 (2005). See also the brief discussion in Henry T. Greely &
Judy Illes, Neuroscience-Based Lie Detection: The Urgent Need for Regulation, 33 Am. J.L. & Med. 377,
387–88 (2007).
The second system is called Brain Electrical Oscillation Signature (BEOS) and was developed in
India, where it has been introduced in trials and has been important in securing criminal convictions.
See Anand Giridharadas, India’s Novel Use of Brain Scans in Courts Is Debated, N.Y. Times, Sept. 15,
2008, at A10.
791
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
it is asserted, demonstrate whether the subject recognizes the photographs or has
“experiential knowledge” of the events—no volitional communication is necessary. It might be harder to classify these EEG records as “testimonial.”
b. Other possible general constitutional protections against compulsory
neuroscience procedures
Even if the privilege against self-incrimination applies to neuroscience methods
of obtaining evidence, it only applies where someone invokes the privilege. The
courts and other government bodies force people to answer questions all the time,
often under penalty of criminal or civil sanctions or of the court’s contempt power.
For example, a plaintiff in a civil case alleging damage to his health can be compelled to undergo medical testing at a defendant’s appropriate request. In that case,
the plaintiff can refuse, but only at the risk of seeing his case dismissed. Presumably,
a party could similarly demand that a party, or a witness, undergo a neuroimaging
examination, looking for either structural or functional aspects of the person’s brain
relevant to the case. If the privilege against self-incrimination is not available, or
is available but not attractive, could the person asked have any other protection?
The answer is not clear. One might try to argue, along the lines of Rochin
v. California,29 that such a procedure violates the Due Process Clause of the Fifth
and Fourteenth Amendments because it intrudes on the person in a manner that
“shocks the conscience.” Alternatively, one might argue that a “freedom of the
brain” is a part of the fundamental liberty or the right to privacy protected by
the Due Process Clause.30 Or one might try to use language in some U.S. Supreme
Court First Amendment cases that talk about “freedom of thought” to argue that
the First Amendment’s freedoms of religion, speech, and the press encompass a
broader protection of the contents of the mind. The Court never seems to have
decided a case on that point. The closest case might be Stanley v. Georgia,31 where
the Court held that Georgia could not criminalize a man’s private possession of
pornography for his own use. None of these arguments is, in itself, strongly supported, but each draws some appeal from a belief that we should be able to keep
our thoughts, and, by extension, the workings of our brain, to ourselves.
c. Other substantive rights against neuroscience evidence
At least one form of possible neuroscience evidence may already be covered by
statutory provisions limiting its creation and use—lie detection. In 1988, Congress
29. 342 U.S. 165 (1952).
30. See Paul Root Wolpe, Is My Mind Mine? Neuroethics and Brain Imaging, in The Penn Center
Guide to Bioethics (Arthur L. Caplan et al. eds., 2009).
31. 394 U.S. 557 (1969). In the context of finding that the First Amendment forbids criminalizing
mere possession of pornography, in the home, for an adult’s private use, the Court wrote “Our whole
constitutional heritage rebels at the thought of giving government the power to control men’s minds.”
The leap from that language, or that holding, to some kind of mental privacy, is not small.
792
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
passed the federal Employee Polygraph Protection Act (EPPA).32 Under this Act,
almost all employers are forbidden to “directly or indirectly, . . . require, request,
suggest, or cause any employee or prospective employee to take or submit to any
lie detector test” or to “use, accept, refer to, or inquire concerning the results of
any lie detector test of any employee or prospective employee.”33 The Act defines
a “lie detector” broadly, as “a polygraph, deceptograph, voice stress analyzer,
psychological stress evaluator, or any other similar device (whether mechanical or
electrical) that is used, or the results of which are used, for the purpose of rendering a diagnostic opinion regarding the honesty or dishonesty of an individual.”34
The Department of Labor can punish violators with civil fines, and those injured
have a private right of action for damages.35 The Act does provide narrow exceptions for polygraph tests in some circumstances.36
In addition to federal statutes, many states passed their own versions of the
EPPA, either before or after the federal act. The laws passed after EPPA generally
apply similar prohibitions to some employers not covered by the federal act (such
as state and local governments), but with their own idiosyncratic set of exceptions.
Many states have also passed laws regulating lie detection services. Most of these
seem clearly aimed at polygraphy, but, in some states, the language used is quite
broad and may well encompass neuroscience-based lie detection.37
States also may provide protection against neuroscience evidence that goes
beyond lie detection and could prevent involuntary neuroscience procedures.
Some states have constitutional or statutory rights of privacy that could be read to
include a broad freedom for mental privacy. And in some states, such as California,
such privacy rights apply not just to state action but to private actors as well.38
Most employment cases would be covered by EPPA and its state equivalents, but
such state privacy protections might be used to help decide whether courts could
32. Federal Employee Policy Protection Act of 1988, Pub. L. No. 100-347, § 2, 102 Stat. 646
(codified as 29 U.S.C. §§ 2001–2009 (2006)). See generally the discussion of federal and state laws in
Greely & Illes, supra note 28, at 405–10, 421–31.
33. 29 U.S.C. § 2002 (1)–(2) (2006) (The section also prohibits employers from taking action
against employees because of their refusal to take a test, because of the results of such a test, or for
asserting their rights under the Act); and id. § 2001 (3)–(4) (2006).
34. Id. § 2001(3) (2006).
35. Id. § 2005 (2006).
36. Id. § 2001(6) (2006).
37. See generally Greely & Illes, supra note 28, at 409–10, 421–31 (for both state laws on
employee protection and state laws more broadly regulating polygraphy).
38. “All people are by nature free and independent and have inalienable rights. Among these are
enjoying and defending life and liberty, acquiring, possessing, and protecting property, and pursuing
and obtaining safety, happiness, and privacy.” (emphasis added). Calif. Const. art. I, § 1. The words,
“and privacy” were added by constitutional amendment in 1972. The California Supreme Court has
applied these privacy protections in suits against private actors: “In summary, the Privacy Initiative
in article I, section 1 of the California Constitution creates a right of action against private as well as
government entities.” Hill v. Nat’l Collegiate Athletic Ass’n, 865 P.2d 633, 644 (Cal. 1994).
793
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
compel neuroimaging scans or whether they could be required in nonemployment
relationships, such as school/student or parent/child.
d. Neuroscience evidence and the Sixth and Seventh Amendment rights to trial
by jury
One might also argue that some kinds of neuroscience evidence could be excluded
from evidence as a result of the federal constitutional rights to trial by jury in
criminal and most civil cases. In United States v. Scheffer,39 the Supreme Court
upheld an express ban in the Military Rules of Evidence on the admission of any
polygraph evidence against a criminal defendant’s claimed Sixth Amendment right
to introduce the evidence in his defense. Justice Thomas wrote the opinion of
the Court holding that the ban was justified by the questionable reliability of the
polygraph. Justice Thomas continued, however, in a portion of the opinion joined
only by Chief Justice Rehnquist and Justices Scalia and Souter, to hold that the
Rule could also be justified by an interest in the role of the jury:
It is equally clear that Rule 707 serves a second legitimate governmental interest:
Preserving the jury’s core function of making credibility determinations in
criminal trials. A fundamental premise of our criminal trial system is that “the
jury is the lie detector.” United States v. Barnard, 490 F.2d 907, 912 (CA9 1973)
(emphasis added), cert. denied, 416 U.S. 959, 40 L. Ed. 2d 310, 94 S. Ct. 1976
(1974). Determining the weight and credibility of witness testimony, therefore,
has long been held to be the “part of every case [that] belongs to the jury, who
are presumed to be fitted for it by their natural intelligence and their practical
knowledge of men and the ways of men.” Aetna Life Ins. Co. v. Ward, 140 U.S.
76, 88, 35 L. Ed. 371, 11 S. Ct. 720 (1891).40
The other four justices in the majority, and Justice Stevens in dissent, disagreed
that the role of the jury justified this rule, but the question remains open. Justice
Thomas’s opinion did not argue that exclusion was required as part of the rights to
jury trials in criminal and civil cases under the Sixth and Seventh Amendments,
respectively, but one might try to extend his statements of the importance of the
jury as “the lie detector” to such an argument.41
39. 523 U.S. 303 (1998).
40. Id. at 312–13.
41. The Federal Rules of Criminal Procedure effectively give the prosecution a right to a jury
trial, by allowing a criminal defendant to waive such a trial only with the permission of both the
prosecution and the court. Fed. R. Crim. P. 23(a). Many states allow a criminal defendant to waive a
jury trial unilaterally, thus depriving the prosecution of an effective “right” to a jury.
794
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
2. Possible rights to the creation or use of neuroscience evidence
a. The Eighth Amendment right to present evidence of mitigating circumstances
in capital cases
In one of many ways in which “death is different,” in Lockett v. Ohio,42 the U.S.
Supreme Court held that the Eighth Amendment guarantees a convicted defendant in a capital case a sentencing hearing in which the sentencing authority must
be able to consider any mitigating factors. In Rupe v. Wood,43 the Ninth Circuit,
in an appeal from the defendant’s successful habeas corpus proceeding, applied
that holding to find that a capital defendant had a constitutional right to have
polygraph evidence admitted as mitigating evidence in his sentencing hearing.
The court agreed that totally unreliable evidence, such as astrology, would not be
admissible, but that the district court had properly ruled that polygraph evidence
was not that unreliable. (The Washington Supreme Court had previously decided
that polygraph evidence should be admitted in the penalty phase of capital cases
under some circumstances.44) Thus, capital defendants may argue that they have
the right to present neuroscience evidence as mitigation even if it would not be
admissible during the guilt phase.
b. The Sixth Amendment right to present a defense
The Scheffer case arose in the context of another right guaranteed by the Sixth
Amendment, the right of a criminal defendant to present a defense. It seems
likely that neuroscience evidence will first be offered by parties who have been
its voluntary subjects and who will argue that it strengthens their cases. In fact,
the main use of neuroimaging in the courts so far, at least in criminal cases, has
been by defendants seeking to demonstrate through the scans some element of a
defense or mitigation. If jurisdictions were to exclude such evidence categorically,
they might face a similar Sixth Amendment challenge.
The Supreme Court has held that some prohibitions on evidence in criminal cases violate the right to present a defense. Thus, in Rock v. Arkansas,45 the
Court struck down a per se rule in Arkansas against the admission of hypnotically
refreshed testimony, holding that it was “arbitrary or disproportionate to the purposes [it is] designed to serve.” The Scheffer case probably provides the model for
how arguments about exclusions of neuroscience evidence would play out. Eight
of the Justices in Scheffer agreed that the reliability of polygraphy was sufficiently
42. 438 U.S. 586 (1978).
43. 93 F.3d 1434, 1439–41 (9th Cir. 1996). But see United States v. Fulks, 454 F.3d 410, 434
(4th Cir. 2006). See generally Christopher Domin, Mitigating Evidence? The Admissibility of Polygraph
Results in the Penalty Phase of Capital Trials, 41 U.C. Davis L. Rev. 1461 (2010), which argues that
the Supreme Court should resolve the resulting circuit split by adopting the Ninth Circuit’s position.
44. State v. Bartholomew, 101 Wash. 2d 631, 636, 683 P.2d 1079 (1984).
45. 483 U.S. 44, 56, 97 L. Ed. 2d 37, 107 S. Ct. 2704 (1987).
795
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
questionable as to justify the per se ban on its use. Justice Stevens, however, dissented, finding polygraphy sufficiently reliable to invalidate its per se exclusion.
3. The Fourth Amendment
The Fourth Amendment raises some particularly interesting questions. It provides,
of course, that,
The right of the people to be secure in their persons, houses, papers, and effects,
against unreasonable searches and seizures, shall not be violated, and no Warrants
shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized.
On the one hand, an involuntary neuroscience examination would seem to
be a search or seizure, and thus “unreasonable” neuroscience examinations are
prohibited. To that extent, the Fourth Amendment would appear to be a protection against compulsory neuroscience testing.
On the other hand, if, say, an fMRI scan or an EEG were viewed as a “search
or seizure” for purposes of the Fourth Amendment, presumably courts could
issue a warrant for such a search or seizure, given probable cause and the relevant
procedural requirements. The use of such a warrant might (or might not) be limited by the privilege against self-incrimination or by some constitutional privacy
right, but, if such rights did not apply, would such warrants allow our brains to be
searched? This is, in a way, the ultimate result of the revolution in neuroscience,
which identifies our incorporeal “mind” with our physical “brain” and allows us
to begin to draw inferences from the brain to the mind. If the brain is a physical
thing or a place, it could be searchable, even if the goal in searching it is to find
out something about the mind, something that, as a practical matter, had never
itself been directly searchable.
VI. Examples of the Possible Uses of
Neuroscience in the Courts
Neuroscience may end up in court wherever someone’s mental state or condition
is relevant, which means it may be relevant to a vast array of cases. There are very
few cases, civil or criminal, where the mental states of the parties are not at least
theoretically relevant on issues of competency, intent, motive, recklessness, negligence, good or bad faith, or others. And even if the parties’ own mental states were
not relevant, the mental states of witnesses almost always will be potentially relevant—are they telling the truth? Are they biased against one party or another? The
mental states of jurors and even of judges occasionally may be called into question.
There are some important limitations on the use of neuroscience in the courtroom. First, it is unlikely to be used that often, particularly if it remains expensive.
796
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
With the possible exception of lie detection or bias detection, most cases will not
present a practical use for it. The garden variety breach of contract or assault and
battery is not likely to provide a plausible context for convincing neuroscience
evidence, especially if there is no evidence that the actor or the actions were odd
or bizarre. And many cases will not provide, or justify, the resources necessary for
a scan. Those costs could come down, but it seems unlikely that such evidence
would commonly be admitted without expert testimony, and the costs of that
seem likely to remain high.
Second, neuroscience evidence usually has a “time machine” problem. Neuroscience seems unlikely ever to be able to discern a person’s state of mind in the
past. Unless the legally relevant action took place inside an MRI scanner or other
neuroscience tool, the best it may be able to do is to say that, based on your current
mental condition or state, as shown by the current structure or functioning of your
brain, you are more or less likely than average to have had a particular mental state
or condition at the time of the relevant event. If the time of the relevant event is
the time of trial (or shortly before trial)—as would be the case with the truthfulness
of testimony, the existence of bias, or the existence of a particular memory—that
would not be a problem, but otherwise it would be.
Nonetheless, neuroscience evidence seems likely to be offered into evidence
for several issues, and in many of them, it already has been offered and even
accepted. In some cases it will be, and has been, offered as evidence of “legislative
facts,” of realities relevant to a broader legal issue than the mental state of any particular party or witness. Thus, amicus briefs in two Supreme Court cases involving the punishment of juveniles—one about capital punishment and one about
life imprisonment without possibility of parole—and to some extent the Court
itself, have discussed neuroscience findings about adolescent brains.46 Three of
46. In Roper v. Simmons, 543 U.S. 551 (2005), the Court held that the death penalty could
not constitutionally be imposed for crimes committed while a defendant was a juvenile. Two
amicus briefs argued that behavioral and neuroscience evidence supported this position. See Brief
of Amicus Curiae American Medical Association et al., Roper v. Simmons; and Brief of Amicus
Curiae American Psychological Association and the Missouri Psychological Association Supporting
Respondent (No. 03-633).
In Roper, the Court itself did not substantially rely on the neuroscientific evidence and does
not cite those amicus briefs. The Court’s opinion noted the scientific evidence only in passing as one
part of three relevant differences between adults and juveniles: “First, as any parent knows and as the
scientific and sociological studies respondent and his amici cite tend to confirm, ‘[a] lack of maturity and
an underdeveloped sense of responsibility are found in youth more often than in adults and are more
understandable among the young. These qualities often result in impetuous and ill-considered actions
and decisions.’ [citation omitted]” 543 U.S. at 569. Justice Scalia, however, did take the majority to
task for even this limited invocation of science and sociology. 543 U.S. at 616–18.
In Graham v. Florida, 2010 U.S. Lexis 3881, 130 S. Ct. 2011, 176 L. Ed. 2d 825 (2010), the Court
held that defendants could not be sentenced to life without the possibility of parole for nonhomicide
crimes committed while they were juveniles. Two amicus briefs similar to those discussed in Roper
were filed. See Brief of Amicus Curiae American Medical Association (No. 08-7412) and Brief Amicus
Curiae American Academy of Child & Adolescent Psychiatry (No. 08-7621) Supporting Neither
797
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the handful of published cases in which fMRI evidence was offered in court concerned challenges to state laws requiring warning labels on violent videogames.47
The states sought, without success, to use fMRI studies of the effects of violent
videogames on the brains of children playing the games to support their statutes.48
These “wholesale” uses of neuroscience may (or may not) end up affecting the law, but the courts would be more affected if various “retail” uses of
neuroscience become common, where a party or a witness is subjected to neuroscience procedures to determine something relevant only to that particular case.
An incomplete list of some of the most plausible categories for such retail uses
includes the following:
• Issuesofresponsibility,certainlycriminalandlikelyalsocivil;
• Predictingfuturebehaviorforsentencing;
• Mitigating(orpotentiallyaggravating)factorsonsentencing;
Party; and Brief of Amicus Curiae American Psychological Association et al. Supporting Petitioners
(Nos. 08-7412, 08-7621). The Court did refer more directly to the scientific findings in Graham,
directly citing the amicus briefs:
“No recent data provide reason to reconsider the Court’s observations in Roper about the nature of
juveniles. As petitioner’s amici point out, developments in psychology and brain science continue
to show fundamental differences between juvenile and adult minds. For example, parts of the brain
involved in behavior control continue to mature through late adolescence.” See Brief for American
Medical Association et al. as Amici Curiae 16–24; Brief for American Psychological Association et al.
as Amici Curiae 22–27.
Justice Thomas, in a dissent joined by Justice Scalia, reviewed some of the evidence from these amicus
briefs:
“In holding that the Constitution imposes such a ban, the Court cites ‘developments in psychology
and brain science’ indicating that juvenile minds ‘continue to mature through late adolescence,’ ante,
at 17 (citing Brief for American Medical Association et al. as Amici Curiae 16–24; Brief for American
Psychological Association et al. as Amici Curiae 22–27 (hereinafter APA Brief)), and that juveniles are
‘more likely [than adults] to engage in risky behaviors,’” id. at 7. But even if such generalizations
from social science were relevant to constitutional rulemaking, the Court misstates the
data on which it relies.
47. Entm’t Software Ass’n v. Hatch, 443 F. Supp. 2d 1065 (D. Minn. 2006); Entm’t Software
Ass’n v. Blagojevich, 404 F. Supp. 2d 1051 (N.D. Ill. 2005); Entm’t Software Ass’n v. Granholm, 404
F. Supp. 2d 978 (E.D. Mich. 2005). Each of the three courts held that the state statutes violated the
First Amendment.
48. The courts, sitting in equity and so without juries, all considered the scientific evidence
and concluded that it was insufficient to sustain the statutes’ constitutionality. In Blagojevich the court
heard testimony for the state directly from Dr. Kronenberger, the author of some of the fMRI-based
articles on which the state relied, as well from Dr. Howard Nusbaum, for the plaintiffs, who attacked
Dr. Kronenberger’s study. After a substantial discussion of the scientific arguments, the district court
judge, Judge Matthew Kennelly, found that “Dr. Kronenberger’s studies cannot support the weight he
attempts to put on them via his conclusions,” and did not provide a basis for the statute. Blagojevich,
404 F. Supp. at 1063–67. Judge Kennelly’s discussion of this point may be a good example of the kind
of analysis neuroscience evidence may force upon judges.
798
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
• Competency,noworinthepast,totakecareofone’saffairs,toenterinto
agreements or make wills, to stand trial, to represent oneself, and to be
executed;
• Deceptionincurrentstatements;
• Existenceornonexistenceofamemoryofsomeeventand,possibly,some
information about the status of that memory (true, false; new, old, etc.);
• Presenceofthesubjectivesensationofpain;
• Presenceofthesubjectivesensationofremorse;and
• Presenceofbiasagainstaparty.
Many, but not all, of these issues have begun to be discussed in the literature.
A few of them, such as criminal responsibility, mitigation, memory detection, and
lie detection, are appearing in courtrooms; others, such as pain detection, have
reached the edge of trial. This chapter does not discuss all of these topics and does
not discuss any of them in great depth, but it will describe three of them—criminal
responsibility, detection of pain, and lie detection—in order to provide a flavor
of the possibilities.
A. Criminal Responsibility
Neuroscience may raise some deep questions about criminal responsibility. Assume
we had excellent scientific evidence that a defendant could not help but commit
the criminal acts because of a specific brain abnormality?49 Should that affect the
defendant’s guilt and, if so, how? Should it affect his sentence or other subsequent
treatment? The moral questions may prove daunting. Currently the law is not very
interested in such deep questions of free will, but that may change.
Already, though, criminal law is concerned with the mental state of the defendant in many more specific contexts. A conviction generally requires both an actus
reus and a mens rea—a “guilty act” and a “guilty mind.” An unconscious person
cannot “act,” but even a conscious act is often not enough. Specific crimes often
require specific intents, such as acting with a particular purpose or in a knowing
or reckless fashion. Some crimes require even more defined mental states, such as
a requirement for premeditation in some murder statutes. And almost all crimes
can be excused by legal insanity. In these and other ways the mental state of the
defendant may be relevant to a criminal case.
Neuroscience may provide evidence in some cases to support a defendant’s
claim of nonresponsibility. For example, a defendant who claims to have been
insane at the time of the crime might try to support his or her claim by alleging
that he or she is seeing and hearing hallucinations. Neuroimaging may be able
49. See Henry T. Greely, Neuroscience and Criminal Responsibility Proving “Can’t Help Himself”
as a Narrow Bar to Criminal Liability, in Law and Neuroscience, Current Legal Issues 13 (Michael
Freeman ed. 2011).
799
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
to provide some evidence about whether the defendant is, in fact, hallucinating, at least at the time when he or she is in the scanner. Such imaging might
show that the defendant had a stroke or tumor in a particular part of the brain,
which then could be used to argue in some way against the defendant’s criminal
responsibility.50
Neuroimaging has been used more broadly in some criminal cases. For example, as noted above, in the trial of John Hinckley for the attempted assassination
of President Reagan, the defense used CAT scans of Hinckley’s brain to support the argument, based largely on his bizarre behavior, that he suffered from
schizophrenia. The scientific basis for that conclusion, offered early in the history of
brain CAT scans, was questionable at the time and has become even weaker since,
but Hinckley was found not guilty by reason of insanity. More recently, in November 2009, testimony about an fMRI scan was introduced in the penalty phase of a
capital case as mitigating evidence that the defendant suffered from psychopathy.
The defendant was sentenced to death, but after longer jury deliberations than
defense counsel expected.51 (This appears to have been the first time fMRI results
were introduced in a criminal case.52)
Neuroscience evidence also may be relevant in wider arguments about
criminal justice. Evidence about the development of adolescent brains has been
referred to in appellate cases concerning the punishments appropriate for people
who committed crimes while under age, including, as noted above, U.S. Supreme
Court decisions. More broadly, some have urged that neuroscience will undercut
much of the criminal justice system. The argument is that neuroscience ultimately
will prove that no one—not even the sanest defendant—has free will and that
this will fatally weaken the retributive aspect of criminal justice.53
50. In at least one fascinating case, a man who was convicted of sexual abuse of a child was
found to have a large tumor pressing into his brain. When the tumor was removed, his criminal
sexual impulses disappeared. When his impulses returned, so had his tumor. The tumor was removed
a second time and, again, his impulses disappeared. J.M. Burns & R.H. Swerdlow, Right Orbitofrontal
Tumor with Pedophilia Symptom and Constructional Apraxia Sign, 60 Arch. Neurology 437 (2003); Doctors
Say Pedophile Lost Urge After Tumor Removed, USA Today, July 28, 2003. See Greely, Neuroscience and
Criminal Responsibility, supra note 49 (offering a longer discussion of this case).
51. The defendant in this Illinois case, Brian Dugan, confessed to the murder but sought to avoid
the death penalty. See Virginia Hughes, Head Case, 464 Nature 340 (2010) (providing an excellent
discussion of this case).
52. Other forms of neuroimaging, particularly PET and structural MRI scans, have been more
widely used in criminal cases. Dr. Amos Gur at the University of Pennsylvania estimates that he has
used neuroimaging in testimony for criminal defendants about 30 times. Id.
53. See, e.g., Robert M. Sapolsky, The Frontal Cortex and the Criminal Justice System, in Law
and the Brain (Semir Zeki & Oliver Goodenough eds., 2006); Joshua Greene & Jonathan Cohen,
For the Law, Neuroscience Changes Nothing and Everything, in Law and the Brain (Semir Zeki & Oliver
Goodenough eds., 2006).
This argument has been forcefully attacked by Professor Stephen Morse. See, e.g., Stephen J.
Morse, Determinism and the Death of Folk Psychology: Two Challenges to Responsibility from Neuroscience, 9
Minn. J.L. Sci. & Tech. 1 (2008); Stephen J. Morse, The Non-Problem of Free Will in Forensic Psychiatry
800
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
The application of neuroscience evidence to individual claims of a lack of
criminal responsibility should prove challenging.54 Such claims will suffer from the
time machine problem—the brain scan will almost always be from after, usually
long after, the crime was committed and so cannot directly show the defendant’s
brain state (and hence, by inference, his or her mental state) at or before the time
of the crime. Similarly, most of the neuroscience evidence will be from associations, not from experiments. It is hard to imagine an ethical experiment that would
scan people when they are, or are not, committing particular crimes, leaving only
indirect experiments. Evidence that, for example, more convicted rapists than
nonrapists had particular patterns of brain activation when viewing sexual material
might somehow be relevant to criminal responsibility, but it also might not.
Careful neuroscience studies, either structural or functional, of the brains
of criminals are rare. It seems highly unlikely that a “responsibility region” will
ever be found, one that is universally activated in law-abiding people and that is
deactivated in criminals (or vice versa). At most, the evidence is likely to show that
people with particular brain structures or patterns of brain functioning commit
crimes more frequently than people without such structures or patterns. Applying
this group evidence to individual cases will be difficult, if not impossible. All of
the problems of technical and statistical analysis of neuroimaging data, discussed in
Section IV, apply. And it is possible that the to-be-scanned defendants will be able
to implement countermeasures to “fool” the expert analyzing the scan.
The use of neuroscience to undermine criminal responsibility faces another
problem—identifying a specific legal argument. It is not generally a defense to a
criminal charge to assert that one has a predisposition to commit a crime, or even
a very high statistical likelihood, as a result of social and demographic variables,
of committing a crime. It is not clear whether neuroscience would, in any more
than a very few cases,55 provide evidence that was not equivalent to predisposition evidence. (And, of course, prudent defense counsel might think twice before
presenting evidence to the jury that his or her client was strongly predisposed to
commit crimes.)
We are at an early stage in our understanding of the brain and of the brain
states related to the mental states involved in criminal responsibility. At this point,
about all that can be said is that at least some criminal defense counsel, seeking to
represent their clients zealously, will watch neuroscience carefully for arguments
they could use to relieve their clients from criminal responsibility.
and Psychology, 25 Behav. Sci. & L. 203 (2007); Stephen J. Morse, Moral and Legal Responsibility and
the New Neuroscience, in Neuroethics: Defining the Issues in Theory, Practice, and Policy (Judy Illes
ed., 2006); Stephen J. Morse, Brain Overclaim Syndrome and Criminal Responsibility: A Diagnostic Note,
3 Ohio St. J. Crim. L. 397 (2005).
54. A good short discussion of these challenges can be found in Helen Mayberg, Does Neuroscience
Give Us New Insights into Criminal Responsibility? in A Judge’s Guide to Neuroscience, supra note 1.
55. See Greely, Neuroscience and Criminal Responsibility, supra note 49 (arguing for a very narrow
neuroscience-based defense).
801
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. Lie Detection
The use of neuroscience methods for lie detection probably has received more
attention than any other issue raised in this chapter.56 This is due in part to the
cultural interest in lie detection, dating back in its technological phase nearly
90 years to the invention of the polygraph.57 But it is also due to the fact that two
commercial firms currently are offering fMRI-based lie detection services for sale
in the United States: Cephos and No Lie MRI.58 Currently, as far as we know,
56. For a technology whose results have yet to be admitted in court, the legal and ethical
issues raised by fMRI-based lie detection have been discussed in an amazingly long list of scholarly
publications from 2004 to the present. An undoubtedly incomplete list follows: Nita Farahany, supra
note 27; Brown & Murphy, supra note 20; Anthony D. Wagner, supra note 12; Frederick Schauer,
Can Bad Science Be Good Evidence?: Neuroscience, Lie-Detection, and the Mistaken Conflation of Legal and
Scientific Norms, 95 Cornell L. Rev. 1191 (2010); Frederick Schauer, Neuroscience, Lie-Detection, and the
Law: A Contrarian View, 14 Trends Cog. Sci. 101 (2010); Emilio Bizzi et al., Using Imaging to Identify
Deceit: Scientific and Ethical Questions (2009); Joelle Anne Moreno, The Future of Neuroimaged Lie
Detection and the Law, 42 Akron L. Rev. 717 (2009); Julie Seaman, Black Boxes: fMRI Lie Detection
and the Role of the Jury, 42 Akron L. Rev. 931 (2009); Jane Campbell Moriarty, Visions of Deception:
Neuroimages and the Search for Truth, 42 Akron L. Rev. 739 (2009); Dov Fox, supra note 27; Benjamin
Holley, It’s All in Your Head: Neurotechnological Lie Detection and the Fourth and Fifth Amendments, 28
Dev. Mental Health L. 1 (2009); Brian Reese, Comment: Using fMRI as a Lie Detector—Are We Lying
to Ourselves? 19 Alb. L.J. Sci. & Tech. 205 (2009); Cooper Ellenberg, Student Article: Lie Detection:
A Changing of the Guard in the Quest for Truth in Court? 33 Law & Psychol. Rev. 139 (2009); Julie
Seaman, Black Boxes, 58 Emory L.J. 427 (2008); Matthew Baptiste Holloway, supra note 27; William
Federspiel, supra note 27; Greely & Illes, supra note 28; Sarah E. Stoller & Paul R. Wolpe, supra
note 27; Mark Pettit, FMRI and BF Meet FRE: Braining Imaging and the Federal Rules of Evidence, 33
Am. J.L. & Med. 319 (2007); Jonathan H. Marks, Interrogational Neuroimaging in Counterterrorism: A
“No-Brainer” or a Human Rights Hazard? 33 Am. J.L. & Med. 483 (2007); Leo Kittay, Admissibility of
fMRI Lie Detection: The Cultural Bias Against “Mind Reading” Devices, 72 Brook. L. Rev. 1351, 1355
(2007); Jeffrey Bellin, The Significance (if Any) for the Federal Criminal Justice System of Advances in Lie
Detector Technology, Temp. L. Rev. 711 (2007); Henry T. Greely, The Social Consequences of Advances in
Neuroscience: Legal Problems; Legal Perspectives, in Neuroethics: Defining the Issues in Theory, Practice
and Policy 245 (Judy Illes ed., 2006); Charles N.W. Keckler, Cross-Examining the Brain: A Legal Analysis
of Neural Imaging for Credibility Impeachment, 57 Hastings L.J. 509 (2006); Archie Alexander, Functional
Magnetic Resonance Imaging Lie Detection: Is a “Brainstorm” Heading Toward the “Gatekeeper”? 7 Hous. J.
Health L. & Pol’y (2006); Michael S. Pardo, supra note 27; Erich Taylor, supra note 27; Paul R. Wolpe
et al., Emerging Neurotechnologies for Lie-Detection: Promises and Perils, 5 Am. J. Bioethics 38, 42 (2005);
Henry T. Greely, Premarket Approval Regulation for Lie Detection: An Idea Whose Time May Be Coming,
5 Am. J. Bioethics 50–52 (2005); Sean Kevin Thompson, Note: The Legality of the Use of Psychiatric
Neuroimaging in Intelligence Interrogation, 90 Cornell L. Rev. 1601 (2005); Henry T. Greely, Prediction,
Litigation, Privacy, and Property: Some Possible Legal and Social Implications of Advances in Neuroscience, in
Neuroscience and the Law: Brain, Mind, and the Scales of Justice 114–56 (Brent Garland ed., 2004);
and Judy Illes, A Fish Story? Brain Maps, Lie Detection, and Personhood, 6 Cerebrum 73 (2004).
57. An interesting history of the polygraph can be found in Ken Alder, The Lie Detectors:
The History of an American Obsession (2007). Perhaps the best overall discussion of the polygraph,
including some discussion of its history, is found in the National Research Council report, supra note
14, commissioned in the wake of the Wen Ho Lee case, on the use of the technology for screening.
58. The Web sites for the two companies are at Cephos, www.cephoscorp.com (last visited
July 3, 2010); and No Lie MRI, http://noliemri.com (last visited July 3, 2010).
802
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
evidence from fMRI-based lie detection has not been admitted into evidence in
any court, but it was offered—and rejected—in two cases, United States v. Semrau59
and Wilson v. Corestaff Services, L.P.,60 in May 2010.61 This section will begin
by analyzing the issues raised for courts by this technology and then will discuss
these two cases, before ending with a quick look at possible uses of this kind of
technology outside the courtroom.
1. Issues involved in the use of fMRI-based lie detection in litigation
Published research on fMRI and detecting deception dates back to about 2001.62
As noted above, to date between 20 and 30 peer-reviewed articles from about
15 laboratories have appeared claiming to find statistically significant correlations
between patterns of brain activation and deception. Only a handful of the published studies have looked at the accuracy of determining deception in individual
subjects as opposed to group averages. Those studies generally claim accuracy rates
of between about 75% and 90%. No Lie MRI has licensed the methods used by
one laboratory, that of Dr. Daniel Langleben at the University of Pennsylvania;
Cephos has licensed the method used by another laboratory, that of Dr. Frank A.
Kozel, first at the Medical University of South Carolina and then at the University
of Texas Southwestern Medical Center. (The method used by a British researcher,
Dr. Sean Spence, has been used on a British reality television show.)
All of these studies rely on research subjects, typically but not always college students, who are recruited for a study of deception. They are instructed to
answer some questions truthfully in the scanner and to answer other questions
inaccurately.63 In the Langleben studies, for example, right-handed, healthy, male
59. No. 07-10074 M1/P, Report and Recommendation (W.D. Tenn. May 31, 2010).
60. 2010 NY slip op. 20176, 1 (N.Y. Super. Ct. 2010); 900 N.Y.S.2d 639; 2010 N.Y. Misc.
LEXIS 1044 (2010).
61. In early 2009, a motion to admit fMRI-based lie detection evidence, provided by No Lie
MRI, was made, and then withdrawn, in a child custody case in San Diego. The case is discussed in a
prematurely entitled article, Alexis Madrigal, MRI Lie Detection to Get First Day in Court, WIRED SCI.
(Mar. 16, 2009), available at http://blog.wired.com/wiredscience/2009/03/noliemri.html (last visited
July 3, 2010). A somewhat similar method of using EEG to look for signs of “recognition” in the brain
was admitted into one state court hearing for postconviction relief at the trial court level in Iowa in
2001, and both it and another EEG-based method have been used in India. As far as we know, evidence
from the use of EEG for lie detection has not been admitted in any other U.S. cases. See supra note 28.
62. The most recent reviews of the scientific literature on this subject are Anthony D. Wagner,
supra note 12; and S.E. Christ et al., The Contributions of Prefrontal Cortex and Executive Control to
Deception: Evidence from Activation Likelihood Estimate Meta-Analyses, 19 Cerebral Cortex 2557 (2009).
See also Greely & Illes, supra note 28 (for discussion of the articles through early 2007). The following
discussion is based largely on those sources.
63. At least one fMRI study has attempted to investigate self-motivated lies, told by subjects
who were not instructed to lie, but who chose to lie for personal gain. Joshua D. Greene & Joseph M.
Paxton, Patterns of Neural Activity Associated with Honest and Dishonest Moral Decisions, 106 Proc. Nat’l
Acad. Sci. 12,506 (2009). The experiment was designed to make it easy for subjects to realize they
803
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
University of Pennsylvania undergraduates were shown images of playing cards
while in the scanner and asked to indicate whether they saw a particular card.
They were instructed to answer truthfully except when they saw one particular
card. Some of Kozel’s studies used a different experimental paradigm, in which the
subjects were put in a room and told to take either a watch or a ring. When asked
in the scanner separately whether they had taken the watch and then whether they
had taken the ring, they were to reply “no” in both cases—truthfully once and
falsely the other time. When analyzed in various ways, the fMRI results showed
statistically different patterns of brain activation (small changes in BOLD response)
when the subjects were lying and when they were telling the truth.
In general, these studies are not guided by a consistent hypothesis about
which brain regions should be activated or deactivated during truth or deception. The results are empirical; they see particular patterns that differ between
the truth state and the lie state. Some have argued that the patterns show greater
mental effort when deception is involved; others have argued that they show more
impulse control when lying.
Are fMRI-based lie detection methods accurate? As a class of experiments,
these studies are subject to all the general problems discussed in Section IV regarding fMRI scans that might lead to neuroscience evidence. So far there are only a
few studies involving a limited number of subjects. (The method used by No Lie
MRI seems ultimately to have been based on the responses of four right-handed,
healthy, male University of Pennsylvania undergraduates.64) There have been, to
date, no independent replications of any group’s findings.
The experience of the research subjects in these fMRI studies of deception
seems to be different from “lying” as the court system would perceive it. The
subjects knew they were involved in research, they were following orders to lie,
and they knew that the most harm that could come to them from being detected
in a lie might be lesser payment for taking part in the experiment. This seems hard
to compare to a defendant lying about participating in a murder. More fundamentally, it is not clear how one could conduct ethical but realistic experiments with
lie detection. Research subjects cannot credibly be threatened with jail if they do
not convince the researcher of the truth of their lies.
Only a handful of researchers have published studies showing reported accuracy rates with individual subjects and only with a small number of subjects.65
would be given more money if they lied about how many times they correctly predicted a coin flip.
Investigators could not, however, determine if a subject lied in any particular trial.
64. Daniel D. Langleben et al., Telling Truth from Lie in Individual Subjects with Fast Event-Related
fMRI, 26 Hum. Brain Mapping 262, 267 (2005).
65. See discussion in Anthony D. Wagner, supra note 12, at 29–35. Wagner analyzes 11 peerreviewed, published papers. Seven come from Kozel’s laboratory; three come from Langleben’s. The
only exception is a paper from John Cacciopo’s group, which concludes “ [A]lthough fMRI may
permit investigation of the neural correlates of lying, at the moment it does not appear to provide a
very accurate marker of lying that can be generalized across individuals or even perhaps across types
804
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
Some of the studies used complex and somewhat controversial statistical techniques. And although subjects in at least one experiment were invited to try to use
countermeasures against being detected, no specific countermeasures were tested.
Beyond the scientific validity of these techniques lie a host of legal questions.
How accurate is accurate enough for admissibility in court or for other legal system uses? What are the implications of admissible and accurate lie detection for
the Fourth, Fifth, Sixth, and Seventh Amendments? Would jurors be allowed to
consider the failure, or refusal, of a party to take a lie detector test? Would lie
detection be available in discovery? Would each side get to do its own tests—and
who would pay?
Accurate lie detection could make the justice system much more accurate.
Incorrect convictions might become rare; so might incorrect acquittals. Accurate lie
detection also could make the legal system much more efficient. It seems likely that
far fewer cases would go to trial if the witnesses could expect to have their veracity
accurately determined.
Inaccurate lie detection, on the other hand, holds the potential of ruining the
innocent and immunizing the guilty. It is at least daunting to remember some of
the failures of the polygraph, such as the case of Aldrich Ames, a Soviet (and then
Russian) mole in the Central Intelligence Agency, who passed two Agency polygraph tests while serving as a paid spy.66 The courts already have begun to decide
whether and how to use these new methods of lie detection in the judicial process;
the rest of society also will soon be forced to decide on their uses and limits.
2. Two cases involving fMRI-based lie detection
On May 31, 2010, U.S. Magistrate Judge Tu M. Pham of the Western District
of Tennessee issued a 39-page report and recommendation on the prosecution’s
motion to exclude evidence from an fMRI-based lie detection report by Cephos
in the case of United States v. Semrau.67 The report came after a hearing on
May 13–14 featuring testimony from Steve Laken, CEO of Cephos, for admission,
and from two experts arguing against admission. (The district judge adopted the
magistrate’s report during the trial.)
The defendant in this case, a health professional accused of defrauding Medicare, offered as evidence a report from Cephos stating that he was being truthful
of lies by the same individuals.” G. Monteleone et al., Detection of Deception Using fMRI: Better Than
Chance, But Well Below Perfection, 4 Soc. Neurosci. 528 (2009). However, that study only looked at
one brain region at a time, and it did not test combinations or patterns, which might have improved
the predictive power.
66. See Senate Select Committee on Intelligence, Assessment of the Aldrich H. Ames Espionage
Case and Its Implications for U.S. Intelligence (1994).
67. See supra note 59. The district court judge assigned to the case had a scheduling conflict on
the date of the hearing on the prosecution’s motion, and so the hearing was held before a magistrate
judge from that district.
805
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
when he answered a set of questions about his actions and knowledge concerning
the alleged crimes.
Judge Pham first analyzed the motion under Rule 702, using the Daubert
criteria. He concluded that the technique was testable and had been the subject of
peer-reviewed publications. On the other hand, he concluded that the error rates
for its use in realistic situations were unknown. Furthermore, he found there were
no standards for its appropriate use. To the extent that the publications relied on
by Cephos to establish its reliability constituted such standards, those standards had
not actually been followed in the tests of the defendant. Cephos actually scanned
Dr. Semrau 2 times on 1 day, asking questions about one aspect of the criminal
charges during the first scan and then about another aspect in the second scan.
The company’s subsequent analysis of those scans indicated that the defendant
had been truthful in the first scan but deceptive in the second scan. Cephos then
scanned him a third time, several days later, on the second subject but with revised
questions, and concluded that he was telling the truth that time. Nothing in the
publications relied upon by Cephos indicated that the third scan was appropriate.
Finally, Judge Pham found that the method was not generally accepted in the
relevant scientific community as sufficiently reliable for use in court, citing several
publications, including some written by the authors whose methods Cephos used.
The magistrate judge then examined the motion under Rule 403 and found
that the potential prejudicial effect of the evidence outweighed its probative value.
He noted that the test had been conducted without the government’s knowledge
or participation, in a context where the defendant risked nothing by taking the
test—a negative result would never be disclosed. He noted the jury’s central role
in determining credibility and considered the likelihood that the lie detection
evidence would be a lengthy and complicated distraction from the jury’s central
mission. Finally, he noted that the probative value of the evidence was greatly
reduced because the report only gave a result concerning the defendant’s general
truthfulness when responding to more than 10 questions about the events but did
not even purport to say whether the defendant was telling the truth about any
particular question.
Earlier that month, a state trial court judge in Brooklyn excluded another
Cephos lie detection report in a civil case, Wilson v. Corestaff Services, L.P.68 This
case involved a claim by a former employee under state law that she had been
subject to retaliation for reporting sexual harassment. The plaintiff offered evidence from a Cephos report finding that her main witness was truthful when he
described how defendant’s management said it would retaliate against the plaintiff.
That case did not involve an evidentiary hearing or, indeed, any expert testimony. The judge decided the lie detection evidence was not appropriate under
New York’s version of the Frye test, noting that, in New York, “courts have
advised that the threshold question under Frye in passing on the admissibility
68. Wilson v. Corestaff Services, L.P., supra note 60.
806
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
of expert’s testimony is whether the testimony is ‘within the ken of the typical
juror.’”69 Because credibility is a matter for the jury, the judge concluded that this
kind of evidence was categorically excluded under New York’s version of Frye. He
also noted that “even a cursory review of the scientific literature demonstrates that
the plaintiff is unable to establish that the use of the fMRI test to determine truthfulness or deceit is accepted as reliable in the relevant scientific community.”70
3. fMRI-based lie detection outside the courtroom
Lie detection might have applications to litigation without ever being introduced
in trials. As is the case today with the polygraph, the fact that it is not generally
admissible in court might not stop the police or the prosecutors from using it to
investigate alleged crimes. Similarly, defense counsel might well use it to attempt
to persuade the authorities that their clients should not be charged or should be
charged with lesser offenses. One could imagine the same kinds of pretrial uses
of lie detection in civil cases, as the parties seek to affect each other’s perceptions of the merits of the case.
Such lie detection efforts could also affect society, and the law, outside of
litigation. One could imagine prophylactic lie detection at the beginning of contractual relations, seeking to determine whether the other side honestly had the
present intention of complying with the contract’s terms. One can also imagine
schools using lie detection as part of investigations of student misconduct or parents seeking to use lie detection on their children. The law more broadly may
have to decide whether and how private actors can use lie detection, determining
whether, for example, to extend to other contexts—or to weaken or repeal—the
Employee Polygraph Protection Act.71
The current fMRI-based methods of lie detection provide one kind of protection for possible subjects—they are obvious. No one is going to be put into an
MRI for an hour and asked to respond, repeatedly, to questions without realizing
something important is going on. Should researchers develop less obtrusive or
obvious methods of neuroscience-based lie detection, we will have to deal with
the possibilities of involuntary and, indeed, surreptitious lie detection.
C. Detection of Pain
No matter where an injury occurs and no matter where it seems to hurt, pain
is felt in the brain.72 Without sensory nerves leading to the brain from a body
69. Id. at 6, citing People v. Cronin, 60 N.Y.2d 430, 458 N.E.2d 351, 470 N.Y.S.2d 110 (1983).
70. See Wilson, supra note 60, at 7.
71. See supra text accompanying note 32.
72. See Brain Facts, supra note 2, at 19–21, 49–50, which includes a useful brief description of
the neuroscience of pain.
807
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
region, there is usually no experience of pain. Without the brain machinery and
functioning to process the signal, no pain is perceived.
Pain turns out to be complicated—even the common pain that is experienced
from an acute injury to, say, an arm. Neurons near the site of the injury called
nociceptors transmit the pain signal to the spinal cord, which relays it to the
brain. But other neurons near the site of the injury will, over time, adapt to affect
the pain signal. Cells in the spinal cord can also modulate the pain signal that is
sent to the brain, making it stronger or weaker. The brain, in turn, sends signals
down to the spinal cord that cause or, at least, affect these modulations. And the
actual sensation of pain—the “ouch”—takes place in the brain.
The immediate and localized sensation is processed in the somatosensory
cortex, the brain region that takes sensory inputs from different body parts (with
each body part getting its own portion of the somatosensory cortex) and processes them into a perceived sensation. The added knowledge that the sensation
is painful seems to require the participation of other regions of the brain. Using
fMRI and other techniques, some researchers have identified what they call the
“pain matrix” in the brain, regions that are activated when experimental subjects,
in scanners, are exposed to painful stimuli. The brain regions identified as part of
the so-called pain matrix vary from researcher to researcher, but generally include
the thalamus, the insula, parts of the anterior cingulate cortex, and parts of the
cerebellum.73
Researchers have run experiments with subjects in the scanner receiving
painful or not painful stimuli and have attempted to find activation patterns
that appear when pain is perceived and that do not appear when pain is absent.
(The subjects usually are given nonharmful painful stimuli such as having their
skin touched with a hot metal rod or coated with a pepper-derived substance
that causes a burning sensation.) Some have reported substantial success, detecting pain in more than 80% of the cases.74 Other studies have found a positive
correlation between the degree of activation in the pain matrix and the degree
of subjective pain, both as reported by the subject and as possibly indicated by
the heat of the rod or the amount of the painful substance—the higher the temperature or the concentration of the painful substance, the greater the average
activation in the pain matrix.75
Other neuroscience studies of individual pain look not at brain function during painful episodes but at brain structure. Some researchers, for example, claim
that different regions of the brain have different average size and neuron densities
73. A good review article on the uses of fMRI in studying pain is found in David Borsook
& Lino R. Becerra, Breaking Down the Barriers: fMRI Applications in Pain, Analgesia and Analgesics, 2
Molecular Pain 30 (2006).
74. See, e.g., Irene Tracey, Imaging Pain, 101 Brit J. Anaesth. 32 (2008).
75. See, e.g., Robert C. Coghill et al., Neural Correlates of Interindividual Differences in the Subjective
Experience of Pain, 100 Proc. Nat’l Acad. Sci. 8538 (2003).
808
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
in patients who have had long-term chronic pain than in those who have not
had such pain.76
Pain is clearly complicated. Placebos, distractions, or great need can sometimes cause people not to sense, or perhaps not to notice, pain that could otherwise be overwhelming. Similarly, some people can become hypersensitive to pain,
reporting severe pain when the stimulus normally would be benign. Amputees
with phantom pain—the feeling of pain in a limb that has been gone for years—
have been scanned while reporting this phantom pain. They show activation in
the pain matrix. In some fMRI studies, people who have been hypnotized to feel
pain, even when there is no painful stimulus, show activation in the pain matrix.77
And in one fMRI study, subjects who reported feeling emotional distress, as a
result of apparently being excluded from a “game” being played among research
subjects, also showed, on average, statistically significant activation of the pain
matrix.78
Pain also plays an enormous role in the legal system.79 The existence and
extent of pain is a matter for trial in hundreds of thousands of injury cases each
year. Perhaps more importantly, pain figures into uncounted workers’ compensation claims and Social Security disability claims. Pain is often difficult to prove,
and the uncertainty of a jury’s response to claimed pain probably keeps much litigation alive. We know that the tests for pain currently presented to jurors, judges,
and other legal decisionmakers are not perfect. Anecdotes of and the assessments
by pain experts both are convincing that some nontrivial percentage of successful claimants are malingering and only pretending to feel pain; a much greater
percentage may be exaggerating their pain.
A good test for whether a person is feeling pain, and, even better, a “scientific” way to measure the amount of that pain—at least compared to other pains
felt by that individual, if not to pain as perceived by third parties—could help
resolve a huge number of claims each year. If such pain detection were reliable,
it would make justice both more accurate and more certain, leading to faster, and
76. See, e.g., Vania Apkarian et al., Chronic Back Pain Is Associated with Decreased Prefrontal and
Thalamic Gray Matter Density, 24 J. Neurosci. 10,410 (2004); see also Arne May, Chronic Pain May
Change the Structure of the Brain, 137 Pain 7 (2008); Karen D. Davis, Recent Advances and Future Prospects
in Neuroimaging of Acute and Chronic Pain, 1 Future Neurology 203 (2006).
77. Stuart W. Derbyshire et al., Cerebral Activation During Hypnotically Induced and Imagined Pain,
23 NeuroImage 392 (2004).
78. Naomi I. Eisenberg, Does Rejection Hurt? An fMRI Study of Social Exclusion, 302 Science
290 (2003).
79. The only substantial analysis of the legal implications of using neuroimaging to detect pain
is found in Adam J. Kolber, Pain Detection and the Privacy of Subjective Experience, 33 Am. J.L. & Med.
433 (2007). Kolber expands on that discussion in interesting ways in Adam J. Kolber, The Experiential
Future of the Law, 60 Emory L.J. 585, 595–601 (2011). The possibility of such pain detection was briefly
discussed earlier in two different 2006 publications: Henry T. Greely, Prediction, Litigation, Privacy, and
Property: Some Possible Legal and Social Implications of Advances in Neuroscience, supra note 56, at 141–42;
and Charles Keckler, Cross-Examining the Brain, supra note 56, at 544.
809
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
cheaper, resolution of many claims involving pain. The legal system, as well as the
honest plaintiffs and defendants within it, would benefit.
A greater understanding of pain also might lead to broader changes in the
legal system. For example, emotional distress often is treated less favorably than
direct physical pain. If neuroscience were to show that, in the brain, emotional
distress seemed to be the same as physical pain, the law might change. Perhaps
more likely, if neuroscience could provide assurance that sincere emotional pain
could be detected and faked emotional distress would not be rewarded, the law
again might change. Others have argued that even our system of criminal punishment might change if we could measure, more accurately, how much pain
different punishments caused defendants, allowing judges to let the punishment
fit the criminal, if not the crime.80 A “pain detector” might even change the
practice of medicine in legally relevant ways, by giving physicians a more certain
way to check whether their patients are seeking controlled substances to relieve
their own pain or whether they are seeking them to abuse or to sell for someone
else to abuse.
In at least one case, a researcher who studies the neuroscience of pain was
retained as an expert witness to testify regarding whether neuroimaging could provide evidence that a claimant was, in fact, feeling pain. The case settled before the
hearing.81 In another case, a prominent neuroscientist was approached about being
a witness against the admissibility of fMRI-based evidence of pain, but, before
she had decided whether to take part, the party seeking to introduce the evidence
changed its mind. This issue has not, as of the time of this writing, reached the
courts yet, but lawyers clearly are thinking about these uses of neuroscience. (And
note that in some administrative contexts, the evidentiary rules will not apply
in their full rigor, possibly making the admission of such evidence more likely.)
Do either functional or structural methods of detecting pain work and, if so,
how well? We do not know. These studies share many of the problems outlined
in Section IV. The studies are few in number, with few subjects (and usually sets
of subjects that are not very diverse). The experiments—usually involving giving college students a painful stimulus—are different from the experience of, for
example, older people who claim to have low back pain. Independent replication
is rare, if it exists at all. The experiments almost always report that, on average, the
group shows a statistically significant pattern of activation that differs depending
on whether they are receiving the painful stimulus, but the group average does
not in itself tell us about the sensitivity or specificity of such a test when applied
to individuals. And the statistical and technical issues are daunting.
In the area of pain, the issue of countermeasures may be the most interesting, particularly in light of the experiments conducted with hypnotized subjects.
Does remembered pain look the same in an fMRI scan as currently experienced
80. Adam J. Kolber, How to Improve Empirical Dessert, 75 Brook. L. Rev. 429 (2009).
81. Greg Miller, Brain Scans of Pain Raise Questions for the Law, 323 Science 195 (2009).
810
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Neuroscience
pain? Does the detailed memory of a kidney stone pain look any different from
the present sensation of low back pain? Can a subject effectively convince himself
that he is feeling pain and so appear to the scanner to be experiencing pain? The
answers to these questions are clear—we do not yet know.
Pain detection also would raise legal questions. Could a plaintiff be forced
to undergo a “pain scan”? If a plaintiff offered a pain scan in evidence, could
the defendant compel the plaintiff to undergo such a scan with the defendant’s
machine and expert? Would it matter if the scan were itself painful or even dangerous? Who would pay for these scans and for the experts to interpret them?
Detecting pain would be a form of neuroscience evidence with straightforward and far-reaching applications to the legal system. Whether it can be done,
and, if so, how accurately it can be done, remain to be seen. So does the legal
system’s reaction to this possibility.
VII. Conclusion
Atomic physicist Niels Bohr is credited with having said “It is always hard to
predict things, especially the future.”82 It seems highly likely that the massively
increased understanding of the human brain that neuroscience is providing will
have significant effects on the law and, more specifically, on the courts. Just what
those effects will be cannot be accurately predicted, but we hope that this guide
will provide some useful background to help judges cope with whatever neuroscience evidence comes their way.
82. This quotation has been attributed to many people, especially Yogi Berra, but Bohr seems to
be the most likely candidate, even though it does not appear in anything he published. See discussion
in Henry T. Greely, Trusted Systems and Medical Records: Lowering Expectations, 52 Stan. L. Rev. 1585,
1591–92 n.9 (2000). One of the authors, however, recently had a conversation with a scientist from
Denmark, who knew the phrase (in Danish) as an old Danish saying and not something original with
Bohr.
811
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
References on Neuroscience
Fundamental Neuroscience (Larry R. Squire et al. eds., 3d ed. 2008).
Eric R. Kandel et al., Principles of Neural Science (4th ed. 2000).
The Cognitive Neurosciences (Michael S. Gazzaniga ed., 4th ed. 2009).
812
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental
Health Evidence
PAUL S. APPELBAUM
Paul S. Appelbaum, M.D., is the Elizabeth K. Dollard Professor of Psychiatry, Medicine,
and Law, and Director, Division of Law, Ethics, and Psychiatry, Department of Psychiatry,
Columbia University and New York State Psychiatric Institute.
CONTENTS
I. Overview of Mental Health Evidence, 815
A. Range of Legal Cases in Which Mental Health Issues Arise, 815
1. Retrospective, contemporaneous, and prospective assessments, 817
2. Diagnosis versus functional impairment, 819
B. Mental Health Experts, 821
1. Psychiatrists, 821
2. Psychologists, 824
3. Other mental health professionals, 826
C. Diagnosis of Mental Disorders, 828
1. Nomenclature and typology—DSM-IV-TR and DSM-5, 828
2. Major diagnostic categories, 831
3. Approaches to diagnosis, 834
4. Accuracy of diagnosis of mental disorders, 839
5. Detection of malingering, 839
D. Functional Impairment Due to Mental Disorders, 841
1. Impact of mental disorders on functional capacities, 841
2. Assessment of functional impairment, 842
E. Predictive Assessments, 846
1. Prediction of violence risk, 846
2. Predictions of future functional impairment, 851
F. Treatment of Mental Disorders, 852
1. Treatment with medication, 853
2. Psychological treatments, 858
3. Treatment of functional impairments, 860
4. Electroconvulsive and other brain stimulation therapies, 861
5. Psychosurgery, 863
6. Prediction of responses to treatment, 863
G. Limitations of Mental Health Evidence, 865
1. Limits of psychodynamic theory, 865
2. Ultimate issue testimony, 867
813
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
II. Evaluating Evidence from Mental Health Experts, 869
A. What Are the Qualifications of the Expert? 869
1. Training, 870
2. Experience, 871
3. Licensure and board certification, 873
4. Prior relationship with the subject of the evaluation, 875
B. How Was the Assessment Conducted? 877
1. Was the evaluee examined in person? 877
2. Did the evaluee cooperate with the assessment? 879
3. Was the evaluation conducted in adequate circumstances? 880
4. Were the appropriate records reviewed? 881
5. Was information gathered from collateral informants? 882
6. Were medical diagnostic tests performed? 883
7. Was the evaluee’s functional impairment assessed directly? 884
8. Was the possibility of malingering considered? 884
C. Was a Structured Diagnostic or Functional Assessment Instrument or
Test Used? 885
1. Has the reliability and validity of the instrument or test been
established? 885
2. Does the person being evaluated resemble the population for
which the instrument or test was developed? 886
3. Was the instrument or test used as intended by its developers? 887
D. How Was the Expert’s Judgment Reached Regarding the Legally
Relevant Question? 889
1. Were the findings of the assessment applied appropriately to the
question? 889
III. Case Example, 892
A. Facts of the Case, 892
B. Testimony of the Plaintiff’s Expert on Negligence, 893
C. Questions for Consideration, 893
D. Testimony of the Plaintiff’s Expert on Damages, 893
E. Questions for Consideration, 894
References on Mental Health Diagnosis and Treatment, 895
References on Mental Health and Law, 895
814
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
I. Overview of Mental Health Evidence
A. Range of Legal Cases in Which Mental Health Issues Arise
Evidence presented by mental health experts is common to a broad array of legal
cases—criminal and civil. In the criminal realm, these include assessments of
defendants’ mental states at the time of their alleged offenses (e.g., criminal responsibility and diminished capacity1) and subsequent to the offenses, but prior to the
initiation of the adjudicatory process (e.g., competence to consent to a search
or waive Miranda rights2). As cases move toward adjudication, evaluation may
be required of defendants’ competence to stand trial or to represent themselves
at trial.3 Postconviction, mental health evidence may be introduced with regard
to sentencing, including suitability for probation and conditions of probation.4
Capital cases uniquely may raise questions regarding a condemned prisoner’s
competence to waive appeals or to be executed.5 Postconfinement, mental health
considerations may enter into parole determinations. Indeed, the development of
1. 18 U.S.C. § 17 (defining standard and burden of proof for insanity defense); Clark v. Arizona,
548 U.S. 735 (2006) (on the use of testimony for diminished capacity).
2. See Thomas Grisso, Evaluating Competencies: Forensic Assessments and Instruments (2002);
Miranda v. Arizona, 384 U.S. 436 (1966) (holding confessions inadmissible unless suspect made aware
of rights and waives them); Colorado v. Connelly, 479 U.S. 157 (1986) (holding that mental condition
alone will not make a confession involuntary under the Fourth Amendment but may be used as a factor
in assessing a defendant’s voluntariness); United States v. Elrod, 441 F.2d 353 (5th Cir. 1971) (holding
that a person of subnormal intelligence may be deemed incapable of giving consent). See Wayne R.
LaFave, Search and Seizure 92–93 (2004); Wayne R. LaFave, Criminal Procedure 363–65 (2004);
Brian S. Love, Comment: Beyond Police Conduct: Analyzing Voluntary Consent to Warrantless Searches by
the Mentally Ill and Disabled, 48 St. Louis U. L.J. 1469 (2004).
3. Dusky v. United States, 362 U.S. 402 (1960) (establishing standard for competence to stand
trial); Pate v. Robinson, 383 U.S. 375 (1966) (holding that the Due Process Clause of the Fourteenth
Amendment does not allow a mentally incompetent criminal defendant to stand trial); Farretta v.
California, 422 U.S. 806 (1975) (upholding defendant’s right to refuse counsel and represent himself);
Indiana v. Edwards, 554 U.S. 164 (2008) (finding that the standards for competency to stand trial and
to represent oneself need not be the same).
4. Roger W. Haines, Jr., et al., Federal Sentencing Guidelines Handbook §§ 5B1.3(d)(5),
5D1.3(d)(5), 5H1.3 (2007–2008).
5. See Ford v. Wainwright, 477 U.S. 399 (1986) (upholding the common law bar against
executing the insane and holding that a prisoner is entitled to a judicial hearing before he may be
executed); Stewart v. Martinez-Villareal, 523 U.S. 637 (1998) (holding that death row prisoners are
not barred from filing incompetence to be executed claims by dismissal of previous federal habeas
petitions); Panetti v. Quarterman, 551 U.S. 930 (2007) (ruling that defendants sentenced to death
must be competent at the time of their execution); Atkins v. Virginia, 536 U.S. 304 (2002) (finding
that executing the mentally retarded constitutes cruel and unusual punishment under the Eighth
Amendment); Rees v. Peyton, 384 U.S. 312 (1966) (formulating the test for competency to waive
further proceedings as requiring that the petitioner “appreciate his position and make a rational
choice with respect to continuing or abandoning further litigation or on the other hand whether he
is suffering from a mental disease, disorder, or defect which may substantially affect his capacity in
the premises.”).
815
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
specialty services for probationers and parolees with mental disorders suggests that
mental health professionals’ input at this stage is likely to increase in the future.6
Mental health evidence in civil litigation is frequently introduced in personal
injury cases, where emotional harms may be alleged with or without concomitant
physical injury.7 Issues of contract may turn on the competence of a party at the
time that the contract was concluded or whether that person was subject to undue
influence,8 and similar questions may be at the heart of litigation over wills and
gifts.9 Broader questions of competence to conduct one’s affairs are considered in
guardianship cases,10 and more esoteric ones may arise in litigation challenging a
person’s competence to enter into a marriage or to vote.11 Suits alleging infringement of the statutory and constitutional rights of persons with mental disorders
(e.g., under the Americans with Disabilities Act or the Civil Rights of Institutionalized Persons Act) often involve detailed consideration of psychiatric diagnosis and treatment and of institutional conditions.12 Allegations of professional
6. Jennifer Skeem & Jennifer Eno Louden, Toward Evidence-Based Practice for Probationers and
Parolees Mandated to Mental Health Treatment, 57 Psychiatric Servs. 333 (2006).
7. Dillon v. Legg, 441 P.2d 912 (Cal. 1968) (allowing recovery based on emotional distress not
accompanied by physical injury); Molien v. Kaiser Foundation Hospitals, 616 P.2d 813 (Cal. 1980)
(holding that plaintiff who is direct victim of negligent act need not be present when act occurs to
recover for subsequent emotional distress); Rodriguez v. State, 472 P.2d 509 (Haw. 1970) (permitting
recovery where a reasonable person would suffer serious mental distress as a result of defendant’s
behavior); Roes v. FHP, Inc., 985 P.2d 661 (Haw. 1999) (allowing assessment of damages for
negligent infliction of emotional distress when plaintiff was in actual physical peril, even if no injury
was suffered); Albright v. United States, 732 F.2d 181 (C.A.D.C. 1984) (holding that alleging mental
distress is sufficient to confer standing); Cooper v. FAA, No. 07-1383 (N.D. Cal. Aug. 2008), rev’d and
remanded, 596 F.3d 538 (9th Cir. 2010) (discussing mental distress as a result of disclosure of personal
information); Sheely v. MRI Radiology Network, P.A., 505 F.3d 1173 (11th Cir. 2007) (holding
damages available under § 504 of the Rehabilitation Act when emotional distress was foreseeable).
8. See generally E. Allan Farnsworth, Contracts 228–33 (2004); John Parry & Eric Y. Drogin,
Mental Disability Law, Evidence, and Testimony 151–52, 185–86 (2007).
9. See generally William M. McGovern, Jr. & Sheldon F. Kurtz, Wills, Trusts and Estates Including
Taxation and Future Interests 292–99 (2004); Parry & Drogin, supra note 8, at 149–51, 182–85.
10. Parry & Drogin, supra note 8, at 138–47, 177–81.
11. Id. at 54. Doe v. Rowe, 156 F. Supp. 2d 35 (D. Me. 2001) (finding a state law denying
the vote to anyone under guardianship by reason of mental disability in violation of the Equal
Protection Clause of the U.S. Constitution and Title II of the Americans with Disabilities Act (ADA));
Missouri Protection & Advocacy Servs. v. Carnahan, 499 F.3d 803 (8th Cir. 2007) (upholding a state
law allowing disenfranchisement of persons under guardianship because it permits individualized
determinations of capacity to vote).
12. Pennsylvania Dep’t of Corrections v. Yeskey, 524 U.S. 206 (1998) (holding that ADA
coverage extended to prisoners); Clark v. State of California, 123 F.3d 1267 (9th Cir. 1997) (finding
state not immune on Eleventh Amendment grounds to suit alleging discrimination under ADA by
developmentally disabled inmates); Gates v. Cook, 376 F.3d 323 (5th Cir. 2004) (upholding District
Court’s finding that prison conditions, including inadequate mental health provisions, violated the
Eighth Amendment of the U.S. Constitution); Gaul v. AT&T, Inc., 955 F. Supp. 346 (D.N.J. 1997)
(finding that depression and anxiety disorders may constitute a mental disability under the ADA);
Anderson v. North Dakota State Hospital, 232 F.3d 634 (8th Cir. 2000) (finding that a plaintiff’s fear
816
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
malpractice by mental health professionals, including failure to protect foreseeable
victims of a patient’s violence,13 invariably call for mental health expert testimony,
as do commitment proceedings for the hospitalization of persons with mental
disorders14 or who are alleged to be dangerous sexual offenders.15
1. Retrospective, contemporaneous, and prospective assessments
Depending on the questions at issue in a given proceeding, evaluators may be
asked to assess the state of mind—including diagnosis and functional capacities—of
a person at some point in the past, at present, or in the future.
Retrospective assessments are called for when criminal defendants assert
insanity or diminished responsibility defenses, claiming that their state of mind at
the time of the crime should excuse or mitigate the consequences of their behaviors, or when questions are raised about competence at some point in the past to
waive legal rights (e.g., waiver of Miranda rights).16 In civil contexts, challenges
to the capacity of a now-deceased testator to write a will or of a party to enter
into a contract, among other issues, will call for a similar look back at a person’s
functioning at some point in the past.17 A variety of sources of information are
available for such assessments. In some cases (e.g., in criminal proceedings), the
defendant is likely to be available for clinical examination, whereas in other
cases he or she will not be able to be assessed directly (e.g., challenges to a will).
Although the person being evaluated will usually have an interest in portraying
him- or herself in a particular light, a direct assessment can nonetheless be valuable
in assessing the consistency of the reported symptoms with other aspects of the
history and current status of the person. Whether or not the person can be assessed
directly, information from persons who were in contact with the person before
and during the time in question, including direct reports and contemporaneous
of snakes did not limit ability to work); Sinkler v. Midwest Prop. Mgmt., 209 F.3d 678 (7th Cir. 2000)
(holding driving phobia did not substantially limit major life activity of working and hence was not an
impairment under the ADA); McAlinden v. County of San Diego, 192 F.3d 1226 (9th Cir. 1999), cert.
denied, 120 S. Ct. 2689 (2000) (reversing summary judgment against plaintiff who alleged that anxiety
and somatoform disorders impaired major life activities of sexual relations and sleep); Steele v. Thiokol
Corp., 241 F.3d 1248 (10th Cir. 2001) (finding major life activity under the ADA of interacting with
others not substantially impaired by obsessive–compulsive disorder).
13. Tarasoff v. Regents of the Univ. of California, 551 P.2d 334 (Cal. 1976).
14. Addington v. Texas, 441 U.S. 418 (1979) (holding that standard of proof for involuntary
commitment is clear and convincing evidence); O’Connor v. Donaldson, 422 U.S. 563 (1975)
(holding unconstitutional the confinement of a nondangerous mentally ill person capable of surviving
safely in freedom alone or with assistance).
15. Kansas v. Hendricks, 521 U.S. 346 (1997); Kansas v. Crane, 534 U.S. 407 (2002).
16. Predicting the Past: Retrospective Assessment of Mental States in Litigation (Robert I.
Simon & Daniel W. Shuman eds., 2002); Bruce Frumkin & Alfredo Garcia, Psychological Evaluations
and Competency to Waive Miranda Rights. 9 The Champion 12 (2003).
17. See Thomas G. Gutheil, Common Pitfalls in the Evaluation of Testamentary Capacity, 35 J. Am.
Acad. Psychiatry & L. 514 (2007); Farnsworth, supra note 8, at 228–33.
817
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
records, is usually an essential part of the evaluation. Sometimes the available data
from all of these sources are so limited or contradictory that they will not allow a
judgment to be made of a person’s state of mind at a point in the past. However,
most experienced forensic evaluators appear to believe that conclusions regarding
past mental state can often be reached with a reasonable degree of certainty if sufficient information is available.18
The most straightforward task for a mental health professional is to evaluate a
person’s current mental state. In criminal justice settings, concerns about a person’s
current competence to exercise or waive rights will call for such evaluations (e.g.,
competence to stand trial or to represent oneself at trial).19 Civil issues calling for
contemporaneous assessments include workers’ compensation and other disability
claims and litigation alleging emotional harms due to negligent or intentional torts,
workplace discrimination, and other harm-inducing situations.20 At the core of an
assessment of current mental state is the diagnostic evaluation described below. As
in all evaluations in legal contexts, careful consideration needs to be given to the
possibility of secondary gain from manipulation of their presentation for persons
being assessed.21
In contrast to contemporaneous assessments, the evaluation of a person’s future
mental state and consequent behaviors is fraught with particular difficulty, especially
when the outcome being predicted occurs at a relatively low frequency.22 Such
predictive assessments may come into play in the criminal process when bail is set,23
at sentencing,24 and as part of probation and parole decisions.25 They often involve
18. Robert I. Simon, Retrospective Assessment of Mental States in Criminal and Civil Litigation: A
Clinical Review in Simon and Shuman, supra note 16 at 1, 8; McGregor v. Gibson, 248 F.3d 946, 962
(10th Cir. 2001) (stating that although disfavored, retrospective determinations of competence may
be allowed in cases when a meaningful hearing can be conducted).
19. See Dusky v. United States, 362 U.S. 402 (1960) (holding that a criminal defendant must
understand the charges and be able to participate in his defense); Godinez v. Moran, 509 U.S. 389
(1993) (holding that a defendant competent to stand trial is also sufficiently competent to plead guilty
or waive the right to legal counsel).
20. See, e.g., Kent v. Apfel, 75 F. Supp. 2d 1170 (D. Kan. 1999); Quigley v. Barnhart, 224 F.
Supp. 2d 357 (D. Mass. 2002); Rivera v. City of New York, 392 F. Supp. 2d 644 (S.D.N.Y. 2005);
Lahr v. Fulbright & Jaworski, L.L.P., 164 F.R.D. 204 (N.D. Tex. 1996).
21. See United States v. Binion, 132 F. App’x 89 (8th Cir. 2005) (upholding an obstruction of
justice conviction and sentencing determination based on a finding that defendant had feigned mental
illness). See discussion, infra, Section I.C.2.
22. Joseph M. Livermore et al., On the Justifications for Civil Commitment, 117 U. Pa. L. Rev.
75–96 (1968).
23. United States v. Salerno, 481 U.S. 739 (1987); United States v. Farris, 2008 WL 1944131
(W.D. Pa. May 1, 2008).
24. Tex. Code Crim. Proc. Ann. art. 37.071 (Vernon 1981); Barefoot v. Estelle, 463 U.S. 880
(1983).
25. See 28 C.F.R. § 2.19 (2008) for parole determination factors. For probation determination
factors, see 18 U.S.C.A. § 356 (2008). See generally Neil Cohen, The Law of Probation and Parole
§§ 2, 3 (2008).
818
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
estimates of the probable effectiveness of treatment, especially in the juvenile justice
system, where the lack of amenability of juveniles to mental health treatment is
frequently a key consideration in decisions regarding transfer to adult courts.26 Predictions regarding behavior related to mental disorders are also seen in civil cases,
for example, in the civil commitments of persons with mental disorders and in the
newer statutes authorizing the commitment of dangerous sex offenders.27 Damage
assessments in civil cases alleging emotional harms will usually call for some estimate
regarding the duration of symptoms and response to treatment.28 The inescapable
uncertainties of the course of mental disorders and their responsiveness to interventions create part of the difficulty in such assessments, but an equally important
contribution is made by the unknowable contingencies of life. Will a person’s
spouse leave or will the person lose his job or his home? As a consequence, will
the person return to drinking, stop taking medication, or reconnect with friends
who have continued to engage in criminal behaviors? At best, predictive assessments can lead to general statements of probability of particular outcomes, with an
acknowledgment of the uncertainties involved.29
2. Diagnosis versus functional impairment
A diagnosis of mental disorder per se will almost never settle the legal question in
a case in which mental health evidence is presented. However, a diagnosis may
play a role in determining whether a claim or proceeding can go forward. The
clearest example in criminal law is embodied in the insanity defense, where the
impairments of understanding, appreciation, and behavioral control that comprise
the various standards must be based, in one popular formulation, on a “mental
disease or defect.”30 In the absence of a diagnosis of mental disorder (including
mental retardation and the consequences of injury to the brain), an affirmative
26. Michael G. Kalogerakis, Handbook of Psychiatric Practice in Juvenile Court 79–85 (1992).
27. See O’Connor v. Donaldson, 422 U.S. 563 (1975) (finding that a state may not confine a
citizen who is nondangerous and capable of living by herself or with aid); for an example of a sex
offender civil commitment statute, see Minn. Stat. § 253B.185 (2008). The constitutionality of civil
commitment for dangerous sex offenders was upheld in Kansas v. Hendricks, 521 U.S. 346 (1997)
(setting forth the procedures for the commitment of convicted sex offenders deemed dangerous due
to a mental abnormality).
28. Gary B. Melton et al., Psychological Evaluations for the Courts: A Handbook for Mental
Health Professionals and Lawyers 413–14 (2007).
29. For a more detailed discussion of predictive assessment regarding future dangerousness, see
Section I.E.
30. The American Law Institute standard for the insanity defense reads, “a person is not
responsible for criminal conduct if at the time of such conduct as a result of mental disease or defect he
lacks substantial capacity either to appreciate the criminality of his conduct or to conform his conduct
to the requirements of the law.” Model Penal Code and Commentaries § 4.01(1) (Official Draft and
Revised Comments 1985) (adopted by American Law Institute, May 24, 1962). The federal insanity
defense was codified in the Insanity Defense Reform Act of 1984, codified at 18 U.S.C. § 17. See also
Durham v. United States, 214 F.2d 862 (D.C. Cir. 1954) (“[A]n accused is not criminally responsible
819
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
defense of insanity will not prevail.31 Comparable situations exist in civil commitment proceedings and work disability determinations.32
Even where the presence of a mental disorder is not an absolute prerequisite
to claims involving mental state, it will often play a de facto threshold role. Thus,
evidence in cases involving claims of incompetence (e.g., to engage in a contractual relationship) or emotional harms will often address the presence of a diagnosis,
even though that may not strictly be required.33 In these cases, failure to establish a
diagnosis may be taken by a factfinder as an indicator of the probable lack of validity of the claim. That is, it may be assumed that unless an underlying disorder can
be identified, the claimed impairments are bogus. Thus, conflicting testimony over
the presence or absence of a diagnosis is common in cases in which mental health
evidence is offered, even when not mandated by the operative legal standard.
Notwithstanding the threshold role played by a mental disorder diagnosis in
many cases, the ultimate legal issue usually will turn on the impact of the mental
disorder on the person’s functional abilities.34 Those abilities may relate to the
person’s cognitive capacities, including the capacity to make a legally relevant
decision (e.g., granting consent for the police to conduct a warrantless search,
altering a will) or the capacity to behave in a particular way (e.g., conforming
one’s conduct to the requirements of the law, cooperating with an attorney in
one’s own defense, resisting undue influence), or both (e.g., skill as a parent,
competence to proceed with criminal adjudication). The former set of capacities can be denoted as decisional capacities and the latter set as performative capacities.
Many of the legal questions to which mental health evidence may be relevant will
involve a determination of the influence of a mental state or disorder on one or
both of these sets of capacities. The mere presence of a mental disorder will almost
always be insufficient for that purpose. Mental disorder in a criminal defendant,
for example, if it does not interfere substantially with competence to stand trial,
does not present a basis for postponing adjudication of the case.35 Some degree of
mental disorder, including dementia, without affecting relevant abilities, does not
provide grounds for voiding a will.36 The point can be generalized to all criminal
and civil competency determinations, most assessments of emotional harms, and
if his unlawful act was the product of mental disease or defect.”); note United States v. Brawner, 471
F.2d 969 (1972), which overturned the Durham Rule (or “product test”).
31. Tennard v. Dretke, 542 U.S. 274 (2004); Bigby v. Dretke, 402 F.3d 551 (5th Cir. 2005).
32. Addington v. Texas, 441 U.S. 418 (1979) (setting the burden of proof required for
involuntary civil commitment as requiring clear and convincing evidence); and Social Security
Administration Listing of Impairments, available at http://www.ssa.gov/disability/professionals/
bluebook/listing-impairments.htm.
33. Farnsworth, supra note 8, §§ 4.6–4.8, at 228–34.
34. Grisso, supra note 2.
35. United States v. Passman, 455 F. Supp. 794 (D.D.C. 1978); United States. v. Valierra, 467
F.2d 125 (9th Cir. 1972).
36. Rossi v. Fletcher, 418 F.2d 1169 (D.C. Cir. 1969); In re Estate of Buchanan, 245 A.D.2d
642 (3d Dept. 1997).
820
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
probably to the majority of cases in which mental health testimony is offered:
Unless a mental disorder can be shown to have affected a person’s functional
capacity, decisional or performative, a diagnosis of mental disorder per se will not
be determinative of the outcome.37
Despite its importance to the adjudicative process, mental health evidence
is often introduced in the context of a serious stigma that attaches to mental disorders38 and considerable confusion regarding their nature, consequences, and
susceptibility to treatment.39 Diagnoses of mental disorders often are perceived to
be less reliable and more subjective than diagnoses of other medical conditions.40
Symptoms of mental disorders may be seen as reflections of moral weakness or lack
of will, and the impact of disorders on functional abilities may not be recognized, or
occasionally may be exaggerated.41 The potential impact and limits of current treatments are not widely understood. Indeed, even the various types of mental health
professionals are frequently confused.42 The remainder of Section I of this reference
guide provides background to clarify these issues; Section II considers questions
specifically related to the introduction of evidence by mental health experts.
B. Mental Health Experts
Evidence related to mental state and mental disorders may be presented by experts
from a number of disciplines, but it is most commonly introduced by psychiatrists
or psychologists.
1. Psychiatrists
Psychiatrists are physicians who specialize in the diagnosis and treatment of mental disorders.43 After college, they complete 4 years of medical school, during
37. For a brief overview of competency evaluations, see Patricia A. Zapf & Ronald Roesch,
Mental Competency Evaluations: Guidelines for Judges and Attorneys, 37 Ct. Rev. 28 (2000), available at
http://aja.ncsc.dni.us/courtrv/cr37/cr37-2/CR37-2ZapfRoesch.pdf. For the underlying standard for
competency to stand trial, see Dusky v. United States, 362 U.S. 402 (1960).
38. Bruce G. Link et al., Measuring Mental Illness Stigma, 30 Schizophrenia Bull. 511 (2004).
39. Bruce G. Link et al., Stigma and Coercion in the Context of Outpatient Treatment for People with
Mental Illnesses, 67 Soc. Sci. & Med. 409 (2008).
40. Thomas A. Widiger, Values, Politics, and Science in the Construction of the DSMs, in Descriptions
and Prescriptions: Values, Mental Disorders, and the DSMs 25 (John Z. Sadler ed., 2002).
41. Michael L. Perlin, “Half-Wracked Prejudice Leaped Forth”: Sanism, Pretextuality, and Why and
How Mental Disability Law Developed as It Did, 10 J. Contemp. Legal Issues 3 (1999); Michael L. Perlin,
“You Have Discussed Lepers and Crooks”: Sanism in Clinical Teaching, 9 Clinical L. Rev. 683 (2003);
Michael L. Perlin, The Hidden Prejudice: Mental Disability on Trial (2000).
42. The degree of popular confusion is underscored by the results of a Web-based search for
“psychiatrist vs. psychologist,” which turns up a remarkably large number of Web sites attempting to
explain the differences between the two professions.
43. Narriman C. Shahrokh & Robert E. Hales, American Psychiatric Glossary 157 (2003).
821
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
which they spend approximately 2 years in preclinical studies (e.g., physiology,
pharmacology, genetics, pathophysiology), followed by 2 years of clinical rotations in hospital and clinic settings (e.g., medicine, surgery, pediatrics, obstetrics/
gynecology, orthopedics, psychiatry).44 Graduating medical students who elect to
specialize in psychiatry enter residency programs of at least 4 years’ duration.45
Accredited residencies must currently offer at least 4 months in a primary care
setting in internal medicine, family medicine, or pediatrics, and at least 2 months
of training in neurology.46 The remainder of a resident’s time is spent learning
psychiatry, including inpatient, outpatient, emergency, community, and consultation settings, and with exposure to the subspecialty areas of child and adolescent,
geriatric, addiction, and forensic psychiatry. Residents will be taught how to use
treatment techniques, among them medications and various forms of psychotherapy. Elective time is usually available to pursue particular interests in greater
depth or to engage in research. Didactic seminars, including sessions on neuroscience, genetics, psychological theory, and treatment, and supervision sessions with
experienced psychiatrists (and sometimes mental health professionals from other
disciplines) complement the clinical experiences.47
After completion of 4 years of residency training, a psychiatrist is designated as
“board eligible,” that is, able to take the certification examination of the American
Board of Psychiatry and Neurology in adult psychiatry.48 Successful completion of
this examination process results in the psychiatrist being designated “board certified.” Psychiatrists who desire more intensive training in a subspecialty area of
psychiatry—for example, child and adolescent or addiction psychiatry—can take a
1- or 2-year fellowship in that area. The psychiatrist who has completed an accred-
44. Medical schools in the United States are accredited by the Liaison Committee on Medical
Education, which establishes general curricular and other standards that all schools must meet.
Standards are available at http://www.lcme.org/standard.htm. Students can elect to extend their
medical school training by taking additional time to conduct research or to obtain complementary
training (e.g., in public health).
45. Residents who choose to combine adult and child psychiatry training can do so in a 5-year
program, or can follow their 4 years of adult residency with 2 years of child training. Some residents
will also extend their residency training by adding a year or more during which they conduct
laboratory or clinical research.
46. Psychiatric residencies are accredited by the Accreditation Council on Graduate Medical
Education. Program requirements are available at http://www.acgme.org/acwebsite/rrc_400/400_
prindex.asp.
47. See descriptions of several leading psychiatry residency training programs on their Web
sites: Columbia University (http://www.cumc.columbia.edu/dept/pi/residency/index.html); Johns
Hopkins University (http://www.hopkinsmedicine.org/Psychiatry/for_med_students/residency_
general/); Harvard/Longwood Psychiatry Residency (http://harvardlongwoodpsychiatry.org/).
48. Information regarding qualifications for board certification and the examination process is
available from the American Board of Psychiatry and Neurology at http://www.abpn.com/Initial_
Psych.htm.
822
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
ited fellowship49 is eligible for additional board certification in that subspecialty.50
Although fellowship training and board certification indicate expertise in a particular area of psychiatry, some psychiatrists are recognized by the courts as having
developed equivalent levels of expertise by virtue of extensive clinical experience
and self-designed instruction (e.g., continuing education courses, remaining current
with the professional literature).51
Forensic psychiatry is the subspecialty that focuses on the interrelationships
between psychiatry and the law.52 Hence, forensic psychiatrists are particularly
likely to offer evidence as part of court proceedings. Fellowship training in
forensic psychiatry involves a 1-year program in which fellows are taught forensic
evaluation for civil and criminal litigation and become involved in the treatment
of persons with mental disorders in the correctional system.53 They also learn
about the rules and procedures for providing evidence in legal proceedings and for
working with attorneys. However, training and/or board certification in forensic
psychiatry are not necessarily the best qualification for expertise in a particular
case. Although forensic psychiatrists are likely to have more expertise than general
psychiatrists for certain kinds of evaluations that are the focus of forensic training
(e.g., competence to stand trial, emotional harms), when issues are raised concerning other substantive areas of psychiatry (e.g., the effects of psychopharmacological
agents on a civil defendant’s ability to drive at the time of an accident that allegedly
resulted in injury to the plaintiff), a psychiatrist who specializes in that area will
often have greater expertise than someone with forensic training.
49. Accredited subspecialty training is currently available in addiction, child and adolescent,
forensic, and geriatric psychiatry, and in psychosomatic medicine. Psychiatrists are also eligible for
training in hospice and palliative medicine, pain medicine, and sleep medicine. See accreditation
standards at http://www.acgme.org/acwebsite/rrc_400/400_prindex.asp. Fellowship programs also
exist in some subspecialty areas for which accreditation and board certification are not available, e.g.,
research, psychopharmacology, and public and community psychiatry.
50. Typically, when new subspecialties are recognized and accreditation standards are developed,
a certain period of time (e.g., 5 years) is allowed for psychiatrists who have gained expertise in that area
by virtue of experience or alternative training to achieve board certification. Thus, many psychiatrists
who are today board certified in a subspecialty have not completed a fellowship.
51. For a comparable determination involving a counselor, see Leblanc v. Coastal Mech. Servs.,
LLC, 2005 WL 5955027 (S.D. Fla. Sept. 7, 2005) (quoting Jenkins v. United States, 307 F.2d 637
(D.C. Cir. 1962) for the proposition that the determination of a psychologist’s competence to render
an expert opinion is a case-by-case matter based on knowledge, not claim to a professional title).
52. See the definition of forensic psychiatry offered by the American Academy of Psychiatry
and the Law: “Forensic psychiatry is a medical subspecialty that includes research and clinical practice
in the many areas in which psychiatry is applied to legal issues,” available at http://www.aapl.org/
org.htm. Psychiatrists who have been certified in adult or child psychiatry by the American Board
of Psychiatry and Neurology, and who have completed a forensic psychiatry fellowship, can take the
examination for subspecialty certification in forensic psychiatry. A description of the requirements for
certification can be found at http://www.abpn.com/fp.htm. Board certification must be renewed by
taking a recertification examination every 10 years.
53. See the accreditation standards in forensic psychiatry at http://www.acgme.org/acWebsite/
downloads/RRC_progReq/406pr703_u105.pdf.
823
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
2. Psychologists
Psychologists have received graduate training in the study of mental processes and
behavior.54 Only a subset of psychologists evaluate and treat persons with psychological or behavioral problems; they may be termed clinical, counseling, health,
neuro-, rehabilitation, or school psychologists. In contrast, many psychologists
teach and/or pursue research in one of the academic aspects of the field (e.g.,
cognitive, developmental, or social psychology), or provide consultation of a
nonclinical nature (e.g., organizational or industrial psychology).55 Independent
practice in psychology requires licensure from the appropriate state licensure
board and generally requires a doctoral degree and postgraduate clinical experience. Although use of the term psychologist is restricted in many jurisdictions to
licensed psychologists,56 the term may be applied in some settings to persons with
master’s-level training in psychology.57
After college, students who enter graduate doctoral programs generally
require 4 to 6 years to complete their training. Those who intend to pursue clinical work generally receive training in clinical, counseling, or school psychology.58
Accredited programs in these areas are required to provide a minimum of three
academic years of graduate study, and students are required in addition to take
a year of clinical internship.59 Course work must include study of biological
aspects of behavior, cognitive and affective aspects of behavior, social aspects of
behavior, history and systems of psychology, psychological measurement, research
methodology, and techniques of data analysis. Students also must be taught about
54. The American Psychological Association defines the field of psychology in this way:
“Psychology is the study of the mind and behavior. The discipline embraces all aspects of the human
experience—from the functions of the brain to the actions of nations, from child development to
care for the aged. In every conceivable setting from scientific research centers to mental health care
services, ‘the understanding of behavior’ is the enterprise of psychologists.” http://74.125.45.104/
search?q=cache:JKti-_3SfkQJ:www.apa.org/about/+psychologist+definition&hl=en&ct=clnk&cd=9
&gl=us.
55. See id. for the American Psychological Association’s characterization of the subspecialties
in psychology.
56. See, e.g., Mass. Gen. Laws ch. 112, § 122; N.Y. Educ. Law § 7601.
57. Note that the American Psychological Association urges that the use of the term be restricted
to persons with doctoral degrees in psychology: “Psychologists have a doctoral degree in psychology
from an organized, sequential program in a regionally accredited university or professional school . . .
it is [the] general pattern to refer to master’s-level positions as counselors, specialists, clinicians, and so
forth (rather than as ‘psychologists’).” http://74.125.45.104/search?q=cache:JKti-_3SfkQJ:www.apa.
org/about/+psychologist+definition&hl=en&ct=clnk&cd=9&gl=us.
58. Other psychology programs offer training in experimental, social, and cognitive psychology,
for example, with the intent of producing graduates who will pursue research or teaching careers, but
will not engage in clinical work. United States v. Fishman, 743 F. Supp. 713, 723 (N.D. Cal. 1990)
(excluding the expert testimony of a social psychologist holding a Ph.D. in sociology).
59. Accreditation of programs in clinical, counseling, and school psychology is undertaken by
the Commission on Accreditation of the American Psychological Association. Accreditation standards
are available at http://www.apa.org/ed/accreditation/.
824
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
individual differences in behavior, human development, dysfunctional behavior
or psychopathology, and professional standards and ethics. A practicum experience, which usually involves placement in an agency or clinic that provides
psychological services, is part of the training experience. Prior to receiving their
degrees, students must also complete a 1-year clinical internship, which is often
taken at a clinical facility that is separate from their graduate school.60
Psychology graduate programs award either the Ph.D. or Psy.D. (“professional psychology”) degree.61 Ph.D. programs generally place greater emphasis on
research training, with students required to complete a research project and write
a dissertation. Psy.D. programs ordinarily stress clinical issues and training and
have less rigorous research requirements.62 Supervised work may involve some
combination of psychological treatment (e.g., individual or group psychotherapy)
and the use of standardized testing techniques (i.e., “psychological tests”). Once
licensed, psychologists can practice independently. At present, two states permit
psychologists who complete additional training requirements to prescribe medications, although physicians’ groups remain strongly opposed to the practice.63
Fellowships in subspecialty areas of psychology are becoming more common,
although they are not always linked to subspecialty certification processes. Among
the areas in which fellowships have been developed is forensic psychology, generally a 1-year program, with didactic and clinical training in forensic evaluation.64
Certification in forensic psychology through an examination process is available
for psychologists who have completed a fellowship in the field or who have at
least 5 years of experience in forensic psychology.65 As with psychiatry, whether
the expertise of a forensic psychologist is relevant to a particular legal issue will
vary and needs to be considered on a case-by-case basis.
60. Id.
61. See sample Ph.D. program curricula for programs at University of Illinois at UrbanaChampaign (http://www.psych.uiuc.edu/divisions/clinicalcommunity.php); Indiana University
(http://bl-psy-appsrv.ads.iu.edu:8080/graduate/courses/clinical.asp); and University of California at
Los Angeles (http://www.psych.ucla.edu/Grads/Areas/clinical.php). See sample Psy.D. curricula for
programs at Massachusetts School of Professional Psychology (http://www.mspp.edu/academics/
degree-programs/psyd/default.asp); and Wisconsin School of Professional Psychology (http://www.
wspp.edu/courseswspp.html).
62. For a discussion of the so-called “Vail model” on which Psy.D. training is based, see John C.
Norcross & Patricia H. Castle, Appreciating the PsyD: The Facts, 7 Eye on Psi Chi 22 (2002), available
at http://www.psichi.org/pubs/articles/article_171.asp.
63.. N.M. Stat. Ann. § 61-9 (2002); La. Rev. Stat. § 37:2371-78 (2004). Note that the New
Mexico statute is set to expire in 2010 under a sunset provision. N.M. Stat. Ann. 61-9-19 (2002).
64. See, e.g., the description of the program at the University of Massachusetts Medical School
at http://www.umassmed.edu/forensicpsychology/index.aspx.
65. Certification is provided by the American Board of Forensic Psychology. Requirements for
candidates are available at: http://www.abfp.com/.
825
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
3. Other mental health professionals
Persons with a variety of other forms of training provide mental health services,
including services that generally are referred to as psychotherapy or counseling,
with individuals, couples, or groups. The best established of these professions is
social work. Schools of social work offer 2-year programs that lead to a master’s
degree (MSW), and students can elect a track that is often referred to as psychiatric social work, which involves instruction and experience in psychotherapy.66
Graduate social workers can obtain state licensure after the completion of a
period of supervised practice and an examination, resulting in their designation
as a “licensed independent clinical social worker (LICSW),” with variation in
nomenclature across the states.67 Social workers may offer psychotherapeutic or
counseling services through social service agencies or in private practice. Recently,
a subspecialty of forensic social work has begun to develop, involving social
workers with experience in the criminal justice system.68
Another group that offers mental health services, which may include psychotherapy or counseling and medications, are master’s- or doctoral-level nurses. A
growing number of nursing schools are developing programs that are termed
“psychiatric nursing.”69 Nursing practice is regulated by state law and hence varies
across jurisdictions, but master’s-level nurses (sometimes referred to as “nurse practitioners”) can achieve a status that allows them to provide psychotherapy and to
dispense medications, although they may need to have a supervisory arrangement
with a physician for the latter.70
Other master’s-level mental health professionals include persons who may
be called psychologists, counselors, marital and family therapists, group therapists,
and a variety of other terms.71 Because state law generally does not regulate the
66. See, e.g., the curricula for social work training at Columbia (http://www.columbia.edu/cu/
ssw/admissions/pages/programs_and_curriculum/index.html) and at Smith (http://www.smith.edu/
ssw/geaa/academics_msw.php).
67. The Association of Social Work Boards provides an overview of state licensure requirements
at http://www.datapathdesign.com/ASWB/Laws/prod/cgi-bin/LawWebRpts2DLL.dll/EXEC/0/
0j6ws4m1dqx37r1ce43dq091bxya.
68. See the description of forensic social work offered by the National Association of Forensic
Social Work at http://www.nofsw.org/html/forensic_social_work.html. Postgraduate certification
programs for forensic social workers are also beginning to be developed, e.g., at the University of
Nevada at Las Vegas (http://socialwork.unlv.edu/PGC_forensic_social_work.html).
69. See the listing of training programs in psychiatric nursing, with links to their curricula,
provided by the American Psychiatric Nurses Association at http://www.apna.org/i4a/pages/index.
cfm?pageid=3311.
70. Sharon Christian et al., Overview of Nurse Practitioner Scopes of Practice in the United
States—Discussion, Center for Health Professions, University of California, San Francisco (2007), at
http://www.acnpweb.org/i4a/pages/index.cfm?page id=3465.
71. See, e.g., the variety of mental health professionals listed by the National Alliance on Mental
Illness at http://www.nami.org/Content/ContentGroups/Helpline1/Mental_Health_Professionals_
Who_They_Are_and_How_to_Find_One.htm. United States v. Huber, 603 F.2d 387, 399 (2d Cir.
826
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
practice of psychotherapy—although the use of titles such as “psychologist” or
“psychotherapist” may be restricted—there is no barrier to persons with variable
levels of training in mental health opening independent practices.72 This includes
persons with degrees in educational psychology (M.Ed. or Ed.D.), clergypersons
(who may have had some training in pastoral counseling in seminary), and members of disciplines unrelated to mental health. Because of the unregulated nature
of their practices, they are largely beyond the reach of professional oversight and
discipline.
Although psychiatrists and doctoral-level psychologists generally provide
expert evidence related to mental health issues, courts will sometimes admit testimony from other mental health professionals.73 Given that training and experience
vary considerably, and titles may be used inconsistently, an individualized inquiry
into the qualifications of the proposed expert is usually required.
1979) (affirming trial court’s rejection of expert testimony on defendant’s mental state from a professor
of economics who was also a certified psychoanalyst).
72. The classic study, albeit now somewhat outdated, is Daniel Hogan, The Regulation of
Psychotherapists (1979); see also Geoffrey Marczyk & Ellen Wertheimer, The Bitter Pill of Empiricism:
Health Maintenance Organizations, Informed Consent and the Reasonable Psychotherapist Standard of Care,
46 Vill. L. Rev. 33 (2001).
73. Leblanc v. Coastal Mech. Servs., LLC, 2005 WL 5955027 (S.D. Fla. Sept. 7, 2005) (finding
a marriage and family counselor holding a Ph.D. in family therapy, bachelor’s and master’s degrees in
psychology, and a record of relevant publications may be qualified to offer helpful testimony about
a plaintiff’s alleged psychological condition); Jenkins v. United States, 307 F.2d 637, 646 (D.C. Cir.
1962) (“The critical factor in respect to admissibility is the actual experience of the witness and the
probable probative value of his opinion. . . . The determination of a psychologist’s competence to
render an expert opinion based on his findings as to the presence or absence of mental disease or defect
must depend upon the nature and extent of his knowledge. It does not depend upon his claim to the
title ‘psychologist.’”); United States v. Azure, 801 F.2d 336, 342 (8th Cir. 1986) (“The social worker
was most likely qualified as an expert under Rule 702”); see also United States v. Raya, 45 M.J. 251
(1996) (finding that trial court’s admission of expert testimony from a social worker on whether the
victim suffered from PTSD was not an abuse of discretion) and United States v. Johnson, 35 M.J. 17
(1992) (holding social worker qualified to render opinion that child suffered trauma). Note, however,
not all courts have been receptive to social worker testimony offered as expert opinion on the diagnosis
of PTSD, e.g., Neely v. Miller Brewing Co., 246 F. Supp. 2d 866 (S.D. Ohio 2003), Blackshear v.
Werner Enters., Inc., 2005 WL 6011291 (E.D. Ky. May 19, 2005). For more restrictive approaches
to testimony by non-Ph.D. psychologists, see also State v. Bricker, 321 Md. 86 (Md. Ct. App. 1990)
(rejecting expert testimony from a nonpracticing psychologist who did not hold a doctorate and did
not qualify for a reciprocal license under state law). People v. McDarrah, 175 Ill. App. 3d 284, 291
(1988) (affirming the trial court’s rejection as an expert witness of a doctoral candidate who did not
have the experience level required for state registration as a psychologist). Parker v. Barnhart, 67 F.
App’x 495 (9th Cir. 2003) (finding error in an administrative law judge’s failure to call a licensed
psychologist, rather than another expert, as an expert witness for appropriate testimony). Earls v.
Sexton, 2010 U.S. Dist. LEXIS 52980 (M.D. Pa. May 28, 2010) (allowing a nurse practitioner to
testify in a negligence action concerning whether a motor vehicle accident caused psychiatric injuries).
827
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
C. Diagnosis of Mental Disorders
1. Nomenclature and typology—DSM-IV-TR and DSM-5
The standard nomenclature and diagnostic criteria for mental disorders in use in
the United States are embodied in the Diagnostic and Statistical Manual of Mental
Disorders, published by the American Psychiatric Association, and now in its fourth
edition with revised text (DSM-IV-TR).74 It is anticipated that the next edition
of the manual (DSM-5) will appear in 2013.75 According to the DSM framework,
the presence of a mental disorder is typically diagnosed by a combination of the
symptoms reported by the patient (e.g., sadness, difficulty falling asleep, anxiety)
and signs observed by the clinician (e.g., attentional difficulties, sad affect, crying).
To qualify for a DSM diagnosis, persons must meet a set of criteria that are characteristic of the disorder.76 The presence of certain signs and symptoms may be
74. American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders (4th
ed. text rev. 2000) (hereinafter DSM-IV-TR).
75. An alternative nomenclature and set of criteria used internationally can be found in the
International Classification of Diseases, now in its 10th edition (ICD-10), published by the World Health
Organization. Although the DSM-IV-TR and ICD-10 nomenclature and criteria are generally similar,
there are differences that can result in diagnostic variations in particular cases, depending on which
criteria are applied.
76 E.g., A diagnosis of obsessive–compulsive disorder requires the following:
A. Either obsessions or compulsions:
Obsessions as defined by (1), (2), (3), and (4):
(1) recurrent and persistent thoughts, impulses, or images that are experienced, at some time during
the disturbance, as intrusive and inappropriate and that cause marked anxiety or distress
(2) the thoughts, impulses, or images are not simply excessive worries about real-life problems
(3) the person attempts to ignore or suppress such thoughts, impulses, or images, or to neutralize
them with some other thought or action
(4) the person recognizes that the obsessional thoughts, impulses, or images are a product of his or
her own mind (not imposed from without as in thought insertion)
Compulsions as defined by (1) and (2):
(1) repetitive behaviors (e.g., hand washing, ordering, checking) or mental acts (e.g., praying,
counting, repeating words silently) that the person feels driven to perform in response to an
obsession, or according to rules that must be applied rigidly
(2) the behaviors or mental acts are aimed at preventing or reducing distress or preventing some
dreaded event or situation; however, these behaviors or mental acts either are not connected
in a realistic way with what they are designed to neutralize or prevent or are clearly excessive
B. At some point during the course of the disorder, the person has recognized that the obsessions or
compulsions are excessive or unreasonable.
NOTE: This does not apply to children.
C. The obsessions or compulsions cause marked distress, are time consuming (take more than 1 hour
a day), or significantly interfere with the person’s normal routine, occupational (or academic) functioning, or usual social activities or relationships.
D. If another Axis I disorder is present, the content of the obsessions or compulsions is not restricted
to it (e.g., preoccupation with food in the presence of an Eating Disorder; hair pulling in the presence of Trichotillomania; concern with appearance in the presence of Body Dysmorphic Disorder;
preoccupation with drugs in the presence of a Substance Use Disorder; preoccupation with having
a serious illness in the presence of Hypochondriasis; preoccupation with sexual urges or fantasies in
the presence of a Paraphilia; or guilty ruminations in the presence of Major Depressive Disorder).
828
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
mandatory for a diagnosis to be made, but in most cases—given the variable presentation of most mental disorders—only some proportion of signs and symptoms
must be present (e.g., five out of nine).77
Since the influential third edition of DSM in 1980,78 the manual has taken a
“multiaxial” approach to diagnosis. That is, it recognizes that multiple aspects of a
person’s situation—not just the signs and symptoms of disorder—may be relevant
to a full understanding of his or her situation. Currently, there are five DSM axes:
Axis 1 is for the designation of most mental disorders, including substance abuse;
Axis 2 covers disorders of personality and mental retardation, which may be present together with or independent of an Axis 1 disorder; Axis 3 addresses concurrent medical disorders; Axis 4 allows the designation of stressors confronting the
person; and Axis 5 is a structured scale that speaks to the person’s overall level of
functioning.79 A complete diagnosis in the DSM system requires some notation
regarding all five axes, although clinicians commonly focus on Axes 1–3. More
than one condition may be indicated on Axes 1–4; for example, major depressive
disorder and alcohol abuse may coexist on Axis 1, and more than one personality
disorder may be noted on Axis 2.
E. The disturbance is not due to the direct physiological effects of a substance (e.g., a drug of abuse,
a medication) or a general medical condition.
DSM-IV-TR at 462–463.
77. E.g., among the criteria required to be met for a diagnosis of Major Depressive Episode are:
A. Five (or more) of the following symptoms have been present during the same 2-week period and
represent a change from previous functioning; at least one of the symptoms is either (1) depressed
mood or (2) loss of interest or pleasure.
(1) depressed mood most of the day, nearly every day, as indicated by either subjective report (e.g.,
feels sad or empty) or observation made by others (e.g., appears tearful).
NOTE: In children and adolescents, can be irritable mood.
(2) markedly diminished interest or pleasure in all, or almost all, activities most of the day, nearly
every day (as indicated by either subjective account or observation made by others)
(3) significant weight loss when not dieting or weight gain (e.g., a change of more than 5% of body
weight in a month), or decrease or increase in appetite nearly every day. Note: In children,
consider failure to make expected weight gains.
(4) insomnia or hypersomnia nearly every day
(5) psychomotor agitation or retardation nearly every day (observable by others, not merely subjective feelings of restlessness or being slowed down)
(6) fatigue or loss of energy nearly every day
(7) feelings of worthlessness or excessive or inappropriate guilt (which may be delusional) nearly
every day (not merely self-reproach or guilt about being sick)
(8) diminished ability to think or concentrate, or indecisiveness, nearly every day (either by subjective account or as observed by others)
(9) recurrent thoughts of death (not just fear of dying), recurrent suicidal ideation without a specific
plan, or a suicide attempt or a specific plan for committing suicide.
DSM-IV-TR at 356.
78. American Psychiatric Association, Diagnostic and Statistical Manual of Mental Disorders
(3d ed. 1980).
79. DSM-IV-TR at 27–37.
829
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The DSM approach has been criticized on a number of grounds, at least one
of which is relevant to the evidence likely to be presented in legal proceedings. By
requiring that persons being evaluated meet a certain number of particular criteria
(e.g., five out of nine signs and symptoms of major depressive disorder), the DSM
all but guarantees that there will be people who fall just short of qualifying for a
diagnosis but may nonetheless be experiencing significant symptoms and impairment.80 When a mental disorder diagnosis is required as a threshold determination
for legal purposes, this may preclude a claim or defense based on the presence of a
disorder. In part, DSM compensates for this problem by allowing alternative “not
otherwise specified” diagnoses to be assigned to persons who fail to meet the full
criteria set (e.g., “depressive disorder, not otherwise specified” for persons who
fall short of meeting criteria for major depressive disorder),81 but the problem
remains. Suggestions that a more dimensional approach to diagnosis be adopted,
that is, one that recognizes a spectrum of extent and severity of symptoms along
a continuum associated with a given disorder,82 have so far been rejected in favor
of continuing with the current categorical system.
The goal of the DSM is to provide a typology that is useful to clinicians and
researchers and that reflects the latest psychiatric understanding of mental disorders.83
Periodic revisions, such as the process now under way that will result in DSM-5, are
accomplished by groups of experts, mostly psychiatrists, but include some experts
from other disciplines and are ultimately subject to the review and approval of the
Board of Trustees and Assembly of the American Psychiatric Association. Hence,
the process is sometimes criticized as reflecting social or political biases, as opposed
to science.84 Although such effects cannot be ruled out, to the extent that they exist,
they are likely to be associated with a small number of controversial categories and
proposed categories (e.g., premenstrual dysphoric disorder,85 paraphilic rapism86). In
addition, the DSM itself recognizes—in a cautionary statement in the introduction
to the text—that diagnostic criteria that are appropriate for clinical or research purposes may not map directly onto legally relevant categories.87 Caution is therefore
required in moving between clinical diagnoses and legal conclusions.
80. Harold A. Pincus et al., Subthreshold Mental Disorders: Nosological and Research Recommendations,
in Advancing DSM: Dilemmas in Psychiatric Diagnosis 129 (Katharine A. Phillips et al. eds., 2002).
81. DSM-IV-TR at 381–82.
82. See the papers on dimensional approaches to psychiatric diagnosis published in the
International Journal of Methods in Psychiatric Research, vol. 16, supplement.
83. Because of confusion regarding the connotations of the term “mental illness,” the DSM
eschews its use. All DSM conditions are referred to as “mental disorders.” DSM-IV-TR at xxx–xxxi.
84. Widiger, supra note 40, at 25–41.
85. Anne E. Figert, Women and the Ownership of PMS: The Structuring of a Psychiatric
Disorder (1996).
86. Herb Kutchins & Stuart A. Kirk, Making Us Crazy: DSM, the Psychiatric Bible and the
Creation of Mental Disorders (2003).
87. The cautionary statement reads, in part: “The purpose of DSM-IV is to provide clear
descriptions of diagnostic categories in order to enable clinicians and investigators to diagnose,
830
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
Given that the anticipated publication of DSM-5 is not due until 2013,88 it
is not possible at this writing to specify the changes that will appear in the new
edition. However, current indications are that the major categories of diagnoses
described in the following section will be retained, although specific changes may
be made to individual diagnostic criteria.89 The task force directing the revision
process is considering potential changes to the five-axis structure that has existed
since 1980,90 minor modifications to the core definition of a mental disorder,91
and the introduction of structured assessments of dimensions that cut across
diagnostic categories, such as depressed mood, anxiety, substance use, or sleep
problems.92 Proposed changes prior to publication can be tracked on a Web site
established by the American Psychiatric Association, which also offers a time line
of the steps in the process.93
2. Major diagnostic categories
Some hint of the number and diversity of mental disorders embodied in the
current diagnostic typology is provided by the fact that DSM-IV-TR is approximately 900 pages long. However, the characteristics of the major categories of
disorders that are likely to be relevant in legal proceedings can be summarized
more concisely.94
communicate about, study, and treat people with various mental disorders. It is to be understood
that inclusion here, for clinical and research purposes, of a diagnostic category such as Pathological
Gambling or Pedophilia does not imply that the condition meets legal or other nonmedical criteria
for what constitutes mental disease, mental disorder, or mental disability. The clinical and scientific
considerations involved in categorization of these conditions as mental disorders may not be wholly
relevant to legal judgments, for example, that take into account such issues as individual responsibility,
disability determination, and competency.” DSM-IV-TR at xxxvii.
88. American Psychiatric Association, DSM-5 Development, Timeline, available at http://www.
dsm5.org/about/Pages/Timeline.aspx.
89. American Psychiatric Association, DSM-5 Development, Proposed Draft Revisions to DSM
Disorders and Criteria, available at http://www.dsm5.org/ProposedRevisions/Pages/Default.aspx.
90. American Psychiatric Association, DSM-5 Development, Classification Issues Under Discussion,
available at http://www.dsm5.org/ProposedRevisions/Pages/ClassificationIssuesUnderDiscussion.aspx.
91. American Psychiatric Association, DSM-5 Development, Definition of a Mental Disorder,
available at http://www.dsm5.org/ProposedRevisions/Pages/proposedrevision.aspx?rid=465.
92. American Psychiatric Association, DSM-5 Development, Cross-Cutting Dimensional
Assessment in DSM-5, available at http://www.dsm5.org/ProposedRevisions/Pages/Cross-Cutting
DimensionalAssessmentinDSM-5.aspx.
93. American Psychiatric Association, DSM-5 Development, DSM-5: The Future of Psychiatric
Diagnosis, available at http://www.dsm5.org.
94. These brief summaries of complex and variable conditions are meant to provide an
orientation to the nature and course of major mental disorders. The current edition of the DSM itself
or standard psychiatric textbooks should be consulted for more complete descriptions. Note that for a
diagnosis of any disorder to be made per the DSM, the symptoms must be deemed to “cause clinically
significant distress or impairment in social, occupational, or other important areas of functioning.”
DSM-IV-TR at 7.
831
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
• Schizophrenia is a complex psychotic95 disorder, involving delusions, hallucinations, disorganization of thought, speech, and behavior, and social
withdrawal. Social and occupational functioning are markedly impaired.
The course is chronic, marked by periodic exacerbations, and often by
slow deterioration over time.96
• Bipolardisorder (formerly called manic-depressive disorder) is a disturbance
of mood marked by episodic occurrence of both mania and depression.
During manic periods, persons experience elevated, expansive, or irritable
mood, accompanied by such symptoms as grandiosity, racing thoughts and
pressured speech, decreased sleep, and hypersexuality. The course is chronic,
but intermittent, though some patients experience a downward trajectory.97
• Majordepressivedisorder involves one or more episodes of depression, typically involving depressed mood, loss of pleasure, weight loss, insomnia,
feelings of worthlessness, diminished ability to think or concentrate, and
thoughts of death. Episodes are often, but not always, recurrent.98
• Substancedisorders include both substance abuse and substance dependence,
the most common of which are alcohol abuse and dependence. Abuse
consists of “a maladaptive pattern of substance use leading to clinically
significant impairment or distress.”99 Dependence involves, in addition,
signs of tolerance, withdrawal, and lack of success in restricting substance
use. These are chronic, and often relapsing, disorders, though successful
recovery, with or without treatment, is possible.100
• Personality disorders are inflexible, maladaptive, and enduring patterns of
perceiving and relating to oneself, other people, and the external world
that cause functional impairment and distress.101
• Antisocial personality disorder is often seen in criminal courts, because it is
marked by a pervasive pattern of disregard for and violation of the rights of
others. Personality disorders tend to be longstanding and difficult to treat.102
• Dementia is marked by progressive impairment of cognitive abilities,
including memory, language, motor functions, recognition of objects,
and executive functioning.103 The most common form of dementia is
Alzheimer’s disease, the incidence of which increases with age and the
cause of which remains unclear, although in many cases genetics seem
95. Psychotic conditions involve some degree of detachment from reality, characterized by
delusional thinking and hallucinatory perceptions. Id. at 770.
96. Id. at 297–317.
97. Id. at 382–92.
98. Id. at 369–76.
99. Id. at 199.
100. Id. at 192–98.
101. Id. at 686.
102. Id. at 701–06.
103. Id. at 147–71.
832
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
to play a role.104 Other causes of dementia include multiple small strokes
(“multi-infarct dementia”), trauma, and infection with certain virus-like
agents.
Additional disorders that may have special legal relevance include anxiety disorders
(including post-traumatic stress disorder (PTSD)), dissociative disorders (such as
dissociative identity disorder, formerly multiple personality disorder), impulse
control disorders (such as kleptomania and pyromania), sexual disorders (especially
the paraphilias, such as pedophilia), delirium, and mental retardation.105
The causes of mental disorders remain to be elucidated. However, as a
general proposition, it appears that many mental disorders may derive from
a genetic predisposition that is activated by particular environmental circumstances.106 This hypothesis is supported by extensive studies of the genetics of
mental disorders107 and epidemiological studies showing a relationship between
various environmental factors and occurrence of illness.108 Only rarely at this
point, however, have particular genes and given stressors been linked to a
particular disorder. For example, a genetic variant in an enzyme that regulates
neurotransmitter reuptake has been shown to predispose to depression, but
only when the susceptible person has been exposed to stressful life events.109
104. Matthew B. McQueen & Deborah Blacker, Genetics of Alzheimer’s Disease, in Psychiatric
Genetics: Applications in Clinical Practice (Jordan W. Smoller et al. eds., 2008).
105. Rebrook v. Astrue, 2008 WL 822104 (N.D. W. Va. Mar. 26, 2008) (anxiety disorder);
United States v. Holsey, 995 F.2d 960 (10th Cir. 1993) (dissociative disorder); Coe v. Bell, 89 F. Supp.
2d 922 (M.D. Tenn. 2000) (dissociative identity disorder); United States v. Miller, 146 F.3d 1281 (11th
Cir. 1998) (impulse control disorder); United States v. McBroom, 991 F. Supp. 445 (D.N.J. 1998)
(person receiving treatment for bipolar disorder and impulse control disorder sentenced for possession
of child pornography); United States v. Silleg, 311 F.3d 557 (2d Cir. 2002) (pedophilia determination
in a child pornography case); Fields v. Lyng, 705 F. Supp. 1134 (D. Md. 1988) (kleptomania); United
States v. Warr, 530 F.3d 1152 (9th Cir. 2008) (sentencing of an arsonist diagnosed with pyromania
upheld); Kansas v. Hendricks, 531 U.S. 346 (1997) (upholding commitment of man unable to control
pedophilic impulses); United States v. Gigante, 996 F. Supp. 194 (E.D.N.Y. 1998) (dementia); Johnson
v. City of Cincinnati, 39 F. Supp. 2d 1013 (S.D. Ohio 1999) (estate of man who died from police
restraint during a seizure sued the city under 28 U.S.C. § 1983; Bertl v. City of Westland, 2007 WL
3333011 (E.D. Mich. Nov. 9, 2007) (finding that delirium tremens is an objectively serious medical
need); Atkins v. Virginia, 536 U.S. 304 (2002) (banning the execution of the mentally retarded as
a violation of the Eighth Amendment); In re Hearn, 418 F.3d 444 (5th Cir. 2005); Hamilton v.
Southwestern Bell Tel. Co., 136 F.3d 1047, 1050 (5th Cir. 1998) (recognizing PTSD as a mental
impairment for the purposes of the Americans with Disabilities Act).
106. Michael Rutter & Judy Silberg, Gene-Environment Interplay in Relation to Emotional and
Behavioral Disturbance, 53 Ann. Rev. Psychol. 463 (2002).
107. Jordan W. Smoller et al., Psychiatric Genetics: Applications in Clinical Practice (2008).
108. Ezra Susser et al., Psychiatric Epidemiology: Searching for the Causes of Mental Disorders
(2006).
109. Avshalom Caspi et al., Role of Genotype in the Cycle of Violence in Maltreated Children, 287
Science 851 (2002).
833
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Active efforts are under way to explore this “diathesis/stress hypothesis” in other
mental disorders as well.110
3. Approaches to diagnosis
The solicitation of symptoms and the observation of signs necessary for a mental
disorder diagnosis can be accomplished with a variety of techniques.
a. Clinical examination
Direct clinical examination of the person whose condition is at issue is still the
core of most mental health evaluations.111 In contrast to general medicine, where
examination involves the laying on of hands, evaluation of mental disorders is
accomplished by careful elicitation of symptoms and observation of signs. A typical
sequence of clinical examination involves exploring with the person being evaluated: the current presenting problem, including the specific symptoms experienced
and the duration of such symptoms; past history of similar symptoms or other
disorders and of treatment for those disorders; developmental history; social and
occupational history; family history; medical history, including a review of current
medical symptoms, medications taken, and substances used (e.g., alcohol, street
drugs, cigarettes); and mental status examination.112 The last category involves
a structured assessment of the person’s mental state, including motor function,
speech, mood and affect, thought process and content, cognitive functioning,
judgment, and insight, along with the presence of ideation or history of self-harm
or harm toward others. Simultaneously, the clinician is observing the person’s
behavior and appearance to glean signs associated with mental disorders.113 If indicated, a physical examination may be performed, if the evaluator is a psychiatrist
who has maintained his or her general clinical skills, or requested.
The duration of a clinical examination sufficient to diagnose the person’s
condition will vary depending on the complexity of the case, the cooperativeness
of the evaluee, and the questions being addressed. Examinations may take from
one to several hours, sometimes spread over multiple sessions. When previous
records of contact with mental health professionals are available, the clinician will
ordinarily want to review them prior to the clinical examination, so that questions can be targeted more efficiently, and previous conclusions confirmed or
110. Margit Burmeister et al., Psychiatric Genetics: Progress Amid Controversy, 9 Nat. Rev. Genetics
527 (2008).
111. For an overview of the evaluation of mental health problems, see Linda B. Andrews, The
Psychiatric Interview and Mental Status Examination, in The American Psychiatric Publishing Textbook
of Clinical Psychiatry 3 (Robert E. Hales et al. eds., 2008).
112. American Psychiatric Association Work Group on Psychiatric Evaluation, Practice Guideline
for the Psychiatric Evaluation of Adults (Supplement), 163 Am. J. Psychiatry 7 (2006) [hereinafter
Psychiatric Evaluation of Adults].
113. Paula T. Trzepacz & Robert W. Baker, The Psychiatric Mental Status Examination (1993).
834
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
rejected.114 Information from collateral sources (e.g., spouses, family members,
friends, other health professionals) can be valuable in confirming the account given
by an evaluee or in providing information not communicated by the evaluee,
especially when an incentive may exist for the person being examined to exaggerate or downplay the nature and extent of symptoms.115 In difficult cases, it
may not be possible to distinguish with reasonable clinical certainty among two or
more possible diagnoses; in such cases, clinicians may assign “rule out” diagnoses,
indicating the range of possibilities and deferring a definitive diagnosis until more
information is available.
b. Structured diagnostic interviews
When a diagnosis is based solely on a clinical examination, which is still most
frequently the case, the clinician is being relied upon to conduct a complete evaluation and to apply the diagnostic criteria accurately. Studies that showed considerable variation in the results of clinical evaluations motivated the development,
largely for research purposes, of structured diagnostic interviews.116 Structured
interviews provide a fixed set of questions—ensuring that important issues are not
omitted from consideration—and a schema for applying the results to the diagnostic framework. Hence, they tend to show increased reliability over unassisted
clinical evaluations. More complete diagnostic interviews may allow consideration
of a large number of diagnostic categories;117 focal interviews clarify whether a
single disorder (e.g., obsessive–compulsive disorder118) or category of disorders is
present (e.g., dissociative disorders119).
The disadvantages of structured diagnostic assessments include the time that
may be required (i.e., more extensive instruments may take several hours to complete) and the fact that many persons respond negatively to an evaluation with a
series of preset questions.120 Many instruments require that the person conducting
the interview be trained in their use; administration by untrained personnel may
not achieve the level of reliability or validity demonstrated in research studies.121
114. Psychiatric Evaluation of Adults, supra note 112, at 16.
115. Id.
116. Robert Spitzer & Joseph Fleiss, A Re-analysis of the Reliability of Psychiatric Diagnosis, 125
Brit. J. Psychiatry 341 (1974).
117. See generally Michael B. First et al., User’s Guide for the Structured Clinical Interview for
DSM-IV Axis I Disorders: SCID-1 Clinician Version (1997).
118. See, e.g., Wayne K. Goodman et al., The Yale Brown Obsessive Compulsive Scale
(YBOCS), 1: Development, Use and Reliability, 46 Arch. Gen. Psychiatry 1006 (1989).
119. See, e.g., Marlene Steinberg, Structured Clinical Interview for DSM-IV Dissociative
Disorders (SCID-D) (1995).
120. Deborah Blacker, Psychiatric Rating Scales, in Comprehensive Textbook of Psychiatry 9th
ed., 1032 (Benjamin J. Sadock, Virginia A. Sadock, & Pedro Ruiz, eds., 2009).
121. See, e.g., the recommended training requirements for the SCID, available at http://www.
scid4.org/training/overview.html.
835
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Although structured diagnostic interviews do not reflect the current standard of
care for clinical purposes, there may be some value in their use for purposes
of forensic evaluation in cases with particularly difficult diagnostic questions.
Diagnostic interviews should be distinguished from instruments that assess the
nature and extent of psychiatric symptomatology.122 The former yield a conclusion about the presence or absence of a psychiatric diagnosis; the latter allow estimates of the type and magnitude of symptoms experienced, regardless of diagnosis.
As with diagnostic interviews, symptom measures may be broad in their scope or
assess a single type of symptom.123 Although they are not likely to be used for the
purpose of diagnosis per se, the results of applying such instruments may be introduced in evidence to establish the severity of symptoms associated with a disorder.
c. Psychological and neuropsychological tests
Formal testing of psychological functions may be used to complement the clinical diagnostic process, but often it is not necessary for a diagnosis to be made.124
Psychological tests such as the Minnesota Multiphasic Personality Inventory (MMPI)
assess multiple dimensions of personality and mental state; research over many years
of use has established correlations between patterns of performance on the MMPI
and particular mental disorders, which may be helpful in establishing or confirming
a diagnosis, particularly when the results of a clinical examination are inconclusive.125
Tests of intelligence, such as the Wechsler Adult Intelligence Scale (WAIS-III),
are important in establishing the presence of mental retardation and determining
its severity.126 Projective tests, such as the famed Rorschach ink-blot test or the
Thematic Apperception Test, were once used more widely than they are today as a
means of probing the nature and content of a person’s thought processes; although
results were said to be helpful for diagnostic purposes, questions about the reliability
and validity of projective measures have limited their use.127 Other tests target personality traits, such as psychopathy, or behavioral characteristics, such as impulsivity,
and may be helpful but not determinative in making a diagnosis of mental disorder.128
122. See, e.g., John E. Overall & Donald R. Gorham, The Brief Psychiatric Rating Scale (BPRS):
Recent Developments in Ascertainment and Scaling, 24 Psychopharmacology Bull. 97 (1988).
123. Compare the BPRS, supra note 122, with the Beck Depression Inventory, in Aaron T. Beck
et al., Manual for the Beck Depression Inventory-II (1996).
124. See discussion in John F. Clarkin et al., The Role of Psychiatric Measures in Assessment and
Treatment, in Hales et al., supra note 111, at 73.
125. Starke R. Hathaway & John C. McKinley, Minnesota Multiphasic Personality Inventory-2
(1989).
126. David Wechsler, Wechsler Adult Intelligence Scale-III Administrative and Scoring Manual
(1997).
127. Scott O. Lilienfeld et al., The Scientific Status of Projective Techniques, 1 Psychol. Sci. Publ.
Int. 27 (2000).
128. See, e.g., Robert Hare, The Psychopathy Checklist-Revised (1991); Ernest S. Barratt,
Impulsiveness and Aggression, in Violence and Mental Disorder: Developments in Risk Assessment 61
(John Monahan & Henry J. Steadman eds., 1996).
836
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
No bright line distinguishes psychological from neuropsychological tests, but
in general the latter are focused on assessing the integrity of the functioning of the
brain itself.129 Hence, they may be helpful—and sometimes even essential—in
the diagnosis of states of impaired brain function, such as may occur in the wake
of traumatic brain injury, infections such as meningitis, and learning disabilities.
Neuropsychological testing usually involves the administration of a battery of
measures, each targeting a relatively discrete area of function, such as attention,
memory, verbal abilities, visual recognition, spatial perception, and the like. The
tests selected by neuropsychologists as part of a battery will often vary on the basis
of the person’s history and suspected condition; thus, it is important before accepting a conclusion that “neuropsychological testing showed no signs of abnormality”
to ascertain precisely which functions were specifically assessed.
Neuropsychological testing can be particularly helpful in the diagnosis of
dementia, a condition that may lead to legal challenges to a person’s decisional or
performative capacities. Although the diagnosis may be suggested by elements of
the person’s history (e.g., forgetfulness, disorientation), serial testing of cognitive
functions can provide strong evidence for a progressive disorder.130 The most frequently used test is the Mini-Mental Status Examination (MMSE), a 20-question
screening tool that can be applied by primary care and other clinicians in ordinary
treatment settings. Structural and functional brain imaging can be helpful in ruling
out other causes of the person’s cognitive decline.131
d. Imaging studies
Progress has been made in recent years in the use of radiological techniques to
assist in the diagnosis and evaluation of mental disorders. With the development
of computer-assisted tomography (CAT or CT), a noninvasive technique became
available for clinicians to visualize aspects of the gross structure of the brain.132
CT scans, which use traditional X rays to provide computer-reconstructed pictures of “slices” through the brain, especially when combined with injection of
radio-opaque dye into the bloodstream, permit the detection of intracranial masses
(e.g., tumors), stroke, atrophy (e.g., associated with Alzheimer’s disease and other
dementias), and other deformations of brain structure. More recently, magnetic
resonance imaging (MRI) has replaced CT scans in many of the situations in
which they previously would have been used. MRI offers higher resolution of
129.. Clarkin et al., supra note 124.
130. Diane B. Howieson & Muriel D. Lezak, The Neuropsychological Examination, in The
American Psychiatric Publishing Textbook of Neuropsychiatry and Clinical Neurosciences 215–43
(Stuart C. Yudovsky & Robert E. Hales eds., 5th ed. 2008).
131. Marshall F. Folstein et al., Mini-Mental State: A Practical Method for Grading the Cognitive
State of Patients for the Clinician, 12 J. Psychiatric Res. 189 (1975). See discussion infra Section I.C.3.d.
132. Robin A. Hurley et al., Clinical and Functional Imaging in Neuropsychiatry, in Yudofsky &
Hales, supra note 130, 245.
837
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
brain structures without exposure to X-rays.133 Regardless of whether CT or
MRI is used, however, it is important to note that, despite evidence for the localization of some brain functions (e.g., speech, vision), the general tendency for the
brain to function as an integrated network limits the conclusions that can be drawn
about a person’s functional abilities on the basis of structural studies alone.134
Functional imaging techniques have augmented the ability of clinicians to
get inside the “black box” of the brain to more directly assess aspects of brain
function. These include functional MRI (fMRI), single-photon emission computerized tomography (SPECT), and proton emission tomography (PET).135 What
they have in common is the capacity to detect changes in such characteristics of
the brain as blood flow or oxygen saturation of the blood that presumably correlate with the activity of a given brain area. Thus, functional imaging can identify
regions with aberrant patterns of activity that may be associated with impaired
function in that area of the brain. Again, however, conclusions relevant to diagnosis or impairment of capacities are limited by the frequent absence of a tight
correlation between functional imaging findings and actual functional impairment
of a sort likely to have legal relevance.136
e. Laboratory tests
Use of standard laboratory tests may be helpful in ruling out general medical
causes of abnormal mental states and behavior. For example, low levels of thyroid
hormone may be associated with a state that resembles a major depressive episode,
vitamin B-12 deficiency can lead to psychosis, and disturbance of the balance of
electrolytes in the blood can cause states of delirium.137 Each of these conditions
is responsive to treatment of the underlying disorder, and all can lead to more
severe and permanent impairments if untreated. Infectious diseases such as HIV,
syphilis, and Lyme disease can present as mental disorders otherwise indistinguishable from depression, mania, and acute psychosis; all can be detected with appropriate blood tests.138 Behavioral abnormalities that may be mistaken for mental
disorders can be caused by several forms of epilepsy, which are usually detectable
133. Id.
134. William R. Uttal, The New Phrenology: The Limits of Localizing Cognitive Processes
in the Brain (2003).
135. Yudofsky & Hales, Hurley et al., supra note 132 at 261–2.
136. Stephen J. Morse, Moral and Legal Responsibility and the New Neuroscience, in Neuroethics:
Defining the Issues in Theory, Practice and Policy 33 (Judy Illes ed., 2006).
137. H. Florence Kim et al., Laboratory Testing and Imaging Studies in Psychiatry, in Hales et al.,
supra note 111, at 19–49.
138. Glenn J. Treisman et al., Neuropsychiatric Aspects of HIV Infection and AIDS, in Sadock,
Sadock, & Ruiz, supra note 120, at 506–31; Brian A. Fallon, Neuropsychiatric Aspects of Other Infectious
Diseases (non-HIV), in Sadock, Sadock, & Ruiz, supra note 120, at 532–41.
838
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
by electroencephalogram (EEG).139 When there is any reason to suspect, on the
basis of a person’s history or the findings of a clinical evaluation, that a general
medical disorder may exist, laboratory testing is an essential aspect of a complete
evaluation. On the other hand, despite many years of investigation of possible
correlates of the major mental disorders in blood, urine, and other bodily fluids,
there are no laboratory tests that can identify schizophrenia, bipolar disorder,
major depression, or other mental disorders.140
f. Previous medical and mental health records
Among the most helpful adjunctive sources of information for a diagnostic assessment are the person’s records of previous contact with the medical and mental
health systems.141 Past records can confirm a person’s account or point to discrepancies that require further exploration (which is particularly important, as
described in Section I.C.4, infra, when malingering is suspected). Such factors
as age of onset, progression of illness, and variability of symptoms, all of which
may affect diagnostic choices, can be determined from records of previous medical and mental health evaluations, as can susceptibility to treatment and need for
other supportive interventions.
4. Accuracy of diagnosis of mental disorders
Diagnostic accuracy has two separate aspects: (1) reliability—the extent to which
two or more examiners of the same person would derive the same diagnosis,
and (2) validity—the extent to which the diagnosis corresponds to the person’s
actual mental state.142 It is axiomatic that reliability is a necessary, but not sufficient, condition for validity. Prior to the introduction of DSM-III in 1980,
several influential studies showed poor reliability of psychiatric diagnosis, even
for major disorders such as schizophrenia.143 Reliability improved with the new,
criteria-based categories that were introduced at that point, but remains greater
for broader categories of diagnosis, such as psychosis, than for finer distinctions,
such as whether a person suffers from schizophrenia or the similar but not identical
syndrome of schizoaffective disorder.144 For most purposes, however, as discussed
139. H. Florence Kim, et al., Neuropsychiatric Aspects of Seizure Disorders, in Yudofsky & Hales,
supra note 132, at 649–76.
140. Barry H. Guze & Martha J. Love, Medical Assessment and Laboratory Testing in Psychiatry, in
Sadock, Sadock, & Ruiz, supra note 120, at 996.
141. Psychiatric Evaluation of Adults, supra note 112, at 16.
142. See Robert E. Kendell, Five Criteria for an Improved Taxonomy of Mental Disorders, in Defining
Psychopathology in the 21st Century 3 (John E. Helzer & James J. Hudziak eds., 2002).
143. Spitzer and Fleiss, supra note 116.
144. Robert L. Spitzer et al., DSM-III Field Trials: I. Initial Interrater Diagnostic Reliability, 136
Am. J. Psychiatry 815 (1979); see also DSM-III at 467–72; Joseph D. Matarazzo, The Reliability of
Psychiatric and Psychological Diagnosis, 3 Clinical Psychol. Rev. 103–45 (1983); Peter E. Nathan & James
839
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
below (see Section I.D), it is unlikely that differences within broader categories of
diagnosis will have significance for the legal issue at stake.
The validity of a diagnosis of mental disorder depends on the underlying
validity of the diagnostic criteria, that is, the extent to which they accurately
characterize a particular psychiatric disorder; and on the validity of the judgment
of the diagnosing clinician in a given case, that is, how well the clinician has
applied the criteria. Diagnostic criteria can be judged on how well they identify
a syndrome whose symptomatology, heritability, course, and treatment response,
among other variables, differentiate it from similar disorders.145 DSM-IV-TR
diagnostic criteria vary along these dimensions, and it is impossible to make a
general statement about the validity of the diagnostic framework as a whole.146
Again, however, for most legal determinations it is the presence or absence of
any mental disorder and associated levels of functional impairment that will be
at issue, rather than distinctions among similar disorders. How well the criteria
have been applied in a particular case can be determined more easily, whether by
means of cross-examination or by virtue of conflicting expert testimony offered
by the adverse party.
5. Detection of malingering
Because the diagnosis of mental disorders rests heavily on the elicitation of symptoms from the person being evaluated and observations of the person’s behavior,
the possibility of malingering—the deliberate simulation of symptoms of mental
disorder—must always be considered.147 Most commonly, the likelihood of malingering is assessed as part of a clinical evaluation. The pattern of symptoms reported
by the person is compared with known syndromes, and the consistency of his or
her behaviors is observed. Contrary to common belief, mental disorders are not
easy to fake, especially when the deception must be sustained over a period of
time.148
When deception is suspected, efforts to confirm it should begin during the
clinical examination, as the person is offered the opportunity to endorse symptoms
that are unlikely to occur naturally (e.g., “Do you ever feel as though the cars
on the street are talking about you?”) or do not fit the condition from which the
W. Langenbucher, Psychopathology: Description and Classification, 50 Ann. Rev. Psychol. 79 (1999).
Reliability in actual clinical practice may well be less than has been demonstrated in research settings,
especially when the latter make use of structured assessment instruments. See, e.g., M. Katherine Shear
et al., Diagnosis of Nonpsychotic Patients in Community Clinics, 157 Am. J. Psychiatry 581 (2000).
145. The Validity of Psychiatric Diagnosis (Lee N. Robins & James E. Barrett eds., 1989).
146. Kendell, supra note 142.
147. Phillip J. Resnick, Malingering, in Principles and Practice of Forensic Psychiatry 543
(Richard Rosner ed., 2003).
148. Id. at 544.
840
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
patient is claiming to suffer.149 Psychological testing can be helpful in detecting
deception; the MMPI-2, for example, has scales that correlate with persons who
are both “faking bad” (i.e., fabricating symptoms) and “faking good” (i.e., hiding
symptoms that actually exist).150 Other instruments specifically for the assessment
of malingering also have been developed, with varying degrees of validation.151
Information from records of previous psychiatric or psychological evaluations
can be helpful in determining the congruence of the person’s current symptoms
with past reports and behaviors. In addition, given the difficulty in maintaining a
consistent pattern of deception over a sustained period, data provided by collateral
sources (e.g., family members, roommates, prisoners in adjoining cells, correctional officers, nurses and other hospital staff, and others who have been in contact
with the person) who have observed the person informally outside of the evaluator’s presence can be crucial in distinguishing real from malingered disorders.152
The difficulty of simulating a mental disorder does not imply that it is impossible to do. Indeed, a skilled and determined person can sometimes fool even an
experienced evaluator. Thus, the only honest response that a clinician can give
in almost every circumstance to a question about the possibility of malingering is
that it is always possible, but is more or less likely in this particular case, given the
characteristics of the person being evaluated.153
D. Functional Impairment Due to Mental Disorders
1. Impact of mental disorders on functional capacities
Mental disorders can affect functional capacities in a variety of ways. Among these,
attention and concentration may be impaired by the preoccupations that appear in
anxiety and depressive disorders, or the grosser distractions (e.g., auditory hallucinations) of psychotic disorders.154 Perception is often distorted in psychotic condi-
149. Paul S. Appelbaum & Thomas G. Gutheil, Clinical Handbook of Psychiatry and the Law
248–49 (2007).
150. Roger L. Greene, Malingering and Defensiveness on the MMPI-II, in Clinical Assessment
of Malingering and Deception 159 (Richard Rogers ed., 2008). These scales, especially the most
prominent of them, the “Fake Bad Scale (FBS),” are not without controversy that has sometimes led
courts to rule them inadmissible. David Armstrong, Malingerer Test Roils Personal-Injury Law: “Fake
Bad Scale“ Bars Real Victims, Its Critics Contend, Wall Street J., Mar. 5, 2008, at A1. However, the
bulk of the psychological literature appears to support the validity of the FBS and many of the other
MMPI-based malingering scales. Nathaniel W. Nelson et al., Meta-Analysis of the MMPI-2 Fake Bad
Scale: Utility in Forensic Practice, 20 Clinical Neuropsychologist 39 (2006).
151. Richard Rogers, Structured Interviews and Dissimulation, in Clinical Assessment of Malingering
and Deception 301–22 (Richard Rogers ed., 2008).
152. Appelbaum & Gutheil, supra note 149.
153. Resnick, supra note 147.
154. Ronald A. Cohen et al., Neuropsychiatric Aspects of Disorders of Attention, in Yudofsky and
Hales, supra note 132, at 405–44.
841
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
tions, as manifest by hallucinations of the auditory, visual, tactile, or other sensory
systems.155 Cognition, encompassing both the process and content of thought, is
also often affected: Thought processes can be impeded by the slowing of thought
in depression, its acceleration in mania, or the scrambling of thought experienced
by persons with schizophrenia or other psychotic disorders; thought content may
be altered by the odd reasoning to which persons with delusions appear to be
prone.156 Motivation to act, even in one’s self-interest, is often globally reduced in
states of intense depression and in schizophrenia.157 Judgment and insight may be
altered under the pressure of delusions.158 Control of behavior can be weakened by
the impulsivity seen in mania and psychosis, the drives of the impulse disorders,
and the use of disinhibiting substances, especially alcohol.159 Any of these impairments in principle could affect a person’s relevant decisional and performative
capacities.
This necessarily incomplete list of the ways in which mental disorders can
affect functional capacities illustrates the vulnerability of almost every aspect of
mental functioning to perturbation. Moreover, although it is common to divide
mental functions into categories such as these for heuristic purposes, most neuroscientists recognize that the brain operates as a unified entity.160 Thus, it is rare
that impairments are limited to a single area of functioning. Impaired concentration, for example, inherently affects cognitive abilities, which in turn may alter
judgment and therefore the person’s choice of behaviors. Although focal deficits
may occur, for example, the anxiety associated with exposure to a phobic stimulus
such as a spider, more severe disorders will have a broader impact on a person’s
functional capacities as a whole.161
2. Assessment of functional impairment
Determining the nature and extent of past, present, or future functional impairment, therefore, is usually the most critical aspect of a mental health evaluation
and subsequent presentation of mental health evidence.
155. Andre Aleman & Frank Laroi, Hallucinations: The Science of Idiosyncratic Perception
(2008).
156. Ann A. Matorin & Pedro Ruiz, Clinical Manifestations of Psychiatric Disorders, in Sadock,
Sadock, & Ruiz, supra note 120, at 1076–81.
157. Id. at 1092–93.
158. Phillipa A. Garety, Insight and Delusions, in Insight and Psychosis 66, 66–77 (Xavier F.
Amador & Anthony S. David eds., 1998).
159. Eric Hollander et al., Neuropsychiatric Aspects of Aggression and Impulse-Control Disorders, in
Yudofsky & Hales, supra note 132, at 535–66.
160. William R. Uttal, supra note 134.
161. The pervasive impact of schizophrenia on all aspects of personality and functioning is the
most extreme example. See Michael J. Minzenberg et al., Schizophrenia, in Hales et al., supra note 111,
at 407–56.
842
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
a. Clinical examination
As in establishing a diagnosis, the core of the assessment of functional impairment
remains the clinical examination.162 A diagnostic assessment may be integral to the
functional assessment process, suggesting to the examiner areas of possible impairment to be explored in greater depth (e.g., attentional and concentration abilities
in an anxiety disorder; impairments in motivation in a depressive disorder). Beginning in the 1970s, however, there was growing recognition among the mental
health professions that merely establishing a diagnosis is insufficient to permit a
conclusion to be drawn about a legally relevant capacity, because a broad range
of functional impairments can be associated with almost any mental disorder.163
Thus, in addition to a diagnostic assessment, an adequate examination
will explore the person’s perspective on the alleged functional impairment and will
probe for symptoms associated with such impairment. The process involves more
than simply taking the person’s word for the issue in question, for example, that
she was not able to comprehend the details of the contract to which she is a
party, or that he remains incapable of the careful calculations required in his job.
Assessors compare the claimed impairments with the person’s overall history and
other areas of function, looking for congruence or incongruence. For example,
the assertion by a plaintiff that because of being harassed on the job he has been
unable to concentrate sufficiently to work will be more or less plausible depending on the consistency and extent of his symptoms and the degree to which the
impairment may generalize to other areas of his life. Degrees of impairment that
are out of scale with the extent of symptoms or the person’s functional history are
inherently suspect.164
In addition to questioning the evaluee directly, the use of collateral information can be essential to a valid assessment, particularly when the person has an
incentive to malinger, which will often be the case in legal proceedings.165 Family
members, coworkers, and others who have had an opportunity to observe the
person can provide invaluable information about the nature and extent of impairments, although one must always be alert to the possibility that informants will
be motivated to assist the person by distorting or exaggerating their accounts.
Records of performance, such as educational test results and work evaluations,
especially if generated prior to the filing of the legal claim, may shed somewhat
more objective light on the person’s capacities.166 To the extent that impairments
162. See supra Section I.C.2.a.
163. Michael Kindred, Guardianship and Limitations upon Capacity, in The Mentally Retarded and
the Law 62 (The President’s Committee on Mental Retardation, 1976); Laboratory of Community
Psychiatry, Harvard Medical School, Competency to Stand Trial and Mental Illness (1973).
164. Richard Rogers, Detection Strategies for Malingering and Defensiveness, in Clinical Assessment
of Malingering and Deception 14 (Richard Rogers ed., 2008).
165. Appelbaum & Gutheil, supra note 149.
166.. Melton et al., supra note 28, at 53–55.
843
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
may be rooted in disruptions of brain functions per se, neuropsychological testing
can also be helpful in documenting their nature and extent. Increasingly, however, it has been accepted that an unstructured clinical evaluation, even when
supplemented by collateral information, is not necessarily the most accurate tool,
standing on its own, for determining functional capacity.167
b. Structured assessment techniques
As with determination of diagnosis, the evaluation of the limitations of function
due to mental disorders increasingly involves the use of structured assessment techniques.168 Most commonly, these are standardized interviews or data-gathering
protocols (e.g., based on a person’s psychiatric record) designed to ensure that
all relevant information is obtained. In addition, where research has established
the validity of the instruments by demonstrating a correlation between the results
and actual impairments, these techniques may allow a quantitative estimate to
be made of the extent of actual functional deficiencies. A recent compendium
of assessment instruments included structured evaluations that address criminal
defendants’ competence to stand trial, waiver of rights to silence and legal counsel,
criminal responsibility and persons’ parenting capacity, competence to manage
one’s affairs (i.e., need for a guardian or conservator), and competence to consent
to medical treatment and research.169 Given that this area is a rapidly developing
focus of research, instruments to address other legally relevant functional capacities and states—propensity to commit violent or sexual offenses comes quickly to
mind170—are continuously being tested and developed.
Although most assessment techniques rely on information gathered from the
person being evaluated or from existing records, some approaches involve direct
testing of the person’s capacity to perform particular tasks. Examples include
computerized assessment of driving capacities,171 observation of tasks involving
167.. Grisso, supra note 2. Surprisingly few studies exist of the reliability of clinical forensic
evaluations. The only U.S. study of actual assessments showed good interrater reliability of evaluations
of competence to stand trial, although many of the reports were deficient in other ways. Jennifer L.
Skeem et al., Logic and Reliability of Evaluations of Competence to Stand Trial, 22 Law Hum. Behav.
519 (1998). A more recent Australian study found only fair to moderate reliability across assessments
of competence to stand trial, but moderate to good reliability of criminal responsibility evaluations.
Matthew Large et al., Reliability of Psychiatric Evidence in Serious Criminal Matters: Fitness to Stand Trial
and the Defence of Mental Illness, 43 Austl. N.Z. J. Psychiatry 446 (2009).
168. Id.
169. Id.
170. See, e.g., Christopher D. Webster et al., HCR-20 Assessing Risk for Violence Manual,
Version 2 (1997); Vernon L. Quinsey et al., Violent Offenders: Appraising and Managing Risk (1998);
John Monahan et al., COVR—Classification of Violence Risk, Professional Manual (2005).
171. Maria T. Schultheis et al., The Neurocognitive Driving Test: Applying Technology to the
Assessment of Driving Ability Following Brain Injury, 48 Rehabilitation Psychol. 275 (2003).
844
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
the handling and management of money172 and of parenting skills,173 and direct
measurement of such capacities as understanding and reasoning about medical
information when a person’s competence to decide about medical treatment is at
issue.174 In general, these approaches reduce the degree of inference required in
drawing conclusions about a person’s functioning because the person is observed
performing something close to the precise tasks in question. Of course, such techniques may not be relevant when the legal issue relates to the impact of mental
disorder on functional abilities at some time in the past or in the future, especially
if the person’s mental state at present may be different from what it was or will
be. Nonetheless, these can be useful approaches to evaluation in appropriate legal
contexts.
The advantages that attend the use of structured assessment instruments
include the thoroughness of the evaluation, because the likelihood is reduced that
variables that have been shown to be important to assessment will be omitted, and
in many cases, a research base exists from which conclusions can be drawn regarding the degree of functional impairment of the person being assessed.175 Indeed,
in some jurisdictions, the use of structured assessments is required for particular
purposes (e.g., evaluation of sexual offenders).176 However, it remains true that
the use of structured assessments performed for the purpose of being introduced
in legal proceedings is variable and far from universal.177 Grisso, the leading
scholar in this area, suggests three reasons why this is still true: (1) it is easier and
may be more lucrative (i.e., where a fixed rate is being paid per evaluation) for
an examiner to avoid the frequently time-consuming use of a structured instrument; (2) many cases involve persons whose functional impairments—or lack of
impairment—are obvious, and use of a structured assessment instrument would
be “overkill”; and (3) perhaps paradoxically, the use of an assessment tool makes
experts more vulnerable to attack on cross-examination.178 To this list should be
added the lack of knowledge of many expert witnesses regarding the existence of
these instruments and a sense that their use denigrates the evaluator’s expertise.
The vulnerability of testimony based on assessment instruments to crossexamination is worth special emphasis. Opinions offered on the basis of “clinical
172. Dan Marson et al., Assessing Financial Capacity in Patients with Alzheimer’s Disease: A
Conceptual Model and Prototype Instrument, 57 Archives Neurology 877 (2000).
173. Marc J. Ackerman & Kathleen Schoendorf, Ackerman-Schoendorf Scales for Parent
Evaluation of Custody Manual (1992).
174. Thomas Grisso & Paul S. Appelbaum, MacArthur Competence Assessment Tool for
Treatment (MacCAT-T) (1998).
175. Grisso, supra note 2, at 45–47.
176. See, e.g., Va. Code Ann. § 37.2-903-C: “Each month the Director shall review the
database and identify all such prisoners who are scheduled for release from prison within 10 months
from the date of such review who receive a score of five or more on the Static-99 or a like score on
a comparable, scientifically validated instrument designated by the Commissioner. . . .”
177. Grisso, supra note 2, at 481.
178. Id. at 481–82.
845
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
experience,” which appears to be the norm, are difficult to challenge when expert
witnesses in fact have appropriate training and a good deal of experience with
the condition in question.179 On the other hand, assessment instruments can be
subjected to scrutiny with regard to the empirical database that supports their use,
including their reliability and validity, their acceptance by the relevant professional community, and their probative value in a particular case. There may also
be questions regarding the examiner’s training and experience with the instrument
and whether it was administered in the manner intended by its developers. All of
these are legitimate questions, of course, and an argument can be made that the
introduction of data from assessment instruments into evidence should be held to
a more rigorous standard, because factfinders may give such data greater credence
than unassisted clinical judgment.180 But the undoubted consequence is that the
arguably more reliable and perhaps more valid data from empirically derived
assessment techniques are less likely to be introduced in evidence than evaluators’
subjective judgments of unknown validity.181
E. Predictive Assessments
As noted above,182 predictive assessments are the most challenging evaluations
performed by mental health professionals.183 The most common tasks involve the
prediction of violence risk and of future functional impairment and responses to
treatment.
1. Prediction of violence risk
The probability that a person may commit a violent act at some point in the future
may come into play in the criminal process regarding determinations of suitability
for diversion, bail, sentencing, probation, and parole, and in the civil process in
hearings for civil commitment to psychiatric facilities and sexual offender treat-
179. See 4 Jack B. Weinstein, Weinstein’s Federal Evidence § 702:02 n.1 (2d ed. 2008) on the
liberal admissibility of expert testimony under Federal Rule of Evidence 702; § 702.02[4] nn.25–27
on the trial judge’s broad discretion to admit or exclude expert testimony, to determine its helpfulness
and relevancy, and the application of the “abuse of discretion” standard of review to determinations
of whether a witness qualifies as an expert; § 702.04[1][c] on the typical “academic credentials plus
experience” combination. Bryan v. City of Chicago, 200 F.3d 1092 (7th Cir. 2000) (an expert may
qualify based on academic expertise and practical experience).
180. Christopher Slobogin, Experts, Mental States, and Acts, 38 Seton Hall L. Rev. 1009 (2008).
181. Grisso, supra note 2, at 482.
182. See supra Section I.A.1.
183. Yogi Berra, New York Yankees’ Hall of Fame catcher and philosopher of everyday life, is
purported to have said, “It’s tough to make predictions, especially about the future.” See http://www.
famous-quotes-and-quotations.com/yogi-berra-quotes.html. For a discussion of the origin of this
phrase, see Henry T. Greely & Anthony D. Wagner, Reference Guide on Neuroscience, Section VII,
in this manual.
846
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
ment programs and when considering the imposition of liability on clinicians
and facilities for failing to protect victims of patients’ violence.184 Although not
all persons for whom such assessments must be made will have mental disorders,
many will, and, in any event, psychiatrists and psychologists are seen by the courts
as having expertise in this area and hence are almost invariably called upon for
these evaluations.185
Persons with serious mental disorders, such as schizophrenia or bipolar disorder, are often considered by the general public to be at high risk for violence.186 However, data on the relationship between serious mental disorders
(schizophrenia is the disorder most frequently studied) and violence are variable.
Although most studies suggest a moderately elevated risk, the proportion of violence accounted for by serious mental disorders is small, probably 3% to 5%, based
on the best available U.S. estimates.187 Data also suggest that the stereotype of
violent mental patients who assault strangers in public places is inaccurate: Most
violence by persons with serious mental disorders is directed at family members
and friends and usually occurs in the living quarters of the perpetrator or the victim.188 Much higher rates of violence are associated with substance use, especially
alcohol use, and with traits such as psychopathy, often found in antisocial personality disorders.189 Indeed, most of the strongest predictors of violence are common
to both persons with serious mental disorders and those without, suggesting that
the impact of the disorders per se is slight.190
a. Approaches to prediction of violence risk
Clinical evaluation of violence risk ordinarily focuses on those variables that have
been shown in empirical research to have the strongest relationship to future
violence,191 whether the information is gleaned directly from the person or
184. See, e.g., Kansas v. Hendricks, 521 U.S. 346 (1997); White v. Johnson, 153 F.3d 197 (5th
Cir. 1998). A unanimous U.S. Supreme Court pointed to the importance of considering empirical
data in identifying circumstances associated with increased risk of violence in Chambers v. United States,
130 S. Ct. 567, 691–93 (2009).
185. See Joanmarie Ilaria Davoli, Psychiatric Evidence on Trial, 56 SMU L. Rev. 2191 (2003).
186. Bernice Pescosolido et al., The Public’s View of the Competence, Dangerousness and Need for
Legal Coercion Among Persons with Mental Illness, 89 Am. J. Pub. Health 1339 (1999).
187. Jeffrey W. Swanson, Mental Disorder, Substance Abuse, and Community Violence: An
Epidemiologic Approach, in Violence and Mental Disorder: Developments in Risk Assessment 101,
101–36 (John Monahan & Henry J. Steadman eds., 1994); Paul S. Appelbaum, Violence and Mental
Disorders: Data and Public Policy (editorial), 163 Am. J. Psychiatry 1319 (2006).
188. Henry J. Steadman et al., Violence by People Discharged from Acute Psychiatric Inpatient Facilities
and by Others in the Same Neighborhoods, 55 Archives Gen. Psychiatry 393 (1998).
189. John Monahan et al., Rethinking Risk Assessment: The MacArthur Study of Mental
Disorder and Violence (2001).
190. Id. at 37–90; Simon Wessely, The Epidemiology of Crime, Violence, and Schizophrenia, 170
Brit. J Psychiatry 11 (1997).
191.. Appelbaum & Gutheil, supra note 149, at 56.
847
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
derived from collateral informants or from a review of relevant records. These
variables include a history of previous violence, age (violence risk peaks in the
late teens and early twenties, declines slowly through the twenties and thirties,
and drops off precipitously after age 40), male gender, lower socioeconomic
status and employment instability, substance abuse, psychopathic personality traits,
and childhood victimization.192 The evaluation process is complicated by the
fact that literally scores of variables show some significant correlation with future
violence, but usually with little predictive power for each.193 However, beginning with the variables noted above, the evaluator estimates the baseline risk
of violence for the person and then adjusts that value by taking into account
foreseeable perturbations to the current equilibrium. When previous violence
has occurred, the risk estimate is adjusted to include those specific variables that
have been associated with violence by this person in the past (e.g., being left by a
girlfriend), including whether they are present at the time of evaluation or likely
to recur in the future.194
The past two decades have seen the development of a growing number of
structured assessment instruments specific to the prediction of future violence risk.
Among the best known of these are the HCR-20,195 the Violence Risk Assessment Guide (VRAG),196 and the computerized Classification of Violence Risk
(COVR).197 A set of instruments also exists for the prediction of the risk of future
sexual offenses.198 Violence risk-assessment instruments have been developed in
one of two ways: either by assembling known predictors from the research literature and combining them with variables drawn from clinical experience (e.g.,
HCR-20), or on the basis of statistical analysis of research data from large subject
populations (e.g., VRAG and COVR). Attempts are then made to validate the
instruments on populations similar to the ones with which it is anticipated they
will be used. The more sophisticated measures yield estimates of the degree of
risk, rather than dichotomous predictions that violence will or will not occur. In
general, the most commonly used instruments have shown a correlation between
the estimated degree of risk and future violence.199
192. Id.
193.. Monahan et al., supra note 189, at 163–68.
194.. Appelbaum & Gutheil, supra note 149.
195.. Webster et al., supra note 170.
196.. Quinsey et al., supra note 170.
197.. Monahan et al., supra note 170.
198. Calvin M. Langton et al., Actuarial Assessment of Risk for Reoffense Among Adult Sex Offender:
Evaluating the Predictive Accuracy of the Static-2002 and Five Other Instruments, 34 Crim. Just. & Behav.
37–59 (2007).
199. Kevin S. Douglas et al., Assessing Risk for Violence Among Psychiatric Patients, 67 J. Consulting
Clinical Psychol. 917 (1999); Grant T. Harris et al., Prospective Replication of the Violence Risk Appraisal
Guide in Predicting Violent Recidivism Among Forensic Patients, 26 Law & Hum. Behav. 377 (2002); John
Monahan et al., An Actuarial Model of Violence Risk Assessment for Persons with Mental Disorders, 56
Psychiatric Servs. 810 (2005).
848
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
The literature on prediction is marked by strong and unresolved differences
of opinion over the best basis for the ultimate risk estimate. Partisans of exclusive
reliance on the quantitative predictions generated by structured assessment instruments, which is often referred to as “actuarial” prediction, argue that any attempts
to modify the resulting risk estimates necessarily reduce accuracy.200 Proponents
of clinical evaluation note that exclusive reliance on instrumentation is unwise
because of the inevitable questions about the applicability of the group data on
which an instrument is based to the person being evaluated; the failure of a fixed
set of questions ever to capture all the variables that may be relevant in a particular situation; and the potential uncooperativeness of evaluees with a structured
process.201 Compromise approaches include anchoring the estimate in the actuarial prediction, but allowing clinical judgment to modify the results on the basis
of additional considerations, or using an instrument to structure the evaluation
and ensure its completeness, but allowing the evaluator to reach a judgment on
the basis of the totality of the information. This last approach has been termed
“structured professional judgment,”202 and at least one study has suggested that it
is capable of yielding predictions with reasonable degrees of accuracy.203 It is fair
to say that the question of which approach is best remains unresolved.
b. Limitations of violence risk prediction
A voluminous research literature exists on violence risk prediction. Studies of
predictions by psychiatrists and psychologists in the 1960s and 1970s showed poor
accuracy in judging whether persons with mental disorders and sex offenders would
be likely to be violent at some point after release.204 Indeed, the most frequently
cited conclusion was Monahan’s statement that when mental health professionals
predicted that a person would be violent, they were twice as likely to be wrong as
right.205 The cumulative impact of these findings stimulated a great deal of research
to identify variables that predict violence and their incorporation into both clinical
predictions and the structured assessment instruments described above.206
200. N. Zoe Hilton et al., Sixty-Six Years of Research on the Clinical Versus Actuarial Prediction of
Violence, 34 Counseling Psychol. 400 (2006).
201. Thomas R. Litwak, Actuarial Versus Clinical Assessments of Dangerousness, 7 Psychol. Pub.
Pol’y & L. 409 (2001); Andrew Carroll, Are Violence Risk Assessment Tools Clinically Useful? 41 Austrl.
& N.Z. J. Psychiatry 301 (2007).
202. Kevin S. Douglas & P. Randall Kropp, A Prevention-Based Paradigm for Violence Risk
Assessment: Clinical and Research Applications, 29 Crim. Just. & Behav. 617 (2002).
203. Kevin S. Douglas et al., Evaluation of a Model of Violence Risk Assessment Among Forensic
Psychiatric Patients, 54 Psychiatric Servs. 1372 (2003).
204. John Monahan, The Clinical Prediction of Violent Behavior (1981).
205. Id. at 60.
206. John Monahan, Clinical and Actuarial Predictions of Violence: II. Scientific Status, in Modern
Scientific Evidence: The Law and Science of Expert Testimony, vol. 1, at 122, 122–47 (David L.
Faigman et al., 2007).
849
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
At this point, it is possible to identify several items of consensus from the
research literature. Violence is not a unitary phenomenon; that is, it occurs for
different reasons, related both to the motivations of the perpetrator and to the
environmental context.207 A bar-room brawl has different roots than a mugging;
the precipitants of spouse abuse bear little similarity to the motivations underlying
a killing that has been premeditated as an act of revenge. Thus, no single variable
or set of variables can be relied upon in all cases to ascertain violence risk. Longterm prediction of violence is inherently inaccurate, due both to the intrinsic
limitations in the prediction of low-frequency events208 and to the difficulty that
clinicians have in anticipating changes in the person and the environment over
time and their effects on the person’s behavior.209 However, shorter-term prediction (i.e., days to weeks) holds greater potential for accuracy. Indeed, recent
studies focused on shorter-term prediction, often from hospital emergency rooms,
have found accuracies for predictions of violence in the range of 40% to 60%.210 It
is worth noting that even when the leading actuarial instruments are used to make
dichotomous judgments of future violence—that is, a cutoff is set to simulate the
clinical prediction process—their rates of accuracy are similar.211 Mental health
professionals, therefore, have been encouraged to move away from attempting to
make dichotomous judgments of dangerousness and toward predictions couched
in terms of the risk of future violence.212 Even here, though, precision has not
yet been attained—and may be unattainable. The state of the art probably allows
well-trained clinicians, especially if they are using structured assessment instruments, to assign persons into high-, medium-, and low-risk groups with reasonable
accuracy. At present, the hope of designating risk categories with greater precision
than that for most categories of persons with mental disorders is likely illusory.213
When quantitative data are available, however, precision in communication of risk
207. Paul S. Appelbaum, Preface, in Clinical Assessment of Dangerousness: Empirical Contributions
ix–xiv (Georges-Franck Pinard & Linda Pagani eds., 2001).
208. Paul E. Meehl, Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review
of the Evidence (1954).
209. Jennifer L. Skeem et al., Building Mental Health Professionals’ Decisional Models into Tests of
Predictive Validity: The Accuracy of Contextualized Predictions of Violence, 24 Law & Hum. Behav. 607
(2000).
210. That is, 40% to 60% of those who have been predicted to be violent go on to commit
violent acts. Note that because interventions to prevent the predicted violence (e.g., hospitalization)
may be taken with many of these subjects, the figures probably underestimate the proportion of true
positive predictions. In addition, when clinicians predicted that a person would not be violent, they
almost always tended to be correct, with well over 90% of such predictions in most studies being
accurate. See, e.g., Charles W. Lidz et al., The Accuracy of Predictions of Violence to Others, 269 JAMA
1007 (1993); Dale E. McNiel & Renée L. Binder, Clinical Assessment of the Risk of Violence Among
Psychiatric Inpatients, 148 Am. J. Psychiatry 1317 (1991).
211. See studies cited supra note 199.
212. Henry J. Steadman et al., From Dangerousness to Risk Assessment: Implications for Appropriate
Research Strategies, in Mental Disorder and Crime (Sheilagh Hodgins ed., 1993).
213. Webster et al., supra note 170, at 10.
850
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
would undoubtedly be enhanced if they were utilized and if assessors specified
their definitions of the categories being employed.214
The studies on the accuracy of prediction, whether clinical or actuarial, have
typically involved the direct evaluation of the person about whom the prediction
was being made. In many cases, considerable additional information about the
person was available. Opinions about the risk of future violence by persons whom
the evaluator has not examined have never been validated, and there are persuasive
reasons to believe that such predictions are not likely to be highly accurate.215
Such opinions have been introduced, for example, in death penalty cases in which
the prosecution sought to prove that further violence was likely, but the defense
denied the prosecution expert direct access to the defendant.216 If such evidence
is to be introduced, at a minimum, one would expect that the limitations on the
assessor’s knowledge of the evaluee and on the certainty with which conclusions
can be reached would be noted.
2. Predictions of future functional impairment
Cases involving claims of emotional harms, along with disability and workers’
compensation claims, often require that efforts be made to estimate the plaintiff’s
future functional impairment so that damages can be determined accordingly.217
Techniques for the assessment of function were described above. See discussion
supra Section I.D.2. However, these cases call for something more: predictions of
the degree of change in functional impairment due to mental disorders that are
likely to occur over time. In contrast to the structured assessment tools that assist
in the prediction of future violence risk, no instruments have been developed
214. See Kelly M. Babchishin & R. Karl Hanson, Improving Our Talk: Moving Beyond the “Low,”
“Moderate,” and “High” Typology of Risk Communication, 16 Crime Scene 11 (2009). Suggestions for
improving the clarity of risk communications include distinguishing between the likelihood of future
violence and the anticipated severity of the offense, specifying the period for which the prediction is
being made (e.g., “over the next 6 months”), indicating the comparison population for the estimate
(e.g., “risk is high compared with the general population” or “risk is high compared with the
population of persons with similar histories of violence”), and providing both absolute and relative risks
when quantitative data are available (e.g., “risk of future violence over the next year is between 8 and
12%, which is between 4 and 6 times greater than would be expected for the general population”).
215. Brief of Amici Curiae American Psychiatric Association, Barefoot v. Estelle, 463 U.S. 880
(1983) (No. 82-6080).
216. See Barefoot v. Estelle, 463 U.S. 880 (1983); Ron Rosenbaum, Travels with Doctor Death,
Vanity Fair, May 1990, at 141.
217. 20 C.F.R. § 404.1520a (2008). See generally Thomas P. Harding, Psychiatric Disability
and Clinical Decision Making: The Impact of Judgment Error and Bias, 24 Clinical Psychol. Rev. 707
(2004); Harold A. Pincus et al., Determining Disability Due to Mental Impairment: APA’s Evaluation
of Social Security Administration Guidelines, 148 Am. J. Psychiatry 1037 (1991); Cille Kennedy, SSA’s
Disability Determination of Mental Impairments: A Review Toward an Agenda for Research, in The Dynamics
of Disability: Measuring and Monitoring Disability for Social Security Programs 241 (Gooloo S.
Wunderlich et al. eds., 2002); Dan B. Dobbs, The Law of Torts 1048–53, 1087–1110 (2000).
851
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
and validated to predict future functional status at this writing. Such predictions
are complicated by the need for simultaneous estimates of several parameters that
affect long-term functional outcome: variables intrinsic to the person (e.g., symptomatic fluctuation, changes in motivation to work), variables that relate to the
environment (e.g., divorce, availability of new categories of jobs), and responses
to treatment (see discussion infra Section I.F.6). Research aimed at identifying
variables associated with some types of future functional impairment exists, but is
largely focused on progressive disorders (e.g., Alzheimer’s disease), and even here
the accuracy of the predictions of forensic evaluators has not been determined.218
Hence, acknowledgment of the uncertainties inherent in these predictions would
appear to be unavoidable for experts undertaking this task.
F. Treatment of Mental Disorders
The nature of available treatments for mental disorders, the probability that
they will be effective, the side effects that they may induce, and the existence
of alternatives are likely to be material to a variety of legal cases. In criminal
proceedings, for example, the continued confinement of a defendant in a psychiatric hospital on the basis of incompetence to stand trial will be based in part
on the probability that treatment of the person will restore capacity;219 involuntary treatment of the defendant will turn on a number of factors, including
the likelihood of success and the side effects and their potential for impairing
the defendant’s defense.220 Decisions about probation and parole of mentally
disordered offenders may also relate to the likelihood that symptoms will remain
in check, and courts may order ongoing treatment as a condition of release.221
Among the civil cases for which treatment-related questions will be at issue are
liability claims for malpractice and failure to protect third parties from patient
violence, claims involving emotional harms (e.g., in calculating the cost of future
care), and issues related to the deprivation of rights of prisoners in correctional
facilities to have adequate mental health treatment.222 Treatment of mental
disorders today offers multiple options for most disorders, often with different
levels of likely effectiveness and varying side-effect profiles. Planning treatment
has become an increasingly complex task.
218. See, e.g., Roy Martin et al., Declining Financial Capacity in Patients with Mild Alzheimer
Disease: A One-Year Longitudinal Study, 16 Am. J. Geriatric Psychiatry 209 (2008).
219. See Jackson v. Indiana, 406 U.S. 715 (1972).
220. See Sell v. United States, 539 U.S. 166 (2003).
221. See, e.g., United States v. Holman, 532 F. 2d 284 (4th Cir. 2008).
222. For an overview of the considerable body of case law on this issue, see Michael L. Perlin,
4 Mental Disability Law § 11-4.3 (2d ed. 1989).
852
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
1. Treatment with medication
The past 50 years have seen the ongoing introduction of new medications for the
treatment of mental disorders. Currently, medications are a mainstay in the treatment of schizophrenia and bipolar disorder; indeed, it is a rare patient who can be
treated successfully for these disorders without medication as part of the treatment
plan.223 Medications are also used commonly to treat and prevent the recurrence
of depression, anxiety disorders, attention-deficit/hyperactivity disorder, and a
large number of other conditions.224 The field of psychopharmacology, as the
treatment of mental disorders with medications is known, has become a complex
and challenging part of psychiatric practice.
a. Targets of medication treatment
As a general rule, medications are targeted at the symptoms of mental disorders,
which may occur in a large number of conditions, rather than being specific for
the treatment of a given disorder. Psychotic phenomena such as delusions and
hallucinations, for example, are generally responsive to antipsychotic medications,
whether the underlying disorder is schizophrenia or bipolar disorder.225 Antianxiety
medications can be effective in primary anxiety disorders (e.g., agoraphobia) or in
anxiety that develops secondary to another condition (e.g., depression).226 Mood
stabilizers, first introduced for bipolar disorder and its variants, can be helpful to
some patients with personality disorders that are marked by fluctuations in mood.227
Medications that aid patients in falling asleep work in many different disorders.228
Moreover, the same drug can have multiple effects. The best-known example
is the selective serotonin reuptake inhibitors (SSRIs), the first and most famous
of which is Prozac (the generic name is fluoxetine).229 Originally introduced for
the treatment of depression, for which they proved effective, SSRIs have since
also proved helpful for anxiety, even in the absence of depression.230 The newer
antipsychotic medications, intended to target psychotic symptoms, can also be
helpful for mania, even when psychosis per se is absent.231 Indeed, one of the
223. Minzenberg et al., supra note 161; Steven L. Dubovsky et al., Mood Disorders, in Hales et
al., supra note 111.
224. See generally Alan F. Schatzberg et al., Manual of Clinical Psychopharmacology (2003).
225. Stephen M. Stahl, Stahl’s Essential Psychopharmacology: Neuroscientific Basis and Practical
Applications 425 (2008).
226. Id. at 726.
227. C. Robert Cloninger & Dragan M. Svrakic, Personality Disorders, in Sadock, Sadock, &
Ruiz, supra note 120, at 2236.
228. Stahl, supra note 225, at 831–39.
229. Prozac was the first SSRI to be introduced to the market, and its use was widely discussed
in popular media, including Peter Kramer’s bestseller, Listening to Prozac (1993).
230. Norman Sussman, Selective Serotonin Reuptake Inhibitors, in Sadock, Sadock, & Ruiz, supra
note 120, at 3191.
231. Stahl, supra note 225, at 689–94.
853
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
antipsychotics is often prescribed to aid in sleep, as is one of the newer antidepressants.232 These multiple effects of a single medication are probably due to
their impact on more than one neurotransmitter system.
Another reality of contemporary psychopharmacology is that medications are
often used for indications that have not been approved by the Food and Drug
Administration (FDA).233 FDA approval is required for a new medication to be
marketed in the United States, and approval is granted only after evidence from
clinical trials is presented to the agency demonstrating the efficacy of the drug
for a particular purpose, within a given dosage range, and often with a particular
population.234 Once FDA has granted approval for a compound to be marketed,
however, physicians are free to prescribe it for any purpose for which they believe
it to be indicated, at a dosage of their choosing, and for whichever patients they
believe will benefit—although pharmaceutical companies can advertise its use only
for FDA-approved purposes. Because approval of a single indication for drug use
makes the medication generally available for other purposes as well, and over time
drugs lose patent protection, pharmaceutical companies often have little incentive
to pursue FDA approval for additional indications.235 Thus, many medications
have long been used for purposes other than the one endorsed by FDA, often with
impressive bodies of clinical experience supporting such use.236
As is true for many classes of medications, the precise mechanisms of action
of most psychopharmacological compounds have not yet been established. Most
appear to block or stimulate neuronal receptors in the brain, which trigger or
inhibit the propagation of electrical impulses, and it has been assumed that this
represents their primary mechanism of action.237 Indeed, many compounds interact with multiple receptor systems, perhaps accounting for their efficacy against
a variety of symptoms, as well as the diverse side effects they produce. But other
232. The antidepressant trazodone is a popular sleep-inducing medication, id. at 845; the
antipsychotic quetiapine is also used for this purpose, id. at 848.
233. David C. Radley et al., Off-Label Prescribing Among Office-Based Physicians, 166 Archives
Internal Med. 1021 (2006).
234. Celia J. Winchell, Drug Development and Approval Process in the United States, in Sadock.
Sadock, & Ruiz, supra note 120, at 2988–96; FDA regulatory information on new drug approvals can
be accessed at http://www.fda.gov/Drugs/DevelopmentApprovalProcess/default.htm.
235. Steven R. Salbu, Off-Label Use, Prescription, and Marketing of FDA-Approved Drugs: An
Assessment of Legislative and Regulatory Policy, 51 Fla. L. Rev. 181 (1999); Rebecca Dresser, The Curious
Case of Off-Label Use, 37 Hastings Center Rep. 9 (2007).
236. For example, the various formulations of valproic acid are among the most commonly used
treatments for bipolar disorder, including maintenance treatment to prevent recurrence. Although
an FDA indication was obtained for the treatment of acute mania, long-term maintenance use
is “off-label.” Schatzberg et al., supra note 224. See also Norman Sussman, General Principles of
Psychopharmacology, in Sadock, Sadock, & Ruiz, supra note 120, at 2972–3.
237. Stahl, supra note 225, at 91–122.
854
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
mechanisms, such as initiating changes in DNA transcription, are also possible and
remain to be fully explored.238
b. Categories of medications
Although a large number of medications are used to treat the symptoms of mental
disorders, several major categories account for the largest number of prescriptions.
• Antipsychoticmedications,firstintroducedinthe1950s,appeartohave
selective effects on psychotic symptoms such as delusions, hallucinations,
and disordered thoughts.239 The first generation of antipsychotics, marked
by the introduction of chlorpromazine, often caused acute neuromuscular
side effects, such as spasms of the muscles, along with a long-term risk of
tardive dyskinesia, a condition characterized by involuntary movements
of the muscles in the face, trunk, and extremities.240 A second generation
of these medications, introduced in the 1990s with great fanfare, presents
lower risks of neuromuscular problems, but several of the most popular
members of this group can cause weight gain, along with diabetes, hyperlipidemia, and increased cardiac risk.241 There does not appear to be a
difference in efficacy between the earlier and later medications.242
• Mood stabilizers were introduced for the treatment of bipolar disorder,
which is characterized by episodic mood swings from mania to depression.243 The first of these drugs was lithium, whose effect was discovered
in the 1940s, but which was not widely adopted in the United States until
the 1970s. Lithium can be very effective, but it often causes problematic
side effects.244 Subsequently, a number of medications that are also effective as treatment for seizure disorders were found to have mood stabilizing
effects as well, and they are generally better tolerated.245
238. Id. at 41–89.
239. Id. at 425.
240. Irene Hurford & Daniel P. van Kammen, First-Generation Antipsychotics, in Sadock, Sadock,
& Ruiz, supra note 120, at 3105–27.
241. Stephen R. Marder, Irene Hurford, & Daniel P. van Kammen, Second-Generation
Antipsychotics, in Sadock, Sadock, & Ruiz, supra note 120, at 3206–40.
242. Jeffrey A. Lieberman et al., Effectiveness of Antipsychotic Drugs in Patients with Chronic
Schizophrenia, 353 New Eng. J. Med. 1209 (2005); Peter B. Jones et al., Randomized Controlled Trial of
Effect on Quality of Life of Second- vs. First-Generation Antipsychotic Drugs in Schizophrenia: Cost Utility of the
Latest Antipsychotic Drugs in Schizophrenia Study (CUtLASS 1), 63 Archives Gen. Psychiatry 1079 (2006).
243. Stahl, supra note 225, at 667–719.
244. James W. Jefferson & John H. Greist, Lithium, in Sadock, Sadock, & Ruiz, supra note 120,
at 3132–45.
245. Robert M. Post & Mark A. Frye, Carbamazepine, in Sadock, Sadock, & Ruiz, supra note
120, at 3073–89; Robert M. Post & Mark A. Frye, Valpronte, in Sadock, Sadock, & Ruiz, supra
note 120 at 3278.
855
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
• Antidepressants include the older class of tricyclic compounds, which
offered the first effective medication treatment for depression.246 Again,
a less-than-optimal side-effect profile led to efforts to discover alternatives. The SSRI medications turned out to have equal efficacy, but are
generally better tolerated.247 They too, though, have adverse side effects,
including diminished sexual function, a numbing of emotional intensity,
or increased anxiety.248 Data suggesting that SSRIs may lead to suicidal
ideation in some patients remain controversial, but have led to FDAmandated “black-box” warnings for the drugs.249 A group of non-SSRI,
but chemically related, compounds has effects and side effects similar to
those of the SSRIs, and the medications are often used interchangeably.250
• Antianxiety medications, which began with nonspecific sedatives,
soon moved on to drugs with targeted effects on anxiety per se.251
Benzodiazepines, including the well-known Valium and Librium, were
used as mainstays of anxiety treatment for many years, but carry liabilities that include the potential for abuse and addiction. Today, the much
safer SSRIs and related compounds are the drugs of choice for long-term
treatment of anxiety, as they are for depression, with benzodiazepines
often reserved for situations in which immediate effects are a priority.252
Newer agents have been introduced from entirely different chemical
classes specifically for anxiety.253
This is by no means a complete list of medications for the treatment of mental
disorders, but represents a brief introduction to the major classes that are likely to
be the focus of evidence presented in legal proceedings.
c. Polypharmacy
The use of more than one psychiatric medication for a patient—often called
“polypharmacy”—is common for several reasons.254 First, because medications
246. J. Craig Nelson, Tricyclics and Tetracyclics, in Sadock, Sadock, & Ruiz, supra note 120, at 3259.
247. Sussman, supra note 230.
248. Id.
249. See FDA guidance at http://www.fda.gov/cder/drug/antidepressants/default.htm.
250. These medications include drugs that selectively target the brain’s norepinephrine
transporters, the so-called selective norepinephrine reuptake inhibitors (SNRIs), along with medications
that appear to act on both serotonin and norepinephrine systems. Michael E. Thase, Selective SerotoninNorepinephrine Reuptake Inhibitors, in Sadock, Sadock, & Ruiz, supra note 120, at 3184–90.
251. Steven Dubovsky, Benzodiazepine Receptor Agonists and Antagonists, in Sadock, Sadock, &
Ruiz, supra note 120, at 3044.
252. Stahl, supra note 225, at 765–71.
253. See, e.g., Anthony J. Levitt, Ayal Schaffer, & Krista Lanctot, Buspirone, in Sadock, Sadock,
& Ruiz, supra note 120, at 3060.
254. Sussman, supra note 236.
856
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
typically target symptoms rather than underlying disorders, and most disorders
present with multiple symptoms, there may be an obvious rationale for the use of
more than one agent (e.g., an antidepressant along with a sleep medication for a
patient with depression who is experiencing insomnia). Second, some disorders
that are imperfectly responsive to a single, initial medication may respond to an
augmentation strategy involving the addition of a second medication, often from a
different chemical class (e.g., an antidepressant medication can be augmented with
lithium, thyroid hormone, or a second unrelated antidepressant).255
Although greater efficacy often can be obtained from combined treatment,
there are risks as well. Multiplying medications increases the chance of adverse
effects from both the individual medications and their interactions.256 Hence,
polypharmacy is best reserved for situations in which documented evidence of
benefit exists or a compelling theoretical rationale is present. Failure to apply
these principles accounts for the vaguely disreputable connotation that the term
“polypharmacy” conveys.
d. Side effects
The specific side effects of several classes of medications have been referred to
earlier. A general point to be noted, however, is that all medications have side
effects, even commonly used drugs that are generally thought of as harmless, such
as aspirin or acetaminophen.257 Prescribers balance the positive effects of medication against the range of possible side effects in making recommendations for
treatment to patients, who, of course, retain the right to decide that the adverse
consequences do not warrant the possibility of therapeutic gains.258 It is a reality,
however, that the side effects of psychiatric medications limit the tolerability of
many drugs, even among people who are benefiting from them. Moreover, some
medications may have adverse effects with particular significance in legal settings.259 These include sedation, which may be associated with antipsychotic or
antianxiety medications and sometimes with other classes of drugs, and restricted
expression of emotion, occasionally experienced with the first generation of antipsychotic medications. In the absence of previous exposure to a given medication,
it is difficult to anticipate the side effects that may arise. Clinicians typically monitor those effects and adjust dosage or change medications accordingly.
255. Charles DeBattista & Alan F. Schatzberg, Combination Pharmacotherapy, in Sadock, Sadock,
& Ruiz, supra note 120, at 3322.
256. Sussman, supra note 236.
257. Id. at 2684.
258. See generally Jessica W. Berg et al., Informed Consent: Legal Theory and Clinical Practice
(2001).
259. Sell v. United States, 539 U.S. 166 (2003); Dora W. Klein, Curiouser and Curiouser: Involuntary
Medications and Incompetent Criminal Defendants After Sell v. United States, 13 Wm. & Mary Bill Rts. J.
897 (2005); Debra A. Breneman, Forcible Antipsychotic Medication and the Unfortunate Side Effects of Sell v.
United States, 539 U.S. 166, 123 S. Ct. 2174 (2004), 27 Harv. J.L. & Pub. Pol’y 965 (2004).
857
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
e. Efficacy and effectiveness
Efficacy refers to a medication’s ability to reduce or eliminate its target symptoms; effectiveness denotes the extent to which that effect can be achieved in
ordinary clinical treatment.260 An illustration of the difference is evident with
antipsychotic medications, the efficacy of which in controlling the positive symptoms of psychosis has been demonstrated in numerous studies.261 However, in
real-world clinical settings, the effectiveness of these medications, particularly over
the long term, is substantially limited by patients’ reluctance to continue taking
them, despite symptomatic relief.262 This may be due in part to the nature of
some mental disorders, especially schizophrenia, given that affected persons often
deny their impairments.263 But there is no question that the side effects of the
medications lead many patients to stop them, because of their unwillingness to
tolerate the weight gain, lethargy, sexual dysfunction, neuromuscular manifestations, or other side effects that often accompany the use of the drugs.264 Because
demonstrations of efficacy are required for FDA approval to market medications,
it can be assumed that drugs for mental disorders are efficacious for their approved
indications. However, their effectiveness may be more limited, and this can be
an important consideration when predictions of long-term symptom control are
called for in both criminal and civil contexts.
2. Psychological treatments
Although medications are a mainstay for treatment of serious mental disorders, a
variety of psychological treatments may be important as either primary or adjunctive treatments.
a. Psychoanalysis and psychodynamic psychotherapy
Psychoanalysis was developed as a therapeutic technique by Sigmund Freud
and is probably the form of psychotherapy that comes first to mind for most lay
people.265 It involves three to four sessions a week for many years, during which
patients recline on a couch and free associate, with little direction from the analyst, whose job it is to analyze patients’ developing unconscious attachment (or
transference) to the analyst. Despite its ubiquity in New Yorker cartoons, psycho-
260. Gerard E. Hogarty et al., Efficacy Versus Effectiveness, 48 Psychiatric Servs. 1107 (1997).
261. Philip G. Janicak et al., Principles and Practice of Psychopharmacotherapy 118–27 (1993).
262.. Lieberman et al., supra note 242.
263. Xavier F. Amador & Henry Kronengold, The Description and Meaning of Insight in Psychosis,
in Insight and Psychosis 15, 15–32 (Xavier F. Amador & Anthony S. David eds., 1998).
264. Diana O. Perkins, Predictors of Noncompliance in Patients with Schizophrenia, 63 J. Clinical
Psychiatry 1121 (2002).
265. T. Byram Karasu & Sylvia R. Karasu, Psychoanalysis and Psychoanalytic Psychotherapy, in
Sadock, Sadock, & Ruiz, supra note 120, at 2746.
858
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
analysis is used with only a tiny percentage of patients, usually those who have less
severe disorders, live in major urban centers, and can afford to pay for their own
extended care. Efforts to demonstrate the efficacy of psychoanalysis have run into
resistance from practitioners and considerable logistical problems; data supporting
its use are therefore hard to find.266 Thus, it is likely to have limited relevance
when mental disorders are at issue in legal proceedings.
Psychodynamic psychotherapies, which are offshoots of psychoanalysis, are
used more frequently and hence have more relevance to the law.267 Based on
similar notions of a dynamic unconscious, that is, processes out of the awareness of
the person affect mood and behavior, psychodynamic therapies generally involve
sessions once or twice a week, for periods ranging from months to several years,
with patients sitting upright and greater activity on the part of the therapist in
identifying conflicts and maladaptive behaviors. As in psychoanalysis, the underlying premise is that when unconscious motivations are made conscious, they
become susceptible to control and alteration by the patient.
Psychodynamic therapies are easier to study and have a somewhat more
robust set of data speaking to their efficacy—for example, in anxiety and depression.268 It is often difficult, though, for patients with more severe disorders, such
as schizophrenia and bipolar disorder, to tolerate the in-depth exploration and
uncovering of intrapsychic conflicts that accompany the therapeutic process. But
many patients with personality disorders, depression, and other conditions will
attribute their stability to ongoing therapy.
b. Cognitive behavioral and related therapies
In contrast to the premises of psychodynamic therapies that mood and behavior are
affected by unconscious conflicts, cognitive behavioral therapy (CBT) is based on
the idea that conscious patterns of thought determine how one feels and behaves.269
CBT is generally shorter term (weeks to months), highly structured, and focused on
helping patients recognize and control maladaptive patterns of thinking. Patients are
often given homework assignments to complete between sessions. A strong database
supports its use in anxiety disorders, depression (where it can be as effective as medications and may be more likely to prevent relapse), and for control of some psychotic symptoms, and its use is steadily being extended to additional conditions.270
Specialized forms of CBT have been developed for use with sex offenders, based on
266. Glen O. Gabbard et al., The Place of Psychoanalytic Treatments Within Psychiatry, 59 Archives
Gen. Psychiatry 505 (2002).
267.. Karasu, supra note 265.
268. Falk Leichsenring & Sven Rabung, Effectiveness of Long-Term Psychodynamic Psychotherapy:
A Meta-Analysis, 300 JAMA 1551 (2008).
269. Cory F. Newman & Aaron T. Beck, Cognitive Therapy, in Sadock, Sadock, & Ruiz, supra
note 120, at 2857–8.
270. Andrew C. Butler et al., The Empirical Status of Cognitive-Behavioral Therapy: A Review of
Meta-Analyses, 26 Clinical Psychol. Rev. 17 (2006).
859
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
a model that is often termed “relapse prevention” that teaches patients to recognize
situations that are likely to lead to recidivism and avoid them.271 Dialectical behavior
therapy is an offshoot of CBT that has shown success with patients with borderline
personality disorders, an otherwise difficult disorder to treat.272
c. Other psychological therapies
Hundreds of forms of talking therapies have been catalogued, but it would be
impossible to review them all here. Many have shown efficacy with particular
disorders, and efforts have been made to identify common therapeutic elements,
which may include the relationship with the therapist and the ability to instill hope
for the future in the patient.273 In addition to individual therapies, persons with
mental disorders may benefit from group therapies of a variety of orientations,
including psychodynamic and cognitive.274 Group therapies can be especially
helpful when socialization and relationships with other people are among the person’s problems. Family and couples therapies generally target relationships within
the family unit or marital dyad; because mental disorders are often disruptive to
relationships, such approaches may be helpful adjuncts to treatments focused on
the affected person’s primary disorder.275 Severely ill patients, including those
with schizophrenia, may benefit from what is termed supportive therapy, which
involves regular contacts aimed at identifying concrete problems in the person’s
life and helping to find solutions. It may also provide a nonthreatening outlet for
social interaction when other relationships are limited.276
3. Treatment of functional impairments
Control of positive symptoms does not necessarily address deficits in function, particularly in the psychotic disorders. What may be required are techniques that focus
on functional difficulties per se. Persons with schizophrenia, for example, given
that the disorder often affects ability to function socially and occupationally, may
need to be taught how to interact with other people, an approach known as social
skills therapy.277 Occupational therapy can provide them with a graded introduc-
271. See, e.g., D. Richard Laws, Relapse Prevention with Sex Offenders (1989).
272. M. Zachary Rosenthal & Thomas R. Lynch, Dialectical Behavior Therapy, in Sadock, Sadock,
& Ruiz, supra note 120, at 2884.
273. Jerome D. Frank & Julia B. Frank, Persuasion and Healing: A Comparative Study of
Psychotherapy (1993).
274. Henry I. Spitz, Group Psychotherapy, in Sadock, Sadock, & Ruiz, supra note 120, at 2832.
275. Henry I. Spitz & Susan Spitz, Family and Couple Therapy, in Sadock, Sadock, & Ruiz,supra
note 120, at 2584.
276. Peter J. Buckley, Applications of Individual Supportive Psychotherapy to Psychiatric Disorders:
Efficacy and Indications, in Textbook of Psychotherapeutic Treatments (Glen O. Gabbard ed., 2009).
277. Melinda Stanley & Deborah C. Beidel, Behavior Therapy, in Sadock, Sadock, & Ruiz, supra
note 120, at 2795–6.
860
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
tion (or reintroduction) to the workplace, with patients taught how to maintain
focus and deal with the demands of the work setting.278 More focal impairments
can be addressed, as well. Thus, defendants found incompetent to stand trial can be
taught about the nature of the courtroom and the expectations they must meet to
be found competent to proceed. Studies of such programs have shown higher rates
of restoration of competence than occurs with treatment of the primary disorder
alone.279 Comparable programs are available for anger management,280 control of
spousal abuse,281 and training in parenting skills,282 among other areas of function
that are often the target of legal proceedings.
4. Electroconvulsive and other brain stimulation therapies
The therapeutic effect of seizure induction by electrical or chemical means on
psychosis and depression was first demonstrated in the 1930s.283 Electroconvulsive
therapy (ECT) became the most popular of these approaches in the era before
efficacious medications existed for mental disorders. The early techniques for ECT
involved application of an electrical current to the brain of patients while they were
awake. Not only was this often terrifying for the patients, but the resulting violent
seizures could cause bone fractures and other complications. Contemporary use
of ECT is quite different, with patients anesthetized prior to the procedure and
paralyzing agents used to prevent muscular contractions.284 Although temporary
confusion and memory loss often occur, long-term adverse effects are uncommon,
making ECT a safe procedure—indeed, for elderly patients with complex medical problems, it may be preferable to the use of medications. Unfortunately, the
graphic images associated with early ECT use, embodied in novels and films, dominate the popular mind and often lead to a distorted perception of the treatment.285
278. See generally Jennifer Creek, Occupational Therapy and Mental Health: Principles, Skills
and Practice (2002).
279. See, e.g., Alex M. Siegel & Amiram Elwork, Treating Incompetence to Stand Trial, 14 L. &
Hum. Behav. (1990); Barry W. Wall et al., Restoration of Competency to Stand Trial: A Training Program
for Persons with Mental Retardation, 31 J. Am. Acad. Psychiatry L. 189 (2003).
280. Raymond DiGiuseppe & Raymond C. Tafrate, Anger Treatment for Adults: A Meta-Analytic
Review, 10 Clinical Psychol.: Sci. & Prac. 70 (2006).
281. Julia C. Babcock et al., Does Batterers’ Treatment Work? A Meta-Analytic Review of Domestic
Violence Treatment, 23 Clin. Psychol. Rev. 1023 (2004). Note that in contrast to anger management and
parenting training, the data on the efficacy of treatment for batterers indicates that effects are limited at best.
282. Kathryn M. Bigelow & John R. Lutzker, Training Parents Reported for or at Risk for Child
Abuse and Neglect to Identify and Treat Their Children’s Illnesses, 15 J. Fam. Violence 311 (2000).
283. Joan Prudic, Electroconvulsive Therapy, in Sadock, Sadock, & Ruiz, supra note 120, at
3285–3301.
284. Id.
285. Ken Kesey, One Flew Over the Cuckoo’s Nest (1962); the popular movie version
appeared in 1976 (see description at http://www.imdb.com/title/tt0073486/); Garry Walter &
Andrew McDonald, About to Have ECT? Fine, but Don’t Watch It in the Movies: The Sorry Portrayal of
ECT in Film, 21 Psychiatric Times 65 (2004), available at http://www.psychiatrictimes.com/display/
861
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ECT is used today primarily for the acute treatment of depression, for which
it has been demonstrated to be effective.286 Although it can also have a therapeutic effect on psychotic symptoms, it is not commonly used for that purpose. An
exception involves states of catatonic stupor or excitement, both of which can be
life threatening and for which ECT can provide immediate relief.287 For patients
responsive to ECT but not to medications, maintenance ECT (i.e., periodic, perhaps monthly, treatments) can be used.288 In most cases, though, ECT is reserved
for patients who have not responded to one or more medications or whose conditions are sufficiently severe (e.g., acute suicidal urges) that a more rapidly acting
intervention than medication—which can take up to 6 to 8 weeks before an effect
is seen—is indicated. ECT’s history continues to haunt its current use, with many
states imposing statutory or regulatory restrictions.289 However, it can be a safe
and effective treatment—and in some cases a life-saving one. The mechanism of
effect for ECT remains unclear.
Given that brain function is integrally linked to electrical transmission of
impulses between nerve cells, it is not surprising that other efforts have been
made to use electrical stimulation for therapeutic purposes. Electrical stimulation
of the vagus nerve has been approved by FDA for the treatment of depression,
although the supporting data are generally thought to be weak.290 The therapeutic use of transcranial magnetic stimulation, in which a strong magnetic field is
applied externally, is being explored, including for depression, autism, and other
disorders.291 Successful use of implanted devices for deep brain stimulation (DBS)
for Parkinson’s disease have led to trials of DBS for obsessive–compulsive disorder
and depression;292 further experimentation in other disorders seems likely.
article/10168/48111; C. Lauber et al., Can a Seizure Help? The Public’s Attitude Toward Electroconvusive
Therapy, 134 Psychiatry Res. 205 (2005); Balkrishna Kalayam & Melvin J. Steinhart, A Survey of
Attitudes on the Use of Electroconvulsive Therapy, 32 Hosp. Community Psychiatry 185 (1981); Richard
Abrams, Electroconvulsive Therapy (1997).
286. Daniel Pagnin et al., Efficacy of ECT in Depression: A Meta-Analytic Review, 20 J.
Electroconvulsive Therapy 13 (2004).
287. Barbara M. Rohland et al., ECT in the Treatment of the Catatonic Syndrome, 29 J. Affective
Disorders 255 (1993).
288. Prudic, supra note 283, at 3297.
289. For a review, though now somewhat out of date, see William J. Winslade et al., Medical,
Judicial, and Statutory Regulation of ECT in the United States, 141 Am. J. Psychiatry 1349 (1984).
Restrictive regulations appear to reduce the incidence of ECT use in the Unied States; Richard C.
Hermann et al., Variation in ECT Use in the United States, 152 Am. J. Psychiatry 869 (1995).
290. Although approved by the FDA for use in depression, concern over the weak database for
TMS led the Centers for Medicare & Medicaid Services to withhold approval for payment for the
procedure. Miriam Shuchman, Approving the Vagus-Nerve Stimulator for Depression, 356 New Eng. J.
Med. 1604 (2007).
291. Philip B. Mitchell & Colleen K. Loo, Transcranial Magnetic Stimulation for Depression, 40
Austrl. N.Z. J. Psychiatry 406 (2006).
292. Helen S. Mayberg et al., Deep Brain Stimulation for Treatment-Resistant Depression, 45 Neuron
651 (2005); Benjamin D. Greenberg et al., Three-Year Outcomes in Deep Brain Stimulation for Highly
862
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
5. Psychosurgery
Direct surgical intervention to alter brain function in mental disorders has an
unfortunate history.293 Prefrontal leucotomy or lobotomy was developed in the
1930s as a treatment for intractable disorders such as schizophrenia, and became
popular in the United States after World War II. Although there was never persuasive evidence of its efficacy, lobotomies were performed in many facilities, often in
primitive conditions, on thousands of patients. Consequences frequently included
a dulling of sensation and emotion. Interest in lobotomies faded in the late 1950s,
because it became clear that the procedures were not having a positive effect, and
they are not used today. Surgical interventions are used only rarely for psychiatric
disorders, and only then for otherwise untreatable conditions. The most common
procedures today involve parallel focal lesions in each of the two halves of the
brain, which seems to help intractable and disabling obsessive–compulsive disorder
and depression.294 But psychosurgery for the treatment of psychiatric disorders is,
in any form, extremely uncommon.
6. Prediction of responses to treatment
In a number of legal contexts, experts are called on to anticipate the responses of
persons with mental disorders to treatment. For example, likely effectiveness must
be considered before a court orders treatment over objections for a defendant who
is incompetent to stand trial,295 and the probable impact of future treatment may
need to be estimated in determining damages in emotional harm cases.296 The
difficulty with these projections relates to several parameters that are inherently
challenging to predict:
• Effectiveness of treatment. Even highly effective treatments for mental disorders do not work in all cases, and when they do work, they may provide
varying levels of relief.297
Resistant Obsessive-Compulsive Disorder, 31 Neuropsychopharmacology 2384 (2006).
293. Elliot S. Valenstein, Great and Desperate Cures: The Rise and Decline of Psychosurgery
and Other Radical Treatments for Mental Illness (1987).
294. Scott L. Rauch et al., Neurosurgical Treatments and Deep Brain Stimulation, in Sadock, Sadock,
& Ruiz, supra note 120, at 2983–90.
295. Sell v. United States, 539 U.S. 166 (2003).
296.. Melton et al., supra note 28.
297. For example, only 45% to 60% of patients receiving antidepressant medication for
uncomplicated major depression show clinically significant responses to the first medication they
receive, and of those who fail to respond, a similar percentage will respond positively to a second
medication. A. John Rush & Andrew A. Nierenberg, Mood Disorders: Treatment of Depression, in
Sadock, Sadock, & Ruiz, supra note 120, at 1734–9. Rates of response in unselected populations of
patients with depression are lower. Madhukar H. Trivedi et al., Evaluation of Outcomes with Citalopram
for Depression Using Measurement-Based Care in STAR*D: Implications for Clinical Practice, 163 Am. J.
Psychiatry 28 (2006).
863
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
• Adherence. Treatment has no chance of being effective if a person declines
to pursue or to continue it, a particular issue in cases where the court lacks
control over the person’s future behavior.298 Tolerability of side effects
may play an important role in these decisions.
• Fluctuations in the course and responsiveness of the disorder. Many mental disorders are chronic, and tend to wax and wane in intensity. Although
adjustments in treatment can sometimes bring more severe symptoms
under good control, that is not always possible. Moreover, for reasons that
are not understood, previously responsive disorders may become resistant
to the therapeutic effects of medication.299
• Environmental conditions. Unpredictable stresses in a person’s life may
exacerbate symptoms, reduce the effectiveness of treatment, or lead to
diminished adherence.
However, given that estimates sometimes must be made of probable treatment effects, there are several indicators to which clinicians can turn.300 Previous treatment response is the best predictor of future response; it is likely, for
example, that someone whose previous delusions have rapidly resolved with
antipsychotic medication will have a similar response in the future. In the
absence of a documented history of successful treatment, estimates should be
based on evidence indicating base rates of response for the person’s disorder,
along with any specific prognostic factors present in the person’s case (e.g., a
schizophrenic disorder that develops slowly over many years and that is associated with gradual functional decline generally has a poorer prognosis than one
with rapid onset and good premorbid functioning). To a greater or lesser extent,
however, it needs to be acknowledged that there is always uncertainty associated
with these predictions.
298. Rates of nonadherence to medications among patients with psychiatric disorders are
in the range of 50% or more. Although these figures are perhaps somewhat higher than those
seen in other chronic conditions, long-term treatment with medication in general is marked by
high rates of noncompliance with prescribed medications. Lars Osterberg & Terrence Blaschke,
Adherence to Medication, 353 New Eng. J. Med. 487 (2005).
299. So-called “poop-out” during treatment of depression is a commonly encountered example.
See, e.g., Sarah E. Byrne & Anthony J. Rothschild, Loss of Antidepressant Efficacy During Maintenance
Therapy: Possible Mechanisms and Treatments, 59 J. Clin. Psychiatry 279 (1998).
300. See, e.g., for predictors of response to treatment for depression, Stuart M. Sotsky et al.,
Patient Predictors of Response to Psychotherapy and Pharmacotherapy: Findings in the NIMH Treatment of
Depression Collaborative Research Program, 148 Am. J. Psychiatry 997 (1991); for predictors of response
to treatment for schizophrenia, Delbert G. Robinson et al., Predictors of Treatment Response from a First
Episode of Schizophrenia or Schizoaffective Disorder, 156 Am. J. Psychiatry 544 (1999).
864
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
G. Limitations of Mental Health Evidence
Certain limitations exist where mental health evidence is concerned that may not
come into play with other types of scientific evidence. Both retrospective assessments of past mental states and prospective estimates of future behavior depend
on estimates of variables that are inherently difficult to know with a high degree
of certainty. Even contemporaneous assessments of functional abilities depend, in
part, on the evaluee’s self-report of such difficult-to-measure attributes as distress,
motivation, and judgment. Where empirically validated assessment tools are used,
the usual concerns about measurement error are present. Two additional problematic areas involve the use of psychodynamic theory and testimony that speaks
to the ultimate legal issue.
1. Limits of psychodynamic theory
Psychoanalysis developed a complex theory of the mind that included both
functional elements (i.e., ego, superego, and id) and processes by which unconscious motivations are brought to bear on conscious thought and behavior (e.g.,
displacement, projection, reaction formation), largely in the service of protecting
the conscious mind from unbearable conflict.301 Freud’s basic schemata, which
underwent evolution even during his lifetime, subsequently have been subject to
permutation and elaboration by a large number of theorists. These schemata form
the theoretical basis for the dynamic psychotherapies and have been incorporated
into popular culture, as reflected in the work of historians, literary theorists, novelists, and cartoonists, among others.302 However, although these concepts have
proven useful in a variety of fields, many of them have been resistant to empirical
testing. Even when ample evidence exists to support a psychodynamic construct—
e.g., recovery of unconscious, nontraumatic memories303 or repression304—it has
been difficult to prove the postulated functional role for the process. Nonetheless, psychodynamic concepts—and the use of psychodynamic therapies—remain
mainstays in many psychiatry and psychology training programs. Testimony based
on these concepts is often introduced, for example, in discussions of a defendant’s
mental state at the time of the crime, in relation to defenses of insanity, dimin-
301. William W. Meissner, Classical Psychoanalysis, in Sadock & Sadock, supra note 120, at
701–46.
302. See, e.g., Psychoanalytic Literary Criticism (Maud Ellmann ed., 1994); Peter Loewenberg,
Psychoanalytic Models of History: Freud and After, in Psychology and Historical Interpretation (William
M. Runyan ed., 1980).
303. Matthew H. Erdelyi, The Recovery of Unconscious Memories: Hypermnesia and
Reminiscence (1996).
304. David S. Holmes, The Evidence for Repression: An Examination of Sixty Years of Research,
in Repression and Dissociation: Implications for Personality Theory, Psychopathology, and Health
85–102 (Jerome L. Singer ed., 1990).
865
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ished capacity, self-defense, provocation, duress, and entrapment.305 It may also
play a role in civil cases, regarding questions as disparate as a parent’s capacity to
raise a child and whether a testator was subject to undue influence.306 Because
these concepts were generally accepted in the relevant fields, although there have
always been skeptics, the test of admissibility under Frye v. United States and similar state rules was usually met.307 The reinvigorated admissibility requirements
promulgated under Daubert v. Merrell Dow Pharmaceuticals, Inc. and Kumho Tire
Co. v. Carmichael, with their emphasis on empirical verification of the bases for
the expert’s testimony, have called the future of testimony based on most psychodynamic concepts into question.308
Questions about testimony based on psychodynamic theory can be raised
with regard both to the legitimacy of the underlying constructs (e.g., displacement of affect) and to the techniques by which the examiner can know that such
a mechanism came into play in a particular case (e.g., the displacement of the
defendant’s unconscious rage at his mother led to a loss of behavioral control
that resulted in an assault on another woman). Slobogin has argued, with regard
to criminal defendants, that frankly speculative testimony about psychodynamic
influences on the crime should be held to a lesser standard of admissibility than
required under Daubert.309 In part, he suggests that the very concepts on which
the law relies—such as extreme emotional stress and reasonable apprehension of
harm—are themselves not easily susceptible to determinations that would meet
Daubert’s reliability considerations. Thus, if defendants are to be able to introduce
evidence that would overcome the presumptions against them, testimony that
relies on accepted but inherently unprovable constructs is essential. Moreover,
305. Christopher Slobogin, Proving the Unprovable: The Role of Law, Science, and Speculation
in Adjudicating Culpability and Dangerousness (2007).
306. Robertson v. McCloskey, 676 F. Supp. 351 (D.D.C. 1988) (declining to admit psychodynamic testimony under the Frye standard); United States v. Libby, 461 F. Supp. 2d 3 n.6 (D.D.C.
2006) (noting that although psychodynamic testimony was not admissible under the Frye standard, that
does not necessarily hold under Daubert, and that “there can be little doubt that today . . . the science
of memory is well established and accepted in the scientific community . . . has been well tested and
subjected to peer review”); United States v. Fishman, 743 F. Supp. 713 (N.D. Cal. 1990) (excluding
testimony on “thought reform” theory from a qualified mental health professional).
307. Frye v. United States, 293 F. 1013 (1923).
308. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579 (1993); Kumho Tire Co. v.
Carmichael, 526 U.S. 137 (1999).
309. Slobogin, supra note 305, at 39–57. Note that Slobogin’s argument is not limited to
testimony rooted in psychodynamic concepts, but extends to other mental health evidence that
intends to speak to aspects of a person’s mental state at some point in the past, knowledge of which
is unlikely ever to meet scientific standards of proof. Under the Daubert standard, the judge serves
as the gatekeeper for scientific testimony. The admissibility of evidence is determined on the basis
of relevance and reliability. Reliability factors offered as examples include falsifiability, peer review,
the known or potential rate of error, and general acceptance by the relevant scientific community.
Daubert, 509 U.S. at 593–95.
866
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
Slobogin claims that this is, in fact, why trial courts usually resist efforts to exclude
mental health testimony.310
Granting the legitimacy of Slobogin’s analysis, there is still reason for caution
in a wholesale embrace of psychodynamic theories. Because persuasive empirical
demonstrations of either the concepts themselves or their application in particular
cases is unlikely, their speculative—even if plausible—nature should be recognized. Moreover, to say that such testimony should not be held to reliability-based
standards of admissibility is not to say that no relevant standards exist. Idiosyncratic
concepts and conclusions that would not be generally accepted by clinicians with
appropriate training might well run afoul of prevailing rules for admissibility,
because they lack support even under the older standard of acceptance in the relevant professional community,311 and to the extent that techniques are available
for generating testable data, they would appear to be preferable. This appears, in
fact, to be the way in which courts generally approach such evidence.312
2. Ultimate issue testimony
Whether mental health experts should testify—or be permitted to testify—to the
ultimate legal issue in a case has been the subject of longstanding controversy.313
The question arises, for example, in criminal cases where experts often have commented directly on whether a defendant is competent to stand trial or whether
the legal standard for insanity has been met.314 Similar issues can arise in civil settings, in which experts may be asked to testify directly about a person’s capacity
to manage affairs or to serve as a custodial parent, or regarding whether a person
was competent to sign a contract at an earlier point in time.315 Some mental health
experts find themselves encouraged or pressured by attorneys to draw conclusions
about the ultimate issue, and judges have been known to exclude testimony in
which experts are unwilling to take that step on the grounds that the evidence that
they would otherwise provide lacks probative value.316 Concerns arise over the
fact that conclusions about the ultimate issue in a case are matters to be decided
by the factfinder, on whose legitimate territory an expert who speaks to the issue
may be encroaching and whose deliberations may be preempted.317
310. For a response to Slobogin’s argument, see Edward J. Imwinkelried, The Case Against
Abandoning the Search for Substantive Accuracy, 38 Seton Hall L. Rev. 1031 (2008).
311. Frye v. United States, 293 F. 1013 (D.C. Cir. 1923).
312. Slobogin, supra note 305, at 21–29.
313. See Fed. R. Evid. 704. See Anne Lawson Braswell, Resurrection of the Ultimate Issue Rule:
Federal Rule of Evidence 704(b) and the Insanity Defense, 72 Cornell L. Rev. 620 (1987).
314. But see discussion below regarding the current prohibition on this practice in federal courts.
315. See Restatement (Second) of Contracts § 15.
316. Appelbaum & Gutheil, supra note 149, at 221.
317. Insanity Defense Workgroup, American Psychiatric Association Position on the Insanity Defense,
140 Am. J. Psychiatry 681, 686 (1983); American Bar Association, ABA Criminal Justice Standards:
867
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Proponents of ultimate issue testimony often include attorneys and judges,
who may be concerned that an expert who provides a clinical formulation without
tying it directly to the ultimate legal issue will leave a group of confused jurors
unable to discern the connection on their own.318 Many experts themselves share
similar concerns or worry that mental health issues will simply be ignored if their
relevance to the legal question at hand is not made clear; moreover, they note
that courts have applied such rules erratically.319 They counter concerns about
such testimony having an undue impact on jurors’ deliberations by noting that
members of juries appear to be little influenced by whether or not ultimate issue
testimony is offered by an expert.320
Moreover, efforts to restrict testimony on the ultimate issue often quickly run
into line-drawing problems. As an example, after a jury found John W. Hinckley,
Jr. not guilty by reason of insanity of the attempted assassination of President
Reagan, the verdict led to wholesale revision of laws governing the insanity defense
at the federal and state levels.321 Among the changes wrought by the Federal
Insanity Defense Reform Act of 1984 was a prohibition on experts directly addressing the question of insanity.322 The Federal Rules of Evidence were amended to
effect this change: “No expert witness testifying with respect to the mental state
or condition of a defendant in a criminal case may state an opinion or inference as
to whether the defendant did or did not have the mental state or condition constituting an element of the crime or of a defense thereto.”323 Although it seems
clear that, according to the terms of the rule, the expert is precluded from opining
directly that a defendant lacked criminal responsibility, it is less clear whether the
expert could say that the defendant could not “appreciate the wrongfulness of his
acts,” the language used in the statute to define the relevant standard.324 And if that,
too, were prohibited, could the expert say that the defendant “could not grasp how
wrong his behavior was,” and if so, would that language be likely to have any different impact on a jury than simply speaking in the words of the statute? Empirical
data exist to suggest that the answer to that question is no.325
Still, a large number of mental health and legal scholars oppose experts
addressing the ultimate legal question, and during the high-pitched debate followMental Health, Standard 7-6.6 (1984). Note that the APA position was recently withdrawn as outdated
and replaced by a briefer statement that does not address the question of ultimate issue testimony.
318. Ralph Slovenko, Commentary: Deceptions to the Rule on Ultimate Issue Testimony, 34 J. Am.
Acad. Psychiatry & L. 22 (2006).
319. Alec Buchanan, Psychiatric Evidence on the Ultimate Issue, 34 J. Am. Acad. Psychiatry & L.
14 (2006).
320. Solomon M. Fulero & Norman J. Finkel, Barring Ultimate Issue Testimony: An “Insane”
Rule? 15 L. & Hum. Behav. 495 (1991).
321. Henry J. Steadman, Before and After Hinckley: Evaluating Insanity Defense Reform (1993).
322. 18 U.S.C. § 17.
323. Fed. R. Evid. 704(b).
324. Id.
325. Fulero & Finkel, supra note 320.
868
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
ing the Hinckley trial, both the American Psychiatric Association and the American Bar Association adopted positions against ultimate issue testimony. 326 In
addition to the argument that such testimony trenches on the function of the
jury, opponents often point to the legal and moral nature of the question whether
someone is criminally responsible.327 Although mental health expertise may be
helpful in determining the person’s mental state at the relevant time, determining
whether the resulting impairment was sufficient to negate responsibility requires
the application of the relevant legal standard and a moral judgment of the fairness
or unfairness of punishing the person for his or her behavior. Psychiatrists and
psychologists have no particular expertise on legal or moral issues; hence, opponents of ultimate issue testimony urge that they should not be permitted to speak
to those issues. Such preclusion may also reduce the much bemoaned “battle of
the experts,” because a good deal of disagreement may derive from views of how
data from the evaluation should be applied to the ultimate legal question, rather
than from differences regarding the person’s mental state. Although testimony on
the ultimate legal issue is now barred in federal courts in insanity defense cases (18
U.S.C. § 17), it remains common in many states, and even in federal jurisdictions
it may be offered in other sorts of cases.328
II. Evaluating Evidence from Mental Health
Experts
To this point, we have considered the kind of evidence that is likely to be offered
by mental health experts and some of the challenges that such testimony presents.
The remainder of the chapter addresses those factors that should enter into consideration of the value and impact of such testimony.
A. What Are the Qualifications of the Expert?
The appropriate qualifications of a mental health professional whose testimony
is proffered will depend on the nature of the evidence that will be presented.
However, a number of relevant parameters can be identified.
326. Insanity Defense Workgroup, supra note 317; American Bar Association, supra note 317.
See also Grisso, supra note 2, at 208; Fulero & Finkel, supra note 320, at 496.
327. Mark S. Brodin, Behavioral Science Evidence in the Age of Daubert: Reflections of a Skeptic, 73
U. Cin. L. Rev. 867 (2005); Michele Cotton, A Foolish Consistency: Keeping Determinism Out of the
Criminal Law, 18 B.U. Pub. Int. L.J. 1, 21–23 (2005); Ric Simmons, Conquering the Province of the Jury:
Expert Testimony & the Professionalization of Fact-Finding, 74 U. Cin. L. Rev. 1013 (2006).
328. Fed. R. Evid. 704. Pennsylvania’s law represents a typical formulation: “Testimony in
the form of an opinion or inference otherwise admissible is not objectionable because it embraces an
ultimate issue to be decided by the trier of fact.” Pa. R. Evid. 704.
869
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
1. Training
Most mental health expert testimony is given by psychiatrists or doctoral-level
clinical psychologists. Given the differences in the education and training of
each profession, their testimony is not necessarily interchangeable. As a rule,
psychiatrists are prepared by their training to speak to the diagnosis of mental
disorders, including medical issues that may play a role in a particular case, and
to treatment approaches, including psychopharmacological treatment.329 They
should be capable of testifying, within the limits of existing knowledge and the
information available to them, regarding the impact of a disorder on a person’s
behavior and functional abilities. Psychologists’ training, in contrast, may provide
deeper knowledge of the theoretical and experimental bases for understanding the
function of the mind, both normal and abnormal.330 As a general matter, doctorallevel clinical psychologists will be prepared by their training to provide evidence
regarding diagnosis and psychotherapeutic treatment of mental disorders, the
results of psychological and neuropsychological testing, and the roots of normal
and abnormal behavior.
However, although the core elements of training in psychiatry and psychology may be similar across training programs, the variability is substantial.331
Moreover, variation in subspecialty (in psychiatry) or specialty (in psychology)
training—for example, in geriatric psychiatry or neuropsychology—contributes
to further differentiation among experts. Thus, inquiries regarding the specific
training afforded an expert may be necessary. This is particularly true when an
expert is testifying about topics that would ordinarily fall outside disciplinary
boundaries, for example, a psychiatrist discussing the results of psychological
testing or a psychologist offering evidence regarding the effect of medication on
a person’s behavior. The same is true for experts who are testifying beyond the
range of their specialty or subspecialty training. In addition, in recent years, expert
testimony on mental health issues has been admitted at times from nonpsychiatric
physicians and mental health professionals of other disciplines.332 These include
329. See discussion of psychiatrists’ training in Section I.B.1, supra.
330. See discussion of psychologists’ training in Section I.B.2, supra.
331. See, e.g., Khurshid A. Khurshid et al., Residency Programs and Psychotherapy Competencies:
A Survey of Chief Residents, 29 Academic Psychiatry 452 (2005); Committee on Incorporating
Research into Psychiatry Residency Training, Institute of Medicine, Research Training in Psychiatric
Residency: Strategies for Reform 91–132 (Michael T. Abrams et al. eds., 2003); Charles J. Gelso, On
the Making of a Scientist-Practitioner: A Theory of Research Training in Professional Psychology, S(1) Training
and Education in Professional Psychology 3–16 (2006); Brendan A. Maher, Changing Trends in Doctoral
Training Programs in Psychology: A Comparative Analysis of Research-Oriented Versus Professional-Applied
Programs, 10 Psychol. Sci. 475 (1999).
332. Campbell v. Metropolitan Prop. & Cas. Ins. Co., 239 F.3d 179 (2d Cir. 2001) (professor
of pediatrics with substantial relevant publications found qualified to testify on neurological injuries
resulting from lead paint exposure); Carroll v. Otis Elevator Co., 896 F.2d 210 (7th Cir. 1990)
(experimental psychologist found qualified to give expert testimony on likelihood that product design
870
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
social work and nursing, and in the future arguably could include master’s-level
psychologists, marriage and family therapists, physician assistants, and additional
disciplines as well. Specific inquiry into relevant training will probably be needed
at least until testimony from such disciplines becomes more widely accepted and
their specific qualifications more generally known.
2. Experience
Experience is relevant to the qualifications of mental health experts in at least two
ways. First, as the Federal Rules of Evidence recognize, experience may substitute for training as a basis for concluding that a witness has special expertise.333
Many experts in forensic psychiatry and forensic psychology, for example, lack
formal training in conducting evaluations of the sort provided in forensic fellowships, because such training programs have become widely available only fairly
recently. In addition, formal training is simply unavailable (or at least difficult to
acquire) in a number of substantive areas of clinical psychiatry and psychology.
For example, most professionals who acquire special knowledge about particular
mental disorders will do so by pursuing their interest through reading and following the literature and by means of clinical contact with patients with the disorders,
as opposed to formal training. Thus, experience must often be relied upon as a
stand-in for more conventional credentials.
The second way in which experience can be material to expert qualifications
relates to the attrition of skills and knowledge over time. Mental health professionals often complete their training within several years of their 30th birthdays
and may engage in practice, including the provision of expert testimony, over the
subsequent four or five decades. Brief exposure to information about a particular
disorder334 or some experience in evaluating and treating the condition may
fade from memory several decades later unless reinforced in a direct way. Just as
problematic is the possibility that additional knowledge about the condition has
would cause children to press escalator’s emergency stop button); United States v. Withorn, 204 F.3d
790 (8th Cir. 2000) (trial court properly admitted testimony from midwife on alleged sexual assault
on basis of bachelor’s degree, some postgraduate work, and clinical experience). But see United States
v. Moses, 137 F.3d 894 (8th Cir. 1998) (social worker lacked expertise to opine that victim of alleged
child abuse would suffer trauma from facing the accused abuser in the courtroom).
333. “If scientific, technical, or other specialized knowledge will assist the trier of fact to
understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge,
skill, experience, training, or education, may testify thereto in the form of an opinion or otherwise,
if (1) the testimony is based upon sufficient facts or data, (2) the testimony is the product of reliable
principles and methods, and (3) the witness has applied the principles and methods reliably to the facts
of the case.” Fed. R. Evid. 702 (2000).
334. Although this discussion in framed in terms of a particular disorder, the condition in issue
may not constitute a disorder in a formal sense. Rather it may involve a symptom (e.g., auditory
hallucinations), a mental state not linked to a specific disorder (e.g., dissociation), or a behavioral
propensity (e.g., violent behavior). The argument in this section is generally applicable to all these
categories of phenomena.
871
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
been gained in the interim, familiarity with which might alter an expert’s evaluation or opinion. Training regarding a mental disorder or treatment, therefore,
may be a necessary but not sufficient aspect of an expert’s qualifications, in the
absence of ongoing experience. Indicia of such experience may include evaluating or treating patients with the disorder, teaching trainees how to assess or treat
the disorder, systematically reviewing the literature on the disorder, attending
continuing education sessions concerning the disorder, and conducting research
on the disorder.
Although experience, including ongoing experience, with the condition at
issue is important in establishing expertise for the purpose of providing evidence
in a case, there is a danger that experience can be overemphasized as a criterion
of expertise as well. Assuming a baseline degree of adequate training and some
ongoing experience in a field or with a condition, it is not clear that additional
experience necessarily enhances an expert’s authoritativeness. Experts will sometimes boast of the number of evaluations they have performed of a particular
type of evaluee (e.g., alleged or convicted murderers) or of a given kind (e.g.,
assessments of competence to stand trial). However, if evaluations are performed
inadequately or used as the basis for invalid conclusions, especially if there is no
feedback loop to correct the expert’s errors, mere experience may only have the
effect of reinforcing bad clinical habits. Indeed, studies of diagnostic performance
by mental health professionals divided into groups by the duration of their clinical
experience have shown no consistent correlation between years of experience and
reliability.335 An explanation for the failure to find a consistent effect of expertise
may be that, despite less clinical experience, recently trained clinicians are more
familiar with the contemporary diagnostic framework and are less tempted to
use their clinical experience as a substitute for generally accepted criteria (e.g., “I
know schizophrenia when I see it, regardless of what the criteria say”). It is of
interest that few studies have compared the performance of experienced forensic
psychiatrists and psychologists to their nonforensic colleagues.336 Although it
might be expected that experts with forensic training would be more sensitive
to the unique aspects of forensic examinations discussed above, for example, the
importance of maintaining a level of suspicion regarding secondary gain and of
confirming the evaluee’s account, when possible, with collateral information, that
hypothesis remains to be tested. One small study has shown that forensic psychiatrists may be less susceptible to some kinds of hindsight bias than their clinical
335. Here reliability is being used in its technical sense of agreement across more than one rater.
For an example of the failure to find a consistent effect of previous experience, see, e.g., Sean H. Yutzy
et al., DSM-IV Field Trial: Testing a New Proposal for Somatization Disorder, 152 Am. J. Psychiatry 97
(1995).
336. There are, however, data to suggest, as might be expected, that clinicians with forensic
training have higher levels of knowledge regarding relevant legal issues, e.g., Gary B. Melton et al.,
Community Mental Health Centers and the Courts: An Evaluation of Community-Based Forensic
Services 43–55 (1985).
872
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
colleagues,337 but additional research would be helpful before firm conclusions
are drawn.
3. Licensure and board certification
a. Licensure
Possession of a valid professional license is usually considered a threshold requirement in the qualification of an expert in legal proceedings. Licensure of physicians
(including psychiatrists) is governed by a licensure board in each state.338 Although
criteria may differ somewhat, generally a physician who has graduated from an
accredited American medical school, passed a sequence of tests designed to ensure
adequate levels of knowledge and clinical judgment,339 and completed 1 or 2 years
of residency training is eligible for full licensure.340 Prior to that point, a temporary
license, allowing practice under supervision, is usually issued. Graduates of medical
schools that are not in the United States are usually subject to a different set of
requirements, often requiring longer periods of residency training and individual
review of qualifications. Once licensure is attained in a state, should a physician
desire to acquire a license in another state, the process is variable. Some states will
grant such a license fairly easily; others, such as California, will require that the
physician take and pass a test of general medical knowledge if a certain period of
time (e.g., 10 years in California) has passed since the original sequence of testing
was completed.341
For clinical psychologists, standards for licensure differ somewhat by state, but
generally after completion of an accredited Ph.D. program in the United States
(including a 1-year internship), they are required to complete 2 years of clinical work under the supervision of a licensed psychologist and to pass a national
licensure examination.342 Because the states do not restrict the practice of psychotherapy per se, but regulate the use of professional titles instead, an unlicensed
psychologist can engage in many aspects of the clinical practice of psychology,
including all forms of psychotherapy, but will not be able to use the title of psychologist. For psychologists who are seeking licensure in another jurisdiction,
many states will grant reciprocity—that is, they will not engage in an independent
337. Herbert W. LeBourgeois et al., Hindsight Bias Among Psychiatrists, 35 J. Am. Acad. Psychiatry
& L. 67 (2007).
338. A summary of the requirements for medical licensure in each jurisdiction is available from
the Federation of State Medical Boards at http://www.fsmb.org/usmle_eliinitial.html.
339. See a description of the tests and the examination process at http://www.usmle.org/
General_Information/general_information_about.html.
340. Federation of State Medical Boards, supra note 338.
341. Cal Bus. & Prof. Code §§ 2080–99, 2184.
342. Details of requirements in each state can be found at the Web site of the Association of
State and Provincial Psychology Boards at http://www.asppb.net.
873
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
process of reviewing the applicant’s credentials, relying instead on the review
conducted by the initial licensure board.
b. Board certification
Board certification represents a level of qualifications beyond those required for
licensure in either medicine or psychology. Although well-trained, competent
psychiatrists may have reasons for not attaining board certification (e.g., examination anxiety that interferes with performance, a career centered on nonclinical
research for which clinical board certification is thought to be unnecessary), the
tests are designed to be passed by a competent psychiatrist and do not require
exceptional levels of clinical skill. Thus, in most cases board certification can be
viewed as reflecting attainment of an adequate level of clinical competence to
engage in independent psychiatric practice. Whether a court chooses to admit
testimony from a psychiatrist who has not been board certified may depend on
the reasons why certification has not been achieved and on the specific question(s)
that will be addressed in the psychiatrist’s testimony.343
Professional psychology also has a board certification process, administered by
the American Board of Professional Psychology.344 Certification is only offered
in psychology specialties, but these include such general clinical fields as clinical
psychology, counseling psychology, and group psychology. As in subspecialty certification in psychiatry, candidates are expected to exhibit advanced competence
in the specialty area, defined specifically for each specialty. Board certification
is less common among psychologists than among psychiatrists, in part perhaps
because the process is more recent.345 Given this, it is less likely that certification
will be applied as a minimum standard for expert testimony in psychology than
in psychiatry or other areas of medicine.
343. For examples of the scope of judicial discretion on this issue, see, e.g., Hall v. Quarterman,
534 F.3d 365 (5th Cir. 2008) (finding that a state requirement that only a licensed expert may
testify in a civil commitment hearing as to mental retardation did not extend to expert testimony
on the same topic); Oberlander v. Oberlander, 460 N.W.2d 400 (1990) (reversing as abuse of
discretion the trial court’s exclusion of expert testimony from a psychologist who was licensed
in the neighboring state); Williams v. Brown, 244 F. Supp. 2d 965 (N.D. Ill. 2003) (finding that
psychiatrists who were not board-certified child psychiatrists may nonetheless testify about the
condition of juvenile plaintiffs).
344. A description of the process and eligibility requirements for the examination process can
be found at http://www.abpp.org/abpp_certification_specialties.htm.
345. A recent study suggests that approximately 85% of psychiatrists become board certified in
the 8 years following completion of residency training. Dorthea Juul et al., Achieving Board Certification
in Psychiatry: A Cohort Study, 160 Am. J. Psychiatry 563 (2003). In contrast, it was estimated that in
2000 only 3.5% of psychologists had achieved board certification. Frank M. Dattilio, Board Certification
in Psychology: Is It Really Necessary? 33 Prof. Psychol.: Res. & Prac. 54 (2002).
874
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
4. Prior relationship with the subject of the evaluation
A presumption may exist among some attorneys, judges, and jurors that a mental
health professional who has had a treatment relationship with the person whose
mental state is in question is better qualified to testify about aspects of that mental
state than an evaluator who is meeting the person for the first time. The logic
seems strong: A professional who has known the person for some period of time,
perhaps a substantial one, should be better able to offer conclusions about the
person’s diagnosis, treatment requirements, and the impact of the person’s mental
state on the person’s function and behavior. Thus, it may seem surprising that the
ethics guidelines produced by both the American Academy of Psychiatry and the
Law, the leading organization of forensic psychiatrists, and the American Psychological Association’s division of forensic psychologists point to problems inherent in such situations.346 Although neither set of guidelines construes testimony
involving current or former patients as unethical, they both have words of caution
to offer and discourage clinicians from playing both clinical and expert roles.347
The professional literature on this issue, and the ethics guidelines themselves,
cite several reasons why having a treating professional perform the evaluation for
346. American Academy of Psychiatry and the Law: Ethics Guidelines for the Practice of
Forensic Psychiatry, May 2005, https://www.aapl.org/ethics.htm; Committee on Ethical Guidelines
for Forensic Psychologists (Division 41 of the American Psychological Association and the American
Academy of Forensic Psychology), Specialty Guidelines for Forensic Psychologists, 15 L. & Hum. Behav.
655 (1991).
347. The forensic psychiatry guidelines are explicitly discouraging of this practice:
Psychiatrists who take on a forensic role for patients they are treating may adversely affect the therapeutic relationship with them. Forensic evaluations usually require interviewing corroborative sources,
exposing information to public scrutiny, or subjecting evaluees and the treatment itself to potentially
damaging cross-examination. The forensic evaluation and the credibility of the practitioner may also
be undermined by conflicts inherent in the differing clinical and forensic roles. Treating psychiatrists
should therefore generally avoid acting as an expert witness for their patients or performing evaluations
of their patients for legal purposes.
American Academy of Psychiatry and the Law: Ethics Guidelines for the Practice of Forensic Psychiatry,
Sec. IV (May 2005), available at https://www.aapl.org/ethics.htm. In contrast, the forensic psychology
guidelines could be seen as being somewhat more permissive:
“D. Forensic psychologists recognize potential conflicts of interest in dual relationships with parties to
a legal proceeding, and they seek to minimize their effects.
1. Forensic psychologists avoid providing professional services to parties in a legal proceeding
with whom they have personal or professional relationships that are inconsistent with the anticipated
relationship.
2. When it is necessary to provide both evaluation and treatment services to a party in a legal
proceeding (as may be the case in small forensic hospital settings or small communities), the forensic
psychologist takes reasonable steps to minimize the potential negative effects of these circumstances on
the rights of the party, confidentiality, and the process of treatment and evaluation.”
Committee on Ethical Guidelines for Forensic Psychologists (Division 41 of the American Psychological
Association and the American Academy of Forensic Psychology): Specialty Guidelines for Forensic
Psychologists, 15 Law & Hum. Behav. 655 (1991).
875
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
legal purposes may not be prudent.348 First, offering testimony, even if it is supportive of the patient’s legal claim, may interfere with the therapeutic relationship.
Not only will it often come as a shock to a patient to hear herself described in
diagnostic terms, but details of the treating clinician’s view of the patient revealed
under both direct and cross-examination may alienate the person from the clinician. The treating clinician, in fact, may be aware of more information that is not
relevant to the legal question than an evaluator called in specifically for purposes
of providing evidence, and hence may be even more likely to reveal it during
testimony. At best, when this happens it impedes the therapeutic process and takes
time away from the primary therapeutic goals; at worst, it may lead the person
to abandon treatment. This effect is likely to be exacerbated if the testimony is
adverse to the patient’s legal position.
Second, the underlying assumption regarding the desirability of having the
clinician testify may be flawed. That is, although the clinician may have known
the person for a long time as a patient, the clinical process may never have
required the clinician to collect the type of information that would be relevant to
the legal question. Even if that information was discussed, the treating clinician is
less likely to have approached it with the degree of caution that a forensic evaluator would be likely to employ or to have attempted to verify the information
through collateral sources. Indeed, even after agreeing to participate as an expert
witness, a clinician may be unaware of the importance of assessing the veracity
of the person’s claim or afraid that doing so may lead to strains in the therapeutic
relationship.
A third problem is that the clinician, having formed an alliance with the
person as a patient, perhaps over a considerable period of time, may feel a natural
allegiance to the person and a desire, even if not a conscious one, to support the
person’s contentions in the case. Thus, presentation of evidence may undergo
subtle distortion, or may be subject to conscious manipulation by a clinician who
sees his or her role as being the patient’s advocate. Fourth, there is an ethical
problem when the clinician is subpoenaed to testify over the patient’s objection.
The preexisting therapeutic relationship was premised on the information that the
patient revealed being used for treatment purposes. It places the clinician whose
testimony cannot support the person’s legal claim in an extremely awkward position to be compelled now to use that information to the patient’s detriment.349
348. Larry H. Strasburger et al., On Wearing Two Hats: Role Conflict in Serving as Both
Psychotherapist and Expert Witness, 154 Am. J. Psychiatry 448 (1997); Ronald Schouten, Pitfalls of
Clinical Practice: The Treating Clinician as Expert Witness, 1 Harv. Rev. Psychiatry 405 (1993); Stuart
Greenberg & Daniel Shuman, Irreconciliable Conflict Between Therapeutic and Forensic Roles, 28 Prof.
Psychol.: Res. & Prac. 50 (1997); Appelbaum & Gutheil, supra note 149, at 236–39.
349. Although all states have psychotherapist–patient and/or physician–patient testimonial
privilege statutes that limit testimony by treating psychiatrists and psychologists (and often other mental
health professionals) without the patient’s consent, the exceptions in many of these statutes—including
the so-called patient-litigant exception that is invoked when patients place their mental state at issue
876
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
Thus, in contrast to what might seem the logical assumption—that the treating
clinician is the best qualified person to testify regarding the patient—there are
multiple reasons to avoid relying on the treater, and in fact to discourage that
person from serving as an expert witness in the case.
B. How Was the Assessment Conducted?
The reliability and validity of an expert opinion related to mental health issues
depends heavily on the manner in which the assessment that forms the basis for
the conclusions was conducted.
1. Was the evaluee examined in person?
Given the range of cases in which mental health experts provide testimony and
the various questions to which they are asked to respond, situations arise in which
the experts are providing evidence without having examined the person about
whom they are testifying.350 Such circumstances may arise when direct evaluation
is impossible, for example, in contests over testamentary capacity, where often
only after the testator is deceased will a claim regarding the person’s capacity be
litigated. Other civil litigation in which there may be issues regarding the state of
mind of a deceased person include contractual capacity, wrongful death, and medical malpractice claims.351 Testimony regarding a person who cannot be evaluated
directly is less likely to occur in criminal cases, but a highly contentious example
occurs in death penalty cases in Texas; defendants have the right to decline evaluation by prosecution experts,352 but such experts frequently testify on the basis of
a hypothetical question that reflects some of the facts regarding the defendants’
history and behavior.353
in a case—are sufficiently numerous that this situation cannot be ruled out. Jaffee v. Redmond, 518
U.S. 1 n.13 (1996); Bruce J. Winick, The Psychotherapist-Patient Privilege: A Therapeutic Jurisprudence
View, 50 U. Miami L. Rev. 249 (1996).
350. In addition, on some occasions, testimony will provide contextual information for the
decisionmaker, for example, how a person in a given situation or with a given disorder would usually
respond, without being applied directly to a specific person. John Monahan & Laurens Walker, Social
Authority: Obtaining, Evaluating, & Establishing Social Science in Law, 134 U. Pa. L. Rev. 477 (1986); John
Monahan & Laurens Walker, Social Science Research in Law: A New Paradigm, 43 Am. Psychol. 465 (1988).
351. Farnsworth, supra note 8, § 3:11. For a case study of the use of postmortem analysis in the
USS Iowa explosion investigation, see Charles Patrick Ewing & Joseph T. McCann, Minds on Trial:
Great Cases in Law and Psychology 129–39 (2006); see also Norman Poythress et al., APA’s Expert
Panel in the Congressional Review of the USS Iowa Incident, 48 Am. Psychol. 8 (1993). See Moon v.
United States, 512 F. Supp. 140 (D. Nev. 1981) (finding that hospital psychiatrists were negligent in
diagnosing as schizophrenic a patient who later committed suicide); Urbach v. United States, 869 F.2d
829 (5th Cir. 1989) (finding no medical malpractice where a mental patient on furlough from a VA
hospital was arrested and beaten to death in a Mexican prison).
352. Estelle v. Smith, 451 U.S. 454 (1981).
353. Barefoot v. Estelle, 463 U.S. 880 (1983); Satterwhite v. Texas, 486 U.S. 249 (1988).
877
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Conclusions about persons who have not been directly examined may be
drawn on the basis of available records, including medical, mental health, police,
educational, armed services, and other records; information from informants who
have been or are in contact with the person, which may derive from interviews by
the expert, prior testimony, depositions, police reports, and other sources; and on
some occasions observations by the expert of the person’s behavior, for example,
in a prison or courtroom setting.354 Although it may be possible to draw valid
conclusions on the basis of such data, conclusions generally are more limited and
have a lesser degree of certainty than when a direct evaluation has taken place.
The ethics statements of the major forensic psychiatry and forensic psychology
organizations offer words of caution about such testimony.355 There are several
reasons why caution is warranted.
Expert knowledge in mental health can be viewed as comprising two components: the knowledge of how to conduct an evaluation to obtain relevant data
and the knowledge of how to weigh those data to reach a conclusion.356 When a
direct examination of the person cannot be carried out, the expert must rely on
information accumulated by others, sometimes for other purposes. The likelihood
354.. Kirk Heilbrun et al., Third Party Information in Forensic Assessment, in Handbook of
Psychology, Vol. 11: Forensic Psychology 69 (Alan M. Goldstein ed., 2003). Testimony offered in
capital sentencing contexts without examination of the defendant has been particularly controversial,
see, e.g., Bennett v. State, 766 S.W.2d 227, 232 (Tex. Crim. App. 1989) (Teague, J., dissenting)
(“[W]hen Dr. Grigson testifies at the punishment stage of a capital murder trial he appears to the
average lay juror . . . to be the second coming of the Almighty. . . . Dr. Grigson is extremely good at
persuading jurors to vote to answer the [future dangerousness] issue in the affirmative.”); “They Call
Him Dr. Death,” Time. June 1, 1981; Rosenbaum, supra note 216.
355. The Ethics Guidelines for the Practice of Forensic Psychiatry of the American Academy of
Psychiatry and Law (available at https://www.aapl.org/ethics.htm) note:
For certain evaluations (such as record reviews for malpractice cases), a personal examination is not
required. In all other forensic evaluations, if, after appropriate effort, it is not feasible to conduct a personal examination, an opinion may nonetheless be rendered on the basis of other information. Under
these circumstances, it is the responsibility of psychiatrists to make earnest efforts to ensure that their
statements, opinions and any reports or testimony based on those opinions, clearly state that there was
no personal examination and note any resulting limitations to their opinions.
The comparable guidelines for forensic psychology state:
Forensic psychologists avoid giving written or oral evidence about the psychological characteristics
of particular individuals when they have not had an opportunity to conduct an examination of the
individual adequate to the scope of the statements, opinions, or conclusions to be issued. Forensic
psychologists make every reasonable effort to conduct such examinations. When it is not possible or
feasible to do so, they make clear the impact of such limitations on the reliability and validity of their
professional products, evidence, or testimony.
Committee on Ethical Guidelines for Forensic Psychologists (Division 41 of the American Psychological
Association and the American Academy of Forensic Psychology), Specialty Guidelines for Forensic
Psychologists, 15 Law & Hum. Behav. 655 (1991).
356. Paul S. Appelbaum, Hypotheticals, Psychiatric Testimony, and the Death Sentence, 12 Bull.
Am. Acad. Psychiatry & L. 169 (1984); see also American Psychiatric Ass’n amicus brief in Barefoot,
supra note 215.
878
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
that all the data that the expert would have wanted to obtain will be available in
such circumstances is remote. This is true even when the data have been generated
by another mental health professional, for example, in medical or mental health
records, both because that person may not have asked all the questions that the
testifying expert would have asked and because all of the person’s responses may
not have been fully recorded. The intangible aspects of an evaluation, including the person’s relatedness, affect, and degree of cooperation, may be especially
difficult to convey. Because many of the diagnostic categories require that other
possibilities have been excluded first,357 the absence of pertinent negative information (e.g., the person does not abuse substances) can restrict the ability to make
definitive diagnoses. Moreover, to the extent that the data available to the expert
have been shaped by someone with an interest in the outcome of the case, as
when an expert testifies in sole reliance on information in a hypothetical question
that is designed to mirror the defendant’s or plaintiff’s situation, these problems
are compounded.
Thus, the major professional organizations in forensic mental health agree
that evidence based on sources other than a direct evaluation of the person should
be framed with due regard for its limitations and that those limitations should be
made clear in reports or testimony by the expert. Failure to do so may represent
unethical behavior on the part of the expert witness358 and should probably cast
doubt on the credibility of the evidence presented.
2. Did the evaluee cooperate with the assessment?
Even when a direct evaluation has taken place, the degree of cooperativeness
of the person may affect the validity of the data obtained.359 Civil plaintiffs and
criminal defendants have obvious reasons to distrust experts who are examining
them on behalf of adverse parties, and may be less than forthcoming in such interactions. However, even when an evaluation is being conducted by an expert hired
by the person’s own attorney, his or her cooperativeness may be limited by the
symptoms of the disorder. For example, the person who is experiencing paranoid
delusions may be suspicious and fearful even of an expert with whom his or her
attorney encourages cooperation (indeed, even of the attorney). As a consequence,
it is important for the expert to clarify, in the presentation of the evidence and
357. For example, DSM-IV-TR criteria for Major Depressive Episode require both that the
symptoms on which a diagnosis is based not be due to the direct physiological effects of a drug (licit or
illicit) that has been ingested or to a general medical condition; and that they not be better accounted
for by a diagnosis of Bereavement after the death of a loved one. DSM-IV-TR at 356. Other major
diagnostic categories carry similar requirements to rule out the possibility that the person’s presentation
is due to other causes before making the diagnosis in question.
358. For one highly publicized case of a psychiatric expert witness who was expelled from the
American Psychiatric Association on these grounds, see Ron Rosenbaum, supra note 216; Estelle v.
Smith, 451 U.S. 454 (1981).
359. Melton et al., supra note 28, at 46.
879
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
conclusions based on the evaluation, the extent to which the evaluee cooperated
with the examination process.
3. Was the evaluation conducted in adequate circumstances?
Mental health evaluations often involve discussions of sensitive material, including histories of abuse, use of illegal substances, sexual practices, intimate fears and
fantasies, and potentially embarrassing symptoms. Although some persons may be
reluctant to speak freely about these issues with an evaluator whom they barely
know—and who may reveal this information in the courtroom—the reassurance
that they are talking with a mental health professional often substantially mitigates
those concerns.360 However, when the evaluation takes place in a setting that is
less than private, the likelihood of such disclosures is reduced.361 This is often
a problem in correctional institutions, where interviews may take place where
guards or other inmates can overhear them. Medical hospitals are another location
where privacy may be compromised, with nursing staff or other patients nearby.
Even if no one is within earshot, interview sites that are noisy or subject to other
distractions may interfere with the evaluee’s ability to attend to the questions
and respond accurately; this can be a particular problem for people with mental
disorders that may impair concentration and attention. Whenever possible, a competent evaluator tries to obtain a venue that is free of these intrusions, and when
it is not possible, the situation should be noted as a limitation on the completeness
of the evaluation in the report or testimony.
Attorneys sometimes ask to sit in on the evaluation. Their presence can raise
similar concerns, even when they are representing the person being evaluated,
because the type of information discussed in a mental health evaluation may be quite
different from what a client usually discloses to an attorney.362 Particularly when the
examination is being conducted by an expert for an adverse party, attorneys may
be tempted to object to questions or to signal the person regarding their answers.
Thus, if an attorney is present, as will sometimes be unavoidable, the ground rules
should include having the attorney sit out of the line of sight of the evaluee and
not interrupt the examination. An alternative is to have the evaluation audiotaped
or videotaped, a technique that some experts now use routinely. Empirical data on
360. Indeed, a considerable literature exists on the question of whether evaluees may too easily
be induced to speak frankly with someone who is introduced as a mental health professional, but whose
role is very different than would obtain in treatment settings and who may reach opinions adverse to
the person’s interests. See, e.g., Daniel Shuman, The Use of Empathy in Forensic Evaluations, 3 Ethics &
Behav. 289 (1993); Strasburger et al., supra note 348; Greenberg & Shuman, supra note 348.
361. Melton et al., supra note 28, at 47. Distraction can be a particular problem when formal
psychological tests are used; see, e.g., Kirk Heilbrun, The Role of Psychological Testing in Forensic
Assessment, 16 Law & Hum. Behav. 257 (1992).
362. Robert I. Simon, “Three’s a Crowd”: The Presence of Third Parties During the Forensic
Psychiatric Examination, 27 J. Psychiatry & L. 3 (1999); Robert L. Goldstein, Consequences of Surveillance
of the Forensic Psychiatric Examination: An Overview, 145 Am. J. Psychiatry 1234 (1988).
880
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
the impact of taping on evaluees’ willingness to be forthcoming are lacking, but
experienced forensic examiners have expressed the view that evaluees rapidly adjust
to the recording equipment, with little impact on the evaluation.363
A final consideration is the time available for the examination.364 Time constraints may result from correctional rules (e.g., prisoners are only available during
given periods of time), medical illnesses or mental disorders (e.g., the evaluee has
limited strength or attention), or limitations on resources (e.g., the party employing the expert only has funds for a certain number of hours of work). Appropriate
duration of a direct examination is difficult to specify for all situations. It is likely
to depend on the question being asked, the complexity of the person’s history and
presentation, and the person’s degree of cooperation with the evaluation. Needless to say, the duration of an examination, standing alone, is not a good indicator
either of its quality or of the validity of the conclusions that were drawn. However, an expert should be able to assess the time necessary to perform an adequate
evaluation and, if sufficient time is not available, should indicate the limitations
on the resulting opinions that are offered.
4. Were the appropriate records reviewed?
The importance for the evaluator of having access to the person’s records will
vary somewhat depending on the legal question being addressed, but can often
be critical to the validity of the evaluation.365 When retrospective assessments
are being conducted—for example, an evaluation of a defendant’s state of mind
at the time of a crime that occurred months to years before the examination, or
an assessment of a person’s capacity to enter into a contract at some distant prior
date—reviewing contemporary or nearly contemporary records can provide crucial insights into the person’s symptoms and functioning at that time. However,
even when contemporaneous function or future behavior is being assessed, having
access to available records may still be of great importance. Because distinctions
between mental disorders can depend in part on the pattern of symptoms over
time, accurate diagnosis often is dependent on having a view of the person’s prior
psychiatric history.366 In addition, when malingering is a consideration, as it will
frequently be, the consistency of the person’s presentation over time can be an
important datum in the assessment.367 And given that past behavior is generally the
363. AAPL Task Force, Videotaping of Forensic Psychiatric Evaluations, 27 J. Am. Acad. Psychiatry
& L. 345 (1999).
364. Melton et al., supra note 28, at 47.
365. Kirk Heilbrun et al., supra note 354; see also discussion in Section I.C.3.f, supra.
366. Diagnosis and subcategorization of bipolar disorder, for example, is dependent not only
on assessing the person’s current symptoms—whether manic or depressed—but also on ascertaining
whether mania or depression was present in the past if it is not apparent at present. See DSM-IV-TR
at 388–89.
367. See generally Section I.C.5, supra.
881
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
best predictor of future behavior, especially where violence is concerned, knowledge of a person’s previous history can be essential for predictions of reasonable
accuracy.368 Thus, regardless of the focus of the evaluation, an effort should be
made to obtain all relevant available records.
Which records are relevant will depend somewhat on the nature of the legal
question being asked.369 Whenever possible, records of past mental health evaluations or treatment should be obtained. Medical records often contain information
about patients’ psychiatric symptoms, alcohol and drug use, and functional levels,
and thus can be useful as well. Light can be shed on both patterns of symptoms and
functional impairment by educational, work, and military records. Educational
records may be especially helpful where disorders of early onset are suspected, and
work and military records are often illuminating when occupational disability is
at issue. In criminal cases, particularly those involving assessments of the defendant’s state of mind at the time of the crime, police records can often be valuable,
including interviews with witnesses or the defendant, and the results of physical
evaluations—including pictures—of the crime scene. It can be helpful to compare the data obtained by these means with the defendant’s own accounts of the
episode that led to the arrest. Diaries or other accounts written by the person
whose mental state is at issue are sometimes available and, to the extent that they
were generated prior to the initiation of legal proceedings, can be enlightening
regarding the person’s state of mind and motivation, the influence of third parties,
and the like. When there has been previous litigation involving the person being
evaluated, depositions or transcripts of testimony can be helpful for information
about state of mind and factual data.
5. Was information gathered from collateral informants?
In addition to reviewing records, interviewing informants with relevant data can
provide important perspectives on the person being evaluated.370 Family members and friends, including coworkers, often can report on patterns of behavior
indicative of symptoms of mental disorder or of functional impairment. They
may know about prior treatment for mental disorders, including hospitalization,
or histories of involvement with the criminal justice system. Current or former
therapists can share useful impressions of diagnosis and comment on levels of function, although to the extent that their interactions with the person are subsumed
under a psychotherapist–patient or physician–patient privilege, and do not fall
under one of the exceptions in that jurisdiction, it may not be possible to contact
them without the person’s consent. Witnesses to an alleged crime or workplace
368. See generally Section I.E.1.a, supra.
369. See, e.g., Deborah Giori-Guarnieri et al., AAPL Practice Guideline for Forensic Psychiatric
Evaluation of Defendants Raising the Insanity Defense, 30 J. Am. Acad. Psychiatry & L. 22 (Supplement)
(2002).
370.. Heilbrun et al., supra note 354.
882
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
harassment can similarly round out a picture of the person and help to confirm
or disconfirm the evaluator’s impressions. Access to collateral informants may be
complicated by legal restrictions or, if they are close to the person being evaluated, by their reluctance to speak to an expert working for an adverse party. When
contact does occur, the assessor needs to take into account possible distortions by
the informant in the service of helping, or sometimes of harming, the interests
of the person who is the subject of the evaluation.
6. Were medical diagnostic tests performed?
Dualistic views of human behavior, in which mind and body are seen as distinctly
separate entities, have been rejected by scientists who study thought and behavior, and clinicians who treat mental disorders.371 The relevant fields, including
cognitive science, neuroscience, psychology, psychiatry, and philosophy, now
acknowledge the brain as the seat of mentation and behavior, and recognize that
all mental phenomena, including abnormal mental states, result from perturbations
in the function of the brain. At some level, there must be a physical concomitant
of every mental phenomenon, and sometimes the physical influences on abnormal
behavior are gross enough to be detected by existing techniques, which may reveal
potentially treatable conditions. Thus, identification of the causes of abnormal
thought or behavior and formulation of a diagnosis may require an evaluation of
the person’s physical state, along with the mental state.372 If there is any reason to
suspect that an identifiable general medical disorder lies at the root of the person’s
condition (e.g., a sudden and unprecedented appearance of symptoms, disproportionate impairment of aspects of cognitive function), medical testing, including
EEGs and imaging studies, may be indicated.373
371. See, e.g., DSM-IV-TR, supra, at xxx, “the term mental disorder unfortunately implies a
distinction between ‘mental’ and ‘physical’ disorders that is a reductionistic anachronism of mindbody dualism.” See also Kenneth S. Kendler, Toward a Philosophical Structure for Psychiatry, 162 Am. J.
Psychiatry 433 (2005).
372. See generally Section I.C.3.e, supra.
373. Identification of structural or electrical abnormalities, however, does not necessarily imply
that they impaired the person’s functioning or were responsible for the person’s behavior. For
discussion of a well-known case in which this issue was raised, see Stephen Morse, Brain and Blame.
84 Geo. L.J. 527 (1996). For a more general discussion of the introduction of findings of abnormalities
demonstrated on brain imaging in court, see Dean Mobbs et al., Law, Responsibility and the Brain, 5
PLoS Biology 693 (2007). Moreover, as with structural findings, the mere presence of a functional
abnormality is not sufficient to establish a causal link to the person’s mentation or behavior. Growing
legal and neuroscience literatures are being generated on the use of functional imaging data in court.
See, e.g., Neal Feigenson, Brain Imaging and Courtroom Evidence: On the Admissibility and Persuasiveness
of fMRI, 2 Int’l J.L. Context 233 (2006); Hal S. Wortzel et al., Forensic Applications of Cerebral Single
Photon Emission Computed Tomography in Mild Traumatic Brain Injury, 36 J. Am. Acad. Psychiatry &
L. 310 (2008).
883
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
7. Was the evaluee’s functional impairment assessed directly?
As previously discussed, mental health evidence will often focus on the extent to
which a person is capable of performing a particular task or set of tasks, that is,
testimony will relate to a person’s impairment on one or more functional abilities.374 Sometimes an evaluator will be able to infer from an examination of the
person’s mental state and information from other sources whether the person is or
was capable of performing the task at hand (e.g., standing trial, returning to work,
managing property). However, another option for evaluation exists, namely direct
assessment of the relevant function.375 Where a functional ability that relates to
a discrete task or set of tasks is at issue, a competent evaluator should have considered direct assessment of performance on those tasks and be able to explain a
decision not use such a technique. It should be noted, though, that conclusions
drawn even from direct assessments of function involve a degree of inference.
A person claiming occupational impairment as a result of anxiety induced by
longstanding harassment on the job, for example, might respond very differently
to the demands of a work-related task in the actual workplace compared with
the safe confines of a mental health professional’s office. Therefore, when actual
observation of functional capacity is employed, the evaluator should be prepared
to comment on the ecological validity of the test, that is, the degree to which the
environment in which the test took place resembled the real-world environment
in the person’s life.376 Although observations in very different settings may have
some value as part of the broader dataset available in an evaluation, they do not
carry the same weight as conclusions reached in environments similar to those at
issue in the case.
8. Was the possibility of malingering considered?
In almost every mental health evaluation for legal purposes, the person being
evaluated has an incentive to exaggerate or confabulate symptoms or to distort
the impact of actual symptoms on his or her functional abilities.377 Thus, the possibility of malingering should be considered by the evaluator in every assessment.
Techniques for detecting malingering are described above.378 Although such
374. See generally Section I.D, supra.
375. See Section I.D.2.b, supra.
376. Additional issues related to the use of functional tests are discussed in Section II.C, infra.
377. There are situations in which the incentive runs in the opposite direction. For example,
a defendant facing relatively minor charges for whom an evaluation of competence to stand trial was
ordered may have every reason to minimize his or her level of symptoms, preferring to go to trial
rapidly rather than spend an extended period of time in a psychiatric facility being treated to restore
competence. A second example is a defendant whose risk for violence is being evaluated prior to a bail
hearing, who also has a powerful incentive to downplay the presence of risk factors associated with
violence and to minimize a past history of violence.
378. See Section I.C.5, supra.
884
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
techniques are not foolproof, and well-prepared evaluees can sometimes mislead
mental health professionals regarding the existence or severity of disorders, successful malingering over time is a difficult task. However, uncovering distortions
of the degree of actual symptoms or exaggerations of their impact is usually more
challenging than detecting wholesale invention of disorders that are not present.
Competent evaluators should be able to explain how they took into account the
possibility of malingering and why they believe that their conclusions are valid and
to acknowledge that their degree of certainty can never be absolute.
C. Was a Structured Diagnostic or Functional Assessment
Instrument or Test Used?
Notwithstanding the advantages of structured assessment techniques, they raise a
set of concerns that must be addressed to determine their relevance to the question
at issue and the weight that should be given to their results.
1. Has the reliability and validity of the instrument or test been established?
Reliability and validity are key concepts in test development.379 Each contains
several subcategories. Reliability refers to the reproducibility of results obtained
with a particular test. That is, it is an estimate of the precision of an assessment
technique. Interrater reliability is a measure of whether different examiners using
the same test or instrument with the same subject come out with similar results,
an important characteristic for an assessment approach that will be used by many
raters. Test-retest reliability assesses the stability of results from an instrument or test
over time; poor correspondence of results between time periods may indicate
either an unreliable technique or a condition subject to periodic changes in status.
It is an axiom of test and instrument development that good reliability is a prerequisite for having a valid assessment technique, but does not in itself guarantee
validity.
Validity connotes the degree to which an instrument or test yields results
that accurately reflect reality. Construct validity refers to the extent that an instrument or test reflects the theoretical construct that it purports to measure (e.g.,
anxiety or depression). Elements of construct validity include discriminant validity,
which is the degree to which the test distinguishes between related conditions or
states, and convergent validity, the extent to which the results of this test resemble
results of other instruments that assess the same or a similar construct. Content
validity describes the adequacy or thoroughness with which a test has sampled the
variables associated with a given domain (e.g., does a measure of ability to work
assess all relevant aspects of a given occupation?). Finally, predictive validity denotes
379. For the discussion in the following two paragraphs, see generally American Psychological
Association, Standards for Educational and Psychological Testing (1999).
885
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the ability of an instrument or test to foretell a person’s condition or behavior at
some point in the future.
When the results of an evaluation using an instrument or test are offered
in evidence, clarification of the extent to which reliability and validity have
been demonstrated is an essential aspect of determining admissibility and weight.
Indeed, based on its discussion in Daubert, when the U.S. Supreme Court referred
to the “reliability” of a scientific technique, it was encompassing both reliability
and validity as usually understood in the social sciences.380 Which aspects of reliability and validity are relevant to a particular case will depend on the purpose for
which the data from the test are being introduced. For example, if the evidence
is addressing change in a person’s test results over time, a measure’s test-retest
reliability becomes crucial. If more than one evaluator was involved, interrater
reliability may be key. Discriminant validity will be relevant when two states or
conditions must be distinguished from each other and predictive validity when
forecasts of future mental state or behavior are being made. Careful evaluators will
only use instruments or tests that have had the relevant types of reliability and
validity confirmed in peer-reviewed publications and will be prepared to cite such
data should questions be raised. Of course, some tests are so widely used over a
sustained period that their reliability and validity are generally accepted (e.g., the
MMPI-2) and do not ordinarily need to be demonstrated again prior to introducing data based on an evaluation in which they were employed. However, the reliability and validity of some longstanding tests (e.g., the Rorschach ink-blot test)
remain controversial,381 and data even from established tests can be used to reach
conclusions of uncertain validity. Thus, novel uses of instruments or tests may also
require that their psychometric characteristics for that purpose be demonstrated.
2. Does the person being evaluated resemble the population for which the
instrument or test was developed?
Reliability and validity once established are not necessarily universally applicable.
If an assessment technique is being used on someone drawn from a different
population than the one for which the instrument or test was developed, and the
new group is likely to differ in some material way, reliability and/or validity may
need to be reestablished. An example with regard to reliability might be the use
with a child of an instrument that was developed to measure symptoms of mental
disorders in adults.382 Either the nature of the symptoms that adults experience or
380. Daubert v. Merrell Dow Pharms., Inc., 509 U.S. 579, 589 (1993).
381.. Lilienfeld et al., supra note 127.
382. The frequently differing presentations of mental disorders in children have led to the
development of instruments intended specifically for use in that population. See, e.g., David Shaffer et
al., NIMH Diagnostic Interview Schedule for Children, Version IV (NIMH DISC-IV): Description, Differences
from Previous Versions, and Reliability of Some Common Diagnoses, 39 J. Am. Acad. Child & Adolescent
Psychiatry 28 (2000).
886
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
the ability of adults to describe their symptoms could be substantially different with
children, leading to greater difficulty in applying the instrument or test. Thus, it
might be prudent for an evaluator to ascertain that data exist showing good reliability in this new population before using this assessment approach. An example
involving validity is the use of predictive scales, such as instruments to assess risk
of future violence, with a different group than the one from which the predictive
algorithm was derived.383 Concretely, if a predictive test is based on a criminal, but
nonmentally disordered sample, applying it to persons with mental disorders—for
whom very different variables may affect their behavior—is dubious in the absence
of data demonstrating that it is valid in the latter group and vice versa.
It should be emphasized, however, that reestablishing reliability and validity is
only necessary when the original group and the new population are likely to differ
in some relevant way. Why an instrument developed in California, for example,
would not be as reliable and valid when used in Texas is not at all clear. Moreover,
the nature of the instrument or test will play a role. Diagnostic tests are likely to
differ in their characteristics across populations only if the disorders or the ways
in which they manifest themselves are different, which will not usually be the
case. Predictive tests, however, may be more sensitive to cultural, socioeconomic,
geographic, and other considerations that could introduce new predictors of future
conditions or behaviors into the mix. In addition, tests that involve comparisons
with broader populations are said to be “normed” against those groups,384 and the
comparative data (e.g., the evaluee is in the lowest quartile of performance) may
be invalid unless the test is renormed for the group of which the person being
evaluated is a member. Thus, whether additional reliability and validity testing is
required for a new use, or whether a test must be renormed before being used in
this way, is necessarily a fact-specific determination.
3. Was the instrument or test used as intended by its developers?
Established reliability and validity are necessary but not sufficient to determine whether an instrument or test has yielded reliable and valid results. Unless
the assessment approach was applied in the manner intended by the developers,
the data on reliability and validity may simply not be applicable to a particular
use. Three possible areas of deviation relate to training in, administration of, and
scoring of the assessment tool.
a. Training
Some instruments and tests are so straightforward in their use that little or no training is required. Reading the instructions accompanying the assessment tool might be
383. See, e.g., John Monahan et al., The Classification of Violence Risk, 24 Behav. Sci. & L. 721
(2006).
384. For a good discussion of norming in the forensic context, see Grisso, supra note 2, at 56–59.
887
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
sufficient. In some cases, though, training may be required to ask the questions properly, especially when followup probing of responses is necessary or when evaluees
are asked to perform tasks that must be conducted in a particular way. Diagnostic
instruments, in particular, may have complex “skip-out” rules, that is, procedures
for determining when to include or omit certain questions based on the person’s
responses to previous questions.385 When information is acquired at least in part
from existing records, rather than from the evaluee directly, rules may exist for how
the information should be identified and abstracted. All of these characteristics of an
assessment approach may require elaborate training for proper implementation.386
Sometimes the training can be acquired from test manuals, but for more complex
instruments or tests, face-to-face training with an opportunity to practice administration is necessary. Developers of such instruments or tests may offer such training
in 1-day or multiday seminars that professionals can arrange to take.387 Thus, a
key question in assessing data based on an instrument or test is whether proper use
requires special training, and if so, whether the assessor was trained in the technique.
b. Administration
Even if training was obtained, the reliability and validity of an instrument or test
will depend on whether the assessor administered the test in the proper way. Many
assessment tools require that questions be asked in a given sequence and that they
be phrased in a particular way. After an incorrect response, it may be permissible
to ask the question again, but only a certain number of times. Probing of responses
may be needed, but only certain probes may be permitted. Some tests are timed,
with a given period allotted for the completion of a particular task. Deviations
from any of these requirements could make the published data on the psychometric
characteristics of the tool inapplicable to its use in a particular instance. Thus, a
second crucial question is whether the instrument or test was administered in the
same way as it was when its reliability and validity were established.
c. Scoring
Assessment tools generally require that evaluees’ responses be scored in some
way. For some instruments and tests, the scoring is simple and self-evident, for
example, the number of positive responses is totaled to yield the score for the test,
or evaluees themselves are asked to indicate the severity of their symptoms on a
385. The Diagnostic Interview Schedule, which is widely used in epidemiological studies of
mental disorders in the United States, is an example. See a description of the latest version of the
instrument at http://epi.wustl.edu/CDISIV/dishome.aspx.
386. Indeed, some psychological and neuropsychological tests should be administered only by
psychologists trained in their use.
387. The creator of the popular Psychopathy Check List (PCL-R), for example, offers an
extensive training program for clinicians and researchers desiring to learn proper administration of the
instrument. See the Web site at http://www.hare.org/training/.
888
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
1-to-7 scale. Or the results could be calculated by a computer program that automatically applies the relevant algorithm, generates statistical data, and even draws
comparisons with broader groups, such as the general population or persons with a
particular disorder. Often, however, particularly when evaluees’ verbal or narrative
responses are elicited, more complex scoring rules exist. An instrument assessing
the severity of symptoms, for example, may require the person administering it
to categorize responses along a numerical scale,388 and specific capacity assessment
tools frequently require similar judgments to be applied.389 Published data on the
reliability of scoring may indicate that it is possible for an instrument to be scored
in the same way by many different raters, but unless the person administering the
instrument in this particular circumstance adheres to the usual rules, the results of
the evaluation may not be comparable to those that would be obtained by another
rater and may be invalid as well. Hence, a third important question when such
evidence is introduced deals with whether the rules for scoring responses were
properly applied.
D. How Was the Expert’s Judgment Reached Regarding the
Legally Relevant Question?
In evaluating testimony from mental health experts, as noted in the preceding
sections, their training and the manner in which they conduct their assessments is
vital information. However, the value of an expert’s opinion also depends on the
process by which the data were assessed and a conclusion was reached.
1. Were the findings of the assessment applied appropriately to the question?
a. Were diagnostic and functional issues distinguished?
Mental health professionals without experience in performing particular forensic
evaluations may fail to recognize that the legal question being asked deals with a
person’s functional capacity, not with some aspect of their clinical state per se.390
As a result, they may mistakenly base their opinions on the presence of a particular diagnosis or symptom cluster rather than on the person’s capacity to perform
in the legally relevant manner. Studies over many years indicate that this has
occurred frequently in testimony regarding defendants’ competence to stand trial,
in which experts often conflated the presence of psychosis with incompetence,
and concluded that any psychotic defendant was ipso facto incapable of proceed-
388. E.g., the Brief Psychiatric Rating Scale. See Overall & Gorham, supra note 122.
389. E.g., the MacArthur Competence Assessment Tool for Treatment; Thomas Grisso & Paul
S. Appelbaum, MacArthur Competence Assessment Tool for Treatment (MacCAT-T) (1998).
390. See Dusky v. United States, 362 U.S. 402 (1960); Thomas Grisso, Competency to Stand
Trial Evaluations: A Manual for Practice 1–23 (1988).
889
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ing to trial.391 Similar problems may occur in hearings on guardianship or contests
regarding testimonial capacity, where the person’s ability to manage or dispose
of assets might be thought incorrectly to turn solely on the clinical question of
whether dementia is present, as opposed to the legal issue of whether the person
retains the necessary capacities despite his or her condition.392 This problem may
be more likely to occur—and to go undetected—when experts are allowed or
encouraged to address the ultimate legal issue in their testimony.393 When experts
are permitted to testify to the ultimate question, the importance of probing their
reasoning is magnified.394 Experts can be asked to identify the relevant functional
capacities and to speak directly to the impact of the person’s mental state on those
capacities.395 That allows their reasoning processes and the correctness of their
assumptions about the relevant functional standard to be tested.
b. Were the limitations of the assessment and the conclusions acknowledged?
Most assessments are imperfect. Evaluees are less than cooperative. Records are
unavailable. Evidence from witnesses is conflicting. Inadequate time is available. Or the evaluator may simply have forgotten to ask about some piece of
information that would have been helpful. Experts should be able to identify the
limitations of their evaluations, and the possible impact of those less-than-optimal
aspects of the assessments. It is unlikely that an expert would be prepared to offer
testimony if he or she believed that the limitations rendered the opinions invalid.
But competent experts should be able to explain why, despite the limitations
(which can occur even in the best evaluations by the most experienced experts),
their evaluations were adequate to allow them to draw the conclusions that they
intend to present.
A comparable set of limitations can occur when conclusions are drawn and
opinions formulated. Just as all assessment tools have error rates, so do expert witnesses, although their rates are difficult to subject to statistical analysis. Errors may
be introduced by inadequacies in the data available or the uncertainties inherent
in particular determinations, especially predictions of future mental states and
behaviors. As noted above, it is often impossible to specify the contingencies that
may arise in a person’s life that could influence their mental states and actions.
Thus, any prediction, no matter how firmly grounded in available data, has a
391. See, e.g., A. Louis McGarry, Competence for Trial and Due Process Via the State Hospital, 122
Am. J. Psychiatry 623 (1965). More recent studies suggest that this is now a less common problem, as
educational efforts among mental health professionals who do such work have had a positive impact.
Robert A. Nicholson & Karen E. Kugler, Competent and Incompetent Criminal Defendants: A Quantitative
Review of Comparative Research, 109 Psychol. Bull. 355 (1991).
392. See Parry & Drogin, supra note 8, at 149–51.
393. See Section I.G.2, supra.
394. See Parry & Drogin, supra note 8, at 429–31.
395. Buchanan, supra note 319.
890
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
degree of uncertainty attached to it that a competent expert should be expected
to acknowledge.
c. Are opinions based on valid empirical data rather than theoretical formulations?
From the development of Freud’s theories in the late nineteenth and early twentieth centuries until the present, many mental health professionals have based
their clinical approaches on psychoanalytically inspired concepts. Some of these
concepts have been confirmed scientifically (e.g., the existence of unconscious
mental states), whereas others have not (e.g., dreams always represent the fantasied
fulfillment of wishes). Although psychoanalytical theories and the psychodynamic
psychotherapies that derive from them have declined in popularity in recent
decades, many mental health professionals have received psychodynamic training
and use the concepts they have learned to assess and treat their patients. Regardless of the possible utility of these theories from a clinical perspective, which is
controversial and may depend on the condition being treated, they are arguably
more problematic when they serve as the basis for conclusions offered as part of
legal proceedings. Nor are psychoanalytical theories the only ones that mental
health professionals use; alternative approaches may be based on theories that have
a greater or lesser degree of empirical support.
To the extent that expert opinions are introduced to inform the judgments
of legal factfinders, it is important for them to be based, insofar as possible, on
empirically validated conclusions rather than on untested or untestable theories.
That appears to be the import of the U.S. Supreme Court’s decision in Kumho
Tire.396 As Slobogin plausibly maintains, some legal questions (such as those
concerning past mental states) may not easily lend themselves to approaches
based on scientific methods, but expert opinions may nonetheless be of assistance to the finders of fact.397 At a minimum, it would seem fair for an expert
to indicate when that is the case, so that the factfinder can make an informed
judgment about the appropriate degree of reliance to be had on that opinion.
And when empirically tested approaches are available, it would appear to be
incumbent on an expert to use them or to be prepared to explain why they
were not employed.
396. Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999) (holding that the Daubert standard
for admitting expert testimony also applies to nonscientists).
397. Christopher Slobogin, supra note 305.
891
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
III. Case Example
A. Facts of the Case
John, a 25-year-old Army veteran who saw combat in Iraq, had begun to have
anomalous experiences in the 4 years since his discharge from active duty. At first,
he believed that people were staring at him, though he was not sure why. Later,
he came to the conclusion that they thought he was a drug addict or a criminal,
ideas confirmed when he heard voices coming through the walls of his apartment,
which he attributed to the neighbors, saying, “He’s using drugs” and “He steals
things.” To avoid people’s stares, John left his apartment less often, spending most
of his time listening to loud music, which helped to drown out the voices. He
also found that alcohol made it easier to ignore the voices, and began to drink up
to a gallon of wine each day.
One evening when the voices were particularly loud and insistent, he began
banging on the walls of his apartment and yelling that he would kill the neighbors
if they did not stop talking about him. Thirty minutes later, the police arrived to
take him to the local Department of Veterans Affairs (VA) hospital, where he was
admitted to the psychiatric unit. Over the course of his hospitalization, he received
antipsychotic medication and participated in group therapy. By the end of his
hospital stay, although he still wondered whether people were staring at him
oddly, he no longer heard people’s voices making derogatory statements about
him. He denied having thoughts of hurting himself and other people. When asked
whether he would continue taking his medication and would attend outpatient
sessions, he said he would. Fourteen days after admission, John was discharged to
outpatient care.
Immediately after discharge, John stopped his medication, and he never saw
his outpatient therapist. As he became more suspicious of his neighbors, he again
began to hear them talking about him, and he resumed drinking several bottles
of wine each day to deal with the situation. Three weeks after discharge, while
he was on his way to the grocery store to pick up more wine, a passerby accidentally bumped into John. Reacting with fury, John pummeled the older man
with his fists, then began beating him with a broomstick that he found on the
sidewalk nearby. It took four people who lived in nearby buildings to pull John
off his victim.
In the wake of the assault, the victim brought suit against the VA for negligence in John’s treatment. The suit alleged that VA mental health staff should
have known that John was dangerous as a result of his mental disorder and not
fit for discharge. Damages were claimed as a result of physical injuries and the
development of PTSD.
892
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
B. Testimony of the Plaintiff’s Expert on Negligence
At trial, the plaintiff introduced testimony from a board-certified forensic psychiatrist, Dr. A, who was 20 years out of residency training and had not directly
treated patients for the past 13 years. Dr. A had reviewed the medical records of
John’s treatment and the police records of the assault, but he had not examined
John directly. On direct examination, he testified that John had a diagnosis of
schizophrenia, with a number of risk factors for violence, including having killed
enemy combatants in Iraq, excessive alcohol consumption, and delusions of persecution. It was Dr. A’s opinion that the VA treatment team had failed to abide
by the standard of care because they had not used a structured violence riskassessment instrument to determine John’s dangerousness. Moreover, although
they had obtained a CT brain scan that had shown frontal lobe injury from an
old automobile accident, the team had failed to recognize that this constituted
an additional risk factor for violence. However, Dr. A believed that, even on the
basis of the available information, at the time of hospital discharge it was reasonably foreseeable that John would be violent, and thus he should not have been
allowed to leave the hospital.
C. Questions for Consideration
1. Given that Dr. A had devoted himself entirely to forensic evaluations and
had not actually treated a patient for 13 years, should he have been considered qualified to offer opinions about whether John’s evaluation and
treatment had conformed to the standard of care?
2. How reliable were Dr. A’s conclusions regarding John’s diagnosis and likelihood of committing an act of violence, given that he did not examine
John or speak directly to anyone who had been in contact with him, but
relied solely on hospital and police records?
3. What information would be needed to determine whether the failure to
use a structured violence risk-assessment tool should be considered evidence of negligence? What information would be needed to determine
whether the alleged failure to recognize the relationship between CT
evidence of frontal brain damage and the risk of violence should be considered evidence of negligence?
4. Is the assertion that John’s violence was reasonably foreseeable sufficient
to establish a prima facie case for the plaintiff? If not, what type of data
should Dr. A have presented to support his testimony?
D. Testimony of the Plaintiff’s Expert on Damages
A second expert, Dr. B, a clinical psychologist in general clinical practice, offered
testimony on the mental health consequences of the assault. Dr. B had been treat893
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ing the victim prior to the assault and had been seeing him weekly for cognitive
behavioral therapy since the assault. She testified that the patient described having intrusive thoughts about the attack, nightmares, difficulty concentrating, and
startle responses when people came near him without his having noticed them.
He also felt overwhelming anxiety walking down the street where the attack
had occurred. Dr. B diagnosed the victim as suffering from PTSD and had used
a structured assessment tool to help make the diagnosis. On cross-examination,
she admitted that she had only seen three or four cases of PTSD in her 5 years
of practice and that the diagnosis was based entirely on the victim’s report of
his symptoms. Although she had not considered the possibility that the victim
was malingering, she considered it very unlikely. Because of his symptoms, she
concluded to a reasonable degree of psychological certainty that he was disabled
from working in his job as a middle manager for a utility company. On crossexamination, she admitted that she did not know exactly what his job entailed
and had not determined how each of his symptoms might interfere with his
work—but she nonetheless believed that normal work performance was not possible given his condition.
E. Questions for Consideration
1. Should Dr. B be qualified as an expert with regard to the damages suffered
by the plaintiff?
2. To what extent should the following considerations affect the weight given
to Dr. B’s testimony:
a. Dr. B had been treating the plaintiff prior to the attack, and continued
to treat him afterward.
b. Dr. B has seen only three or four cases of PTSD in her practice.
c. Dr. B’s diagnosis was made on the basis of the patient’s self-report,
without corroboration from collateral informants, and she had not
considered the possibility that he might be malingering.
3. What information regarding the structured assessment tool that was used
in making the diagnosis of PTSD would be needed to determine whether
the results of the assessment should be admissible?
4. Was an appropriate evaluation done with regard to the extent of the victim’s work disability? If not, what additional information should have been
obtained and by what means? Should the testimony as offered have
been admissible?
894
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Mental Health Evidence
References on Mental Health Diagnosis and
Treatment
American Psychiatric Association, American Psychiatric Association Practice
Guidelines for the Treatment of Psychiatric Disorders: Compendium 2006
(2006).
American Psychiatric Association, Diagnostic and Statistical Manual of Mental
Disorders DSM-IV-TR (4th ed. Text Rev. 2000).
American Psychiatric Publishing Textbook of Clinical Psychiatry (Robert E.
Hales et al. eds., 5th ed. 2008).
Kaplan and Sadock’s Comprehensive Textbook of Psychiatry (Benjamin J. Sadock
et al. eds, 9th ed. 2009).
Alan F. Schatzberg et al., Manual of Clinical Psychopharmacology (6th ed. 2007).
Stephen M. Stahl, Essential Psychopharmacology: The Prescriber’s Guide (3d ed.
2009).
References on Mental Health and Law
Paul S. Appelbaum, A Theory of Ethics for Forensic Psychiatry, 25 J. Am. Acad.
Psychiatry L. 233 (1997).
Paul S. Appelbaum & Thomas G. Gutheil, Clinical Handbook of Psychiatry and
the Law (4th ed. 2007).
Deborah Giorgi-Guarnieri et al., American Academy of Psychiatry and the Law Practice Guideline for Forensic Psychiatric Evaluation of Defendants Raising the Insanity
Defense, 30 J. Am. Acad. Psychiatry L. S1 (2002).
Thomas Grisso, Evaluating Competencies: Forensic Assessments and Instruments
(2d ed. 2002).
Gisli H. Gudjonsson, The Psychology of Interrogation and Confessions (2003).
Glenn J. Larrabee, Forensic Neuropsychology: A Scientific Approach (2005).
Gary B. Melton et al., Psychological Evaluations for the Courts: A Handbook for
Mental Health Professionals and Lawyers (3d ed. 2007).
Douglas Mossman et al., American Academy of Psychiatry and the Law Practice Guideline for the Forensic Psychiatric Evaluation of Competence to Stand Trial, 35 J. Am.
Acad. Psychiatry L. S3 (2007).
Mental Disorder, Work Disability, and the Law (Richard J. Bonnie & John
Monahan eds., 1997).
John Monahan, The Scientific Status of Research on Clinical and Actuarial Predictions
of Violence, in Modern Scientific Evidence: The Law and Science of Expert
Testimony (David L. Faigman et al. eds., 2007).
Michael L. Perlin, Mental Disability Law, Civil and Criminal (2d ed. 2002).
895
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Retrospective Assessment of Mental States in Litigation: Predicting the Past (Robert I. Simon & Daniel W. Shuman eds., 2002).
Richard Rogers, Clinical Assessment of Malingering and Deception (3d ed. 2008).
Christopher Slobogin, Proving the Unprovable: The Role of Law, Science, and
Speculation in Adjudicating Culpability and Dangerousness (2006).
Robert M. Wettstein, Treatment of Offenders with Mental Disorders (1998).
896
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
C H A N N I N G R . R O B E RT S O N , J O H N E . M O A L L I ,
A N D D AV I D L . B L A C K
Channing R. Robertson, Ph.D., is Ruth G. and William K. Bowes Professor, School of
Engineering, and Professor, Department of Chemical Engineering, Stanford University, Stanford, California.
John E. Moalli, Sc.D., is Group Vice President & Principal, Exponent, Menlo Park, California.
David L. Black, J.D., is Partner, Perkins Coie, Denver, Colorado.
CONTENTS
I. What Is Engineering? 899
A. Thinking About Engineering and Science, 899
B. Engineering Disciplines and Fields of Practice, 900
C. Cross-Disciplinary Domains, 900
II. How Do Engineers Think? 902
A. Problem Identification, 902
B. Solution Paradigms, 903
III. How Do Engineers Make Things? 904
A. The Design Process—How Engineers Use This Guiding Principle, 904
B. The Design Process—How Engineers Think About Safety and Risk
in Design, 908
1. What is meant by “safe”? 908
2. What is meant by “risk”? 910
3. Risk metric calculation assumptions, 912
4. Risk metric evaluation, 914
5. What is meant by “acceptable risk”? 915
C. The Design Process—Examples in Which This Guiding Principle
Was Not Followed, 920
1. Inadequate response to postmarket problems: Intrauterine
devices (IUD), 920
2. Initial design concept: Toxic waste site, 921
3. Forseeable safety hazards: Air coolers, 922
4. Failure to validate a design: Rubber hose for radiant
heating, 922
5. Proper design—improper assembly: Kansas City Hyatt
Regency Hotel, 923
6. Failure to validate a design: Tacoma Narrows Bridge, 924
897
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
7.
Failure to conform to standards and validate a design:
Automotive lift, 924
8. Lack of sufficient information and collective expertise to
consummate a design: Dam collapse, 925
9. Operation outside of design intent and specifications: Space
shuttle Challenger, 926
10. Foreseeable failure and lack of design change in light of field
experience: Air France 4590, 928
IV. Who Is an Engineer? 929
A. Academic Education and Training, 929
B. Experience, 930
C. Licensing, Registration, Certification, and Accreditation, 931
V. Evaluating an Engineer’s Qualifications and Opinions, 932
A. Qualification Issues and the Application of Daubert Standards, 932
B. Information That Engineers Use to Form and Express Their
Opinions, 933
1. Observations, 933
2. Calculations, 936
3. Modeling—mathematical and computational, 936
4. Literature, 938
5. Internal documents, 938
VI. What Are the Types of Issues on Which Engineers May Testify? 939
A. Product Liability, 939
1. Design, 939
2. Manufacturing, 941
3. Warnings, 941
4. Other issues, 942
B. Special Issues Regarding Proof of Product Defect, 943
C. Intellectual Property and Trade Secrets, 945
D. Other Cases, 946
VII. What Are Frequent Recurring Issues in Engineering Testimony? 948
A. Issues Commonly in Dispute, 948
1. Qualifications, 949
2. Standard of care, 949
3. State of the art, 950
4. Best practice, 950
5. Regulations, standards, and codes, 951
6. Other similar incidents, 952
B. Demonstratives and Simulations, 956
VIII. Epilogue, 958
IX. Acknowledgments, 959
898
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
“Scientists investigate that which already is; Engineers create that which has never been.”
Albert Einstein
I. What Is Engineering?
A. Thinking About Engineering and Science
Although this is a reference manual on scientific evidence, the Supreme Court in
Kumho Tire Co., Ltd. v. Carmichael1 extended the Daubert v. Merrell Dow Pharmaceuticals, Inc.2 decision on admissibility of scientific evidence to encompass nonscientific expert testimony as well.3 Put another way, experts not proffered as
“scientists” also are held to the Daubert standard.4 So then we might ask, who are
these nonscience experts and where do they come from? Many emerge from the
realm of engineering and hence the relevance of “engineering” or “technical”
expert testimony to this manual.
The Court’s distinction between these two kinds of expert testimony might
suggest that there is a bright line dividing science and engineering. Indeed, a great
deal has been written and discussed about this matter and arguments made for
why science and engineering are either similar or different. It is a conversation
that resonates among philosophers, historians, “scientists,” “engineers,” politicians,
and lawyers. Apparently even Albert Einstein had a point of view on this issue as
attested to by the above quotation. Perhaps this deceptively attractive dichotomy
is best resolved by recognizing that at the end of the day engineering and science
can be as different as they are alike.
There is no shortage of “sound bites” that attempt to categorize science from
engineering and vice versa. Consider, for instance, the notion that engineering
is nothing more than “applied science.” This is a too often recited, simple and
uninformed view and one that has long been discredited.5 Indeed, it is not the
case that science is only about knowing and experimentation, and that engineering is only about doing, designing, and building. These are false asymmetries that
defy reality. The reality is that who is in science or who is in engineering or who
is doing science or who is doing engineering are questions to be answered based
on the merit of accomplishments and not on pedigree alone.
1. 526 U.S. (1999).
2. 509 U.S. 579 (1993).
3. See Margaret A. Berger, The Admissibility of Expert Testimony, in this manual.
4. See David Goodstein, How Science Works, in this manual, for a discussion of science and
scientists.
5. Walter G. Vincenti, What Engineers Know and How They Know It (1990).
899
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. Engineering Disciplines and Fields of Practice
One can think of engineering in terms of its various disciplines as they relate to the
academic enterprise and the names of departments or degrees with which they are
associated, for instance electrical engineering or chemical engineering. One also
can consider the technological context in which engineering is practiced as in the
case of nanotechnology, aerospace engineering, biotechnology, green buildings,
or clean energy.
In the same sense that some struggle trying to identify the differences and
likenesses between science and engineering, others pursue a different kind of
identity crisis by staking out their turf through title assignment. It is pointless to
list titles of engineering disciplines because such a list would be incomplete and not
stand the test of time as disciplines come and go, merge, diverge, and evolve. Bioengineering, biochemical engineering, molecular engineering, nanoengineering,
and biomedical engineering are relative newcomers and have emerged in response
to discoveries in the sciences that underlie biological and physiological processes.
Software engineering and financial engineering are two other examples of disciplines that have developed in recent years.
In the end, it is not the names of disciplines that are critical, they being no
more than labels. Names of disciplines are at best imprecise descriptors of the activities taking place within those disciplines and ought not to be relied on for accurate
characterizations of pursuits that may or may not be occurring within them.
C. Cross-Disciplinary Domains
Whereas engineering disciplines are often associated with their scientific roots (i.e.,
mechanical engineering and physics, electrical engineering and physics, chemical
engineering and chemistry, bioengineering and biology, biomedical engineering
and physiology) some lack this kind of direct association (i.e., aerospace engineering, materials engineering, civil engineering, polymer engineering, marine
engineering). Indeed, there are software engineers, hardware engineers, financial
engineers, and management engineers. There is no shortage of adjectives here.
Nonetheless, these and many other such discipline titles have meant or mean
something to someone, and new ones are emerging all the time as the historical barriers that once separated and defined the “classic” engineering disciplines
continue to disintegrate and become a thing of the past. No longer can we rely
on discipline names to inform us of specific enterprises and activities. There is,
after all, nothing wrong with this as long as it is recognized that they ought not
be used as reliable descriptors to subsume all possible activities that might be
occurring within a domain. One must reach into a domain and investigate what
kind of engineering is being conducted and resist the temptation to draw conclusions based on name only. Doing otherwise could easily lead to an unreliable and
inaccurate characterization.
900
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
To provide a tangible example, consider cases involving personal injury in
which central questions often revolve around the specifics of how a particular
trauma occurred. In situations where proximate cause is an issue, the trier of
fact can benefit from a thorough understanding of the mechanics that created
an injury. The engineering and scientific communities are increasingly called on
to provide expert testimony that can assist courts and juries in coming to this type
of understanding. What qualifies an individual to offer expert opinions in this area
is often a matter of dispute. As gatekeepers of admission of scientific evidence,
courts are required to evaluate the qualifications of experts offering opinions
regarding the physical mechanics of a particular injury. As pointed out earlier,
however, this gatekeeping function should not rise and fall on whether a person
is referred to or refers to himself or herself as a scientist or engineer.
Specifically, one cross-disciplinary domain deals with the study of injury
mechanics, which spans the interface between mechanics and biology. The traditional role of the physician is the diagnosis (identification) of injuries and their
treatment, not necessarily a detailed assessment of the physical forces and motions
that created injuries during a specific event. The field of biomechanics (alternatively called biomechanical engineering) involves the application of mechanical
principles to biological systems, and is well suited to answering questions pertaining to injury mechanics. Biomechanical engineers are trained in principles of
mechanics (the branch of physics concerned with how physical bodies respond to
forces and motion), and also have varying degrees of training or experience in the
biological sciences relevant to their particular interest or expertise. This training
or experience can take a variety of forms, including medical or biological coursework, clinical experience, study of real-world injury data, mechanical testing of
human or animal tissue in the laboratory, studies of human volunteers in noninjurious environments, or computational modeling of injury-producing events.
Biomechanics by its very nature is diverse and multidisciplinary; therefore
courts may encounter individuals being offered as biomechanical experts with seemingly disparate degrees or credentials. For example, qualified experts may have one
or more advanced degrees in mechanical engineering, bioengineering, or related
engineering fields, the basic sciences or even may have a medical degree. The
court’s role as gatekeeper requires an evaluation of an individual’s specific training and experience that goes beyond academic degrees. In addition to academic
degrees, practitioners in biomechanics may be further qualified by virtue of laboratory research experience in the testing of biological tissues or human surrogates
(including anthropomorphic test devices, or “crash-test dummies”), experience in
the reconstruction of real-world injury events, or experience in computer modeling of human motion or tissue mechanics. A record of technical publications in the
peer-reviewed biomechanical literature will often support these experiences. Such
an expert would rely on medical records to obtain information regarding clinical
diagnoses, and would rely on engineering and physics training to understand the
mechanics of the specific event that created the injuries. A practitioner whose expe901
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
rience spans the interface between mechanics (i.e., engineering) and biology (i.e.,
science), considered in the context of the facts of a particular case, can be of significant assistance in answering questions pertaining to injury mechanism and causation.
This example illustrates the futility of trying to untangle engineering from
science and vice versa and to the inappropriateness of using semantics, dictionary
definitions, or labels (i.e., degree names) to parse, dissect, or portray the intellectual activities of an expert witness. In the end, it is their background and experience that are the dominant defining factors—not whether they are a scientist and/
or an engineer and not by the titles they hold.
II. How Do Engineers Think?
A. Problem Identification
Although a somewhat overworked part of our lexicon, it is indeed the case that
“necessity is the mother of invention.” Engineering breeds a culture of technological responsiveness. All the “science” explaining a solution to a problem need
not be known before an engineer can solve a problem.
Take steam engines, for example. Their history goes back several thousand
years and their utility forged the beginning of the industrial revolution late in the
seventeenth century. It was not until the middle of the nineteenth century that
the science of thermodynamics began to gain a firm ground and offer explanations for the how and why of steam power.6 In this instance, technology came
first—science second. This, of course, is not always the case, but demonstrates
that one does not necessarily precede the other and notions otherwise ought to
be discarded. So here the problem was one of wanting to produce mechanical
motions from a heat source, and engineers designed and built systems that did this
even though the science base was essentially nonexistent.
To reinforce the point that technology can precede science, consider the design
of the shape of aircraft wings. This, of course, was driven by the desire of humans
to fly, a problem already solved in nature since the time of the dinosaurs but one
that had eluded humankind for tens of thousands of years. Practical solutions to this
problem began to emerge with the Wright brothers’ first motive-powered flight and
continued into the twentieth century before the “science” of fluid flow over wing
structures had been fully elucidated. Once that happened, wings could be designed
to reduce drag and increase lift using a set of “first principles” rather than relying
solely on the results of empirical testing in wind tunnels and prototype aircraft.7
6. Pierre Perrot, A to Z of Thermodynamics (1998).
7. The pioneering aerodynamicist Walter Vincenti provides a detailed and fascinating account
of this. See Vincenti, supra note 5, ch. 2; see also John D. Anderson, Ludwig Prandtl’s Boundary Layer,
Physics Today, December 2005, at 42–48.
902
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
So, in short, engineers create, design, and construct because interesting and
challenging problems arise in the course of human events and emergent societal
needs. Whether a science base exists or only partially exists is just one of a myriad
of constraints that shapes the process. Other constraints might include, but are not
limited to, the availability of materials; device shape, size, and/or weight; cost;
demand; efficiency; safety; robustness; and utility. It has been said, and possibly
overstated, but it does make the point, that if engineers waited until scientists
completed their work, they might well still be starting fires with flint stones.
B. Solution Paradigms
So when faced with a vexing and challenging problem, along with its particular or
peculiar constraints, an engineer seeks a path to follow that has a reasonable chance
of leading to a solution. In so doing an engineer must contend with uncertainty
and be comfortable with it. In very few instances will everything be known that
is required to proceed with a project. Assumptions need to be made and here it is
critical that the engineer understand the difference between what is incidental
and what is essential. There are excellent assumptions, good assumptions, fair
assumptions, poor assumptions, and very bad assumptions. Along this spectrum the
engineer must carefully pick and choose to make those assumptions that ensure
the robustness, safety, and utility of a design without undue compromise. This
is the sort of wisdom that comes from experience and is not often well honed in
the novice engineer.
This impreciseness that accompanies uncertainty can be used as a perceived
disadvantage for the engineer in the role of expert witness. Yet it is this very
uncertainty that lies at the heart of technological innovation and is not to be
viewed as so much a weakness as it is a strength. To overcome uncertainty
in design under the burden of constraints is the hallmark of great design, and
although subtle and not always well understood by those who seek precision (i.e.,
why can’t you define your error rate?), this is the way the world works and one
must accept it for what it is. Assumptions and approximations are key elements of
the engineering enterprise and must be regarded as such. And as with all things,
hindsight might suggest that a particular assumption or approximation was not
appropriate. Even so, given what was known, it may well have been the right
thing to do at the time it was made.
In addition to evolving business opportunities and changing financial markets,
technological innovation results from the continuing and many times unexpected
advances in science and technology that occur as time passes. Buildings constructed in Los Angeles in the 1940s would never be built there in the same way
now. We have a much better understanding of earthquakes and the forces they
exert on structures now than then. Airbags were not placed in automobiles until
recently because we did not have cost-effective systems and materials in place to
accurately measure deceleration and acceleration forces, trigger explosives, contain
903
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the explosion, and do this on a timescale that was effective without harming an
occupant more-so than the impending collision. It is unavoidable that as we learn
from new discoveries about the natural world and accumulate more experience
with our designed systems, products, and infrastructure, engineers will be in an
increasingly better place to move forward with improved and new designs. It is
both an evolutionary and a revolutionary process, one that produces both failures
and successes.
III. How Do Engineers Make Things?
A. The Design Process—How Engineers Use This Guiding
Principle
The genesis of nearly every object, thing, or environment conceived by engineers
is the design process. Surprisingly, although products designed using it can be
incredibly complex, the general tenets of the design process are relatively simple,
and are illustrated in Figure 1.
The progression is iterative from two perspectives: (1) Changes in the design
resulting from testing and validation lead to new formulations that are retested.
(2) After the design is complete, performance data from the field can also lead to
design changes.
As a first step, engineers begin with a concept—an idea that addresses a need,
concern, or function desired by society. The concept is refined through research,
appropriate goals and constraints are identified, and one or more prototypes are
constructed. Although confined to a sentence here, this stage can take a significant
amount of time to complete.
In the next phase of the design process, the prototypes are tested and evaluated against the design requirements, and refinements, perhaps even significant
changes, are made. The process is iterative, as faults identified during the testing
phase manifest themselves as changes in the concept, and the testing and evaluation process is restarted after having been reset to a higher point on the learning
curve. As knowledge is gained with each iteration, the design progresses and is
eventually validated, although as alternative solutions are considered, it is possible that certain undesirable characteristics in the design cannot be completely
mitigated through changes in design and should be guarded against to minimize
their impact on safety or other constraints. A classic example of this step in the
design process is the installation of a protective shield over the blade in a table
saw; although the saw may have the unwanted characteristic of cutting fingers
or arms, the blade clearly cannot be eliminated (designed out) in a functioning
product. As a last resort, anomalies that cannot be designed out or guarded against
can be addressed through warnings. Not every design is amenable to guarding or
warning, but instead the iterative process of testing and prototype revision is relied
904
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
Figure 1. Schematic of the engineering design process.
upon to perfect designs. Indeed, in some instances, an acceptable design solution
cannot be found and the work is abandoned.
The testing process itself can be complex, ranging from simple evaluations
to examine a certain characteristic to multifaceted procedures that evaluate the
prototype in conditions it is anticipated to see in the real world. The latter type
of evaluation is often denoted as end-use testing, and is very effective in identifying faults in the prototype. Because many designs cannot be evaluated over their
anticipated life cycle because of time constraints (a product expected to last for
20 years cannot be tested for 20 years in the development process), the testing
905
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
cycle is often accelerated. For example, if it is known that a pressure vessel will
see 50,000 cycles over a 10-year lifetime, those cycles can be performed in several months and the resultant effects on vessel performance established. Another
method of accelerating the evaluation cycle involves testing at an elevated temperature and using scientific theory and principles to equate the temperature increase
to a timescale reduction. The efficacy of this approach is highly dependent on
correct execution, but done properly and with appropriate care, it allows product
development to go forward rather than having good or even great designs languish
on the drawing boards because there is no feasible way to validate them under the
exact end-use environment.
Regulations, standards, and guidelines also play an important role in testing of products during the design process. Federal requirements are imposed on
design and testing of aircraft, medical devices, and motor vehicles, for example,
and mostly govern how those products are evaluated by engineers. Standards
organizations such as the American Society for Testing and Materials (ASTM),
the American National Standards Institute (ANSI), and the European Committee
for Standardization (CEN) promulgate test methods and associated performance
requirements for a large number of objects and materials, and are relied on by
engineers as they evaluate their designs. It is critical to understand, however,
that ASTM, ANSI, CEN, and other such national and international standards
organizations describe testing methods that engineers use to obtain reliable data
about either the products they are evaluating (or components thereof), but most
often they do not in and of themselves provide a means to evaluate a finished
product in its actual end-use environment. It is also important to understand the
difference between a performance standard and a testing standard—the former
actually specifies values (strength, ductility, environmental resistance) that a
product must achieve, whereas the latter simply describes how a test to measure
a parameter should be conducted. It is the engineer’s job to use the correct
testing procedures from those that have been approved and on which he or she
can rely. Or, alternatively, if no approved test exists, the engineer must create
one that is reproducible, repeatable, reliable, and efficacious. Furthermore, it
is the engineer’s job to ensure the relevance of such testing to the overall and
final product performance in its end-use environment. No testing or standards
organization can foresee, nor do they claim to do so, all possible combinations
of product components, design choices, and functional end-use requirements.
Therefore, testing of a design in accordance with a testing standard does not
necessarily validate the design, nor does it necessarily mean that the design will
function in its end-use environment.
After testing and validation are complete, and the product is introduced to
the market, the design process is still not finished. As field experience is gained,
and products are used by consumers and sometimes returned to the manufacturer,
engineers often fine-tune and perfect designs based on newly acquired data. In
this part of the design process, engineers will analyze failures and performance
906
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
problems in products returned from the field, and adjust product parameters
appropriately.8 The process of continual product improvement, illustrated by an
arrow from the “Go” stage to the “Design/Formulate” and “Test/Validate” stages
in Figure 1, is taught to engineers as a method to effectively optimize designs.
Such refinements of product design are often the topic of inquiry in depositions
of engineers and others involved in product design, and frequently misunderstood
as an indication that the initial design was defective.9 The engineering design process anticipates review and ongoing refinement of product design as a means of
developing better and safer products. In fact, retrospective product modification
is mandated as company practice in some industries, and regulated or suggested
by the government in others. For example, examination of FDA guidelines for
medical device design will show a process that mirrors the one described above.
Another important component of the design process relates to changes in
technology that render a design, design feature, or even tools used by an engineer
obsolete. Engineers consider obsolescence to be a consequence of advancement,
and readily adjust designs, or create new designs, as new technology becomes
available. This concept is apparent in the automotive industry, where tremendous
advances in restraint systems and impact protection have greatly reduced the
risk of fatal injuries from driving (see discussion below). Although vehicles with
lap belts as the sole means of occupant protection would today be considered
unacceptable, they were by no means deficient when introduced in the 1950s.
From the engineer’s perspective, errors and omissions in the design process can
render a design defective; however, changes in technology can render a design
obsolete, not retrospectively defective.
Of course even well-designed products can fail, especially if they are not manufactured or used in the manner intended by the design engineer. For example,
a steel structure may be adequately designed, but if the welds holding it together
are not properly made, the structure can fail. Similarly, a well-designed plastic
component manufactured in such a way as to overheat and degrade its constituents
may also be prone to premature failure. In terms of misuse of a product, most
engineers are trained to consider foreseeable misuse as part of the design process,
and one can generally expect to encounter a debate over what is reasonably foreseeable and what is not.
8. Although feedback on product performance and failure analysis on returned products is most
often used to perfect designs, the iterative nature of the process can also cause the design to progress
toward failure when cost becomes the driving factor.
9. Although the reasons for subsequent refinements in product design may be explored in
depositions, Federal Rule of Evidence 407 bars the introduction of evidence of such improvements at
trial as evidence of a defect in a product or a product’s design.
907
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
B. The Design Process—How Engineers Think About Safety
and Risk in Design
Almost everything that an engineer designs involves some aspect of safety, and the
elegance and efficiency of designs are often forced to balance safety with competing parameters such as cost and physical constraints. The legal dilemmas that
often arise from this balance are a direct result of the way an engineer must deal
with safety in the reality of the engineering world (i.e., assertions that safety must
be considered over everything else or that a particular design should or could be
safer). Therefore, a discussion of how safety factors into design, and “how safe
is safe enough” is prudent for an understanding of engineering and engineering
design. It is critical that the reader note that in the framework of this discussion,
risk is something engineers constantly face, and while we discuss what levels of
risk are acceptable, the context is clearly engineering, and no legal construct is
intended.
There is practically no product that cannot be made safer by reducing the
product benefits (making it more inconvenient) or increasing the product cost, or
both. In product design, safety is just one of the many variables factored into the
design, as also is cost, and often safety and cost trade off directly on the product
price point. In product design there are rarely instances where small cost changes
render a substantial improvement in the risks. Safety always has a cost; the question is whether the consumer will find it reasonable in the face of what else the
design has to offer. Conversely, the claim that the product is as safe as possible is
almost never true either.
The simple and completely correct answer to the question “How safe is safe
enough?” is “It depends.” Exactly what safety is, and what conditions determine
its adequacy, that is, what adequate safety depends on, are the topics briefly discussed in the following sections.
1. What is meant by “safe”?
Few words are used more often in the context of a product liability tort than the
words “safe” and “unsafe,” and their close cousin, “defective.” Because the word
“safe” is commonly used in so many different contexts, it is seldom, if ever, used
with precision. Indeed, its common use has given it a number of meanings, some
of which are in conflict.
Intuitively we understand the word and have a grasp of what a speaker probably means when declaring a product or environment “safe.” We have to say
“probably” because some would mean by a “safe” product one that presents no
risk to the user under normal circumstances, and others would mean no risk to
the user under any circumstances. Still others who ask the question “how safe is
safe enough?” clearly evidence an understanding that safety is a continuum and
not an absolute. Although “safe” is a simple word, it is used in so many ways that
908
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
rigorous definition presents much of the complexity of other deceptively simple
but widely used four letter words, for example, “good.”
Fortunately, there is a whole field of scholarship, science, and technology
related to the study of “safety.” The field was spawned during the industrial
revolution, when it came to be recognized that preventable industrial accidents
were simply economically, if not morally, unacceptable.10 For the remainder of
this discussion, we examine the concept of safety as it relates to the possibility
of physical harm to persons.
Safety is technically defined, and empirically measured, by the concept of
“risk.” And often a speaker who declares a product or environment “safe” does
indeed mean to say that the product or environment is risk-free. However, as
we will discuss in more detail, there is no product or product environment that
attains the ideal status of “risk-free.”11 Every product manufactured by man, with
his imperfections, and every environment, no matter how carefully constructed,
presents some risk in its use, even if this risk is extremely small. This fact of life
is easily illustrated.
For example, the U.S. Consumer Product Safety Commission (CPSC) estimates that nationally in the year 2007 alone, there were approximately 42,000
injuries serious enough to require treatment at a hospital emergency room associated with the use, and more often the misuse, of first-aid equipment. Thousands
of these injuries were associated with the use of first-aid kits. The CPSC maintains
the National Electronic Injury Surveillance System (NEISS), which monitors a
statistically selected sample of all the emergency rooms in the United States, so
that data collected on each consumer injury associated with the categories of consumer products that fall under the jurisdiction of the CPSC can be extrapolated
to a national estimate.
It is not immediately obvious how so many injuries could be associated with
first-aid equipment. And this reaction is an excellent lesson about the reliability
of intuition for determining risk.12 One soon learns that the cotton swabs in a
first-aid kit can puncture eardrums; the ointments, pills, and antibiotic creams can
be ingested by infants; the ice packs can cause thermal burns to the skin; and the
cotton can become lodged in all sorts of unintended places.
With the understanding that there are no risk-free products, then we have
no choice but to define safe in terms of the amount of risk. Of course with “risk”
defining “safe,” the task of defining “safety,” or “safe enough,” has been replaced
10. The rigorously scientific portion of this field is a product of the past 50 years. Although it
has no single father, the seminal contributions of Dr. Chauncey Starr, ultimately recognized through
his receipt in 1990 of the National Medal of Technology from President George H.W. Bush, deserve
mention. http://www.rpi.edu/about/hof/starr.html.
11. S.C. Black & F. Niehaus, How Safe Is“Too” Safe, 22 IAEA Bull. 1 (1980); Water Quality:
Guidelines, Standards and Health 207 (Lorna Fewtrell & Jamie Bartram eds., 2001).
12. The emergency room treatment records of CPSC’s NEISS can be retrieved by anyone online
at the CPSC Web site, www.cpsc.gov.
909
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
with the task of defining “risk” or “acceptable risk.” This then is the definition of
safe; something is “safe” when it presents “acceptable risk”13 (the reader is again
reminded that we are discussing an engineering, not legal, construct).
2. What is meant by “risk”?
“Risk” is another of those deceptively simple four-letter words that society uses in
a wide variety of ways. Perhaps a few decades ago, when the field was developing
its rigorous intellectual underpinnings, a term other than “risk” could have been
chosen. But now there is a Harvard Center for Risk Analysis,14 a Food Safety Risk
Analysis Clearinghouse at the University of Maryland,15 and many other academically oriented risk analysis organizations too numerous to name. There is a convenient Web site, Risk World, that lists many of the other Web sites that reference
risk analysis.16 Risk as the technical term for safety is now too institutionalized to
be changed, and for the remainder of this discussion we are concerned with safety
as the risk of physical harm, that is, health or safety risk.
The concept of risk is slightly more complicated and significantly more rigorous than the concept of safety. Again we have an intuitive understanding of the
concept of “risk,” and that it involves some concept of probability, more specifically the probability of some “bad” thing. In the case of safety the “bad” thing is
injury or physical harm.
Risk is often empirically measured and expressed quantitatively, and a “risk”
number always contains units of frequency (or probability) and severity. This is a
substantial advantage over the concept of “safe.” It would make no sense to say,
“this product was found to be 2.73 safe.” Risk on the other hand is the measure of
safety. For example, the fatal risk of driving in the United States in 2007 was 1.36
fatalities for every hundred million vehicle-miles traveled (note: 100 million =
100,000,000 = 108).17 This is a risk number because it contains a severity, “fatal,”
and the frequency, per every 108 miles. This fatality risk is not a complete measure
of the safety or risk of U.S. vehicular travel in 2007 because the same 108 vehicle
miles traveled that produced the 1.36 fatalities also produced 82 injuries; “injuries”
as defined by the National Highway Traffic Safety Administration (NHTSA).18
13. International Organization for Standardization & International Electrotechnical Commission,
ISO/IEC Guide 51: Safety Aspects—Guidelines for Their Inclusion in Standards (2d ed. 1999); William W.
Lowrance, Of Acceptable Risk: Science and the Determination of Safety 8 (1976); Fred A. Manuele,
On the Practice of Safety 58 (3d ed. 2003); National Safety Council, Accident Prevention Manual—
Engineering & Technology 6 (Philip Hagan et al. eds., 12th ed. 2000).
14. http://www.hcra.harvard.edu/.
15. http://www.foodriskclearinghouse.umd.edu/.
16. http://www.riskworld.com/websites/webfiles/ws5aa013.htm.
17. National Highway Traffic Safety Administration, Motor Vehicle Traffic Crash Fatality
Counts and Estimates of People Injured for 2007, DOT HS 811 034 (Sept. 2008, updated Feb. 2009)
(hereinafter “NHTSA, Motor Vehicle Traffic Crash Fatality Counts”).
18. Id., slide 9.
910
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
These two different risk metrics, one for injuries and one for fatalities, naturally
invite the question of a single metric that characterizes the risk, and therefore
safety, of highway travel in the United States.
Sadly, the answer is that the single metric does not exist. For decades the
risk analysis community has worked on developing some calculus through which
injuries of differing severity could be rigorously combined and expressed as a
defensible “average” severity. Some safety data are collected in a form that naturally lends itself to this exercise. When occupational injury severity is characterized
by a “lost workday” metric, that is, the more severe injuries obviously result in
more lost days of work, then the average number of lost work days is a defensible
average severity with which one can characterize a population of occupational
injuries. But this exercise quickly breaks down in the face of permanently disabling
occupational injuries and deaths. Obviously, one could impute an entire career’s
worth of lost workdays in the case of fatal injury or permanent injury, but then
these injuries would completely overwhelm all other types of occupational injury.
And the issue of whether a permanently disabling injury is really of the same severity as a fatal injury remains unresolved.
Similarly the CPSC attempted soon after its creation in the early 1970s, to
develop a geometric sliding scale to numerically categorize the differing consumer
product–associated injury severities being treated in the hospital emergency rooms
that the agency monitored. The CPSC scale had six to eight severity categories
over the years, to which numerical weights were applied, ranging from 10 for
severity category 1, mild injuries and sprains, to 34,721 for severity category 8,
all deaths, in its original configuration. The weighting for deaths has changed and
has been as low as 2516. An amputation was accorded a weight of 340, and fell
into category 6, unless it resulted in hospitalization, at which time it became a
category 7 with a weight of 2516. In the end, this scheme has proved generally
unsatisfactory, but it still appears in the occasional CPSC document, and is used
to generate a “mean severity” for emergency room–treated injuries.19 Even if
somehow a calculus for comparing and combining various injury severities could
be developed, the challenge of how to compare the risk of differing injury frequencies at different severity levels would remain. There is practically no chance
that the relationship would be linear, and the nonlinear characteristics would be
highly subjective.
Instead of trying to develop a calculus to combine severities with differing
frequency, it has become the custom and practice in the risk analysis community
to express risk frequency or probability by stratified severity. That is, if a level of
severity is specified, then the risk likelihood is stated. There is no agreement on
the proper stratification, but rather a de facto consensus that fatal injuries are the
most severe, and fatality risk is commonly measured. In addition, calculations of
accident risk with no injury, injury risk, and hospitalized injury risk are often seen
19. U.S. Consumer Product Safety Commission, 1995 Annual Report to Congress A-5 (1995).
911
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
in the risk literature. Rather than being combined into a single metric, these risks
are expressed as independent risk frequencies or probabilities. Specialized average
severities, such as average number of lost workdays as a risk metric for the severity
of average occupational injury, are occasionally used.
The upshot of our inability to develop a severity calculus means that risk
metrics cease to be parameters with units of both frequency and severity, and
become merely frequencies or likelihood of an injury of a given severity. Frequencies are easier to compute and merely require what is called “numerator”
data (i.e., the actual number of adverse events for which the risk is being calculated) and “denominator” data (i.e., some measure of the opportunity to have the
adverse event). In the previously cited 1.36 fatal injuries per 108 miles of vehicle
travel, the numerator datum was the 41,059 deaths in 2007 traffic accidents and
the denominator datum was the 3,029,822 million vehicles miles traveled by all
vehicles in 2007.20 The division of these two numbers gives 1.36 deaths per 108
miles. Vehicle-miles traveled (VMT) is one obvious measure of the opportunity
to have a vehicular accident. However, it is not the only measure. If the data are
available, vehicle hours can be substituted for vehicle miles, and then the fatal
risk can be expressed as a frequency per vehicle hour. Measures such as miles and
hours are often called “exposure” data, and must be some empirical measure of
the opportunity to encounter the hazard (the adverse event itself) for which the
risk is being calculated. The “correct” exposure measure is usually determined by
the analysis being performed. Miles is appropriate for on-road vehicles, because
travel is what the automotive products are intended to produce. For off-road
recreational vehicles, where recreation as opposed to travel is the purpose of the
product, hours of use would probably be a more appropriate exposure measure.
3. Risk metric calculation assumptions
Having determined that the fatality risk of driving in the United States is 1.36
deaths per 108 VMT, does that mean one’s risk of dying in a traffic accident is
1.36 × 10−8 every time a mile is driven? No. With the danger of presenting too
much detail, we can use this one risk parameter as a tool to briefly illustrate that
many assumptions are inherent in any risk calculation, and that questions should
arise in the court’s mind when encountering any number that purports to represent the “risk” and therefore “safety” of any product or activity.
First, the number 1.36 is the gross fatality risk for vehicular travel in the
United States. It is the risk we as a society de facto accept for the benefits of
vehicular travel. Some of those deaths are pedestrians, motorcyclists, passengers,
and bicyclists, and their deaths are part of the risk society must accept to have
motorized vehicular travel. But, the fatal risk to you as a driver, by your “exposure” driving a mile, clearly does not involve any pedestrian risk or bicycle risk
20. NHTSA, Motor Vehicle Traffic Crash Fatality Counts, supra note 18, slide 40.
912
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
or motorcycle risk or vehicle passenger risk. Thus, fatally injured pedestrians
(4654),21 pedal cyclists (698),22 motorcyclists (5154),23 vehicle passengers (8657),24
and others (147 skate boarders, etc.)25 have to be subtracted from the 41,059 traffic
deaths in 2007 numerator datum, to compute a “fatal risk of you being a driver”
number, because none of them was driving a car. That leaves us with 21,64726
fatally injured vehicle drivers in 2007. Because all of the vehicles had to have a
driver to go even a mile, it might be tempting to just use the 3,029,822 million
vehicle miles number in 2007 as the denominator without adjustment. But, to
be accurate, the motorcycle operators were “vehicle” drivers, and so we cannot
remove their 5154 deaths from the numerator without removing the approximately 13,610 million27 vehicle miles those motorcycles were driven from the
denominator datum. Because the motorcycle operator fatal injury risk per mile is
37.86 per 108 VMT,28 more than 52 times that of an automobile driver, removing
the motorcycle data entirely when trying to compute an automotive risk number
is sound. If we do the appropriate adjustments, then we compute a fatality risk for
a nonmotorcycle vehicle driver of 0.718 deaths per 108 VMT.
Now, we can again ask the question, “Is 0.718 × 10–8 one’s risk of being
killed every time a mile is driven?” The answer is now “Possibly, but unlikely.”
This risk number is the composite risk for all drivers in society for 2007. And,
because of lifestyle choices, this number might serendipitously be accurate for
some but not for everyone. Every driver has significant control over the majority
of his or her risk of being killed on the road. For example, 33.6% of the fatally
injured drivers, 7283, almost exactly one-third, had blood alcohol levels at or
above 0.08 g/dL.29 Exactly how much a blood alcohol level of 0.08 increases one’s
risk of dying per mile driven is a topic of some debate, but the consensus would
fall somewhere between 3 and 5 times. You are much more likely to be killed if
you drive on the weekends during the early morning hours. Even restricting ourselves to passenger vehicle fatalities in the daytime, when 82% of vehicle occupants
wear their seat belts, 45% of the drivers killed in the daytime were unrestrained by
their seat belts.30 Numerous other decisions that we make concerning our driving
or circumstances that affect us, such as the size of the car we drive, cell phone use,
21. Id.
22. National Highway Traffic Safety Administration, Traffic Safety Facts 2007 Data: Pedestrians,
DOT HS 810 994, at 3 (hereinafter “NHTSA, Pedestrians”).
23. National Highway Traffic Safety Administration, Traffic Safety Facts 2007 Data: Motorcycles,
DOT HS 810 990, at 1 (hereinafter “NHTSA, Motorcycles”).
24. NHTSA, Motor Vehicle Traffic Crash Fatality Counts, supra note 18, slides 52, 74, 85.
25. Id., slide 9.
26. Id., slide 40.
27. Id.
28. NHTSA, Motorcycles, supra note 23, at 1.
29. National Highway Traffic Safety Administration, Traffic Safety Facts: 2007 Traffic Safety
Annual Assessment—Alcohol-Impaired Driving Fatalities, DOT HS 811 016, at 2 (Aug. 2008).
30. NHTSA, Pedestrians, supra note 22, at 3.
913
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
regard for yellow lights, aggressiveness, medication, vision correction, etc., may
contribute in some way to the likelihood that we will be fatally injured driving
the next mile, but are beyond the scope of this brief discussion.
4. Risk metric evaluation
With some understanding of what comprises the calculation of a risk metric, we
can now turn to the more important questions related to its meaning. A fair question about the vehicular risk we just examined might be: “Is a fatal motor vehicle
risk of 1.36 × 108 VMT good or bad?” Should society be ashamed or proud?
This question for vehicle safety, and every other arena of risk analysis, can only
be answered comparatively. The only absolute risk standard is “zero,” but this
ideal can never be achieved. So, to answer the question of how “good” the 1.36
number is, we can look to several comparisons. A logical starting point might be
previous years; are we getting better or worse? Fortunately, with a few singular
exceptions (such as motorcycles), everything is getting safer, and has been for the
past 100 years. Although 1.36 people dead for every 108 VMT is surely not desirable, in 1966, that same number was over 5.31
The data in Table 1 are for the previous decade:32
Table 1. Fatalities per 100 Million Vehicle Miles Traveled
Year
Fatalities per 108 VMT
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
1.69
1.64
1.58
1.55
1.53
1.51
1.51
1.48
1.44
1.45
1.42
These data illustrate the fact that a risk number such as 1.36 deaths per 108
VMT in isolation is practically meaningless. But when put in a historical context,
or in the context of other products or activities, a perspective is gained to evaluate
the magnitude of the risk. As can be seen in Table 1, we as a society are making
steady progress on reducing the fatal risk of driving, and our current risk number
31. Matthew L. Wald, Deaths on Motorcycles Rise Again, N.Y. Times, Aug. 15, 2008, at A11.
32. NHTSA, Motor Vehicle Traffic Crash Fatality Counts, supra note 18, slides 52, 74, 85.
914
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
does not look too bad. Similarly, if our risk number were presented in the context
of the fatal highway risk of other industrialized nations, it would compare very
favorably as well.
5. What is meant by “acceptable risk”?33
With an understanding of how risk is calculated, and that risk must necessarily be
viewed in a comparative context, we now turn our discussion back to our original
question “how safe is safe enough?” which in light of what we have learned must
be rephrased “how much risk is acceptable?”
How much risk is acceptable is not a simple question; books are written
with “acceptable risk” in the title. However, a simple answer to this question is
typically another question: “acceptable to whom?” As individuals we exhibit radically different de facto risk acceptance, and the same individual will exhibit significantly different risk acceptance throughout his or her lifetime. Certainly, as
compared with a stuntman, the average person would have widely variant views
on what is an acceptable risk. And, neither could nor probably should make this
decision for society as a whole. There is no absolute standard of how much risk
is too much or too little, but innumerable federal, state, and voluntary standards
prescribe maximum risk levels, and we touch on them briefly.
Risk acceptance has been studied extensively, and there are more than a
dozen factors that influence how much risk is acceptable either to an individual, or
to society as a whole, in a given situation. And they are not always the same factors. Examining and discussing all these factors is beyond the scope of this guide,
but a few of the most important are illustrated.
Probably the single most important factor for determining how much risk is
“acceptable” is how much “benefit” we gain from accepting the risk. We are willing
to accept substantial risk for substantial benefit. Motorized vehicular transportation
confers tremendous benefits in our society and almost the entire population participates, and by our participation we indisputably evidence our de facto “acceptance”
of the known risks for the known benefits, even if we do not find the risks of driving
intellectually “acceptable.” That does not mean we have to like the level of risk, or
that we “accept” the current level of risk in the sense that we do not need to do
anything about it. Indeed we spend billions and billions of dollars trying to reduce
the level of risk associated with motorized vehicular travel. That being said, the
overwhelming majority of the current population finds the current level of motorized vehicular travel risks low enough, given the benefits, to participate. This would
not be too surprising if the level of risk associated with motorized transportation
were low, because the benefits of motorized transport are clearly high. However,
the fatal risk associated with motorized vehicle travel is not low.
33. Although acceptable risk is also a legal concept, we are merely using engineering vernacular
in this chapter, and no legal construct is intended.
915
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Returning to our fatality risk for a nonmotorcycle vehicle driver of 0.718
deaths per 108 VMT, this translates into a fatal risk of 0.718 × 10−8 for each mile.
That 10−8 term makes this number quite small, and the fatal risk per mile low.
However, very few people drive just a mile in a week or year. In fact, according
to the Federal Highway Administration, the average U.S. driver logs about 13,500
miles behind the wheel every year.34 That means for the year, the average U.S.
driver faces 0.718 × 1.35 × 104 × 10−8 = 0.97 × 10−4 risk of a fatal accident every
year, or about 1/10,000. But, very few people drive for a year. It is not uncommon to drive 60 years. Certainly the mileage we drive when young and old is less
each year, and when middle-aged, more, but for the purpose of calculation let’s
assume the average value for 60 years. Then the risk of driving for one’s adult
lifetime, on average, is 0.97 × 6 × 10 × 10−4 = 5.82 × 10−3 Stated another way,
if we drive for a lifetime, even at the low fatal risk of 2007, the average driver
runs a risk of 0.00582 of being killed in his lifetime in a vehicular accident, a little
more than a chance of 1/200. So, one out of every 200 drivers will die in his or
her lifetime from the activity of driving a vehicle.
Needless to say, this number does not look so small any more. This brings
us to the most commonly advanced argument against de facto “risk acceptance”
being the measure of “acceptable risk” or “safe enough.” Critics argue that no one
can be said to “accept” a risk if they do not know what the risk is. Logically this
is true, but it ignores the fact that even if we cannot cite a specific risk parameter,
that does not mean we do not have an intuitive grasp of the risk. For example,
in the case of motorized vehicle travel above, relatively few people can go through
the calculation above and derive the number 1/200. But, we all have personally
known in our lifetime more than one person (not just luminaries such as James
Dean, Jayne Mansfield, Princess Grace Kelly, General George Patton, and Princess
Diana, and even Barack Obama Sr., father of our current President) who has died
in a vehicular accident. For the 1/200 number to be true, since we all know a
few hundred people, we must know at least a couple who have died in vehicular
accidents. Therefore, even though we may not be able to calculate the number,
society has an excellent grasp of the risk associated with vehicular travel.
This 1/200 risk of fatal vehicular injury also illustrates the important difference between a “unit of participation risk” and a “lifetime risk.” Because, fortunately, average lifetime is so long, when a risk to which we are constantly exposed
is summed over a lifetime, the resulting fraction can become uncomfortably large.
For example, the lifetime risk of developing cancer from merely exposure to the
background levels of environmental chemicals has been estimated to between
1/1000 to 1/100.35
Indeed, people who study common perceptions of risk have found that
people do a fair job of estimating the national death toll from a great many com34. http://www.fhwa.dot.gov/ohim/onh00/bar8.htm.
35. C.C. Travis & S.T. Hester, Global Chemical Pollution, 25 Envtl. Sci. & Tech. 814–19 (1991).
916
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
mon risks, such as vehicular travel, but typically overestimate risks, such as airplane
crashes, that have significant publicity associated with them.36 We all know there
is a small, but highly controllable, risk of drowning when we go swimming. Yet
most of the U.S. population participates in this activity on some basis.
After the benefits gained from assuming the risk, probably the second most
important factor determining the acceptability of a given risk level is “control.”
Is the risk under our control or in the hands of fate? We are willing to voluntarily assume up to 1000 times more risk if we perceive we are assuming the risk
voluntarily and it is under our control. This is certainly a substantial factor in the
acceptance of the risk of motorized vehicular travel. It is also observed very commonly in sports recreation activities. We perceive that the overwhelming majority
of this risk is under our direct control, so we are almost universally willing to
accept it for the perceived benefits on a societal basis. On the other hand, if we
perceive the risk is imposed on us involuntarily, and it is out of our control, such
as a nuclear power plant being built in our city, then the amount of risk we are
willing to “accept” being imposed on us is dramatically less.
Another important factor in determining if a particular risk is “acceptable” is
the cost of reducing or eliminating the risk. This issue is commonly encountered
in product-related injury tort litigation, and it is often not a simple one. As mentioned above, there is practically no product that cannot be made safer by reducing
the product benefits or increasing the product cost, or both.
Unfortunately, plaintiffs and defendants often muddy the intellectual landscape
related to safety in products litigation. Plaintiffs will often assert that the product
should be completely risk free, an impossible ideal to achieve, even if the product is
being misused. Defendants will often assert that safety is the “highest” priority in
their product’s design. However, this cannot be true either. If safety were the highest priority in any product’s design, the cost would be uneconomical, because at no
point in the design, no matter how low the risk, would the level of risk be as low
as could be achieved with more cost. Everyone knows that a big car is safer than
a small car. And this is demonstrably true. It is particularly true when the big car
hits the small car, and death risk in the small car is commonly 8 to 10 times higher
than that of occupants of the big car in such collisions. Big cars also present less risk
to their occupants, even hitting stationary obstacles such as trees. But, big cars cost
more than small cars. If safety were the highest priority in vehicle design, we would
all have to pay for vehicles with the weight, complexity of design, and handwork
found in nameplates such as Mercedes. In the real world we can choose among
more than 300 car models. Some of the very smallest and lightest mass-produced
models are very inexpensive relative to a Mercedes, but they also do not remotely
protect their occupants to the degree of a Mercedes. All cars must provide their
occupants a minimum level of protection by compliance with the Federal Motor
36. Baruch Fischhoff et al., Acceptable Risk (1981).
917
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Vehicle Safety Standards (FMVSS). But even with the FMVSS, there is demonstrably more risk associated with driving a small car.
How much risk is “acceptable” is further complicated by the fact that risk
cannot be spread uniformly in society. The fatal risk of motorized vehicle travel
is borne by those relatively few who die, and by the rest of us only by taxes and
insurance premiums. Unfortunately, some purchasers are willing to assume the
additional risk of a small car in the showroom for the very substantial cost savings, but change their minds after a collision demonstrates the complete cost of
the tradeoff. Our economic system permits purchasers to trade cost for safety
in innumerable other products, from helmets, to tools, to furniture and houses.
In reality, consumers and manufacturers must engage in consideration of cost
versus safety virtually every day, because there are few products where a safer
and more expensive model is not available, and no products exist that cannot be
made safer by being made less convenient and/or more expensive. Denying or
obfuscating this process does not advance safety, science, engineering, or justice.
In light of all the preceding considerations, we last examine the question of
whether there is any absolute level of risk low enough that it almost always is
regarded as “acceptable” and therefore “safe.” Unfortunately, there are a multitude
of such levels from a myriad of sources. In the United States, Chauncey Starr in
1969 quantified the risk of disease to the general population as one fatality per
million hours of exposure, and after studying risk acceptance and participation of
society in many activities concluded that “the statistical risk set by disease appears
to be a psychological yardstick for establishing the level of acceptability of other
risks.”37 Starr observed the de facto level of risk people accept, not necessarily that
which they would say is “acceptable,” was about one in a million chance of fatality
per hour, or unit of, exposure. If an activity presents this level of fatal risk, and a
person wants the benefit of that activity, he or she will almost always accept this
level of risk for the perceived benefit. As a consequence of this initial observation,
“one in a million risk” calculations are now commonplace in the risk literature.38
As the risk level rises above this threshold, a decreasing fraction of the population will find the risk worth the benefits. This is why very high risk sports,
such as skydiving, have many fewer participants. Let us return one more time to
our driver fatality risk of 0.718 × 10–8 for each mile. This can be conveniently
converted into a risk per hour by recognizing that the average driving speed in
the United States is about 30 miles per hour.39 That means in an hour, the fatal
risk to the average driver is 30 × 0.718 × 10−8 = 0.214 × 10−6 or about 0.2 per
million hours or 2 in 10 million hours. It is perhaps more appropriate to return
37. C. Starr, An Overview of the Problems of Public Safety, in Proceedings of Symposium on Public
Safety 18 (1969).
38. R. Wilson & C. Crouch, Risk-Benefit Analysis 208–09 (2d ed. 2001).
39. See, for example, government calculations at http://www.epa.gov/OMS/models/ap42/
apdx-g.pdf or http://nhts.ornl.gov/briefs/Is%20Congestion%20Slowing%20us%20Down.pdf.
918
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
to the 1.36 × 10−8 per mile traveled for the overall risk to society of motorized
vehicular travel, not just the driver risk, to compute the risk level that society de
facto accepts for the benefits of motorized vehicular transport. Then the risk per
hour becomes 30 × 1.36 × 10−8 = 0.408 × 10−6 or a little less than half a fatality
per million hours of exposure. This is well below the “one in a million” threshold,
and thus 98%+ of the society will participate in this activity. As a final sanity check
on our work, let’s return to the 37.86 risk of fatal injury per 108 VMT for motorcycles. This translates into a risk per hour of 30 × 37.86 × 10−8 = 1.136 × 10−5
or more than 11 fatal injuries per million exposure hours. This is above the onein-a-million threshold, and, understandably, motorcycle riding is regarded as an
unacceptable risk by a large fraction of the population.
This threshold of “one in a million” as the “acceptable” risk level has many
variants. In the United Kingdom, for example, the Health and Safety Executive40
adopted the following levels of risk, in terms of the probability of an individual
dying in any one year:
• 1in1000asthe“‘justabouttolerablerisk’”foranysubstantialcategoryof
workers for any large part of a working life;
• 1in10,000asthe“‘maximumtolerablerisk’”formembersofthepublic
from any single nonnuclear plant;
• 1in100,000asthe‘“maximumtolerablerisk’”formembersofthepublic
from any new nuclear power station;
• 1 in 1,000,000 as the level of ‘“acceptable risk’” at which no further
improvements in safety need to be made.
There are essentially innumerable regulations promulgated by different agencies
within states and the federal government that are beyond the scope of this guide,
but which mandate expenditures to maintain certain maximum risk levels either
implicitly or explicitly. These regulations cover everything from food additives to
acceptable levels of remediation at toxic Superfund sites. Regrettably, there is little
or no coordination among regulating agencies, and no standardized procedures for
addressing risk within the federal government, or within the states. As a result, the
amount spent to “save a life,” which should be termed “forestall a fatality” (because
everyone eventually dies) varies by six orders of magnitude. Table 2 lists a number
of regulations, the year that they were mandated, their issuing agency, and the cost
they effectively mandate be expended “per life saved.” Needless to say, these are
estimates, and the data are somewhat dated, but the relative costs will be approximately the same. Executive orders from recent Presidents starting with Reagan
have attempted to introduce “cost-effectiveness” in one form or another into the
regulatory process, but with little observable effect at this writing.
40. Water Quality: Guidelines, Standards and Health 208–09 (L. Fewtrell & J. Bartram eds.,
2001).
919
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Table 2. Relative Cost of Selected Regulations as a Function of Lives
Saved
Regulation
Year
Agency
Cost per Life
Saved (Millions of
Dollars in 1990)
Unvented space heater ban
Aircraft cabin fire protection
Aircraft seat cushion flammability
Trenching and excavation standards
Rear lap/shoulder belts for cars
Asbestos occupational exposure limit
Ethylene oxide occupational exposure limit
Acrylonitrile occupational exposure limit
1980
1985
1984
1989
1989
1972
1984
1978
CPSC
FAA
FAA
OSHA
NHTSA
OSHA
EPA
OSHA
0.1
0.1
0.5
1.8
3.8
9.9
24.4
61.3
Note: CPSC = Consumer Product Safety Commission; EPA = Environmental Protection Agency;
FAA = Federal Aviation Administration; NHTSA = National Highway Traffic Safety Administration;
OSHA = Occupational Safety and health Administration.
Source: W. Kip Viscusi & Ted Gayer, Safety at Any Price? Regulation 54, 58 (Fall 2002).
Finally, although we acknowledge that this section on safety is quite extensive, we also believe it is extremely important for the court to recognize how
engineers think about safety. Engineers are dedicated to making safe products.
At the same time, they recognize that every increment in safety has an expense
associated with it. Just as there is no product or environment that is risk-free,
there is no bright-line threshold that universally divides safe and unsafe products;
safety is not binary. For each properly designed product, there is a unique set of
constraints (including cost), and a safe-enough level exists that balances constraints
with acceptable risk.
C. The Design Process—Examples in Which This Guiding
Principle Was Not Followed
To illustrate ways in which flawed design processes lead to adverse outcomes,
a number of examples are selected covering a range of incidents that occurred
during the past century. In each instance, the link in the design process that was
either missing or corrupted is highlighted and discussed. The reader may wish to
refer to Figure 1 when considering these examples.
1. Inadequate response to postmarket problems: Intrauterine devices (IUDs)
Insertion of objects into a woman’s uterus has long been a means of contraception.
In the twentieth century, IUDs were designed, manufactured, and mass marketed
920
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
around the world. Many of them were associated with adverse health consequences,
in particular, pelvic inflammatory disease, which led to long-term disabilities and
even death in substantial numbers of women. An example was the Dalkon Shield,
marketed and sold by A.H. Robbins. The health problems of wearers of this device
put its manufacturer in bankruptcy and led Congress to pass legislation to enhance
medical device regulation generally, including most IUDs. Thus, those authorities
corrected the flawed design process employed by A.H Robins which had led it to
conclude that the product could be initially marketed and even continued to be
marketed in the face of reports of serious health problems and death.
The Copper-7 IUD, marketed and sold by G.D. Searle, represented a somewhat different situation. That device received FDA approval as a drug. After
it reached the market, Searle received reports of health problems. In litigation
brought by women who used the product, some courts concluded that the risk
associated with its use was “unacceptable.”41
With all IUDs, the inserted device has a “string” attached to it that passes
from the uterus through the cervix and into the vagina. The “string” is used for
the purposes of removal as well as to provide certainty to the woman that the
IUD remains in place and has not been expelled. But, to provide these functions,
it compromises a biological firewall that ensures sterility of the uterus—the cervix.
Therefore, in choosing the string material and fabrication method, designers had
to assess choices that if properly made, reduced, if not eliminated, the potential for
bacteria to migrate from the vagina into the uterus. With both the Dalkon Shield
and the Copper-7, the designers set aside this consideration and traded it for the
ability to enhance manufacturability and appearance by using strings that resulted
in the unacceptable transmission of infectious agents into the uterus. These design
choices were made for the purpose of reducing expense and gaining a competitive
marketing edge, not to enhance consumer safety, and therefore led to unacceptable risk. They turned out to be lethal choices, two more examples of failures to
adhere to the well-established and time-honored design process.
2. Initial design concept: Toxic waste site
For 17 years, over 35 million gallons of industrial waste were deposited in pits
dug into the ground in what had been presumably certified to be a granite-lined
impermeable geological formation that would not leak. These were known as the
Stringfellow Acid Pits located near the Riverside suburb of Glen Avon, California, some 50 miles east of Los Angeles. History proved otherwise and millions
of gallons of toxic materials escaped containment and contaminated groundwater
supplies and exposed local inhabitants to chemical vapors.42
41. See Robinson v. G.D. Searle & Co., 286 F. Supp. 2d 1216 (N.D. Cal. 2003); Kociemba v.
G.D. Searle & Co., 683 F. Supp. 1577 (D. Minn. 1988).
42. See State v. Underwriters at Lloyd’s London, 54 Cal. Rptr. 3d 343 (Cal. Ct. App. 2006), pet.
for review granted, 156 P.3d 1014 (Cal. 2007) for general overview and United States v. Stringfellow, No.
921
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
In this instance, the design process was flawed from the very beginning (i.e.,
the incomplete and incorrect geological analysis) and led to an “engineered” site
for the containment of toxic wastes, which had no chance of performing properly.
This is an excellent example showing that once the design process is corrupted,
everything that follows in the design cascade, although perhaps done correctly,
will most likely not lead to a successful design outcome.
3. Forseeable safety hazards: Air coolers
In low-humidity locales, it is possible to “air condition” a structure using evaporative cooling of water. This is done in devices known as “swamp coolers.” They
either sit beside or in most instances on the roofs of the structures being cooled.
The operation is simple. They consist of an enclosure or a box in which a small
pump is used to saturate porous panels through which air is drawn thereby
evaporating the water and cooling the air that is directed into the interior spaces.
The pumps are electrically powered and are known to short-circuit and fail, thus
becoming a potential source of ignition and fire. A simple design solution is to
make the box inflammable. This, of course, is the case when the box is metal,
but then one has to be concerned with corrosion and subsequent maintenance.
To obviate the corrosion issue, the box can be made of plastic. Plastic does not
corrode but it is potentially flammable unless flame retardants are added as part of
the materials formulation. Foreseeing this occurrence and making the conscious
choice not to add flame retardants is an abdication of the design process and with
that comes tragic consequences.
This scenario was played out in Vanasen v. Tradewinds,43 where a 5-year-old
girl was killed as the result of a foreseeable pump failure, subsequent electrical
short circuit, and ignition of a non-fire retardant plastic swamp cooler attached
to the roof of her home. Again, failure to adhere to the straightforward tenets
of the design process (i.e., designing out the known tendency of many plastics
to burn) is tantamount to “rolling the dice” and hoping for the best. Experience
teaches us time and again that taking design “shortcuts” seldom translates into an
acceptable design outcome.
4. Failure to validate a design: Rubber hose for radiant heating
Radiant heating has been in use since Roman times, and a common variant of this
heating method involves placement of tubes that circulate heated fluids beneath
floors, thus warming the floors that then in turn heat the surrounding structure.
Even though metallic tubes once frequented this application, their cumbersome
CV 83-2501 JMI, 1993 WL 565393 (C.D. Cal. Nov. 30, 1993) for discussion of specific findings of fact
by the special master; P. Kemezis, Stringfellow Cleanup Settlement: Companies Agree to Pay $150 Million,
Chemical Week, Aug. 12, 1992, at 11; http://www.dtsc.ca.gov/PressRoom/upload/t-01-99.pdf.
43. Tulare County, CA Sup. Ct., No. 93-161828.
922
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
installation and susceptibility to corrosion led to the development of plastic tubes.
One manufacturer recognized that rubber hose would be even easier to install
than the somewhat rigid plastic conduits, and engaged a major rubber company
to design a hose for the radiant heating market. The rubber company supplied a
hose formulation that was designed for and used in automotive cooling applications, which made some sense given that similar fluids at similar temperatures are
circulated in both cases. The rubber company failed to test the newly developed
hose under end-use conditions, and thereby neglected to detect a failure mode
caused by hose hardening and embrittlement. Engineering experts for the plaintiffs
conducted a simple end-use test that verified that the hose would degrade under
foreseeable conditions, thus completing the step in the design process that was not
performed by the rubber company.44
5. Proper design—improper assembly: Kansas City Hyatt Regency Hotel
On July 17, 1981, during a tea dance in the vast atrium at the Hyatt Regency
Hotel in Kansas City, two elevated walkways collapsed onto the people celebrating
in the lobby, killing 114 of them and injuring more than 200.
The determination of what happened focused on the design and construction of the walkways. The 40-story complex featured a unique main lobby design
consisting of a 117-foot by 145-foot atrium that rose to a height of 50 feet.
Three walkways spanned the atrium at the second, third, and fourth floors. The
second-floor walkway was directly below the fourth, and the third was offset to
the side of the other two walkways. The third- and fourth-floor walkways were
suspended directly from the atrium roof trusses, while the second-floor walkway
was suspended from the fourth-floor walkway. During construction, the design,
fabrication, and installation of the walkway hanger system were changed from
that originally intended by the design engineer. Instead of one hanger rod connecting the second- and fourth-floor walkways to the roof trusses, two rods were
used—one to connect the second- to the fourth-floor walkway, and another to
connect the fourth-floor walkway to the roof, thus doubling the stresses in the
ill-conceived connection.
Just prior to the collapse, about 2000 people had gathered in the atrium
to participate in and watch a dance contest, including dozens who filled the
walkways. At 7 p.m., the walkways on the second, third, and fourth floor were
packed with visitors as they looked down to the lobby, also full of people. It was
the second- and fourth-floor walkways—the ones that experienced the design
changes—that collapsed. Clearly then, in the iterative cycle of the design process,
modifications to the original design need to be validated, and failure to do so can
44. http://www.entraniisettlement.com/PDFs/PreliminaryApprovalAmended.pdf; J. Moalli et
al., Failure Analysis of Nitrile Radiant Tubing, ANTEC 2006 Plastics: Annual Technical Conference,
Society of Plastics Engineers, May 7–11, 2006, Charlotte, NC (2006).
923
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
have severe consequences. Further details of this event can be found in the second
edition of this manual.45
6. Failure to validate a design: Tacoma Narrows Bridge
Spanning a strait, the third longest suspension bridge of its time, the Tacoma
Narrows Bridge opened on July 1, 1940. In November of that same year, it collapsed into Puget Sound. During the design process, engineers failed to adequately
account for the effects of aerodynamic flutter on the structure, a phenomenon
in which forces exerted by winds couple with the natural mode of vibration of
the structure to establish rapid and growing oscillations. In essence, the bridge
self-destructed.
It is fair to say, however, that aerodynamic flutter was not well understood
at the time this bridge was constructed. Indeed, the term was not coined until
the late 1940s, years after the bridge collapsed. The root cause of this unfortunate
circumstance was a desire to build a bridge with enhanced visual elegance (i.e.,
long and narrow) and to use an untested girder system that offered significant
cost savings. This should have led to a thorough testing and validation program
to ensure that venturing into uncharted waters in bridge design would not result
in unintended or unanticipated consequences. Indeed, after the bridge was constructed and put into use on July 1, 1940, it gained a reputation for its unusual
oscillations and was known as “Galloping Gertie.” It was only then that engineers
built a scale model of the bridge and began testing its behavior in a wind tunnel.
Those studies were completed and remedies proposed in November 1940, just
days before the bridge fell into the Tacoma Narrows channel.
A substantial departure from the norm of appropriate testing and validation is
an unacceptable application of the design process, and the collapse of this bridge
is an all too sobering reminder of this. Stated in another way, end-use testing
should not be done by the “consumer” and in cases where this occurs, a clear
violation of the design process tenets has taken place.46
7. Failure to conform to standards and validate a design: Automotive lift
Automotive lifts are often used in dealerships and service stations to raise vehicles
and provide access to components on the bottom of the vehicle for service. To
reduce the propensity for injury, ANSI and the American Lift Institute (ALI)
promulgate standards that specify, among other things, the minimum resistance
on the horizontal swing-arm restraints. The lift in question had a label on the lift
support structure that indicated it was in compliance with these specifications, so
45. Henry Petroski, Reference Guide on Engineering Practice and Methods, in Reference Manual on
Scientific Evidence 577, 601–02 (2d ed. 2000).
46. This example is also further discussed in the second edition of this manual.
924
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
when a Jeep Wrangler fell from the lift and injured the owner of a service station,
verification of conformity to the standards was assessed by the plaintiff.
Testing by the plaintiff’s expert revealed that the swing-arm lift restraints
resisted only 30% of the criteria specified in the standard, and that simple reconfiguration of the restraint components could create a conforming lift. Furthermore, the plaintiff’s expert calculated that for the vehicle-lift configuration in
question, the amount of force required to provide positive restraint was less than
that required by the standards, and therefore the accident would have been prevented had the standards been met. Finally, the plaintiff’s expert opined that the
label on the lift that claimed compliance with the standard would tend to convey
to the end-user of the product that the presence of the swing-arm restraint added
a layer of insurance for the operator in the event that there was an imperfect
placement of the vehicle over the lifting pads.
In response, the lift manufacturer claimed that the intended swing-arm
restraining forces arose from the friction created when the lifting pad contacted
the vehicle undercarriage, and further argued that the swing-arm restraint was
nothing but “fluffery” forced upon lift manufacturers to remain competitive in the
marketplace. The jury found for the plaintiff, implicitly recognizing the tenant of
the design process that calls for testing and validation of design claims and features.
8. Lack of sufficient information and collective expertise to consummate a
design: Dam collapse
After 2 years of construction the St. Francis dam in southern California was completed in 1926 and the reservoir behind it began to fill. As the reservoir reached
near capacity behind the 195-foot-high concrete arch dam, the eastern abutment
gave way shortly before midnight on March 12, 1928, unleashing a wall of water
over 100 feet high that eventually dissipated into the Pacific Ocean some 50 miles
downstream. The flood killed more than 600 people and most likely more. The
collapse of the St. Francis dam is one of the worst American civil engineering
failures of the twentieth century.47
The dam was designed and certified by a single individual, William Mulholland,
chief engineer and general manager of the Los Angeles Department of Water &
Power (at the time known as the Bureau of Water Works & Supply). Mulholland
had no formal education and was a self-taught individual. While the ultimate physical cause of the failure was the proximity of a paleomegalandslide to the eastern
dam abutment, a geological anomaly that geologists argue today as to whether
such a feature could have been detected in the 1920s, the inquest that followed the
disaster determined that improper engineering, design, and governmental inspection was where the responsibility for this tragedy resided.
47. St. Francis Dam Disaster Revisited (Doyce B. Nunis. Jr., ed., 2002).
925
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Indeed, we now know that the design of this structure failed to meet accepted
design principles already in place in the 1920s. The dam height was increased by
10 feet at the start of construction, and another 10 feet midway through construction, bringing the final capacity to 38,000 acre feet. No modifications were made
to the base to accommodate this additional capacity, and there were a number of
weaknesses in the design of the base. It is estimated that the factor of safety, which
was meant to be above 4 in the initial design, may have been as low as 0.77 on
the dam that was actually constructed.
Geoforensics expert J. David Rogers enumerated many other design deficiencies associated with the St. Francis Dam, among them the lack of hydraulic
uplift theory being incorporated into the dam’s design; lack of uplift relief wells
on the sloping abutment sections of the dam; failure to batter the upstream face
of the dam to reduce tensile forces via cantilever action; failure to analyze arch
stresses of the main dam; failure to remove high-water-content cement paste
(laitance layer) between concrete lifts; failure to account for the mass concrete
heat-of-hydration; failure to recognize the tendency of the Vasquez formation to
slake upon submersion and failure to provide the dam with grouted contraction
joints; failure to recognize that the dam concrete would eventually become saturated; and failure to wash concrete aggregate before incorporation in the dam’s
concrete.48
In this instance there simply was no credible design process from concept,
through design, execution, and postconstruction surveillance. As a result, a massive failure ensued.49
9. Operation outside of design intent and specifications: Space shuttle
Challenger
On January 28, 1986 the space shuttle Challenger and its accompanying liquid
hydrogen and oxygen external tank (ET) disintegrated over the Atlantic Ocean
after only about 70 seconds of flight. The two attached solid rocket boosters
(SRB) separated from the shuttle and ET and were remotely destructed by the
range safety officer. All seven of the NASA crewmembers were killed.
We now know the physical reason for this catastrophe. Two rubber “O”-rings
placed at the aft joint where two sections of the right SRB came together had
failed to “extrude” themselves as the SRB metal shell deformed during the early
moments of ignition. Because of this, hot gases escaped through the breach created by the ineffective seal at the O-ring joint and led to the separation of the aft
strut that attached the right SRB to the ET. This was followed by failure of the
48. J. D. Rogers, The St. Francis Dam Disaster Revisited, 77 Southern California Q. (1-2) (2003);
J. D. Rogers, The St. Francis Dam Disaster Revisited, 40 Ventura County Q. (3-4) (2003).
49. Donald C. Jackson & Norris Hundley, Privilege and Responsibility: William Mulholland and the
St. Francis Dam Disaster, California History (Fall 2004).
926
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
aft dome of the liquid hydrogen tank. The massively uneven thrust created by the
escaping hydrogen gas altered the trajectory of the shuttle and aerodynamic forces
destroyed it. The failure of the “O”-rings to alter their conformations with SRB
shell deformation was attributed to the low ambient temperature at the time of
launch. The O-rings had “hardened” and as a result lost their required flexibility.
Two investigations into the circumstances surrounding this disaster took
place. Reports and findings were issued by the Presidential Rogers Commission50
and the U.S. House Committee on Science and Technology.51 While both reports
agreed on the technical causes of the catastrophe (failure of the “O”-rings to perform as intended), their conclusions as to the root cause were stated somewhat
differently but in the end pointed to the same basic issue. The Rogers Commission concluded that the National Aeronautics and Space Administration (NASA)
and the O-ring manufacturer, Morton Thiokol, failed to respond adequately to
a known design flaw in the O-ring system and communicated poorly in reaching the decision to launch the shuttle under extremely low ambient temperature
conditions. The House Committee concluded that there was a history of poor
decisionmaking over a period of several years by NASA and Morton Thiokol in
that they failed to act decisively to solve the increasingly serious anomalies in the
SRB joints.
Another way of stating what both reports essentially say is that the design
process resulting in the double O-ring (now a triple O-ring system) was flawed.
Moreover, NASA managers knew of this problem as early as 1977. Warnings by
engineers not to launch that cold morning were disregarded. Each SRB consisted of six pieces, three welded together in the factory and the remaining three
fastened together at the launch facility in Florida using the double O-ring seal
system. Thiokol engineers lacked sufficient data to guarantee seal performance of
the O-rings below 53 degrees Fahrenheit (°F). Temperature at launch hovered at
31°F. When originally designed, the O-rings were intended to remain in circumferential grooves. After several shuttle launches, it became evident that the SRB
shell was deforming and that hot gases could escape but that the O-rings were
“extruding” to seal these temporary breaches. As a result, the design specifications
were changed to accommodate this process. The design itself, however, remained
unchanged.
If one considers that the original design concept was to ensure a seal between
the SRB field-joined sections using two O-rings, the question on the table is
whether the actual design and subsequent execution were consistent with the
design process. Clearly this was not the case. First, the system performed differently than expected (i.e., extrusion occurred). Validation and testing to ensure that
50. Rogers Commission, Report of the Presidential Commission on the Space Shuttle
Challenger Accident (1986).
51. Committee on Science and Technology, Investigation of the Challenger Accident, H.R.
Rep. No. 99-1016, (Oct. 29, 1986).
927
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
this aberrant behavior of the original seal system actually was acceptable never was
done other than to monitor shuttle launches and hope for the best. Second, the
O-rings were known to have insufficient resiliency at temperatures substantially
higher than those encountered on the day of the Challenger launch; therefore,
launching at such a low ambient temperature equated to misuse of the system.
The unfortunate truth of all this is that an unsound design process most certainly
will produce a flawed product.
10. Foreseeable failure and lack of design change in light of field
experience: Air France 4590
On July 25, 2000, Air France flight 4590, a Concorde supersonic passenger jet
departed Charles de Gaulle Airport and crashed into a nearby hotel killing 100
passengers, 9 crew, and 4 others on the ground. The physical cause was readily
determined. The Concorde was designed to take off without flaps or leading-edge
slats as a weight-saving measure. Because of this, it required a very high takeoff
roll speed to become airborne. This placed unusually high stresses on the tires. A
piece of titanium metal approximately 1 × 16 inches was lying on the departure
runway. It had fallen from a thrust reverser assembly on a Continental Airlines
DC-10 that had departed minutes earlier. During its takeoff roll, the Concorde
struck the metal debris and this punctured and subsequently shredded one of its
tires. The tire remnants broke an electrical cable and created a shock wave that
fractured a fuel tank. The fuel ignited and an engine caught fire. The plane had
reached a ground speed such that the pilot elected that it was prudent to continue
the takeoff rather than abort. The crew shut down the burning engine. Unable
to retract the landing gear, and now experiencing problems with the remaining
engines, the crew was unable to climb and the aircraft rolled substantially to the
left and contacted the ground.52
In this instance, a design decision was made to save weight by not having
retractable flaps and slats. This led to higher than normal landing and takeoff
speeds. This in turn placed additional demands on the tires. They would be rotating at higher speeds and contain much increased kinetic energy. This meant that
when one or more failed, the rubber shrapnel would be released with additional
force. This led to a greater risk of puncture of the aircraft structure and therefore
special consideration to ensure that the aircraft skin could maintain integrity in
the foreseeable event of a tire rupture. Making the skin more resilient to puncture
implied additional weight and this would work against the primary reasoning for
not having the slats and flaps. And there we have the design conundrum.
Having made what was initially regarded as a reasonable compromise in
the aircraft design, the manufacturer subsequently gained experience with the
Concorde, learning that tire failures could be potentially catastrophic (the type
52. http://www.bea-fr.org/docspa/2000/f-sc000725a/htm/f-sc000725a.html.
928
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
of experience illustrated by the “Performance” arrow from the “Go” stage to
the “Design/Formulate” and “Test/Validate” stages in Figure 1). Between July
1979 and February 1981, there were four documented tire ruptures on takeoff.
In two of these instances, substantial damage was done to the aircraft structure,
but the planes were able to land without incident. Despite having these critical
data related to the initial design assumptions and associated compromises in hand,
no remedial changes were made to either the tire or aircraft design. After the
2000 crash, design changes were made to the electrical cables, the fuel tanks were
lined with Kevlar, and specially designed burst-resistant tires were put into use.
The Concorde fleet was retired from service in 2003, with declining passenger
revenues cited as the major cause.
In the case of the Concorde, the record appears to indicate that designers
chose not to alter the design, even in the face of significant data, until a fatal
accident occurred. Although these actions may be consistent with the above
discussion on risk, and how it is perceived, the crash is illustrative of how the
fundamentally simple design process works, and that departures from it can have
serious consequences.
IV. Who Is an Engineer?
A. Academic Education and Training
Having earned a bachelor’s degree in an engineering curriculum is generally sufficient to enter the professional workplace and begin to immediately solve a wide
variety of problems. It is less so the case for students who graduate with degrees in
the basic sciences such as physics, chemistry, or biology or in mathematics. Typically,
but not always, these basic science students will go on to earn graduate degrees.
It is also the case that some students who have earned an engineering degree
will continue to the master’s or even doctorate level of study. In 2004, U.S. colleges and universities awarded approximately 75,000 bachelor’s degrees, 36,000
master’s degrees, and 6000 doctoral degrees in all areas of engineering.53
One can think of the educational process as providing engineering students
with a toolkit from which they select “tools” to enable them to either individually
or in teams participate in scientific and technological innovation. Because these
students are educated, as opposed to having been trained, one can never be quite
sure how they will choose to use their tools, or add to the kit, or delete from the
kit. Although carpenters share a common toolkit, we know the structures they
build can be appreciably different in size, shape, and scope. So it is with engineers.
53. Report 1004D: Total Numbers of Bachelor’s, Master’s and Doctoral Degrees Awarded per Million
Population Since AY1945-46—Including Data for Degrees Awarded to US Citizens Since AY1970-71,
Engineering Trends Quarterly Newsletter, Oct. 2004, at 1.
929
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
One example that scientists and engineers can be one and the same is epitomized by Renaissance humanism during a period almost five centuries past.
There, Leonardo da Vinci, with a minimalist toolkit by today’s standards, lived
a life equally as an engineer and a scientist, and indeed an artist. Four centuries
later, Buckminster Fuller seamlessly combined elements of geometry (aka “science”), structures (aka “engineering”), and architecture (aka “art”) to conceive
and develop an entirely new approach to architectural design. Architects Norman
Foster and Frank Ghery seized on recent advances in computer science and engineering to provide innovative platforms for architectural design that paved the way
for radical changes in structural and visual renderings. Striking examples include
the Guggenheim Museum Bilbao, the Walt Disney Concert Hall Los Angeles,
the Experience Music Project Seattle, the City Hall London, the Beijing Airport,
and the Reichstag Berlin.
Searching for ways to create or define the “bright line” that classifies da Vinci,
Fuller, Foster, or Gehry as engineers, scientists, architects, or artists is as empty an
exercise today as it would have been five centuries ago. This, of course, does not
preclude one from considering himself or herself as an “engineer” or a “scientist”;
however, the subtler point is that one can also be both or either at different points
in time or at the same time. This can be overlooked or ignored in the quest for
limiting or excluding expert testimony.
B. Experience
Without knowing how an engineer or scientist will use his or her toolkit and to
what extent it will be replenished or modified as time goes on, it is not possible
to begin to even second-guess what any particular individual may do to shape
his or her career as time passes. There is a great deal of truth to the notion of
“learning on the job.” Indeed, as one’s career unfolds, the number of opportunities expands and with that comes additional skills and an ever-increasing ability
to make wise and informed choices and decisions. Being an engineer affords one
the opportunity to continually remodel oneself as new and unexpected problems
and challenges become evident.
And so it is with the passage of time that the “title” of one’s degree becomes
an increasingly murky description of who one is and what one does. This is why
it is so critical when evaluating whether an “engineer” is testifying within his
or her realm of expertise that titles do not overshadow the actual context of a
degree (i.e., the name may not reflect the knowledge attributes accurately) and the
experience base at hand. Even though it is an all too common tactic to attempt
to confine expert witness testimony to the asserted domain of his or her named
academic credentials, it is one that may necessarily lead to less-informed testimony
than otherwise would be the case. This is a high price to pay when the desired
outcome is finding the right path to both truth and justice.
930
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
C. Licensing, Registration, Certification, and Accreditation
Licenses are required for engineering professionals in all 50 states and the District
of Columbia, if their services are offered directly to the public and they would affect public health and safety. Licensed engineers are called professional engineers (PEs).
In general, to become a PE, a person must have a degree from an ABET54accredited engineering college or university, have a specified time of practical and
pertinent work experience, and pass two examinations. The first examination—
Fundamentals of Engineering (FE)55—can be taken after 3 years of university-level
education, or can be waived in lieu of pertinent experience. The FE examination
is a measure of minimum competency to enter the profession. Many colleges and
universities encourage students to take the FE exam as an outcome assessment
tool following the completion of the education coursework. Students who pass
this examination are called engineering interns (EIs) or engineers in training (EIT)
and take the second examination after some work experience. This is the Principles and Practice of Engineering examination. The earmark that distinguishes
a licensed/registered PE is the authority to sign and seal or “stamp” documents
(reports, drawings, and calculations) for a study, estimate, design, or analysis, thus
taking legal responsibility for it.
Many engineering professionals do not seek a PE license because their services
are not offered directly to the public or they have no need to sign, seal, or “stamp”
engineering documents. Whether an individual is licensed as a PE is neither sufficient nor necessary to establish his or her competency as an engineer. Furthermore, the two examinations test only for knowledge gained and assimilated at
the undergraduate level. It is therefore common for professors of engineering in
colleges and universities not to have PE licensure—indeed, they are the ones who
teach and prepare those who do take these examinations. Despite this, a common
litigation practice is to attempt to preclude “engineering” testimony offered by
professionals who have had no need to obtain PE licensure as if this was intended
to be some sort of requirement for practicing in the profession or for testifying in
court. Such an approach is unwarranted and inconsistent with the way in which
engineers behave and think about the work they do.
54. Founded in 1932 as the Engineer’s Council for Professional Development (ECPD), it was
later renamed ABET (Accreditation Board for Engineering and Technology). In the United States,
accreditation is a nongovernmental, peer-review process that ensures the quality of the postsecondary
education that students receive. Educational institutions or programs volunteer to undergo this review
periodically to determine if certain criteria are being met. ABET accreditation is assurance that a
college or university program meets the quality standards established by the profession for which it
prepares its students. The quality standards that programs must meet to be ABET-accredited are set by
the ABET professions themselves. This is made possible by the collaborative efforts of many different
professional and technical societies. These societies and their members work together through ABET
to develop the standards, and they provide the professionals who evaluate the programs to make sure
that they meet those standards.
55. In the past, this examination was known as the Engineer in Training (EIT) exam.
931
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
PE licensure is quite different from board certification for a physician or bar
certification for a lawyer. Physicians and lawyers may not practice their professions
without having such board certification. Such is not always the case for engineers
and therefore it is not appropriate or correct to construe this to be so. The title
“engineer” is legally protected in many states, meaning that it is unlawful to use it
to offer engineering services to the public unless permission is specifically granted by
that state, through a professional engineering license, an “industrial exemption,”
or certain other nonengineering titles such as “operating engineer.” Employees
of state or federal agencies may also call themselves engineers if that term appears
in their official job title. In some states, businesses generally cannot offer engineering services to the public or have a name that implies that it does so unless it
employs at least one PE. For example, New York requires that the owners of a
company offering engineering services be PEs. In summary, licensing procedures
and requirements are state specific, but such licensure is not a requirement to
testify in federal court.
As a postscript to this discussion, civil engineers often seek PE registration because of their association with public works projects. This can be traced
directly back to the failure and subsequent legacy of the St. Francis dam collapse
in southern California in the late 1920s. More about this disaster is discussed in
Section III.C.8.
V. Evaluating an Engineer’s Qualifications
and Opinions
A. Qualification Issues and the Application of Daubert
Standards56
Engineers are treated like other witnesses when it comes to determining whether
they can testify as factual or expert witnesses. Thus, if they have information
regarding facts in dispute, an engineer can be a fact witness describing that information. In the context of the design of a product or the conception of an allegedly protectable method or device, that may take the form of describing what the
engineer did to create the product or construct at issue, how he or she conceived
of the subject of that product or construct, and how the product or allegedly
56. Daubert standards were established in the trilogy of cases, Kumho Tire Co. v. Carmichael, 526
U.S. 137 (1999), General Elec. Co. v. Joiner, 522 U.S. 136 (1997), and Daubert v. Merrell Dow Pharms.,
Inc., 509 U.S. 579 (1993), and refer to factors to be considered when assessing the admissibility of
expert testimony. See generally Margaret A. Berger, The Admissibility of Expert Testimony, in this
manual.
932
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
protectable property compares to other designs or intellectual property that are
claimed to relate to the subject of the dispute.
As discussed above, engineers can also be expert witnesses. Like any other
proffered expert, an engineer’s training, background, and experience play a role
in qualifying him or her to provide expert opinions. Education, licensing, professional activities, patents, professional society involvement, committee work,
standards development, professional consulting experience, and involvement in
a business based on similar technology or engineering principles can all help to
fortify an engineering expert’s qualifications. The work that the engineer did to
acquire facts about the matter at issue (described below) and to test the engineer’s
hypothesis as to how the incident in question occurred and what caused it provide
still stronger bases for allowing the engineer to testify as an expert witness.
Ultimately, the court’s application of Daubert standards to the qualifications
asserted by the engineer and the opinions that the engineer seeks to give determines whether the engineer may testify. In the role of gatekeeper of scientific
or technical testimony, the trial judge determines whether the engineering testimony is both “relevant” and “reliable.” The relevance and reliability of engineering testimony is judged in the context of the design process and the way that
engineers approach a problem as described above. And as the court clarified in
Kumho Tire, Daubert extends to all expert testimony, including testimony based
on experience alone.57
B. Information That Engineers Use to Form and Express
Their Opinions
Under Federal Rule of Civil Procedure 26(a)(2)(B)(i), the expert report must
contain the basis and reasons for all opinions expressed, and certainly the expectation is that oral testimony will do the same. Apart from opinions based purely
on knowledge, skill, experience, training, or education, nearly all expert opinion
is based on observations, calculations, experimentation, or some combination
thereof.
1. Observations
a. Inspections
When called as an expert in a products liability case, engineers will often complete
a physical inspection of a failed product or accident scene. ASTM and the National
Fire Protection Association (NFPA) have published several standard practices that
57. See Margaret A. Berger, The Admissibility of Expert Testimony, in this manual.
933
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
offer guidance on inspections and related issues.58 Although it is not required that
engineers adopt and follow these standards, if the court has questions as to whether
the techniques or procedures used by an engineer are reasonable, reference to the
standards can certainly be helpful.
As a first step in the inspection process, engineers will typically document
evidence or the accident scene using photography and videography. It may be
worth noting that just as 2009 represents the first year that the official presidential
portrait is digital, most engineers will record photos and video digitally. Other
measurements and readings can also be made at the initial inspection, as engineers
establish the state of the evidence and attempt to determine if it has been altered
subsequent to the incident.
One important issue that often arises during an inspection is the destruction of
evidence, and engineers sometimes argue as to whether testing is truly destructive.
ASTM E 860 provides some guidance that could be useful to the court in terms
of providing a reference to engineers:
Destructive testing—testing, examination, re-examination, disassembly, or other
actions likely to alter the original, as-found nature, state or condition of items of
evidence, so as to preclude or adversely affect additional examination or testing.
In terms of inspections, destruction of evidence typically relates to disassembly
or displacement of parts, and disputes can usually be resolved by establishing an
agreed-on protocol between parties. If items that have physically broken or separated
are at issue, it should be remembered that two fracture surfaces are created, each a
mirror image of the other, and one can be preserved while the other is evaluated.
Microscopic examination of failure surfaces, also known as fractography, is commonly used by engineers to determine the cause of failure. Fractography can be
used to establish such things as how the product failed (overload versus a fatigue
or time-dependent failure) and whether manufacturing defects (poor welds, voids,
inclusions) exist.
b. Experiments and testing
After performing inspections of the evidence, engineers develop hypotheses as
to the cause of what they are investigating and evaluate these hypotheses. One
common method of testing a hypothesis is experimentation, and engineers are
educated and trained to conduct experiments, often to the displeasure of their
58. Although not intended to be an exhaustive list, these standards include:
• ASTM E 860—Standard Practice for Examining and Preparing Items That Are or May Become
Involved in Criminal or Civil Litigation,
• ASTM E 2332—Standard Practice for Investigation and Analysis of Physical Component Failures,
• ASTM E 1188—Standard Practice for Collection and Preservation of Information and Physical Items
by A Technical Investigator, and
• NFPA 921—Guide for Fire and Explosion Investigations.
934
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
client-attorneys who would rather not perform any test for which the outcome
is uncertain. Engineers can design tests to study kinematics (motions) and kinetics
(forces) and to recreate accidents; to evaluate physical, mechanical, and chemical
properties of materials; or to assess specific characteristics against claims in a patent.
Because the circumstances surrounding accident and product failure investigation
can be quite complex, and often novel as well, engineers sometimes must design
experiments that have never before been performed. This notion, experiments
conducted for the first time for purposes of litigation, has been the topic of much
debate.
Although it is typically suggested that such work is biased and therefore ought
to be excluded, an experiment that is designed and executed for the purposes of
litigation is not inherently suspect. If the experiment has a well-defined protocol
that can be interpreted and duplicated by others, articulates underlying assumptions, uses instrumentation and equipment that is properly calibrated, and is demonstrated to be reliable and reproducible, it should not be summarily discarded
simply because it is new. It is often the case that the precise matter in dispute has
not been the subject of engineering or scientific studies, because in the normal
course of events, the problem at hand was never addressed in a public forum and
no peer-reviewed literature spoke directly to it. In typical engineering problems,
because a multitude of factors can vary, it is often difficult to find suitable preexisting information, and the question at hand may not have been asked in such
a way as is before the court.
The fact that problem identification occurs within the course of a legal
dispute does not mean that the problem cannot then be explored directly using
either the scientific method or the engineering design process or both to ascertain
and understand the physical or chemical behavior of the issue at hand. In point of
fact, an experiment that is designed for litigation will better fit the issues standing
before the court, and either the plaintiff or the defendant is free to pursue this
and to subsequently criticize the results. Not only will experiments designed to
specifically address the matter at issue be more directly relevant to questions at
hand, they will also provide data the court can use in thoughtful deliberation.
Indeed our personal experience has found this not only to be helpful in adjudicating complex issues for which no directly relevant prior work had been done,
but in the end, after the litigation had been completed, peer-reviewed articles
were written about the work that was done for the purposes of studying an issue
for litigation.59
59. Richard D. Hurt & Channing R. Robertson, Prying Open the Door of the Tobacco Industry’s
Secrets About Nicotine: The Minnesota Tobacco Trial, 280 JAMA 1173 (1998); John Moalli et al., supra
note 44; Monique E. Muggli et al., Waking a Sleeping Gaint: The Tobacco Industry’s Response to the
Polonium-210 Issue, 98 Am. J. Pub. Health 1643 (2008); M.S. Warner et al., Performance of Polypropylene
as the Copper-7 Intrauterine Device Tailstring, 2 J. Applied Biomaterials 73 (1991); Richard Hurt et al.,
Open Doorway to Truth: Legacy of the Minnesota Tobacco Trial, 84 Mayo Clinic Proc. 444 (2009).
935
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Of course not all situations require novel techniques to be developed, and in
those instances an abundance of standards for testing materials and products exist.
Typically promulgated by organizations such as ASTM, ANSI, CEN, and others,
these standards envelop everything from sample preparation, to sampling procedures,
to test equipment operation and calibration, to analysis of data acquired during testing. Although such a broad array of standards and guidelines exist, it is possible that
some portion of even the more novel test may not be covered. It is also common
for engineers to follow a standard to the maximum extent allowed by the circumstances and state of the evidence, and to note deviations from that standard in their
protocols and reports.
2. Calculations
A substantial portion of an engineer’s education is spent learning how to calculate
things, so it should come as no surprise that when litigation is involved, engineers
would be making calculations as well. As part of this education, engineers learn
how to derive equations based on scientific and mathematical principles, and consequently become aware of the limitations of a particular equation or expression.
Although it would be convenient if a single equation could be used to solve every
engineering problem, this is clearly not the case, and so engineers must learn what
principles to apply, and when to apply them.
The difference between a good calculation and a marginal one is related to
how applicable the equations used in the calculation are to the situation at hand,
and how valid the underlying assumptions are. As mentioned above, it is the rare
case in which an engineering analysis contains no assumptions. For example, there
are well-known equations that relate the pressure inside a cylindrical vessel to the
stresses in the wall of that vessel. These equations assume, however, that the wall
thickness of the pressure vessel is small compared with the inner diameter, and
if this is not the case, significant error may result. If an engineer uses the more
simplified approach, he should assess whether his analysis is conservative (i.e., how
the assumptions affect the overall calculated result).
In the modern age, it is simple to download programs from the Internet
that will make calculations based on input variables. These programs can save
engineers considerable time, because they can reduce hours of “paper” calculation to minutes. Used blindly, though, without proper understanding of core
assumptions or approximations, these programs can be precarious. Computer
programs should always be validated, and the simplest way to accomplish that
task is to have the program calculate a range of solutions for which the result is
already known. The program is then validated within that range.
3. Modeling—mathematical and computational
When hand calculations become overly tedious, or are too simplified to handle a
highly complex problem, engineers will often use computer models to examine
936
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
systems, processes, or phenomena. Quite distinct from the simple programs mentioned above used to solve an equation or two, these computer models employ
enormous bodies of code that can solve thousands of equations. One of the most
common techniques employed by these programs is the finite element method
(FEM), which can be used to solve problems in stress analysis, heat transfer, and
fluid flow behavior.
FEM is dependent on the computational power of computers, and basically
divides the system or component into small units, or elements, of uniform geometry.
This mesh, as it is called, reflects the geometry of the actual system or component
as closely as possible. Boundary conditions are established on the basis of known
applied loads, and the fundamental equations of Newtonian mechanics are solved by
iterative calculations for each individual cell. The resulting loads and displacements
(or stresses and strains) in each cell are then summed at each increment of time to
give an overall picture of the load/displacement (or stress/strain) history of the system or component. The literally millions of calculations required for each time step
can only be handled by a computer. These data can then be used to determine the
loads and displacements at the time of failure, information that otherwise could not
be obtained from hand (or “back of the envelope”) calculations.
In its early stages, FEM code could only be found in universities and corporate and governmental laboratories, and was executed by doctoral-level engineers
who used separate programs to postprocess results into usable graphical output.
Today, commercial FEM programs are widely available, and are capable of generating eye-catching graphics that appeal to juries. Other software programs are
available that create similar graphics for car-crash or mechanical simulations. This
tool is as much an accepted part of the engineering design community as the slide
rule was in the 1960s. In addition, engineers involved in determining the cause of
failure of mechanical systems have been using FEM since the 1980s to determine
the loads and strains at critical points in complex geometries as part of root-cause
analysis efforts. This is often a principal means to determine what actually caused
something to break, and ultimately to determine whether a design or manufacturing defect or overload or abuse was ultimately at fault. FEM can, in certain
circumstances, be a valuable tool to assess the cause of a design failure.
To be sure, FEM, like any scientific tool, must be properly applied and
interpreted within its limitations. It can be abused and misused, and because the
output from these models can be made to appear extremely realistic, especially
when coupled with computer graphics, their use needs to be carefully considered.
To summarily reject FEM as a simulation, though, would be to deprive a modernday engineer of a tool that is regularly used. There is an old adage in the modeling world, called “garbage in—garbage out,” or GIGO, that gets to the heart of
the issue. No matter how sophisticated the software, or how realistic the output
seems, if the data fed to the program are inaccurate, the results will be poor, and
thus can be misleading. The proper way to evaluate the efficacy of the model or
simulation is to validate it, and this is usually done by processing known scenarios
937
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
or input conditions, and making certain the results are representative of the known
output within the validated range. Regardless of the qualifications of the engineer,
if any mathematical model has not been validated within the boundaries at issue,
its use in the courtroom should be carefully considered. Additionally, once the
model is used in litigation, engineers should be prepared to provide a fully executable copy of the model if requested during discovery.
4. Literature
Engineers are trained to rely on literature as part of their work, and the literature
they employ is nearly as varied as engineers themselves. Structural and mechanical
engineers use codes and regulations when they design everything from buildings
to bridges, and pressure vessels to heating systems (an extended discussion on the
use and misuse of codes is provided below). Engineers rely on published standard
methods when they conduct run-of-the-mill tests, scientific literature to test the
efficacy of complex calculations and experiments, and textbooks to validate techniques and methods from their educational training.
It is common for engineers to gather literature that addresses an issue about
which they are testifying. Industrial engineers may gather literature related to
warnings, materials engineers may collect literature related to development and
processing of a compound, and mechanical engineers may assemble literature
related to stress analysis. Inevitably, literature exists that is not concurrent with
the engineer’s perspective, and a proper analysis of the available literature should
include this as well, with the engineer addressing discrepancies directly.
Engineers may also rely on scientific and technical literature to assess the state
of knowledge at a given period in time. This is especially useful in matters involving intellectual property (discussions related to prior art, best mode, and the like)
or product design (state-of-the-art analysis). The appropriateness of reliance on
this type of literature should not only be weighed by its applicability to the case in
discussion, but also by the engineer’s mastery and frequency of use of the particular
subject. The topic of peer review is often raised concerning scientific and technical
literature, and although the peer review process aids in the promotion of sound
science and engineering, its presence does not ensure accuracy or validity, and its
absence does not imply that a reference is scientifically unsound.
5. Internal documents
Engineers called as experts by either party in a products or personal injury case
will likely review documents produced during discovery that relate to the design
process of the product in question. From these documents, engineers can often
assess whether appropriate actions were taken during the product design process,
including product development, product testing and validation, warning and
risk communication, and safety and risk assessment. Because the specific constraints imposed on a design are not always apparent from internal engineering
938
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
documents, and understanding constraints can be critical in terms of effective
critical review, engineers called as experts may need to review deposition testimony relating to the design to supplement what they learn from the documents
themselves.
VI. What Are the Types of Issues on Which
Engineers May Testify?
Because engineers are problem solvers, their work frequently becomes the subject
of disputes, which eventually involve lawyers and courtrooms. Many times these
disputes involve the sort of “scientific, technical or other specialized knowledge”
that may be best understood with the help of one or more engineers.60 Stated
differently, these issues may be difficult for a jury of laypersons, or even judges, to
understand and resolve without the assistance of an engineer who was not directly
involved in the facts of the case. As a result, disputes involving engineering concepts and principles may be properly the subject of expert testimony from one or
more witnesses qualified in the field of engineering.
Just as there are a multitude of disciplines within engineering, there are a
multitude of issues upon which engineers may be called upon to testify. Some
examples follow.
A. Product Liability
Generally speaking, a product may be defective if it contains a design defect, a
manufacturing defect, or inadequate warnings or instructions. Therefore, disputes
regarding the efficacy or safety of products typically involve questions regarding
whether the product was properly designed, tested, manufactured, sold, or marketed. These issues are examined from the perspective of what was known at the
time of first sale and also what was done after information became available about
the product’s performance.
1. Design
The conception and design of a product is often a focus of dispute in a product
liability case. An understanding of the way that engineers think and the engineering design process described above is essential to determine the nature of
and extent to which engineering testimony should be admitted. For example, in
medical device litigation, it may be significant to know the purpose for which
the medical device was designed and the process by which the design at issue was
60. Fed. R. Evid. 702.
939
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
achieved. To gain that understanding, testimony from the product designer as well
as testimony by engineers with experience in design may be helpful.61
The adequacy of testing done on a product is closely related to the issue
of design defect. This is true whether the testing in question occurred before
the product was first sold (“premarket”) or after the product had been on the
market for a time and information regarding its performance became available
(“postmarket”).62 Engineering testimony may be helpful to the court and to the
trier of fact in these circumstances as well.
An engineer’s examination of products that have failed in use may result in
valuable evidence for a court and trier of fact to consider. For example, an engineer skilled in fractography can testify regarding how and why a product failed.63
Such testimony may prove helpful to the court and a trier of fact on such issues
as whether the subject product was defective as originally designed, whether an
alternative design could have been used, what the cost of such response would
be and whether the manufacturer’s response to such incidents of product failure
was reasonable.
61. Russell v. Howmedica Osteonics Corp., No. C06-4078-MWB, 2008 WL 913320 (N.D.
Iowa 2008) (biomechanical expert allowed to testify that medical device’s inability to handle weight
loads was a design defect and hence caused the plaintiff’s injuries, and that the defendant’s failure to
warn surgeons of this fact that caused the failure of the device); Poust v. Huntleigh Healthcare, 998
F. Supp. 478 (D.N.J. 1998) (engineer with expertise in medical device use, safety, and design allowed
to testify about defects concerning lack of instructions, the alarm, lack of fail-safe mechanism, and
lack of pressure gauge in pneumatic compression device); see also Dunton v. Arctic Cat, Inc., 518 F.
Supp. 2d 296 (D. Me. 2007) (admitting expert testimony of mechanical engineer and product designer
regarding, among other things, purpose and design of certain components of allegedly defectively
designed snowmobile); Floyd v. Pride Mobility Prods. Corp., No. 1:05-CV-00389, 2007 WL 4404049
(S.D. Ohio 2007) (three engineering experts, including mechanical engineer with expertise as product
designer, allowed to testify about defects in design of scooter); Tunnell v. Ford Motor Co., 330 F.
Supp. 2d 731 (W.D. Va. 2004) (engineer allowed to testify about feasibility of proposed safer auto
design).
62. See, e.g., Smith v. Ingersoll-Rand Co., 214 F.3d 1235 (10th Cir. 2000) (human factors
engineering expert allowed to testify that defendant’s failure to conduct human factors analysis of
milling machine was inadequate); Montgomery v. Mitsubishi Motors Corp., 448 F. Supp. 2d 619
(E.D. Penn. 2006) (engineer allowed to testify improper or deficient testing rendered vehicle design
defective and unsafe, based in part on his review of test results of another engineer); accord Phelan
v. Synthes (U.S.A.), 35 F. App’x 102 (4th Cir. 2002) (biomedical engineer not allowed to testify
about inadequacy of premarket testing of medical device when underlying opinion that device was
unreasonably dangerous was not supported by reliable methodology).
63. See Parkinson v. Guidant Corp., 315 F. Supp. 2d 754 (W.D. Pa. 2004) (metallurgist who
reviewed fractographs was allowed to testify in product liability action that manufacturing flaws caused
the premature fracture of guidewire used in angioplasty); Hickman v. Exide, Inc., 679 So. 2d 527 (La.
Ct. App. 1996) (expert in, among other things, fracture analysis was allowed to testify about cause
of explosion of car battery in product liability action); Reif v. G & J Pepis-Cola Bottlers, Inc., No.
CA87-05-041, 1988 WL 14052 (Ohio Ct. App. Feb. 15, 1988) (fractography expert was allowed to
testify about cause of break in broken glass bottle).
940
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
2. Manufacturing
The manufacture of a product and the quality process through which uniformity
of ingredients, processes, and the final product are ensured may properly be the
subject of product safety litigation. Testimony of engineers with experience in
designing and implementing manufacturing systems to ensure product quality may
be critical in resolving product disputes and helpful to the court and trier of fact.64
3. Warnings
Warnings issues in product liability cases are at the intersection of factual evidence
and legal standards and thus are particularly difficult for the court and/or other
trier of fact to resolve. Many product disputes involve claims concerning the
adequacy of warnings that accompanied the product when it was first sold. In
these cases, the focus may be on what was known through the conception and
design phases of the design process and the necessity for and adequacy of warnings
that accompanied the product in view of that knowledge. Other disputes center
upon the warnings that were added or could have been added after the product
had been used and the company received feedback from users of the product.
The reasonableness of the company’s response to these reports may be an issue.
Thus, the case may be decided on the basis of whether the company conducted,
or failed to conduct, design and testing activities in view of that information or
whether the company modified the product or communicated to users of the
product what it knew.
But not all warnings issues are properly the subject of expert testimony,
particularly with respect to products that are regulated by federal law.65 Properly
qualified engineers may be able to provide opinions that could help the court
and the trier of fact to understand such issues with respect to such products, but
64. See, e.g., Galloway v. Big G Express, Inc., 590 F. Supp. 2d 989 (E.D. Tenn. 2008)
(defendant’s expert with significant experience in engineering fields, including product design, allowed
to testify about manufacturing process used by the defendant); Schmude v. Tricam Indus., Inc., 550
F. Supp. 2d 846 (E.D. Wis. 2008) (discussing generally the propriety of admitting testimony of expert
who studied mechanical engineering and had degree in product design regarding manufacturing
process for rivets used in ladder that collapsed); Yanovich v. Sulzer Orthopedics, Inc., No. 1:05 CV
2691, 2006 WL 3716812 (N.D. Ohio 2006) (discussing testimony of engineering experts regarding
manufacture of medical device). See also Pineda v. Ford Motor Co., 520 F.3d 237 (3d Cir. 2008)
(metallurgical engineer allowed to testify about explicit procedure for replacing allegedly defective
product in order to reduce likelihood of product failure).
65. The FDA’s drug approval process may preempt state law product liability claims based on
a failure to warn. See, e.g., Riegel v. Medtronic, Inc., 552 U.S. 312, 128 S. Ct. 999 (2008). See also
Bates v. Dow Agrosciences LLC, 332 F.3d 323 (2005) (discussing preemption of state law product
liability claims by the Federal Insecticide, Fungicide, and Rodenticide Act). But see Wyeth v. Levine,
555 U.S. 555, 129 S. Ct. 1187 (2009) (holding that FDA approval of a drug did not preempt state law
tort claim based on inadequate drug warnings).
941
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
nonetheless may not be allowed to testify based on the substantive law applicable
to such products.66
For example, industrial engineers, or engineers educated in human factors,
may have training that allows them not only to testify when warnings are necessary from an engineering perspective (recall the discussion above about the design
process), but also about the efficacy of warnings, and development of risk communications including text, pictures, auditory, or visual signals.
4. Other issues
Issues regarding the sale and marketing of products often concern promises made
regarding the expected performance of the product, including both the positive
results that a product is able to achieve and, especially, what possible harm that a
product may cause. The efficacy of a product may be proved by a straightforward
comparison between premarket data on product performance and sales and marketing claims, and engineers may provide helpful testimony regarding the interpretation of such data. There may be a dispute about whether the claims made about the
product’s safety exceeded the testing results that had been obtained for the product
or led to a hazardous situation because the product was not properly tested. These
may also be the subject of appropriately qualified engineering testimony.
Common personal injury cases may also present issues on which engineering
testimony may be helpful. Such disputes often turn on testimony as to how a particular trauma occurred. Our discussion of biomechanical engineering highlights
some of these issues.67 In a car accident case, properly qualified engineers may provide opinion testimony regarding how an accident occurred, including reconstructing the conduct of each of the parties and how that conduct affected the accident.
In a slip-and-fall case, engineering testimony can concern such basic issues as why
the injured person slipped and what could have been done to prevent it.
In addition to the above, engineers may also testify about various aspects of
a party’s damages and give an opinion about whether those alleged damages were
caused by the conduct in question. Testimony about causation in a products dispute often involves both factual and legal questions. Through experience, training,
and activities in the case, engineers may have the ability to understand the interrelationship between events and thus can provide helpful testimony on whether
the asserted damages had a relationship to the asserted misconduct so as to have
66. See Pineda v. Ford Motor Co., 520 F.3d 237 (3d Cir. 2008) (metallurgical engineer permitted
to testify that safety manual should have contained warning about glass failure in SUV); Michaels v.
Mr. Heater, Inc., 411 F. Supp. 2d 992 (W.D. Wis. 2006) (human factors engineering expert allowed
to testify about the adequacy of product warning); Nesbitt v. Sears, Roebuck & Co., 415 F. Supp. 2d
530 (E.D. Pa. 2005) (expert with practical experience as engineer allowed to testify that the plaintiff
would have responded to an additional warning); Santoro v. Donnelly, 340 F. Supp. 2d 464 (S.D.N.Y.
2004) (mechanical engineer allowed to testify about adequacy of warnings for fireplace heater).
67. Supra Section I.C.
942
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
been “caused” by it.68 But issues regarding the standard to apply for the sufficiency
of causal proof may be both scientific and legal issues. Thus, the adequacy or
admissibility of an engineer’s opinion on causation will be evaluated in light of the
law, as well as the adequacy of the science that forms the basis for the opinion.69
Situations where property damages are asserted may pose special problems
on which engineering testimony may be appropriate. For example, determining
whether a product problem is an isolated occurrence or whether it is part of a
widespread product problem may be difficult to resolve in the absence of engineering testing and analysis, which aims at determining a product defect and a
product breakdown process.
B. Special Issues Regarding Proof of Product Defect
Although the definition of what is defective may be the subject of a jury instruction at trial,70 proof of a product defect may involve identifying key facts that
68. See, e.g., Nemir v. Mitsubishi Motors Corp., 381 F.3d 540 (6th Cir. 2004) (automotive
safety engineer allowed to testify that defective seatbelt latching mechanism caused plaintiff’s injuries);
Babcock v. General Motors, 299 F.3d 60 (1st Cir. 2002) (structural and mechanical engineer allowed
to give testimony about impact speed, cause of injuries, how the product allegedly ultimately failed,
and testing procedures for the product); McCullock v. H.B. Fuller Co., 61 F.3d 1038 (2d Cir. 1995)
(engineer allowed to testify regarding whether plaintiff was within “breathing zone” for hot-melt
glue in workplace); Perez v. Townsend Eng’g Co., 545 F. Supp. 2d 461 (M.D. Penn. 2008) (engineer
allowed to testify that product was defective, that defect caused plaintiff’s injury, and that alternative
design would have prevented injury); Farmland Mut. Ins. Co. v. AGCO Corp., 531 F. Supp. 2d 1301
(D. Kan. 2008) (electrical engineer allowed to testify about cause of farm equipment fire); Phillips
v. Raymond Corp., 364 F. Supp. 2d 730 (N.D. Ill. 2005) (biomechanical engineer testified as to the
mechanics of plaintiff’s injury resulting from allegedly defective forklift); Tunnell v. Ford Motor Co.,
330 F. Supp. 2d 731 (W.D. Va. 2004) (engineer allowed to testify there was an absence of evidence
that the accident was caused by electrical arcing); Figueroa v. Boston Scientific Corp., 254 F. Supp.
2d 361 (S.D.N.Y. 2003) (expert with substantial experience, education, and knowledge in engineering
field allowed to testify about cause of damage to plaintiff); Yarchak v. Trek Bicycle Corp., 208 F. Supp.
2d 470 (D.N.J. 2002) (expert in forensic and safety engineering, among other subjects, allowed to
testify that bicycle seat caused the plaintiff’s erectile dysfunction); Traharne v. Wayne Scott Fetzer Co.,
156 F. Supp. 2d 690 (N.E. Ill. 2001) (electrical engineer allowed to testify about cause of deceased’s
electrocution); Bowersfield v. Suzuki Motor Corp., 151 F. Supp. 2d 625 (E.D. Pa. 2001) (engineer
allowed to testify about causation of automobile passenger’s injuries).
69. See generally Margaret A. Berger, The Admissibility of Expert Testimony, in this manual; see
also Michael D. Green et al., Reference Guide on Epidemiology, Section V, in this manual.
70. Restatement (Third) of Torts § 2 (1998) provides that the general definition of a product
defect is as follows:
A product is defective when, at the time of sale or distribution, it contains a manufacturing defect,
is defective in design, or is defective because of inadequate instructions or warnings. A product:
(a) contains a manufacturing defect when the product departs from its intended design even though
all possible care was exercised in the preparation and marketing of the product;
(b) is defective in design when the foreseeable risks of harm posed by the product could have been
reduced or avoided by the adoption of a reasonable alternative design by the seller or other distributor,
943
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
relate to that definition. These issues may be the subjects of the testimony of an
engineer. The issue of whether a product is unreasonably dangerous may involve
proof of available alternative designs, the existence of modifications to the product
that would make it safer (which directs us back to the discussion on risk, above;
all products can be made safer at the expense of cost and/or convenience) and
what consumers expect the product to do or not to do, to identify a few such
issues.71 Engineers may be asked to engage in testing to determine a cause and/
or mechanism of failure and as a basis for an opinion regarding product defect.
Such testing may include accelerated testing and end-use testing to replicate the
conditions that the products see in use.
To understand the product and its expected or anticipated uses, engineers
may review documents regarding the product at issue and published literature
about like products or product elements. Visits to sites where the product is or
was in use may provide information to engineers about recurring characteristics
of product performance and aspects of the environment of use, which bear on
that performance. Visual examination, measurements made at the site and experiments conducted at the site and in the laboratory may provide valuable information regarding the characteristics of the product that affect product performance
or nonperformance.
In sum, the engineer’s problem-solving approach using the design process as
described above can provide valuable information about the nature and cause of
product problems and the limitations of the design of the product at issue, including the characteristics of the environment of use and the choice of materials for
the subject product. Armed with this and other information, properly qualified
engineers can provide valuable opinions on issues going to the heart of the question of whether the product at issue is defective and caused the claimed damages.72
or a predecessor in the commercial chain of distribution, and the omission of the alternative design
renders the product not reasonably safe;
(c) is defective because of inadequate instructions or warnings when the foreseeable risks of harm
posed by the product could have been reduced or avoided by the provision of reasonable instructions
or warnings by the seller or other distributor, or a predecessor in the commercial chain of distribution,
and the omission of the instructions or warnings renders the product not reasonably safe.
See also Restatement (Second) of Torts § 402A (1965), which defines a defect as one that makes a
product “unreasonably dangerous.”
71. Martinez v. Triad Controls, Inc., 593 F. Supp. 2d 741 (E.D. Pa. 2009) (engineer allowed
to testify about design defects and warnings); Page v. Admiral Craft Equip. Corp., No. 9:02-CV-15,
2003 WL 25685212 (E.D. Tex. 2003) (mechanical engineer allowed to testify about defect in design
of bucket and safer alternative).
72. “While an expert’s legal conclusions are not admissible, an opinion as to the ultimate issue of
fact is admissible, so long as it is based upon a valid scientific methodology and supported by facts. See
Fed. R. Evid. 704. The ‘ultimate issue of fact,’ as used in Rule 704, means that the expert furnishes
an opinion about inferences that should be drawn from the facts and the trier’s decision on such issue
necessarily determines the outcome of the case.” Strickland v. Royal Lubricant Co., Inc., 911 F. Supp.
1460, 1469 (M.D. Ala. 1995).
944
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
C. Intellectual Property and Trade Secrets
Engineering testimony may be helpful in disputes regarding patents and other
forms of intellectual property. Knowledge has become a key source to wealth in
our economy,73 and we increasingly depend on innovation and the protection of
innovation.74 The federal government’s power to protect patents and copyrights is
one of only a handful of enumerated powers in the U.S. Constitution.75 Engineers
are at the very heart of technology and innovation and therefore often become
natural contributors to the resolutions of disputes involving these subjects.
The issues for factual or expert engineering testimony in this area are closely
allied to those highlighted in the above description of the product design process.
Key issues concern conception and development of the invention or protected
trade secret, commercialization and sales/marketing of the protected concept,
infringement or theft of the protected concept, and damages, including proof of
willfulness or bad intent. There are a myriad of situations in which engineering
testimony may be received. We will highlight a few of them.
The patentability of an idea is measured by its advance over prior art in the
relevant field. Almost all new inventions are combinations or uses of known elements. What constitutes prior art and what is the relevant field for such art are
thus questions that relate to the conception stage of the design process. Who is a
qualified engineer to testify about these issues is answered under the Daubert standard. Thus, engineering testimony may be helpful on such issues as whether the
invention is new or novel and whether it is non-obvious to one who has ordinary
skill in the art. And engineers can help to define the description of the person
with ordinary skill and interpret what such a person would learn from the art in
question. Prior art is meant to include all prior work in the field. It sometimes
connotes “public” prior art, not hidden or unknown art. A properly qualified
engineer witness can provide relevant and reliable testimony regarding these and
other prior art–related questions.
The rules for using engineering experts in patent infringement proceedings in
federal courts are reasonably well defined. For example, under the U.S. Supreme
Court’s decision in KSR International Co. v. Teleflex, Inc.,76 non-obviousness is
73. See, e.g., Thomas A. Stewart, Intellectual Capital (1997).
74. “There is established within the [National Institute of Standards and Technology] a program
linked to the purpose and functions of the Institute, to be known as the ‘Technology Innovation
Program’ for the purpose of assisting United States businesses and institutions of higher education
or other organizations, such as national laboratories and nonprofit research institutions, to support,
promote, and accelerate innovation in the United States through high-risk, high-reward research in
areas of critical national need.” 15 U.S.C. § 278n. See also Prioritizing Resources and Organization for
Intellectual Property (PRO-IP) Act of 2008, Pub. L. No. 110-403, 122 Stat. 4256 (2008); America
Creating Opportunities to Meaningfully Promote Excellence in Technology, Education, and Science
Act, Pub. L. No. 110-69 (2007).
75. U.S. Const. art. I, § 8.
76. 550 U.S. 398 (2007). See also Dennison Mfg. Co. v. Panduit Corp., 475 U.S. 809 (1986).
945
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
ultimately a question of law for the court to decide. However, the underlying
factual determinations, including the secondary factors involved in determining
patent validity, remain jury questions concerning which expert engineering testimony may be admitted.77 Questions regarding the scope and teachings of prior
art also may invite engineering testimony and interpretation.78
A similar analysis applies to trade secret matters. The nature and scope of the
claimed trade secret, including the existence of steps taken to protect the trade
secret are all issues where the engineer as a witness may be involved. This is true
of protected processes and methods as well as devices or other products of the
subject trade secret.79
When an assertion is made that intellectual property or protected trade secrets
have been infringed, engineering testimony may be necessary on a number of
issues as courts attempt to resolve issues regarding identifying features of the
challenged device, methods or processes that infringe the protected property.80
Additional issues may relate to knowledge regarding the protected property and
conception and design of the subject of the alleged infringement.81
Proof of damages may involve a number of issues that relate to or derive from
an engineer’s analysis of the scope of claims or protected methods and processes,
commercial viability of the subject intellectual property, and scope of protection
afforded by the subject patent. Qualification of engineers as witnesses to provide
testimony in these areas may present its own challenges under Daubert as to both
reliability and relevance.
D. Other Cases
There are many other areas where engineering testimony may be helpful to the
court and trier of fact. Because the range of such possible situations is virtually
limitless, we will list only a few examples.
77. See Finisar Corp. v. DirecTV Group, Inc., 523 F.3d 1323, 1338 (Fed. Cir. 2008).
78. See, e.g., Rosco, Inc. v. Mirror Lite Co., 506 F. Supp. 2d 137 (E.D.N.Y. 2007) (mechanical
engineer allowed to present testimony of his review of patent for teaching or suggestion as to meaning
of the claims).
79. Am. Heavy Moving & Rigging Co. v. Robb Technologies, LLC, No. 2:04-CV-00933JCM (GWF), 2006 WL 2085407 (D. Nev. 2006) (in case involving misuse of trade secrets, engineer
appointed by the court to assist in making discovery rulings).
80. See, e.g., The Post Office v. Portec, Inc., 913 F.2d 802 (10th Cir. 1990).
81. See State Contracting & Eng’g Corp. v. Condotte Am., Inc., 346 F.3d 1057 (Fed. Cir. 2003)
(expert in civil and structural engineering testified about whether different pieces of prior art are in
the same field of endeavor as patents at issue); Philips Indus., Inc. v. State Stove & Mfg. Co., 522 F.2d
1137 (6th Cir. 1975) (use of engineering expert to establish the presence of design concept in prior
art); Mayview Corp. v. Rodstein, 385 F. Supp. 1122 (C.D. Cal. 1974) (tool engineer testifying about
concept of balance in hand-tool design in prior art).
946
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
• Claims of personal injury or property damage resulting from the spread
of a toxic substance may involve a number of issues where engineering
testimony may be both reliable and relevant.82
• Environmentaldisputesregardingthenecessityforandnatureofanenvironmental problem and the responsibility for and cost of its cleanup
involve numerous issues concerning which properly qualified engineers
may provide reliable and relevant evidence.83
• Thetestimonyofanengineercanbehelpfulindeterminingcausationin
both product liability cases and nonproducts liability cases as well. Recent
cases in the electrical engineering area demonstrate the range of possible
situations where such issues may arise and engineering testimony may
be admitted.84 For example, an electrical engineer was allowed to testify
about lightning in Walker v. Soo Line Railroad Co.85 The plaintiff in that
case filed suit under the Federal Employees Liability Act. Claiming that he
had been injured by lightning while working in a railroad tower, the
plaintiff sought to introduce the testimony of the chairman of the electrical engineering department at the University of Florida to the effect that
lightning could have struck a number of places in the yard and penetrated
the tower without a direct hit. The district court excluded the evidence
and the Seventh Circuit reversed, finding that the jury would have been
helped by hearing the engineer’s testimony about the ways in which lightning could have struck the tower, even if he could not testify which of the
locations was struck or if any of them were struck at all.
• In a slightly different context than might be expected, an electricity
transmission line planning engineer testified as an expert at a administrative hearing in California Public Utilities Commission v. California Energy
Resources Conservation & Development Commission.86 The dispute in that
case was the extent of the CERCDC’s jurisdiction over transmission line
siting, and more specifically the interpretation of a section in California’s
82. See, e.g., Jaasma v. Shell Oil Co., 412 F.3d 501 (3d Cir. 2005) (civil and environmental
engineer permitted to testify about environmental status of real property, which was relevant to
damages and efforts to mitigate); In re Train Derailment Near Amite La., No. Civ. A. MDL. 1531,
2006 WL 1561470 (E.D. La. 2006) (court relied on declaration of environmental engineer regarding
exposure to airborne contaminants in concluding that claims of potential class were not based on
actual physical harm).
83. Olin Corp. v. Lloyd’s London, 468 F.3d 120 (2d Cir. 2006) (admission of environmental
civil engineer testimony on issue of property damage in pollution liability insurance coverage case not
an abuse of discretion).
84. See, e.g., Newman v. State Farm Fire & Cas. Co., 290 F. App’x. 106 (10th Cir. 2008)
(electrical engineer was allowed to testify about the origin of fire that destroyed insureds’ house);
McCoy v. Whirlpool Corp., 287 F. App’x 669 (10th Cir. 2008) (electrical engineer was allowed to
testify in product liability case that manufacturing defect in dishwasher caused fire).
85. 208 F.3d 581 (7th Cir. 2000).
86. 50 Cal. App. 3d 437 (Cal. Ct. App. 1984).
947
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Public Resources Code which defined “electric transmission line” as “any
electric power line carrying electric power from a thermal power plant
located within the state to a point of junction with any interconnected
transmission system.”87 The engineer, employed by Pacific Gas & Electric,
testified at the hearing before the CERCD about the use of certain terms
in the industry related to that definition and about electricity transmission principles. The court subsequently relied on that testimony in part in
determining that “electric transmission line” had a plain meaning, and that
the plain meaning cut off the CERCDC’s jurisdiction at the first point at
which a power line emanating from a thermal power plant joined to the
interconnected transmission grid.
• Civilengineershavebeenallowedtotestifyinabroadrangeofcircumstances also, including those involving an improper application of the
design process in the building of a bridge. Numerous examples of situations involving roles of engineers involved in building bridges, which ultimately failed, have been described in an earlier version of this guide.88 In
each of those situations, engineering testimony from engineers who were
involved in the design of the bridge or who had experience in designing other bridges or who had experience with design generally could be
qualified to testify in an inquiry or lawsuit about the causes and financial
implications of the failures.
VII. What Are Frequent Recurring Issues in
Engineering Testimony?
A. Issues Commonly in Dispute
Following are several issues with which engineers frequently are confronted in
the course of attempting to give testimony as experts. Each of them is controlled
by the specific fact pattern that gives rise to the case and the way in which the
case is presented. In this section we describe the issues as they are perceived by
the engineer in the courtroom. Because this is not a treatise on the procedural or
substantive law at issue, we do not summarize the state of the law on each issue.
We assume that the court and other readers of this guide are familiar with the
applicable law on these issues.
87. Id. at 440.
88. Petroski, supra note 45, at 593–94, 597–600, 604–06, 608–09, 612–13.
948
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
1. Qualifications
In an earlier section of this guide (Section III.C.2), we referred to the Stringfellow
case.89 An important aspect of this litigation revolved around estimates of the mass
of toxic materials released over a long time period some 20 to 30 years after the
fact. To reconstruct events long past, a chemical engineer used aerial photographs
of the site taken during its period of operation to estimate the surface areas of the
toxic waste ponds. His qualifications were challenged under Frye v. United States90
because he was not a photogrammist. Despite having a background and credentials
to support the work he did using the aerial photographs, but not the pedigree, the
court found that a photogrammist would need to confirm his findings. In the end
a photogrammist corroborated the engineer’s work. This is but one example of an
all too common situation where an engineering expert’s qualifications have been
challenged based on “name” rather than on relevant and documented experience.
Under Daubert, there may be even more pressure on the court to assess who can
or cannot testify as an expert. But this example illustrates that a court should be
cautious about drawing conclusions about an expert’s qualifications based solely
on titles, licenses, registration, and other such documentation.
2. Standard of care
Another common issue for engineers to confront in their testimony is the standard
of care. Engineers do not think of the concepts of standard of care and duty of
care as they relate to tort law, particularly negligence. Instead, for many engineers,
“standard of care” means “how we do it in my office” or some variation thereof.
Following the Oklahoma City bombing, a structural engineering expert
prepared a report regarding blast damage and progressive collapse for the U.S.
Attorney prosecuting McVeigh. In an attempt to block this testimony, McVeigh’s
defense team obtained an affidavit from an engineer with a well-known structural engineering firm to the effect that the prosecution expert’s report did not
meet minimal standards for a building condition report because it did not include
detailed architectural and structural drawings, measurements, and specifications,
all of which were irrelevant to the issues at bar. The defense expert engineer
argued essentially that he and his firm were leaders in the field of building assessment reports and therefore what they did set the standard. He was wrong on
two counts: (1) the practice of any single firm or office does not establish the
standard of care, and (2) the standard of care for one technical purpose (condition
assessment of commercial buildings with leaky curtain walls) cannot be applied
to another technical purpose (determination of number of bombs employed to
destroy a building) just because both involve buildings and engineers.
89. United States v. Stringfellow, No. CV 83-2501 JMI, 1993 WL 565393 (C.D. Cal. Nov. 30,
1993).
90. 293 F. 1013 (D.C. Cir. 1923).
949
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
The phrase “standard of care” has various meanings and connotations to
engineers that are somewhat discipline specific. Standard of care in the medical
sciences may be different than standard of care in some other context. In engineering, it can be said that the standard of care is met whenever the design process
was properly employed at the point in time that the event or incident happened.
Although the design process itself is “fixed,” when properly applied to a problem
in the 1940s and again to the same problem in 2009, the design outcome can be
quite different and indeed might be expected to be so. Even so, the standard of
care may be met each time.
3. State of the art
“State of the art” has a specific meaning in the law and may be the subject of a
particular statute in many jurisdictions. In addition, state of the art can be a distinct defense in many states.91 To engineers, however, its meaning may be slightly
different.
Simply put, this phrase refers to the current stage of development of a particular technology or technological application. It does not imply that it is the best
one can ever hope for but is merely a statement that at whatever point in time
referenced, technology was in a certain condition or form. For instance, the Intel
4004 4-bit microprocessor was state of the art in 1971 whereas the Intel 64-bit
microprocessor was the state of the art in 2006. Of course, there is the question
as to whether in either of these cases those microprocessors were state of the art
for just Intel, for all American semiconductor companies, or for all semiconductor
companies in the world. The question of the context in which this phrase is used
often lies at the heart of disputes. Because appropriate context may be difficult
to pin down, experts are often challenged with defining the “state of the art” in
relation to a particular technology or application. The answer from an engineering
perspective is often an assumption, nothing more, nothing less. As such, from an
engineering perspective, it is best to accept this phrase as a general colloquialism
that is difficult to define even though it is simple to state.
4. Best practice
Although this term is used colloquially and oftentimes in “business” activities, to
engineers it is not a phrase that is easily quantifiable and suffers from meaning
different things to different people. Despite this, it generally refers to the notion
that at any point in time there exists a method, technique, or process that is preferred over any other to deliver a particular outcome. That being said, there is
great latitude in how one goes about determining that preference and associating
it with the desired outcome. So, although it sounds good, this phrase is fraught
91. See, e.g., Ariz. Rev. Stat. § 12-683(1) (2009); Colo. Rev. Stat. § 13-21-403(1)(a) (2009);
Ind. Code § 34-20-5-1(1) (2009).
950
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
with ambiguity. In the end, the more important issue is whether there was adherence to the design process.
5. Regulations, standards, and codes
An issue that often arises in matters involving buildings and structures is the
distinction between design codes and physics (political laws vs. physical laws) in
the context of failure analysis. Design codes and standards are very conservative
political documents. They are conservative because they are intended to address
the worst-case scenario with a comfortable margin of safety. But buildings do not
fail because of code violations—they fail according to the laws of physics. They
do not fail when the code-prescribed loads exceed the code-prescribed strength
of the materials—they fail when the actual imposed loads exceed the physical
strength of the components. Buildings fail not when the laws of man are ignored
but when the laws of physics are violated. Examples of this are most common in
the context of earthquake-damaged structures. Buildings are not designed to resist
100% of expected earthquake forces. Rather, they are designed to resist only a
fraction of the expected load (typically about one-eighth) without permanently
deforming. The code implicitly recognizes that buildings are much stronger than
assumed in design and also have considerable ability to absorb overloads without
failure or collapse. Yet following an earthquake, engineers may inappropriately
compare the ground accelerations recorded by the U.S. Geological Survey with
design values in the code.
In the Northridge, California earthquake, recorded acceleration values were
2–3 times greater than the design code values. Many engineers concluded that the
buildings had been “overstressed” by 200–300% and were thus extensively damaged, even if that damage was not visually apparent. In a line of reasoning remarkably similar to that of the plaintiff’s expert in Kumho,92 the damage was “proved”
analytically, even though it could not be physically seen (or disproved) in the
building itself. (If the same logic were applied to cars, every car that sustained an
impact greater than the design capacity of the bumper would be a total loss.) If this
approach was accepted, the determination of damage could only be done by a few
wizards with supposedly sophisticated, yet often unproven, analytical tools. The
technical issues in the Northridge situation were thus removed from the realm of
observation and common sense (where a jury has a chance of understanding the
issues) to the realm of arcane analysis where the experts have the final say.
This is not to say that standards and codes do not have their place in the
courtroom. We described above how standards are often used by engineers to
conduct tests, and cases that involve malpractice or standard-of-care may often
critically examine if a particular code was followed in the course of a design. On
92. 526 U.S. 137 (1999). In Kuhmo, the expert inferred the defect from an alleged set of
conditions, even though the alleged defect was not observed.
951
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the other hand, failure to use a code, or comparison of code values to actual values
does not guarantee that a disaster will occur. Common sense is often the best judge
in these situations—if a code value is exceeded, yet no damage is observed, it is
likely that the conservative nature of the code met its objectives and protected
its subject.
6. Other similar incidents
From an evidentiary perspective, evidence of similar or like circumstances has
a number of evidentiary hurdles to overcome before it can be admitted into
evidence.93 To an engineer, however, the concept of similarity or “other similar
incidents” (OSIs) has a somewhat different meeting and describes the types of
circumstances and documentation of such circumstances that an engineer can
rely on as a basis for his or her opinions. Although this section focuses primarily
on product design issues, the underlying theme is nonetheless broadly applicable
across the domain of engineering forensics.
Sometimes these other events are recorded in documentary form and relate
to events regarding product performance characteristics, product failures, product anomalies, product performance anomalies, operational problems associated
with product use, product malfunctions, or other types of product failures. These
events are sometimes alleged by a party to a dispute to be substantially similar in
kind to an event or circumstance that had precipitated the subject case. Alleged
OSIs can be documented in multiple forms: (1) written narratives from various
sources (consumers, employees of the manufacturer, bystanders to a reported
event, insurers’ representatives, investigators, law enforcement personnel, owners
of a location involved in the dispute at bar, etc.) who might prepare and submit
a record of observation to a legal entity who retains those records of submission;
(2) telephonic reports of the same character and source as written reports, but
documented through telephone reports made to a recording representative or
office staff responsible for collecting event reports of interest to a legal entity;
(3) electronic submissions of the same character and source as written narratives;
(4) reports in a standardized format that are intended to record and document
events of interest (the forms may be in written or electronic media; (5) images of
events in film or electronic media that may or may not also have been recorded
and submitted in alternative formats. As a result, each may have its evidentiary
hurdles to overcome before it is admitted into evidence.
Similarly, each OSI may have legal issues regarding authentication, which
may be overcome by the repository where the underlying documentation is
93. For evidence of other similar incidents (OSIs) to be admissible, the proponent must show
that the OSIs are (1) relevant, see Fed. R. Evid. 401; (2) “substantially similar” to the defect alleged in
the case at bar; and (3) the probative value of the evidence outweighs its prejudicial effect, see Fed. R.
Evid. 403. Some courts merge the first two requirements; to be relevant, the OSIs must be substantially
similar to the incident at issue.
952
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
found. The repositories of documents and reports that may be alleged to be OSIs
to an issue at bar can have many original purposes, and a collection of such documents may serve multiple purposes for the owner institution. Such document
collections may be used by the owner of the repository for various administrative
purposes, accounting, claims management and resolution, an archive of information and/or data, database management, institutional knowledge building, warranty management, in-service technology performance assessment and discovery,
service records, customer interactions, and satisfaction of regulatory specifications
or requirements, to name a few. Discovery requests may call for the owner of
the materials to search and retrieve records, documents, and reports from such
repositories even if the collections and repositories themselves may not have
been constructed for the purposes of document search and retrieval. Sometimes
engineers can be of use in searching and retrieving potentially relevant materials.
OSIs are discovered and may be offered into evidence to (1) demonstrate
prior knowledge on the part of the record owner regarding an alleged defect or
danger manifest to the consuming public that is causally related to the issue at bar;
(2) demonstrate by the number, volume, or rate of reports that a defect exists; and/
or (3) demonstrate careless disregard for the safety of others.94 To be admitted or
relied upon by an engineering expert, the proponent must demonstrate that the
event recorded and reported is “substantially similar” to the issue at bar.95 Testifying engineers can be useful in identifying and describing the specific characteristics
that must be known and shown to make an assessment of similarity, including
specifying objective parameters for determinations of the degree of similarity or
dissimilarity and detailing the objective parameters and physical measurements
necessary and sufficient to determine substantial similarity. The conditions that are
necessary and sufficient to demonstrate substantial similarity include the following:
(1) the product or circumstance in the alleged OSI must be of like design to the
product or condition at issue in the instant case; (2) the product or circumstance in
the alleged OSI must be of like function to the product or condition at issue in the
instant case; (3) the application to which the product had been subjected must be
like the application to which the product at issue in the instant case was subjected;
and (4) the condition of the product, its state of repair, and/or its relevant state of
wear must be like the state of repair and the relevant state of wear of the product
that had been involved in the instant case.96 Engineers can contribute to a technical understanding of each of these dimensions and, in some cases, they may be able
94. See, e.g., Sparks v. Mena, No. E2006-02473-COA-R3-CV, 2008 WL 341441, at *2 (Tenn.
Ct. App. Feb. 6, 2008); Francis H. Hare, Jr. & Mitchell K. Shelly, The Admissibility of Other Similar
Incident Evidence: A Three-Step Approach, 15 Am. J. Trial Advoc. 541, 544–45 (1992).
95. See, e.g., Bitler v. A.O. Smith Corp., 391 F.3d 1114, 1126 (10th Cir. 2004); Whaley v. CSX
Transp. Inc., 609 S.E.2d 286, 300 (S.C. 2005); Cottrell, Inc. v. Williams, 596 S.E.2d 789, 793–94
(Ga. Ct. App. 2004).
96. See, e.g., Brazos River Auth. v. GE Ionics, Inc., 469 F.3d 416, 427 (5th Cir. 2006); Steele
v. Evenflo Co., 147 S.W.3d 781, 793 (Mo. Ct. App. 2004).
953
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
to apply objective measures to questions of substantial similarity and thus quantify
the level of similarity between an event proffered as an OSI and the instant case.
The reverse is also true. Failure to establish likeness in any of these dimensions
is failure to demonstrate substantial similarity to the circumstances of the subject
case.97 If one or more of the necessary and sufficient conditions are unknown or
unknowable, the test of substantial similarity also fails; the lack of demonstrable
similarity is a lack of substantial similarity.
To demonstrate like design, a product or condition need not be identical in
all aspects of form.98 It must simply be similar in form to the product or condition
at issue in the instant case.99 Consider a machine control design with a feature
alleged to have been the proximate cause of an injury-producing event that gave
rise to a product liability lawsuit. Events proffered as OSIs that involve products
having an identical control design meet the test of “likeness” in design. In addition, other control designs that differ in aspects not related to the feature that is
alleged to have served as the proximate cause for the instant injury event may also
be considered to be “like” if the relevant design elements on the two products
cannot be differentiated. Engineers can assess the design elements of the control,
determine which features may be relevant to questions of design likeness, and
provide testimony to answer such questions.
Like function can be demonstrated if the operational purpose of the product
or condition defined in the alleged OSI is similar to the function of the product or
condition in the instant case. In the control design hypothesized above, a control
that is applied to command the dichotomous functional states to start and stop
(either “on” or “off”) a crane winch might serve the same operational purpose to
start and stop another type of equipment or winch. In such a case, the functions
and purpose of the control design may be alike. If however, that same control
design is applied to a machine in which the operational purpose is not simply to
command a dichotomous “on” or “off” signal, but rather its purpose is to provide
a modulated signal to which the machine response is a continuously variable function of control placement, the control design function is unlike the purpose of
dichotomous positioning. Engineers can provide assessments and analyses of the
functions embedded in a specific design and assist in the determination of likeness
or lack of likeness between an instant condition and one proffered as an OSI.
Like application can be demonstrated if it can be shown that the operational
conditions to which the product is subject are alike in the proffered OSI and in
the instant case.100 The environmental exposure to which a product is subjected
must be of like condition. A control design function can vary with temperature,
97. See, e.g., Peters v. Nissan Forklift Corp. N. Am., No. 06-2880, 2008 WL 262552, at *2
(E.D. La. Feb. 1, 2008); Whaley v. CSX Transp., Inc., 609 S.E.2d 286, 300 (S.C. 2005).
98. See, e.g., Bitler v. A.O. Smith, 391 F.3d 1114, 1126 (10th Cir. 2004).
99. Id.
100. See, e.g., Steele v. Evenflo Co., 147 S.W.3d 781, 793 (Mo. Ct. App. 2004).
954
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
air or water exposure, reactions to corrosive elements, reactions to acid or base
contaminants, and in potential interactions with surrounding materials and components that can be of differing electrochemical potential. Engineers with the
appropriate technical background can evaluate operating conditional applications
and determine if the conditions that obtain for a proffered OSI are similar to
those that had obtained in an instant case, thereby assisting the determination of
substantial similarity.
Differing environmental exposures resulting from differing applications may
render an event proffered as an OSI unlike and not substantially similar. Further,
like applications must comprehend that the load and stress conditions to which
a product or condition is placed is substantially similar to the circumstances that
obtained in the instant case to which the OSIs are being proffered for comparison.
In our control design identified above, the control device may be manually actuated through a lever. Levers of differing length will apply differing forces to the
control device and produce differing operational stresses upon the control device
itself. The durability and performance of the control design itself can be affected
by these differing operating applications, and anomalies or failures under one
application may not be at all similar to those that obtain under differing circumstances in which the operating loads and applied stresses are different. Engineers
are well qualified to assess conditions of comparative loading and applied stresses.
A like state of repair can be demonstrated if there is reasonable evidence
that products involved in the proffered OSI are (1) in a specific working order,
(2) in a condition of adjustment (if possible to adjust), (3) in a state of wear, and
(4) within an expected range of tolerance that would not differentiate the product
or condition from that which obtained in the product or condition involved in
the instant case. Additionally, the products or conditions reported in the proffered OSIs must be shown to be free of modification from an original design
state, or must be shown to be in a state of modification that is reflective of the
product or condition involved in the instant case.101 An absence of evidence
to demonstrate a state of likeness in application, operating environment, state
of repair and wear, or state of modification is not sufficient to show similarity.
Engineers with appropriate background can review data and information about
modifications and service conditions related to wear and wear rates, as well as
assess information related to the state of repair or disrepair, and thereby contribute
to understanding of the level of similarity or dissimilarity among specific events
and operational conditions.
For evidentiary reasons, OSIs generally are not admissible to demonstrate
the truth of the matter recorded therein.102 Event records are necessarily reports
of noteworthy events made after the fact by parties who may or may not have
an interest in establishing a specific fact pattern, may or may not be qualified to
101. See, e.g., Cottrell, Inc. v. Williams, 596 S.E.2d 789, 791, 794 (Ga. Ct. App. 2004).
102. Fed. R. Evid. 801 & 802.
955
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
make the observations and assertions included in such reports, and may or may
not have any specialized training necessary to evaluate proper system function or
state of repair. The persons who report events collected and offered as OSIs may
not be fully informed of the set of circumstantial conditions that are necessary
and sufficient to determine causation of the reported event. Thus, often-reported
events have incomplete or insufficient data and information to determine substantial similarity. Even if informed, persons reporting events may not have the
correct observational powers, tools, and insights necessary for accurate evaluation
and reporting. The individuals who make reports regarding recorded events may
be unable to factually assess and accurately report all of the conditions relevant to
determination of event causation and resolution of questions regarding substantial
similarity. Reports of events made by parties who may have an interest in economic recovery or other compensation may not always accurately disclose known
or knowable facts that could bear on determinations of causation and substantial
similarity. Furthermore, some parties may have an economic or other interest in
the outcome of a report or claim. Therefore, such reports, if offered to prove the
truth of the other incidents, are typically excluded as hearsay (unless the business
records exception applies).103
B. Demonstratives and Simulations
Computer animations, simulation models, and digital displays have become
more common in television and movies, especially in entertainment media concerning forensic investigation, law enforcement, and legal drama. The result is
an increased expectation among the court and juries that visual graphics and displays will be used by engineering experts and other expert witnesses to explain
and illustrate their testimony. Additionally, boxed presentation software such
as PowerPoint, is often a technology used. Attorneys and their clients typically
expect their experts will use computer animations, simulations, and/or exhibits
to educate the jury and demonstrate the bases for their opinions. When used
correctly, these tools can make the expert’s testimony understandable and can
leave a lasting impression with the trier of fact of that party’s theory of the case.
For that very reason, the role of the court as the gatekeeper for use of these
demonstratives has become increasingly critical. As the technology underlying
these tools rapidly advances, the court’s task likewise becomes more difficult. In
assessing the validity of these tools, the court is often forced to decide whether
the visual display accurately represents the evidence and/or is supported by the
103. See Willis v. Kia Motors Corp., No. 2:07CV062-PA, 2009 WL 2351766 (N.D. Miss.
July 29, 2009) (finding customer complaints of similar accidents were not hearsay because they were
offered to notice, not the truth of the matter asserted, and even if they were hearsay, they fell under
the business records exception of Fed. R. Evid. 803(6)).
956
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
expert’s opinions and qualifications.104 To assist the court in this difficult task,
we present some guidance regarding the types of technology presently in use
and the strengths and weaknesses of each.
A primary basis for misunderstanding and uncertainty is the difference
between a computer animation and a computer simulation. An animation is a
sequence of still images graphically illustrated (two dimensions) or modeled (three
dimensions), and are often textured and rendered to create the illusion of motion.
A cartoon is a simple example. There are no constraints inherent in an animation,
and the laws of physics, or any other science, do not necessarily apply (a black
mouse can be dressed in red shorts with yellow shoes and be made to dance, sing,
and fly). The lack of imposed restriction does not make the animation deficient a
priori; if the still images that comprise the animation are accurate in their representation of individual snapshots of time, then the animation itself can be proven
precise. The converse, of course, is also true.
Animations contain key frames that define the starting and ending points
of actions, with sequences of intermediate frames defining how movements are
depicted. For example, a series of still photographs can depict the path of a vehicle
vaulting off an embankment, with a single image at the takeoff, mid-flight, and
landing positions each correct in its representation. However, when an animation
of the event is created, the intermediate frames fill in the missing areas, and if so
desired, contrary to known physical phenomena, the animation could show the
vault trajectory of the vehicle to remain flat and then suddenly drop, similar to
the inaccurate representation of motion experienced by a cartoon coyote momentarily contemplating his fate after chasing a bird off a cliff. Thus, in an animation,
some of the inputs (stills) may represent reality, but the sum of the parts (intermediate frames) may not.
Unlike an animation, a simulation is a computer program that relies on
source data and algorithms to mathematically model a particular system (see, e.g.,
the discussion on finite element modeling, above), and allows the user to rapidly
and inexpensively gain insight into the operation and sensitivity of that system
to certain constraints. Perhaps the most common example of a simulation can be
found daily as a computer-generated image showing the predicted growth of a
storm system.
On the surface, a simulation would seem to provide more accuracy than an
animation. However, this is not necessarily the case. The simulation model is only
as accurate as its input data and/or constraining variables and the equations that
form its calculation stream. Simulation models also require a sensitivity analysis—
just because a model produces an answer does not mean that it is the best model or
104. See Lorraine v. Market Am. Ins. Co., 241 F.R.D. 534 (D. Md. 2007) (distinguishing
between demonstrative computer animations and scientific computer simulations and discussing the
evidentiary requirements, including authentication, for each); People v. Cauley, 32 P.3d 602 (Colo.
Ct. App. 2001) (same).
957
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
the most correct answer. For example, a computer model depicting the motions of
a vehicle prior to and after an impact with a pole may be correct if it matches the
known physical evidence (e.g., tire marks and vehicle damage). However, whether
the model is accurate depends on the accuracy of the inputs for tire friction, vehicle
stiffness, vehicle weight, location of the vehicle’s center of gravity, etc. Even if the
inputs are accurate, once a solution is found, other solutions may exist that also
match the evidence. Assessing the accuracy of each solution requires an iterative
process of making changes to those variables believed to have the greatest effect
on the output. Simply put, the difference between a vehicle accident simulation
model that predicts 10 inches of crush deformation and two complete revolutions
post impact versus 14 inches of crush and three complete revolutions may depend
on just a few selected vehicle characteristics. Thus, compared to an animation, in
a simulation model, the sum of the constraining variables and equations may represent reality, but some of the user-selected inputs may not.
The difficulty for the court is the need to decide whether some or all of the
computer animation or simulation accurately represents the facts and/or opinions
of the expert.105 This is not an easy endeavor, but can usually be executed in a
reasonable fashion for simulations by evaluating whether the simulation has been
validated. If the underlying program predicts the behavior of vehicles in a crash, it
can be validated by crashing vehicles under controlled conditions, and comparing
the actual results to those predicted by the simulation. If the software in question
predicts the response of a complex object to applied forces, it can be validated by
modeling a simple object, the response of which can be calculated by hand, and
comparing the simulation to those known results.106
Similarly, for animations, engineers need to establish authenticity, relevance,
and accuracy in representing the evidence using visual means.107 They may rely
on blueprint drawings, CAD (computer-aided design) drawings, U.S. Geological
Survey data, photogrammetry, geometric databases (vehicles, aircraft, etc.), eyewitness statements, and field measurements to establish accuracy of an animation.
VIII. Epilogue
Most engineers are not educated in the law and to them the setting of a deposition or a courtroom is peculiar and often uncomfortable. The rules are different
105. See id.
106. See Lorraine v. Markel Am. Ins. Co., 241 F.R.D. 534 (D. Md. 2007); Livingston v. Isuzu
Motors, Ltd., 910 F. Supp. 1473 (D. Mont. 1995) (finding computer simulation of rollover accident
by expert to be reliable and admissible under Daubert whether computer program was made up of
various physical laws and equations commonly understood in science, program included case-specific
data, and expert’s computer simulation methodology had been peer reviewed).
107. See, e.g., Friend v. Time Mfg. Co., No. 03-343-TUC-CKL, 2006, WL 2135807 (D. Ariz.
July 28, 2006); People v. Cauley, 32 P.3d 602 (Colo. Ct. App. 2001).
958
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Guide on Engineering
from those to which they are accustomed. The conversations are somewhat alien.
Treading in this unfamiliar territory is a challenge. And so, although it is important for the engineer to “fit” into this environment, it is equally important for
the triers of fact and the court to understand the engineer’s world. We hope this
chapter has provided a glimpse into that world, and by considering it, the reader
will have some insight as to why engineers respond to questions as they do. The
foundation that underlies and supports essentially all that has been done and all that
will be done by engineers is the design process. It is the roadmap for innovation,
invention, and reduction to practice that characterizes those who do engineering
and who call themselves “engineers.” It is the key metric against which products
and processes can be and should be evaluated.
IX. Acknowledgments
The authors would like to thank the following for their significant contributions:
Dr. Roger McCarthy, Robert Lange, Dr. Catherine Corrigan, Dr. John Osteraas,
Michael Kuzel, Dr. Shukri Souri, Dr. Stephen Werner, Dr. Robert Caligiuri,
Jeffrey Croteau, Kerri Atencio, and Jess Dance.
959
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Appendix A
Biographical Information of
Committee and Staff
Jerome P. Kassirer (Co-Chair) served as the Editor-in-Chief of the New
England Journal of Medicine (1991–1999). He is currently Distinguished Professor of
Medicine at Tufts University School of Medicine where he has also served as vice
chairman of the Department of Medicine. Dr. Kassirer has served on the American
College of Physician’s Board of Governors and Board of Regents, chaired the
National Library of Medicine’s Board of Scientific Counselors, and is past chairman of the American Board of Internal Medicine. He is a member of the Association of American Physicians, the Institute of Medicine of the National Academies,
and the American Academy of Arts and Sciences. Dr. Kassirer’s current interests
are reliable approaches to the assessment of the quality of health care, professionalism, ethical scientific conduct, and patient involvement in decisionmaking. He has
been highly critical of for-profit medicine, abuses of managed care, and political
intrusion into medical decisionmaking. Dr. Kassirer received his M.D. from the
University of Buffalo School of Medicine and trained in internal medicine at
Buffalo General Hospital. He trained in nephrology at the New England Medical
Center. His latest book, on financial conflicts of interest in medicine, entitled On
the Take: How Medicine’s Complicity with Big Business Can Endanger Your Health,
was published by Oxford University in 2004. He has also published extensively
on nephrology, medical decisionmaking, and the diagnostic process.
Gladys Kessler (Co-Chair) was appointed to the United States District Court
for the District of Columbia in July 1994. She received a B.A. from Cornell
University and her LL.B. from Harvard Law School. Following graduation, Judge
Kessler was employed by the National Labor Relations Board, served as legislative
assistant to a U.S. senator and a U.S. congressman, worked for the New York
City Board of Education, and then opened a public interest law firm. In June
1977, she was appointed Associate Judge of the Superior Court of the District
of Columbia. From 1981 to 1985, Judge Kessler served as Presiding Judge of the
Family Division and was a major architect of one of the nation’s first Multi-Door
Courthouses. She was president of the National Association of Women Judges
from 1983 to 1984, served on the Executive Committee of the ABA’s Conference
of Federal Trial Judges and the U.S. Judicial Conference’s Committee on Court
Administration and Management. She is a board member and has been chair of
the board of directors of Our Place, D.C., an organization devoted to serving the
961
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
needs of incarcerated women returning to the community. She now chairs the
District of Columbia Commission on Disabilities and Tenure.
Ming W. Chin was appointed to the California Supreme Court in March 1996.
Before being named to the high court, Justice Chin served from 1990 to 1996
on the First District Court of Appeal, Division Three, San Francisco. Prior to
his appointment to the Court of Appeal, Justice Chin served on the bench of
the Alameda County Superior Court. Previously, Justice Chin was a partner in
an Oakland law firm specializing in business and commercial litigation. He also
served as a prosecutor in the Alameda County District Attorney’s office. Justice
Chin earned his bachelor’s degree in political science and law degree from the
University of San Francisco. After his graduation from law school, Justice Chin
served 2 years as a Captain in the U.S. Army, including a year in Vietnam, where
he was awarded the Army Commendation Medal and the Bronze Star. Justice
Chin chairs the Judicial Council of California’s Court Technology Advisory
Committee, as well as the California Commission for Impartial Courts. He frequently lectures on DNA, genetics, and the courts. Justice Chin served as chair
of the Judicial Council’s Science and the Law Steering Committee. In 2009 the
Judicial Council named him California Jurist of the Year. He is an author of
California Practice Guide: Employment Litigation (The Rutter Group 2011). He is
also an author of California Practice Guide: Forensic DNA (The Rutter Group, to
be published in 2012).
Pauline Newman is a Judge on the United States Court of Appeals for the Federal
Circuit. She received a B.A. degree from Vassar College in 1947, M.A. in pure science from Columbia University in 1948, Ph.D. in chemistry from Yale University
in 1952, and LL.B. from New York University School of Law in 1958. She was
admitted to the New York bar in 1958 and to the Pennsylvania bar in 1979. Judge
Newman worked as research scientist for the American Cyanamid Company from
1951 to 1954; as patent attorney and house counsel for the FMC Corp. from 1954
to 1984; and, since 1969, as director of the FMC Patent, Trademark, and Licensing
Department. On leave from FMC Corp. in 1961–62, she worked for the United
Nations Educational, Scientific and Cultural Organization as a science policy specialist in the Department of Natural Sciences. Offices in scientific and professional
organizations include member of Council of the Patent, Trademark and Copyright Section of the American Bar Association, 1982–84; board of directors of the
American Patent Law Association, 1981–84; vice president of the United States
Trademark Association, 1978–79, and member of its board of directors, 1975–76,
1977–79; member of board of governors of the New York Patent Law Association, 1970–74; president of the Pacific Industrial Property Association, 1978–80;
member of executive committee of the International Patent and Trademark Association, 1982–84; member of board of directors of the American Chemical Society, 1973–75, 1976–78, 1979–81; member of board of directors of the American
962
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Appendix: Biographical Information of Committee and Staff
Institute of Chemists, 1960–66, 1970–76; member of the board of trustees of
Philadelphia College of Pharmacy and Science, 1983–84; member of patent policy
board of State University of New York, 1983–84; member of national board of
Medical College of Pennsylvania, 1975–84; and member of board of directors
of Research Corp., 1982–84. Service on government committees included State
Department Advisory Committee on International Intellectual Property, 1974–84;
Advisory Committee to the Domestic Policy Review of Industrial Innovation,
1978–79; Special Advisory Committee on Patent Office Procedure and Practice,
1972–74; and member of the U.S. Delegation to the Diplomatic Conference on
the Revision of the Paris Convention for the Protection of Industrial Property,
1982–84. Judge Newman received the Wilbur Cross Medal of Yale University
Graduate School, 1989; the Jefferson Medal of the New Jersey Patent Law Association, 1988; the Award for Outstanding Contributions in the Intellectual Property
Field of the Pacific Industrial Property Association, 1987; Vanderbilt Medal of New
York University School of Law, 1995; and Vassar College Distinguished Achievement Award, 2002. She was Distinguished Professor of Law, George Mason
University (adjunct faculty), served on the Council on Foreign Relations, and was
appointed judge of the U.S. Court of Appeals for the Federal Circuit by President
Reagan and entered upon duties of that office on May 7, 1984.
Kathleen McDonald O’Malley was appointed to the United States Court of
Appeals for the Federal Circuit by President Barack H. Obama on December 27,
2010. Prior to joining the Federal Circuit, Judge O’Malley was a District Judge on
the United States District Court for the Northern District of Ohio, a position to
which she was appointed by President William J. Clinton on October 12, 1994.
Prior to her appointment to the bench, Judge O’Malley served as First Assistant
Attorney General and Chief of Staff in the Office of the Attorney General for the
State of Ohio from 1992 to 1994, and Chief Counsel in that office from 1991
to 1992. From 1983 to 1991, Judge O’Malley was in private practice, where she
focused on complex corporate and intellectual property litigation; she was with
Porter, Wright, Morris & Arthur from 1985 to 1991 and with Jones Day from
1983 to 1985. As an educator, Judge O’Malley has taught patent litigation at Case
Western Reserve University School of Law and is a regular lecturer on issues arising in complex litigation, including intellectual property matters. Judge O’Malley
began her legal career as a law clerk to the Honorable Nathaniel R. Jones, United
States Court of Appeals for the Sixth Circuit, from 1982 to 1983. She received her
J.D. degree from Case Western Reserve University School of Law, Order of the
Coif, in 1982, and her undergraduate degree from Kenyon College in Gambier,
magna cum laude and Phi Beta Kappa, in 1979. She received an honorary Doctor
of Laws degree from Kenyon College in 1995.
Jed S. Rakoff has been a United States District Judge for the Southern District
of New York since 1996. Prior to his appointment, he was a partner at Fried,
963
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Frank, Harris, Shriver & Jacobson LLP. From 1980 to 1990, he was a partner at
Mudge, Rose, Guthrie, Alexander & Ferdon LLP. Judge Rakoff was an Assistant
U.S. Attorney for the Southern District of New York from 1973 to 1980 and
chief of the Business and Securities Fraud Prosecutions Unit from 1978 to 1980.
Before joining the U.S. Attorney’s Office, Judge Rakoff spent 2 years in private
practice as an associate attorney at Debevoise & Plimpton LLP. He served as a
law clerk to the Honorable Abraham L. Freedman, U.S. Court of Appeals for the
Third Circuit, in 1969–70. Judge Rakoff is coauthor of five books and author of
more than 110 published articles, more than 375 speeches, and more than 900
judicial opinions. He has been a lecturer in law at Columbia Law School since
1988. He was a member of the Board of Managers, Swarthmore College, from
2004 to 2008. Judge Rakoff currently serves as a Trustee for the William Nelson
Cromwell Foundation and from 2007–10 served as a member of the Governance
Board for the MacArthur Foundation Initiative on Law and Neuroscience. From
1998–2011, he was chair of the Criminal Justice Advisory Board, Southern District of New York; from 2003–11 he was chair of the Second Circuit Bankruptcy
Committee; and from 2006–09 he was chair of the Honors Committee of the
New York City Bar Association. Since 2001 he has served as chair of the Grievance Committee of the Southern District of New York. He is a Judicial Fellow
at the American College of Trial Lawyers and was chair of the Downstate New
York Chapter in 1993–94. Judge Rakoff is the former director of the New York
Council of Defense Lawyers and former chair of the Criminal Law Committee,
New York City Bar Association. He has been a Judicial Fellow at the American Board of Criminal Lawyers since 1995. Judge Rakoff received a B.A. from
Swarthmore College in 1964, an M.Phil. from Oxford University in 1966, and a
J.D. from Harvard Law School in 1969. He was awarded honorary LL.D.s from
Swarthmore College in 2003 and St. Francis University in 2005.
Channing R. Robertson is Ruth G. and William K. Bowes Professor and former Dean of Faculty and Academic Affairs, School of Engineering, and Professor,
Department of Chemical Engineering, Stanford University. He was named a Bass
University Fellow in Undergraduate Education in 2010. Dr. Robertson received
his B.S. Chemical Engineering, from the University of California, Berkeley; M.S.
in Chemical Engineering, from Stanford University; and Ph.D. in Chemical
Engineering—emphasis on fluid mechanics and transport phenomena, from
Stanford University. Professor Robertson began his career at the Denver Research
Center of the Marathon Oil Company and worked in the areas of enhanced oil
recovery, geophysical chemistry, and polyurethane chemistry. Since 1970, he has
been on the faculty of Stanford’s Department of Chemical Engineering. He has
educated and trained over 40 Ph.D. students, holds seven patents, and has published over 140 articles. He is past director of the Stanford-NIH Graduate Training Program in Biotechnology. He was co-director of the Stanford initiative in
biotechnology known as BioX, which in part includes the Clark Center for Bio964
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Appendix: Biographical Information of Committee and Staff
medical Engineering and Sciences. He directed the summer Stanford Engineering
Executive Program. He received the 1990 Stanford Associates Award for service
to the University, the Stanford Associates Centennial Medallion Award in 1991,
the 1991 Richard W. Lyman Award, the Society of Women Engineers Award for
Teacher of the Year 2000 at Stanford, the Stanford Society of Chicano/Latino
Engineers & Scientists Faculty of the Year Award in 2004, and the Lloyd W.
Dinkelspiel Award for Distinctive Contributions to Undergraduate Education in
2009. He is a Founding Fellow of the American Institute of Medical and Biological Engineering. Professor Robertson serves on the Scientific Advisory Committee
on Tobacco Product Regulation of the World Health Organization and served on
the Panel on Science, Technology, and Law, National Research Council, National
Academy of Sciences, 1999–2006. Because of his interests in biotechnology, he
has consulted widely in the design of biomedical diagnostic devices. He has also
served as an expert witness in several trials, including the Copper-7 intrauterine
contraceptive cases (United States and Australia), the Stringfellow Superfund case,
and most recently the Minnesota tobacco trial. He has cofounded 2 and consulted
with over 30 Silcion Valley startups during the past three decades.
Joseph V. Rodricks is an internationally recognized expert in the field of
toxicology and risk analysis, and their uses in the regulation and evaluation of toxic
tort and product liability cases. Since 1980, he has consulted for hundreds of manufacturers, for government agencies, and the World Health Organization, and he
has served on 30 boards and committees of the National Academy of Sciences and
the Institute of Medicine. He has more than 120 publications on toxicology
and risk analysis, and has lectured nationally and internationally on these topics.
Dr. Rodricks was formerly Deputy Associate Commissioner, Health Affairs,
and Toxicologist, U.S. Food and Drug Administration (1965–80), and is a visiting professor at The Johns Hopkins University School of Public Health. He
has been certified as a Diplomate, American Board of Toxicology, since 1982.
Dr. Rodricks’ experience includes chemical products and contaminants in foods,
food ingredients, air, water, hazardous wastes, the workplace, consumer products,
and medical devices and pharmaceutical products. He is the author of Calculated
Risks (Cambridge University Press), a nontechnical introduction to toxicology and
risk analysis that is now available in a fully revised and updated second edition, for
which he won an award from the American Medical Writers Association.
Allen Wilcox is Senior Investigator, Epidemiology Branch at the National Institute of Environmental Health Sciences, NIH, and Editor-in-Chief of Epidemiology.
His research is primarily on human reproduction, with research topics ranging from fertility and early pregnancy loss to fetal growth and birth defects.
Dr. Wilcox earned his undergraduate and medical degrees at the University of
Michigan, Ann Arbor, and his M.P.H. (maternal and child health) and Ph.D.
(epidemiology) at the University of North Carolina School of Public Health at
965
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Chapel Hill, where he is an adjunct professor in the Department of Epidemiology.
The school recognized him with its Distinguished Alumni Award in 2006. Other
distinctions include the Distinguished Service Medal (highest award of the U.S.
Public Health Service); election as a Fellow of the American College of Epidemiology; and election as president of the Society of Epidemiologic Research, the
American Epidemiologic Society, and the Society for Pediatric Epidemiologic
Research. In 2008, he received the National Maternal and Child Health Epidemiology Award. He holds an honorary doctorate from the University of Bergen
(Norway). He is the author of Fertility and Pregnancy: An Epidemiologic Perspective
(Oxford University Press 2010).
Sandy L. Zabell is Professor of Mathematics and Statistics at Northwestern University. He received his A.B. from Columbia College in 1968, his A.M. in biochemistry and molecular biology from Harvard University in 1971, and his Ph.D.
in mathematics from Harvard University in 1974. He was Assistant Professor of
Statistics at the University of Chicago from 1974 to 1979, and joined Northwestern University as Associate Professor of Mathematics in 1980. He is a Fellow of
the American Statistical Association and the Institute of Mathematical Statistics. In
the past he has served as an associate editor of the American Mathematical Monthly
and the Journal of Mathematical Analysis and Applications, and book review editor
of the Annals of Probability. His principal research interests revolve around mathematical probability (in particular, large deviation theory) and Bayesian statistics
(in particular, the study of exchangeability). He has also written extensively on the
history and philosophical foundations of probability and statistics, is an affiliated
faculty member of the Northwestern Philosophy Department, and the author of
Symmetry and its Discontents (Cambridge University Press, 2006). Professor Zabell
has had a longstanding involvement in the legal applications of statistics, including
serving on two panels of the National Research Council, and teaching courses on
statistics at both the University of Chicago and Northwestern Law Schools. One of
his primary interests at present is forensic science, in particular, the statistical issues
arising from the use of DNA in human identification. He has spoken numerous
times at forensic science conferences, and lectured on forensic DNA identification
in courses at Northwestern. He is also interested in the statistical proof of employment discrimination and the legal uses of sampling. In addition to his scholarly
interests, he has assisted legal counsel over the years in more than 200 cases, both
civil and criminal.
Staff
Joe S. Cecil is a Senior Research Associate and Project Director in the Division
of Research at the Federal Judicial Center. Currently, he is directing the Center’s
Program on Scientific and Technical Evidence. As part of this program, he served
966
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Appendix: Biographical Information of Committee and Staff
as principal editor of the first and second editions of the Center’s Reference
Manual on Scientific Evidence. He has published several articles on the use of
court-appointed experts and is currently examining changes in dispositive motion
practice in federal district courts over the past 30 years. Dr. Cecil received his
J.D. and a Ph.D. in psychology from Northwestern University. He serves on the
editorial boards of social science and legal journals. He has served as a member
of several panels of NAS, and currently is serving as a member of the National
Academies Committee on Science, Technology, and Law. Other areas of research
interest include federal civil and appellate procedure, jury competence in complex
civil litigation, claim construction in patent litigation, and judicial governance.
Anne-Marie Mazza is the Director of the Committee on Science, Technology, and Law. Dr. Mazza joined the National Research Council in 1995.
She has served as Senior Program Officer with both the Committee on Science, Engineering, and Public Policy and the Government-University-Industry
Research Roundtable. In 1999, she was named the first director of the Committee on Science, Technology, and Law, a newly created activity designed to
foster communication and analysis among scientists, engineers, and members
of the legal community. Dr. Mazza has been the study director on numerous
Academy reports including Review of the Scientific Approaches Used During the FBI’s
Investigation of the 2001 Anthrax Mailings (2011), Managing University Intellectual
Property in the Public Interest (2010); Strengthening Forensic Science in the United
States: A Path Forward (2009); Science and Security in a Post-9/11 World (2007);
Daubert Standards: Summary of Meetings (2006); Reaping the Benefits of Genomic and
Proteomic Research: Intellectual Property Rights, Innovation, and Public Health (2005);
Intentional Human Dosing Studies for EPA Regulatory Purposes: Scientific and Ethical
Issues (2004); Ensuring the Quality of Data Disseminated by the Federal Government
(2003). Dr. Mazza received an NRC distinguished service award in 2008. In
1999–2000, Dr. Mazza divided her time between the National Academies and
the White House Office of Science and Technology Policy (OSTP), where she
served as a Senior Policy Analyst responsible for issues associated with a Presidential Review Directive on the government-university research partnership.
Before joining the Academy, Dr. Mazza was a Senior Consultant with Resource
Planning Corporation. Dr. Mazza received a B.A., an M.A., and a Ph.D. from
the George Washington University.
Steven Kendall is Associate Program Officer for the Committee on Science,
Technology, and Law. Mr. Kendall has contributed to numerous Academy reports
including Review of the Scientific Approaches Used During the FBI’s Investigation of the
2001 Anthrax Mailings (2011), Managing University Intellectual Property in the Public
Interest (2010); and Strengthening Forensic Science in the United States: A Path Forward
(2009). He is currently a Ph.D. candidate in the Department of the History of
Art and Architecture at the University of California, Santa Barbara, where he is
967
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
completing a dissertation on nineteenth-century British painting. Mr. Kendall
received his M.A. in Victorian Art and Architecture at the University of London.
Prior to joining the NRC in 2007, he worked at the Smithsonian American Art
Museum and the Huntington in San Marino, California.
Guruprasad Madhavan is a Program Officer with the Board on Population
Health and Public Health Practice, and the Committee on Science, Engineering,
and Public Policy at the National Academies. Previously, he served as a Program
Officer for the Committee on Science, Technology, and Law and as a Christine
Mirzayan Science and Technology Policy Fellow with the Board on Science,
Technology and Economic Policy. He has worked on such National Academies’ publications as Direct-to-Consumer Genetic Testing: Summary of a Workshop
(2010); Managing University Intellectual Property in the Public Interest (2010); and
Rising Above the Gathering Storm, Revisited (2010). Dr. Madhavan completed his
Ph.D. in biomedical engineering at the State University of New York (SUNY)
at Binghamton where his research was directed toward developing noninvasive,
nonpharmacological, neuromuscular stimulation approaches for enhancing circulation. He received his B.E. (honors with distinction) in instrumentation and
control engineering from the University of Madras, and M.S. in biomedical engineering from SUNY Stony Brook. Following his medical device industry experience as a research scientist at AFx, Inc. and Guidant Corporation in California,
Dr. Madhavan completed his M.B.A. in leadership and healthcare management
from SUNY Binghamton. Among other honors, he was selected as an outstanding
young scientist to attend the 2008 World Economic Forum Annual Meeting of
the New Champions, and 1 among 14 people as the “New Faces of Engineering”
of 2009 in USA Today. He is co-editor of Career Development in Bioengineering and
Biotechnology (Springer 2008) and Pathological Altruism (Oxford University Press
2011).
968
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
A
Frye test, 12, 53, 60, 63, 82, 102
n.291, 103 n.300, 110 n.343, 133
n.7, 166, 173 n.102, 186, 189, 195
n.183, 197, 367, 368, 806-807,
866, 867, 949
interpreting Daubert, 19-22
procedural issues, 30-36
qualifications of expert witness, 22-23
relevancy standard, 13
reliability standard, 13
scientific foundation of studies, 23-25
standard of review, 14, 16, 17, 18, 19,
21, 25, 100 n.279, 101 n.282, 104
n.303, 112 n.353, 226 n.36, 563564 n.44, 565 n.48, 693, 827 n.73,
846 n.179, 874 n.343, 947 n.83
sufficiency conflated with, 20-21
Supreme Court cases, 12-19 (see also
Daubert; General Electric; Kumho;
Weisgram)
synthesizing multiple studies vs.
piecemeal examination, 19-20, 21,
23-24
technical and other specialized
knowledge, 16-18
Advertising
costs, 321 n.48, 322, 326
deceptive, 224, 231-233, 363 n.10,
366, 398-399, 400, 403-404, 410,
441
Advisory Committee on Civil Rules, 33
Agency for Healthcare Research and
Quality, 701, 723, 728 n.174
Agent Orange litigation, 507 n.8, 520
n.38, 565 n.48, 583 n.100, 592
n.130, 609 n.179
Ake v. Oklahoma, 29 n.85, 127
Alcohol, blood levels, 228, 373 n.64, 791,
913
Alleles
binning, 200
defined, 139, 199
Abuse-of-discretion standard, 14, 16, 17,
18, 19, 21, 25, 35-36, 100 n.279,
101 n.282, 104 n.303, 112 n.353,
226 n.36, 308 n.18, 563-564 n.44,
565 n.48, 693, 827 n.73, 846
n.179, 874 n.343, 947 n.83
Academy of Toxicological Sciences
(ATS), 677
Accreditation
engineering education, 931
laboratories, 28, 62 n.30, 66, 68-69,
70 n.83, 98, 154, 156, 171 n.98,
538
medical education, 695, 696, 697,
700, 701, 822, 823 n.49, 824, 873
Accreditation Board for Engineering and
Technology (ABET), 931
Accreditation Council for Continuing
Medical Education (ACCME), 700
Accreditation Council for Graduate
Medical Education (ACGME),
696, 697
Acute myelogenous leukemia, 20 n.51,
26, 505, 655, 656 n.65, 663 n.81,
668-669, 670 n.97
Additive effects, 615 n.200, 673, 680
Admissibility of expert testimony,
generally (see also individual
disciplines)
applying Daubert, 22-26, 63 n.39
class certification proceedings, 30-32,
307 n.7, 365, 463, 489
credibility issues, 21-22, 36, 99, 318
n.41, 376 n.75, 741, 781-782,
789-790 n.24, 794, 806, 807, 875
n.347, 879
Daubert hearings, 6, 14, 23 n.61, 31,
35-36, 74 n.105, 76-77, 122
discovery, 32-35
969
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
American Academy of Clinical
Toxicologists, 678
American Academy of Environmental
Medicine, 677 n.115
American Academy of Family Physicians,
735
American Academy of Forensic Sciences,
125
American Academy of Psychiatry and the
Law, 823 n.52, 875
American Association for Public Opinion
Research, 417
American Association for the
Advancement of Science, 8, 39
n.3, 46
American Association on Mental
Retardation, 371
American Bar Association, 8, 869
American Board of Bariatric Medicine,
699
American Board of Criminalistics, 156
n.52
American Board of Emergency Medicine,
676-677 n.114
American Board of Forensic Odontology
(ABFO), 107
American Board of Forensic Psychology,
825 n.65
American Board of Forensic Toxicology,
69 n.78
American Board of Medical Specialties
(ABMS), 676, 677 n.114, 698, 699
American Board of Medical Toxicology,
676 n.114
American Board of Pediatrics, 677 n.114,
697 n.42
American Board of Preventive Medicine,
677 n.114
American Board of Professional
Psychology, 874
American Board of Psychiatry and
Neurology, 697 n.42, 822, 823
n.52
American Board of Toxicology, 677, 678
American Cancer Association, 735
American Chemical Society, 46
drop in, drop out, 151, 152, 153, 160
electropherogram, 144, 145-146,
182-183
genetic typing, 139-140, 152, 159,
182, 196 n.185
haplotype, 178, 181, 182, 204
Hardy-Weinberg equilibrium, 165,
166, 204, 207
heterozygosity, 139, 140, 147, 183
n.139, 199, 204
homozygosity, 139, 140, 183 n.139,
199, 204
kinship and, 163, 190
ladders, 146, 147, 199
linkage equilibrium, 166, 205, 207
location description, 200
match, 205
mixtures of DNA, 182-183, 184-185
multilocus genotype, 166, 204
nonhuman DNA, 195, 196, 197, 198
null, 144
population frequencies, 148, 155,
163, 164-165, 166, 191, 195, 196
n.185, 197, 200, 203, 204-205,
207
preferential amplification, 144
probes, 140, 207, 209
randomly mating population, 165,
198, 204, 208
sex-typing test, 146-147
single-locus genotype, 204
size considerations, 153
at STR loci, 141-143, 144, 145-147,
153, 159, 182-183
three-allele locus, 183 n.140
variation, 142-143
at VNTR loci, 142, 199, 200, 202
Alternative hypotheses
beta error calculation in epidemiology,
582
DNA profiling, 205
hypothesis testing, 205, 254 n.106,
255 n.110, 257, 276, 278, 283,
297, 299, 300, 319-321, 353
multiple regression models, 319-321,
353
970
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
Antitrust litigation, 22, 31 n.90, 213, 226
n.36, 260, 305, 306, 307 n.7, 313,
320, 321 n.48, 326, 328, 348 n.90,
365, 366 n.25, 373, 429 n.1, 431,
439, 475, 491 n.89, 498, 728
Aplastic anemia, 561, 724, 731
Appraisal approaches, 242-244, 248-249,
278, 444, 445-446, 447, 501
Asbestos, 248 n.93, 489, 519 n.36, 523,
532 n.67, 551 n.3, 573 n.68, 585
n.104, 587, 588 n.114, 606, 607
n.171, 609 n.178, 614, 615, 626,
627, 635, 640, 643-644 n.28, 652,
653, 669, 672, 676, 694, 724, 920
Association of American Medical
Colleges, 695, 696 n.34
Association of Firearm and Tool Mark
Examiners (AFTE), 93, 94, 95, 97
n.258, 100 n.273
Association of Social Work Boards, 826
n.67
Association of State and Provincial
Psychology Boards, 873
Associations (see also Causation)
aggregation of data from multiple
sources and, 235
biological plausibility, 20, 573, 600,
604-605, 606, 620, 664-665
causation and, 20, 218, 221, 222, 262,
264, 552-553, 559, 562, 566, 567,
570, 571, 574, 577 n.81, 578 n.85,
584, 591, 592-593, 604-605, 610
n.184, 664-665
confounders, 262-264
correlation coefficients, 213, 227, 228,
260, 261-264, 265, 266, 286, 290,
301, 333
defined, 552 n.7
ecological correlations, 266, 267
exposure–disease, 552-553, 554, 555556, 557, 559, 561, 566, 567-568,
570, 572, 573, 574-575, 576, 577,
578 n.85, 579, 580, 581, 582, 583,
584, 585, 586 nn.107 & 108, 588
n.115, 589, 590, 591-593, 595,
597-606, 610 n.184, 611-612, 613,
622
American College of Medical Toxicology,
678
American College of Physicians, 735
American College of Radiology, 727
American Conference of Governmental
Industrial Hygienists, 529 n.65
American Honda Motor Co. v. Allen, 31
American Industrial Hygiene Association,
539, 540
American Law Institute, 890 n.30
American Lift Institute (ALI), 924
American Medical Association, 677
n.115, 735
American National Standards Institute
(ANSI), 906, 924
American Osteopathic Association
(AOA), 697-698, 699, 700
American paddlefish, 194
American Petroleum Institute, 678
American Physical Society, 46
American Psychiatric Association, 828,
830, 831, 869, 879 n.358
American Psychiatric Nurses Association,
826 n.69
American Psychological Association, 367,
824 nn.54, 57, & 59, 875
American Society for Testing and
Materials (ASTM), 906
American Society of Crime Lab
Directors/Laboratory Accreditation
Board (ASCLD/LAB), 68, 69
nn.76 & 78, 154 n.48, 156 n.52
American Society of Internal Medicine,
735
American Urological Association, 727,
735
Americans with Disabilities Act, 816, 833
n.105
Ames, Aldrich, 805
Amicus curiae briefs, 5, 30, 371, 797-798
Anecdotal evidence, 59 n.17, 85, 217,
218, 220, 310, 677 n.115, 809
Anthrax, 194, 713
Antibodies, 199
Antigens, 199, 202, 203, 735
971
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
lands and grooves, 91-92
limits on testimony, 27 n.79, 101-102,
122, 123 n.440
neutron activation analysis, 120 n.415,
123 n.440, 126
pretrial discovery, 125-126
proficiency testing, 97-98
subclass charateristics, 93
techniques, 91-97, 120
toolmarks, 72 n.93, 93 n.241, 96-97,
98, 99, 103 n.300
Bayes, Thomas, 241 n.84
Bayesian approach (Bayes’ theorem;
subjectivist approach)
to conditional probabilities of related
events, 259 n.122, 274
to decision theory, 242 n.84
defined, 200, 283, 742
to DNA matches, 173, 174, 188, 189,
190-191, 200, 209
to empirical distributions, 259 n.123
in epidemiology, 611 n.188
to error rates, 259 n.122, 282
frequentists compared with, 273-275
inference writ large, 242 n.84
limitations, 174
medical decisionmaking, 259 n.122,
706 n.78, 707-714, 725, 742
“objective,” 259 n.123
to posterior probabilities, 241, 242,
258, 259
to prior probabilities, 259, 283
to probative value, 259 n.122
to statistical inference, 173, 174, 242
n.48, 273-275
Bayh-Dole Act, 48
Bendectin litigation, 13-14, 562 n.38,
565 n.48, 578 n.85, 579 n.86, 604
n.164, 638
Benzalkonium chloride, 507 n.8
Benzene, 20 n.51, 26, 217 n.14, 505-506,
514, 526 n.27, 532, 539, 543, 587
n.112, 606 n.169, 617-618 n.214,
646 nn.34 & 35, 649 n.44, 653,
655, 656 n.65, 657 n.67, 663-664
nn.81 & 82, 668-669, 670 n.97
income–education, 219, 260-262,
264-266, 312
linear, 261, 262, 264-268, 286, 321,
348, 352
negative, 566 n.51
statistical, between variables, 213, 217218, 219, 221-222, 230, 233-235,
252 n.103, 253, 254, 260-263, 264,
265 n.129, 266, 285, 286, 291,
295, 298, 312, 321, 352, 356
true or real, 559, 568, 572, 574, 575,
581-582, 590, 591, 592 n.126,
625, 627, 629
Atkins v. Virginia, 369-371, 815 n.5, 833
n.105
Attributable risk, 566, 570-571, 612
n.191, 619
Autoradiograph, 141 n.17, 199
B
Bacon, Francis, 39-40, 42, 43, 45, 50
Ballistics evidence
ammunition, 92, 93, 99, 120-121,
125-126
automated identification systems, 95-96
cartridge identification, 27 n.79, 92,
94-95, 98
case law development, 58, 91,
100-103
clarity of testimony, 120-121
class characteristics, 72 n.93, 92, 97,
100-101
computer imaging of bullets, 99
consecutive matching striae, 94
Daubert and, 101
empirical record, 61, 65, 97-100, 121
error rates, 97, 98
firearms, 65, 72 n.93, 91-92
individual characteristics, 72 n.93,
93-94, 97, 99
inductively coupled plasma-atomic
emission spectrometry, 120 n.415
Integrated Ballistics Information
System (IBIS), 95
972
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
technique, 71, 104-107
uniqueness of dentition, 105-106
Blood bank samples, 164
Blood evidence
ABO typing, 72, 132 n.3, 275
alcohol levels, 228, 373 n.64, 791, 913
animal, 197
DNA analysis, 143, 151, 155, 156,
158, 160 n.60, 164, 169 n.89, 173
n.103, 182, 197
exposure, 508, 509, 518-519, 535537, 544, 656, 657, 672
preservative for, 202
serology analysis, 58, 62 n.32, 132 n.3
spatter examinations, 71 n.88
toxicology, 508, 509, 518-519, 535537, 544, 635, 636, 637 n.8, 653,
656, 657, 662, 667, 672
Bootstrap simulation, 284, 469
Brain (see also Neuroimaging;
Neuroscience evidence)
brain stem, 755
cellular structure, 750-754
cerebellum, 755, 808
cerebrum, 755, 756, 757
cortex, 756-758, 759-760, 808
deep brain stimulation, 773, 775, 862
frontal lobe, 755, 756, 757, 759, 763,
771, 893
functional aspects, 759-760
implanted microelectrode arrays,
775-776
lesion studies, 774
neurons, 750-754, 755, 757, 758-759,
760, 768, 770, 772, 774, 775-776,
778, 808-809, 854
neurotransmitters, 751, 752, 753, 755,
763, 764, 833, 854
occipital lobe, 756, 759, 760
parietal lobe, 755, 756, 771
structure, 754-759
synapse, 750-751, 752, 763
temporal lobe, 755-756, 774
transcranial magnetic stimulation,
773-774
Breach of contract, 433, 434, 436, 437,
461 n.54, 466 n.68, 797
Bias (see also Confounding factors;
individual disciplines)
aggregation, 623
ascertainment, 187
cognitive, 29, 79-80, 169 n.89, 706,
743
conceptual errors, 590
contextual (expectation), 29, 67 n.63,
80
controlling for/minimizing, 68 n.70,
225, 246, 573-575
expectation, 411
information, 585-590, 624
jury pool, 365, 403
misclassification, 588 n.115, 589-590,
622, 624, 625
nonresponse, 225, 226, 249, 290, 332,
362 n.8, 383-385, 407, 408, 416
observer effects, 67-68, 160
order effects, 395-396
publication, 590
recall, 249, 585, 586, 626
selection, 98, 187, 224-225, 226 n.36,
249, 290, 293, 296, 370, 386, 408,
512 n.22, 583-585, 591, 627
systematic, 394, 572 n.67, 573
Biomarkers, 509, 536 n.76, 586 n.110
Bipolar disorder, 832, 833 n.105, 839,
847, 853, 854 n.236, 855, 859, 881
n.366
Birth defects, 13-14, 249, 552 n.4, 562,
563, 570 n.63, 578 n.85, 579 n.86,
585 n.106, 587 n.112, 590, 614,
618, 620, 638, 984
Bite mark evidence
ABFO guidelines, 107, 123
case law development, 105, 110-112
comparison methods, 106-107
computer-generated overlays, 106
n.317
crimes involving, 103-104
Daubert and, 112
DNA exonerations, 62 n.32, 109-110
empirical record, 61, 65, 108-111
proficiency testing, 109
specificity of expert’s opinion, 111,
123, 215
973
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Breach-of-warranty action, 31
Breast cancer, 259 n.122, 562, 607 n.170,
617 n.211, 704-705, 708, 710,
711-712, 719, 721, 727, 733-734,
736-737, 738-739
Bundy, Ted, 112
Bureau of Economic Analysis, 484
Bureau of Labor Statistics (BLS), 471, 484
Burke v. Town of Walpole, 110, 123
Daubert hearings, 6, 14, 23 n.61, 31,
35-36, 74 n.105, 76-77, 122
in limine motions, 14, 22, 414 n.213
jury instructions, 29, 168 n.84, 170
n.95, 383 n.104, 455, 943
pretrial conferences, 6, 488
pretrial Daubert hearings, 6, 18, 30,
311, 362
pretrial lie detection, 807
protective orders, 487
special masters or expert assistants, 6,
7, 35, 135, 488, 489
structuring expert testimony, 23-24
survey uses, 366-367
videotaped testimony, 7, 880-881
Case reports, 23 n.59, 25 n.69, 108
n.329, 217 n.14, 639, 714, 724
Case-control studies, 556, 557, 558, 559560, 568, 569, 583-584, 585-586,
587 n.112, 588 n.115, 589-590,
591 n.122, 607, 620, 625
Cats, 196, 197
Causation (see also specific disciplines)
abuse-of-discretion standard and, 24
alternative explanations, 552-553 n.7,
570 n.63, 582, 595, 598, 600, 605,
672-673
anecdotal evidence, 217, 218, 220
association and, 20, 218, 221, 222,
262, 264, 552-553, 559, 562, 566,
567, 570, 571, 574, 577 n.81, 578
n.85, 584, 591, 592-593, 604-605,
610 n.184, 664-665
biodistribution of toxic agents,
667-668
biological plausibility of associations,
20, 573, 600, 604-605, 606, 620,
664-665
but-for analysis and, 429, 431, 432,
433, 436, 438-439, 440-443, 449450, 455, 460, 461, 470, 471, 472,
473, 475, 476-477, 491, 492, 493494, 496-497, 498, 501, 597, 598
n.136
cessation of exposure and, 605
conflicting research, 606, 674-675
C
California Public Utilities Commission
v. California Energy Resources
Conservation & Development
Commission, 947-948
Canadian General Social Survey, 408
n.212
Cancer risk, 635, 638 n.12, 642-643,
644-645, 649 n.46, 650, 653, 654,
655, 656, 659, 660 n.74, 665, 668669, 670, 683
Capital Asset Pricing Model (CAPM),
459, 469
Capital punishment (see Death penalty)
Carbon monoxide, 513 n.27, 540 n.88,
587 n.111, 635-636, 637 n.8, 651
n.52, 652 n.56, 672, 681
Carbon tetrachloride, 543, 544, 653, 662
Carcinogens/carcinogenicity, 643 n.29,
644, 645, 647 nn.37 & 38, 649
n.44, 650 n.49, 651, 655-656, 658
n.70, 659, 660 n.74, 670 n.97, 673
n.105, 680
Carcinogenicity bioassay, 644, 654-655,
680
Case management (see also Disclosures to
opposing parties; Discovery)
amicus curiae briefs, 5, 30, 371,
797-798
bifurcation, 476
closing arguments, restriction on, 124
court-appointed experts, 6-8, 14, 35,
311, 329, 489, 599 nn.141 & 143
cross-examination, 169 n.84
974
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
subsequent unexpected events and,
438, 480-481, 495, 500
synthesizing multiple studies, 19-20,
21, 23-24, 217 n.14
target organ specificity, 662-663
temporal relationship, 217 n.14, 323
n.52, 558, 560-561, 562-563, 587
n.111, 600-601, 606, 669 n.94,
714 n.100
“weight-of-the-evidence”
methodology, 565 n.48
Censuses, undercount litigation, 2-3,
213, 223-224, 247 n.90, 268, 275
n.149, 307, 308
Centers for Disease Control and
Prevention (CDC), 418 n.246,
536-537, 561-562 n.36, 672
Centers for Medicare & Medicaid
Services, 862 n.290
Central Intelligence Agency, 805
Charter on Medical Professionalism, 703
Chloramphenicol, 724, 731
Chlordane, 543, 643 n.28
Chromosomes (see also Genes)
allele variations on, 142
anomalies, 183 n.140
autosomes, 200, 201, 204
cytogenetic analysis, 655
defined, 201
diploid number, 202, 204
haploid number, 204
homologous, 204
inheritance, 137-138, 142, 183 n.139
loci used for profiling, 142-143, 144,
145-146, 147, 148, 151, 153, 155,
159, 162, 163, 164, 165, 166, 175176, 182, 183, 188 n.157, 190,
191, 192, 196, 197, 198, 199, 201,
202, 204, 205, 206, 207, 209
monomorphic loci, 139
mutations, 206, 655, 683
recombination, 138
reduction process, 137-138
structure, 136-137, 142, 750
X, 136, 137, 138, 147, 201
Y, 136, 137, 138, 147, 181-182, 184,
201
confounding factors, 218, 220, 221,
222, 591, 592-593, 598, 672-673
consistency of trends, 606
correlation and, 309
Daubert trilogy and, 12
defined, 552 n.7
differential diagnosis, 217 n.14, 512
n.21
direction of, 322-323
dose–response relationship, 603
ecological studies, 561 n.34, 562
epidemiological studies, 23, 217 n.14,
218, 597-606
excretion routes for toxic agents and,
668
exposure evidence, 25-26, 558, 587
n.111, 588, 597-606, 666-667
extrapolation issues, 23, 222, 223,
563-565, 661-662, 664
general, 24, 551 n.2, 552, 565 n.48,
578 n.85, 597-606, 637 n.7, 638,
657 n.87, 659, 660-665
generalizability of studies, 222, 564,
595 n.133, 623
guidelines for assessing, 599-600
latency period for disease and,
668-669
level of exposure and, 669-670
medical evidence, 217 n.14, 438,
670-671
metabolism of toxic agents and, 668
observational studies, 215-216, 218,
220-222
preponderance of the evidence
standard, 565 n.48, 610 n.182
proximate cause, 463, 464
randomized controlled experiments,
218, 220, 221, 222
replication of results, 604
specific, 24, 25-26, 551 n.2, 552, 608618, 637 n.7, 638, 645 n.31, 659
n.72, 665-666, 669-670 n.95
specificity of association, 605-606
statistical studies, 213, 216-223, 249,
260-272, 288
strength of association, 602
structure–activity relationships, 663
975
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Computer assisted tomography (CAT
scan), 718 n.117, 719, 720, 762763, 837-838, 893
Confessions, coerced, 59 n.16
Confidence intervals (see specific disciplines)
Conflicts of interest, 8, 21-22, 48-49,
590, 728, 875
Confounding factors (see also specific
disciplines)
controlling for, 596-597
identifying, 595
lurking variables, 262-264
preventing or limiting, 595
Confrontation Clause, 26-27, 30, 789
“Consistent with” testimony, 70, 104
n.302, 111, 113, 116, 120, 121,
160 n.60, 184, 604-605, 606, 927
Consumer Product Safety Commission
(CPSC), 650, 909, 911, 920
Convenience samples, 164, 224-225, 248,
285, 287
Costs of expert testimony, 19
Council of American Survey Research
Organizations, 382 n.102, 416
n.240
Council on Continuing Medical
Education, 700
Coupon settlements, 491
Credibility issues (see also Conflicts of
interest)
Daubert and, 21-22
Crime Laboratory Proficiency Testing
Program, 97
Criminal Justice Act of 1964, 127
Cross-sectional studies, 319, 345, 352,
556, 560-561, 621-622, 716,
736-737
Cruel and unusual punishment, 3, 369,
815 n.5
Current Population Survey, 260 n.125,
266
Cyanide, 651-652
Chronic Lyme disease (CLD), 728
Chronic lymphocytic leukemia, 505 n.4
Civil Rights Act, 228, 350
Class-action cases, 7, 238 n.72, 247 n.90,
248 n.93, 429, 462, 463, 483, 486,
489-491, 649 n.47
Class certification proceedings, 30-32,
307 n.7, 365, 463, 489
Classification of Violence Risk (COVR),
848
Clean Air Act, 666
Clinical studies, 510, 555, 556, 575, 590,
607, 621, 640, 648 n.42, 656 n.64,
658, 659, 661
Cocaine, 126, 536 n.76, 760, 789
Cohort studies, 556, 557-559, 560, 567,
568, 573, 583, 584, 585 n.104, 589,
590, 592, 593, 594, 607, 621, 624,
625, 626, 628, 657, 658-659, 716
Coker v. Georgia, 370
Collaborative Testing Services, Inc., 69
n.82, 78, 85, 87, 88, 98
Commission on Osteopathic College
Accreditation, 696
Common-law fraud action, 31
Commonwealth v. Patterson, 81-82
Competency
confinement based on, 852
to consent to treatment and research,
844, 845
to enter into contracts or make wills,
816, 817, 820, 867
evaluations, 817-819, 820-821, 823,
844 n.167, 872, 880, 884, 885,
889, 890
to manage one’s affairs, 816, 844, 867
to marry or to vote, 816
of medical patients, 735
neuroscience applications, 796, 799
parenting capacity, 820, 844, 867
to represent oneself, 3, 799, 815, 818
restoration of, 852, 861, 863
to stand trial, 3, 785, 799, 815, 818,
820, 821 n.37, 823, 844, 852, 861,
863, 867, 872, 885 n.377, 889
to waive rights, 815, 817, 844
976
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
D
DeLuca v. Merrell Dow Pharmaceuticals, Inc.,
247 n.92, 551 n.2, 553, 567 n.55,
572 n.67, 575 nn.73 & 75, 577
n.81, 578 n.85, 579-580 n.88, 582
n.91, 599 n.143, 609 n.178, 610
n.184
Department of Commerce v. United States
House of Representatives, 2-3
Department of Defense, 46
Department of Energy, 46
Department of Health and Human
Services, 46
Department of Justice, 80, 117, 411, 491
Department of Labor, 793
Department of Veterans Affairs, 696 n.33,
892
Diagnosis of mental disorders
accuracy, 839-840
approaches, 834-839
clinical examination, 834-835
functional impairment vs., 819-821
laboratory tests, 838-839, 883
major diagnostic categories, 831-834
malingering detection, 840-841
neuroimaging studies, 837-838
nomenclature and typology (DSMIV-TR and DSM-V), 828-831
psychological and neuropsychological
tests, 836-837
records of previous assessments, 839
structured interviews, 835-836
Dioxins, 522, 536 n.76, 545, 643 n.28,
652, 653, 667 n.92
Disclosures to opposing parties (see also
Discovery)
analytical methods and nonsupporting
analyses, 216
damages data, 486-488
data dictionaries, 487-488
database information and analytical
procedures, 331-332
dispute resolution, 488
drafts and communications, 33
format standardization, 487
unretained testifying experts, 32 n.96,
34
Damages (see Economic damages)
Daubert v. Merrell Dow Pharmaceuticals (see
also individual disciplines)
admissibilty conflated with sufficiency,
20-21
application issues, 22-26
atomization, 19-20
characteristics of scientific knowledge,
49 n.16 (see also Reliability of
scientific testimony)
and civil cases, 63
and class certification proceedings,
30-32
credibility issues, 21-22
definition of science, 39 n.3
and empirical testing, 62-64
evidentiary (Daubert) hearings on
admissibility, 6, 14, 18, 30, 31,
35-36, 74 n.105, 76-77, 122, 125126, 216, 362
exposure assessment, 22, 25-26
and Fed. R. Evid. 702, 12
and forensic identification evidence,
62-64, 101, 112
and Frye test, 12
gatekeeping function of trial judges,
6, 12-13, 16, 17, 102 n.291, 866
n.309, 901, 933, 956
and in limine motions, 14
interpretive issues, 19-22
overview and impact, 12-14
pretrial hearings, 6, 14, 23 n.61, 31,
35-36, 74 n.105, 76-77, 122
qualifications of expert witness, 22-23
and scientific foundation of studies,
23-25
scientist’s view of, 52-54
sufficiency conflated with admissibility,
20-21
Death penalty, 3, 27 n.78, 126, 216, 220,
221, 223, 307, 308, 369 n.45, 370371, 797, 800 n.51, 851, 877
Decision theory, 242 n.84
977
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
DNA databases and database searches
all-pairs matching, 191-192
Arizona offender database, 191-192
Australian offender database, 192
birthday problem, 192 n.170
British National DNA Database,
144-145
CODIS (Combined DNA Index
System), 61, 62 n.30, 145, 201
comprehensive population-wide
database, 163 n.73
disclosure of trawling to juries, 189190 n.164
judicial opinions on adjustment, 189
laboratory quality assurance
requirements, 154
mitochondrial DNA, 178-180, 190
near-miss (familial) searching, 189-191
New Zealand offender database, 191,
192
population databases for validation of
new loci, 148, 155, 163-164, 197198, 199
probative value of matches, 165 n.76,
186-189, 190
proficiency testing for participants,
69-70, 156
representativeness of populations, 179
sampling error, 178
selection effects, 187
statistical analyses of adjustments, 165
n.76, 179, 186, 187-188
trawling, 174 n.109, 186-191
verification of random-match
probabilities, 191-192
DNA Identification Act of 1994, 61, 69,
70 n.83, 154 n.46, 156
DNA identification evidence
admissibility, 131, 132-133, 140, 166,
173 n.102, 181 n.134, 182, 186,
189, 195 n.183, 197
Bayesian approach to matches, 173,
174, 188, 189, 190-191, 200, 209
bite marks, 62 n.32, 109-110, 151
blood, 143, 151, 155, 156, 158, 160
n.60, 164, 169 n.89, 173 n.103,
182, 197
Discovery (see also Disclosures to
opposing parties)
amended rules, 32-34
“assumptions” provision, 34
DNA evidence, 125-126, 191
e-discovery, 34-35
improving the process, 330-331
laboratory reports, 125
mass torts litigation, 366-367
motions to compel, 34-35, 373 n.62
opinion work product, 33, 374
pretrial, 57, 125-126, 216, 310 n.24
procedural issues, 32-35, 125-126
statistical evidence, 310 n.24, 330-331
of summary of expert’s opinion, 125
undue burden or cost, 33, 34
Discrimination (see Racial discrimination;
Sex discrimination)
DNA (deoxyribonucleic acid) (see also
Alleles; Chromosomes; Genes;
Genome)
base pairs, 138, 139, 140, 141, 142,
143, 146, 147, 149, 152, 153, 176,
177, 180, 200, 201, 202, 203, 204,
206, 209
chemical structure, 131, 136-139, 202
complementary sequences, 143, 150,
201, 204, 207, 208
damage from toxic chemicals, 645,
654-655, 656, 663, 682
defined, 202
D-loop, 177, 201, 202
environmental insult, 153 n.44,
202-203
individual variation, 135-136, 137
introns and exons, 138 n.16
mitochondrial, 143 n.23, 202, 206,
651
polymorphisms, 139-143, 148, 177,
182, 197, 199, 207, 208, 209
pseudogenes, 138 n.16
repetitive sequences, 141, 142-143 (see
also STR under DNA sequencing
and testing)
DNA Advisory Board (DAB), 61-62, 154
n.46, 187
978
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
NRC reports, 60-61, 125, 127, 133,
134 n.12, 141 n.19, 143, 161, 162,
163 n.72, 164 n.75, 166-167, 168
n.84, 169 n.89, 170 n.95, 174
n.110, 175, 176 n.114, 185, 187188, 192 n.170
objections to, 135
population frequencies, 134 n.12, 148,
155, 163-165, 166, 178, 182, 191,
195, 196 n.185, 197, 200, 203,
204-205, 207
population structure adjustments, 166167, 179, 182, 192, 207
posterior probabilities, 172, 173-174
prejudicial testimony, 167-170, 171
n.97, 181 n.136, 185-186, 189,
190 n.164
pretrial discovery, 125-126
prior probabilities, 173, 174
probability sampling, 184
product rule, 165-167, 198, 199, 204205, 207
qualifications of experts, 134-135, 156
n.52
random match probabilities, 60, 135,
155, 164, 165, 167-171, 172, 173,
175-176, 181 n.34, 182, 186, 187,
188, 189, 190, 191-192, 196, 197,
198 n.194, 205, 208, 251 n.99
random sample/sampling, 164-165,
178
in randomly mating populations, 165166, 179, 198, 204, 208
“rarity” or “strength” testimony, 175
reappraisal of, 60-62
relatives as sources (kinship
hypothesis), 161, 162-163, 170,
172, 173 n.105, 174, 175-176, 184
n.143, 190, 192, 202
reliability, 60, 62 n.32, 73, 227
semen, 143, 151, 155, 159 n.58, 169
n.89
source attribution, 156-157, 161-162,
175-176
case law development, 131, 132-133
ceiling principles, 167 n.80, 200,
204-205
chain of custody, 157, 162
coincidence hypothesis, 161, 163-167,
172, 173 n.106
contact, 151
database matches, 165 n.76, 179, 186189, 190, 191-192
Daubert and, 166, 167, 171 n.98, 173
n.102, 181, 186, 189, 194 n.176
defendant’s fallacy, 168 n.89
defense experts, 127, 162, 168 n.84
empirical testing, 60-62, 66, 148
error rates, 162, 170, 171 nn.96-98
exclusions, 116, 133, 135, 144 n.25,
156, 158 n.56, 159-160, 167-168,
169, 171, 173, 175 n.111, 177178, 179, 180, 181, 184, 185, 186,
188, 190 n.164, 196 n.185
exonerations, 27, 62, 109-110, 116,
117, 119, 124, 125 n.450, 134, 157
n.55
Frye (general acceptance) test, 133 n.7,
166, 167, 173 n.102, 181, 186,
189, 195 n.183, 197
hair, 113, 116, 143, 149 n.133, 151,
155, 170, 177, 178 n.123, 179,
180, 181 n.134
history, 132-134
hypotheses for matching profiles,
160-161
jury comprehension of, 167-171, 175
n.111, 189-190 n.164
laboratory errors, 160-162
likelihood ratios, 169 n.89, 172-173,
174, 175, 177-178, 185-186, 205,
206
matches/inclusions, 74 n.107, 159-160
mishandling or mislabeling, 156-158,
175
mitochondrial DNA, 113, 116
multilocus profile frequency, 164, 166,
202, 204
979
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
degraded samples, 147, 149, 151, 152153, 155, 157 n.55, 158, 160, 177,
201
denaturation, 143, 201
electropherograms, 144, 145, 146,
147, 182-183, 184, 199, 202, 208
emerging (next-generation)
technologies, 140, 148-150
extraction, 132, 143-144, 148, 151,
153 n.44, 177, 183
false-negative rates, 115-116, 162
false-positive rates, 161-162, 170-171
gel electrophoresis, 141, 200, 203, 209
heteroplasmy and, 177, 179, 180, 181,
204
high-throughput sequencing, 149-150
“lab-on-a-chip” devices, 148-149,
200-201
limitations, 141
low copy number (LCN) or low
template (LT), 151-152
measurement error, 141, 205
microarrays, 150
mitochondrial DNA, 71, 113, 116,
140, 150, 176-181, 201, 204
mixtures of DNA, 155, 158, 172-173,
182-186
multiplexing, 142, 144, 145-146, 206
phylogenetic analysis, 193, 194, 195
polymerase chain reaction (PCR),
133, 140, 142, 143-144, 145, 146,
147, 148, 151, 152 n.41, 153, 158,
177, 182, 196, 199, 202, 205, 206,
207, 209
population genetics, 133, 135, 148,
164, 181, 182, 191, 192, 198, 207
primers, 143, 144, 153, 182, 196, 204,
206, 207
quality of sample and, 152-153
quantity of DNA in sample and,
151-152
random amplified polymorphic DNA
(RAPD) analysis, 196
regions for forensic sequencing, 139
relative fluorescent units, 145, 146,
208
RFLP testing, 132, 140-141, 208
statistical conclusions, 131, 133, 134,
135, 155, 160, 163, 166, 167, 168
n.85, 169 n.91, 171, 172, 174,
178, 179, 181, 182-183, 185, 186189, 193, 197
transposition fallacy, 168-169, 170
n.92, 173, 209
“uniqueness” testimony, 175-176
unrelated person as source, 163-167
vaginal swabs, 147, 151, 158, 182, 183
verbal expressions of probative value,
174-176, 182
wrongful convictions on, 62 n.32, 141
n.18
DNA laboratories
accreditation, 62, 154, 156, 171 n.98
certification, 154 n.48, 156
documentation requirements, 154-155
errors in matches, 160-162, 171
performance standards, 153-159
population genetics research, 192
profìciency testing, 60-62, 69-70, 148,
154, 155-156, 160-161, 162, 171,
196, 207
quality assurance and quality control,
61-62, 143-144, 153-156
retention of samples, 157
sample handling, 156-159
validation of procedures, 155
DNA sequencing and testing (see also
Nonhuman DNA testing)
allele-specific oligonucleotide (ASO)
probes, 140, 207, 209
amplification, 142, 143-144, 148, 151,
152 n.41, 153, 158, 182, 183, 196,
199, 205, 208
amplified fragment length
polymorphism, 199
artifacts, 153, 156-157, 185
autoradiography, 141 n.17, 199
capillary electrophoresis, 144-147,
200, 202
contaminated samples, 143-144, 153,
155, 156-157, 158, 160-161, 170,
181 n.134
980
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
low-dose risk curve, 673 n.107
maximum tolerated dose (MTD), 644645, 682
target site/organ, 507, 519, 535, 536,
547, 636, 646
toxicology, 525, 636-637, 638, 641,
642, 644-645, 646, 647, 648, 651,
657 n.67, 658-659, 660, 661, 664,
665, 667 n.91, 668, 670 n.96, 673
n.107, 674, 677 n.115, 680, 681,
682, 684
Dose-response curve, 646, 651, 673
n.107, 681
Dose-response relationships, 4-5, 563,
565, 585, 600, 603, 613, 616
n.204, 622, 635, 636, 639, 641,
642-643, 645, 646, 648, 649, 651,
658, 663 n.82, 669, 670, 676, 680,
681
Drugs, illegal, 59
Due process, 44 n.10, 59 n.16, 104
n.306, 119, 127, 134 n.10, 157
n.55, 170 n.92, 171 n.97, 186
n.151, 226 n.35, 792, 815 n.3
Duke University Private Adjudication
Center, 8
sample collection and preservation,
151-153
sequence-specific oligonucleotide
(SSO) probes, 140, 209
sex-typing test, 146-147
size of sample, 141, 151-152
SNP chips, 140, 201
SNPs (single nucleotide
polymorphisms), 139, 140, 142,
148, 149 n.33, 150, 181, 182-183
n.138, 197, 201, 206, 208, 209
Southern blotting, 141 n.17, 209
special issues, 150, 176-192
STR (short tandem repeat or
microsatellite) profiling, 132, 133,
134, 141-142, 143, 144-147, 148,
149, 151-152, 153, 159, 160, 164,
170, 175, 176 n.115, 181-182,
183, 184 n.143, 189, 190, 191,
192, 196, 197, 198, 200, 201, 205,
209, 210
validation of methods and procedures,
133, 134, 148, 150, 153, 154, 155,
185, 193, 195
VNTR (variable number of tandem
repeats) testing, 140-143, 147, 166,
199, 200, 202, 205, 209-210
Y chromosomes (Y-STRs and
Y-SNPs), 132, 160 n.60, 181-182,
184 n.143, 190
Dogs
bite marks, 104
DNA profiling, 193, 197, 198 n.193
extrapolation of studies to humans,
535 n.75, 662 n.78
scent evidence, 62 n.32
Dose, dosage
benchmark, 642, 670 n.96, 680
exposure, 507, 508, 509, 513 n.26,
518-520, 525-528, 529, 531 n.67,
533-534, 535, 536, 538, 539, 541,
544, 545, 546, 547
extrapolation, 4-5, 603 n.160, 636,
641, 645, 648, 651
lethal dose 50 (LD50), 641, 682
limits, 512 n.22, 521 n.43, 528-529,
536
E
Ecological fallacy, 623
Ecological studies, 556-557, 561-563, 623
Economic damages
actual earnings of plaintiff after
harmful event, 450-451, 493
antitrust, 22, 260, 305, 320, 328, 348
n.90, 365, 366 n.25, 373, 429 n.1,
431, 439, 475, 491 n.89, 498
apportionment, 477, 479-480, 617
n.209
appraisal approaches, 444, 445-446,
447, 501
assets and liabilities (balance sheet)
approach, 443, 446
avoided costs, 429, 449-450, 456,
466-467, 501, 502
981
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
escalation, 451, 452, 453, 454, 495,
501
examples of calculations, 491-500,
893-894
exclusionary conduct, 441-442, 476
expectation damages, 433, 434, 435,
436, 437, 442-443, 449, 467, 497,
501
expected value approach, 437, 459,
462, 463, 468-469
fairness of settlement, 490-491
fixed cost, 446, 450, 499, 502
foreseeability rule, 463, 464
fringe benefits, 455, 470, 471-473,
492-494
future dollars, 451, 501
general approach to quantification,
432-443
harmful act/event analyzed, 432, 439,
440, 442
hypothetical acts of defendant,
439-440
hypothetical property, 447-448
individual class members, 490
inflation, 429, 444, 451-454, 469,
495, 499, 501, 502
intellectual property damages, 307,
311, 429 n.1, 439, 440, 441,
498-499, 501, 502, 932-933, 938,
945-946
interest for losses, 429, 430, 436 n.19,
441 n.26, 444, 452-453, 454, 457459, 460, 490, 495, 500, 501, 502
life expectancy, 471, 472, 473, 474,
492, 493, 495-496
lifetime income calculations, 470-471,
474, 493, 495-496
limitations on recoverable damages,
461-467
liquidated damages, 429, 461, 467,
476
marginal costs, 446, 499
market approach based on prices and
values, 429, 431, 440, 443, 444448, 459-460, 469, 481, 498, 501
market effect of adverse information,
448
avoided losses, 449, 464-465, 466467, 478
but-for analysis, 311, 319, 429, 431,
432, 433, 436, 438-439, 440-443,
449-450, 455, 460, 461, 470, 471,
472, 473, 475, 476-477, 491, 492,
493-494, 496-497, 498, 501
capitalization factor, 444, 459-460,
469, 501
causation, 22, 438, 463-464, 480-481,
495, 500, 942-943
class actions, 7, 238 n.72, 247 n.90,
248 n.93, 429, 462, 463, 483, 486,
489-491
class certification and, 30-31, 32 n.95,
463, 489
compensatory, 238 n.71, 239-240,
388, 433, 434 n.10, 437, 455
compound interest, 457, 458, 501
constant dollars, 451, 452, 453, 454,
495, 501
court-appointed experts and special
masters, 7, 488, 489
data used to measure damages,
482-486
Daubert and, 22, 30-31, 32 n.95, 431432, 461, 462
defined benefit plan, 472-473
defined contribution plan, 472, 473
disaggregation, 429, 438, 475-477,
479
disclosure standards, 429, 481,
486-488
discount rates, 452-454, 459, 471,
493, 495
discounted lost cash flows, 429, 430,
443, 444, 448-460, 469, 471, 500
double-counting, 442, 443 n.29
earnings losses, 429, 431, 443, 453454, 455 n.44, 458, 460, 465, 472,
490, 491-497, 501
earnings projections, 430, 438, 439440, 454 n.40, 470-471, 492,
496-499
electronic data, 482, 486, 487, 488
employment cases, 434 n.11, 457, 478
engineering testimony, 942-943
982
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
reliance damages, 433, 434, 435, 436
n.18, 437, 442-443, 449, 497, 502
restitution, 433, 435, 497, 502
retirement issues and benefits, 465,
470, 471, 472-473, 492, 493,
495-496
sampling data, 482-483, 485, 486
securities litigation, 429 n.1, 431, 448
services lost, 474
speculative, 454 nn.40 & 43, 461-463,
468
startup businesses, 447, 449, 456, 465466, 468-470
statutory damages, 433, 457, 500
stock options, 429, 456
subsequent unexpected events, 438,
451, 480-481, 495, 500
supervening events, 480-481
survey and market research data, 389,
431, 469-470, 482, 483, 484, 486
taxes, 429, 447, 454-456, 458, 459,
460, 470, 500
unjust enrichment, 433
validity of data, 449, 469-470, 483-485
variable costs, 450, 502
wrongful death, 238 n.71, 470, 471,
473-474, 475 n.77
wrongful termination, 470, 471, 475,
491
zero damages, 438-439, 460, 461, 463,
468, 476, 477, 479
Economic loss rule, 435, 436, 437 n.22
Education Commission for Foreign
Medical Graduates, 696
EEOC v. Sears, Roebuck & Co., 257
n.115, 308, 313 n.36, 365 n.20
Eighth Amendment cases, 3, 369-370,
795, 815 n.5, 816 n.12, 833 n.105
Eisen v. Carlisle & Jacquelin, 31-32
Electric Power Research Institute, 678
Electroencephalography (EEG), 761,
766, 772-773, 791-792, 796, 803,
838-839
Electropherogram, 144, 145-146,
182-183
Eleventh Amendment, 816 n.12
Emphysema, 592, 593, 595, 606, 618
market friction adjustments, 446-447
mean vs. median awards, 238
measures of losses, legally prescribed,
433-439
medical expenses, 472, 474, 505 n.3
medical insurance benefits, 471-472
medical malpractice, 474
missing data, 319, 485-486
mitigation of losses, 450-451, 461,
464-466, 470, 481, 496, 497, 498,
499-500, 502
multiple challenged acts, 429, 438,
475-477, 479
nominal (ordinary) interest rate, 453,
495, 502
offsets, 443, 448, 451, 453-454, 465,
471, 474, 475
pain and suffering, 429 n.2, 434, 475
paper data, 482, 487, 488
partial losses, 445-446, 477-478
patent infringement cases, 311, 319,
440 n.24, 441
personal income losses, 470-475,
491-497
prejudgment interest, 430, 441 n.26,
444, 454 n.42, 457-458, 459, 502
present value, 443, 444, 448-449, 452,
455, 469, 472, 473, 474, 475, 495,
500, 502
price erosion, 440, 441, 502
profit losses, 429, 439, 440 n.24, 441,
443, 453-456, 459, 461 n.64, 464,
466, 468, 469, 478, 491, 497-500
profìtability of business, 326, 442443, 444, 449, 460, 461 n.64, 462,
468-470
proximate cause (remoteness of
damages), 463-464
punitive, 239-240, 433, 436, 437
qualifications of experts, 431-432
quality of life, 428 n.2, 475
real interest rate, 444, 453, 495, 502
reasonable certainty standard, 434
n.11, 461-463, 468
regression analysis, 305-306, 308 n.12,
311, 319, 326, 348, 431, 446, 450,
481, 499, 502
983
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
experience, 901, 902, 903, 904, 906,
922, 928-929, 930
experiment-based evidence, 933, 934936, 938, 944
Federal Motor Vehicle Safety
Standards (FMVSS), 917-918
finite element modeling (FEM), 937,
957
formulation of products and materials,
907, 922, 923, 929
Fundamentals of Engineering (FE)
exam, 931
gatekeeping role of judges, 901, 933,
956
inspections, 925, 933-934
intellectual property disputes, 932933, 938, 945-946
internal documents, 938-939
intrauterine device (IUD) flaws,
920-921
issues for litigation, 939-948
Kansas City Hyatt Regency Hotel
disaster, 389, 923-924
licensure, 931-932, 933, 949
literature analysis, 938
manufacturing issues, 906, 907, 918,
920-921, 923, 925, 927, 928-929,
934, 937, 939, 940, 941, 943 n.70,
947 n.84, 952
modeling, mathematical and
computational, 901, 936-938, 957
observational evidence, 933-936, 952,
955-956
obsolescence, 907
opinion testimony, 901, 933-939, 940
n.62, 941-942, 943, 944, 952, 956957, 958
other similar incidents (OSIs) concept,
952-956
patent cases, 935, 945-946
personal injury cases, 942, 947
presentation of evidence, 901, 936937, 956-958
problem identification, 902-903
product defects, 907, 908, 934, 937,
939-944, 947 n.84, 951 n.92, 952,
953
Empirical testing (see also specific disciplines)
Daubert and, 62-64
Engineering evidence
accelerated testing, 905-906, 944
acceptable risk, 908, 909-910, 915-920
accreditation, 931-932
administrative hearings, 947-948, 953
admissibility, 899, 932-933, 943, 944
n.72, 952 n.93, 955, 958 n.106
air cooler flaws, 922
Air France 4590 disaster, 928-929
animated presentations, 956-958
approximations, 903, 913, 919, 936
assumptions, 903, 912-913, 935, 936,
950
automotive lift design, 924-925
best practice, 950-951
certification, 932
Challenger space shuttle disaster,
926-928
computer simulation and digital
displays, 901, 936-937, 956-958
Concorde design flaws, 928-929
cross-disciplinary domains, 900-902
dam collapse, 925-926
Daubert and, 899, 932-933, 945, 946,
949, 958 n.106
“defect” testimony, 907, 908, 934,
937, 939-944, 947 n.84, 951 n.92,
952, 953
design issues, 920-929, 939-940, 948
design process, 904-929
disciplines and fields of practice, 900
disputed issues commonly occurring,
948-956
education and training, 929-930, 931
end use testing, 905-906, 923, 924,
925, 944
engineering calculations, 936-937,
938, 957
engineering interns, 931
engineers in training, 931
environmental disputes, 906, 916, 947,
954-955
examples of flawed design processes,
920-929
984
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
Epidemiology
adjustments for noncomparable study
groups, 571-572
admissibility of evidence, 551 n.2,
553, 555 n.14, 562 n.38, 565 n.48,
579 n.85, 581 n.89, 583 n.100,
601 n.153, 606 n.169, 609 n.181,
610, 618 nn.213 & 214
agent, defined, 551 n.3, 619
alpha, 576, 577, 578 n.84, 579-581,
582, 619
alternative explanations, 552-553 n.7,
582, 595, 598, 600, 605
animal studies of toxicity, 563-565,
603 n.160, 625
association (exposure–disease), 552553, 554, 555-556, 557, 559, 561,
566, 567-568, 570, 572, 573,
574-575, 576, 577, 578 n.85, 579,
580, 581, 582, 583, 584, 585, 586
nn.107 & 108, 588 n.115, 589,
590, 591-593, 595, 597-606, 610
n.184, 611-612, 613, 622
attributable risk, 566, 570-571, 612
n.191, 619
Bayesian approach, 611 n.188
beta, 576 n.80, 581, 582
biases, 24, 553 n.9, 554, 567-568,
572, 573-574, 575 n.74, 583-591,
592 n.127, 595 n.133, 598, 602,
605, 610 n.184, 612-613, 615,
620, 622, 624, 625, 626, 627
biological markers, 586, 620
biological plausibility of association,
573, 600, 604-605, 606, 620
case-control studies, 556, 557, 558,
559-560, 568, 569, 583-584,
585-586, 587 n.112, 588 n.115,
589-590, 591 n.122, 597-606, 607,
620, 625
causation, 23, 24, 217 n.14, 218,
551-552, 553, 554, 558, 559, 560563, 564 n.48, 566, 570, 574, 577
n.81, 579 n.85, 584, 585-586, 587
n.111, 588, 591, 592-593, 597-618
clinical studies, 555, 556, 575, 590,
607, 621
product liability litigation, 901, 938,
939-943, 947
professional engineers (PEs), 931, 932
property damage, 943, 947
qualifications of experts, 831, 901,
932-939, 949
radiant heating hose flaws, 922-923
reasoning processes, 902-904
registration, 931, 932, 949
regulatory context, 919, 947-948,
951-952, 953
reliability of evidence, 900, 906, 933,
935, 940 n.62, 945, 946, 947, 958
n.106
retrospective product modification,
907
risk assessment, 909, 910-914
risk calculations, 911-912, 918
“safe” products, 908-909
safety considerations, 908-910
severity of injury, 910, 911-912
solution paradigms, 902-903
standard of care, 949-950
standards and codes, 924-925, 947948, 951-952
state of the art, 950
Tacoma Narrows Bridge collapse, 924
testing-related evidence, 934-936
toxic waste site design, 921-922
trade secret disputes, 946
uncertainty, 903
validation, 904, 906, 907, 922-925,
927-928, 929, 936, 937-938, 944
n.72, 958
vehicle-miles-traveled (VMT), 912,
913-914, 916, 919
warning issues, 904, 938, 939, 941942, 943-944 n.70
Environmental Defense Fund, 678
Environmental Protection Agency (EPA),
506, 530 n.66, 532, 577-578, 640,
648, 649 n.46, 650, 656 n.64, 663,
665 n.88, 673 n.108, 679, 920
Ephedra litigation, 23-24 n.61, 25 n.67,
577 n.81, 579 n.85, 617 n.212,
664 n.84, 669 n.94, 676 n.113
985
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
hospital-based studies, 584
human (in vitro) studies, 555-556,
564, 623
incidence of disease, 551, 557, 558,
560, 561 n.34, 562, 566, 567, 568
n.58, 569 n.61, 570, 571, 577
n.83, 582, 592, 594, 595, 598, 603,
605, 608 612, 613, 615, 616, 619,
621, 622, 623, 624, 625, 628
information bias, 585-590, 624
interpretation of study results, 566-572
meta-analysis, 15, 19-20, 23, 579, 581
n.89, 606-608, 624
misclassification bias, 588 n.115, 589590, 622, 624, 625
missing data, 595-596
multivariate analysis, 596, 597, 625
null hypothesis, 574, 575, 576 n.78,
577 nn.81 & 82, 579, 581-582,
619, 620, 625, 626, 628, 629
observational studies, 555-563, 566,
581, 590, 592, 593, 607, 608, 624,
625, 627
odds ratio, 566, 568-569, 573, 584,
589, 625
power of hypothesis tests, 580 n.88,
582-583, 607
prevalence, 560, 567 n.54, 602 n.157,
622, 625, 626
publication bias, 590
p-values, 575 n.74, 576, 577 n.81,
578-580, 626, 628
random assignment, 555, 556 n.15,
592, 607, 623, 626
random error, 556 n.19, 572 n.67,
573, 574-582, 585 n.106, 589,
612-613, 621, 623, 624, 626, 627,
628
random sample/sampling, 572 n.67
randomized controlled studies, 398,
555, 556, 581 n.89, 592, 607, 621
randomness, generally, 555, 626
relative risk (RR), 566-568, 569, 570
n.62, 572, 573, 574, 575-576, 577
n.82, 578 n.85, 579, 580, 581,
582, 592, 594, 602, 611, 612, 614615, 616, 619, 621, 626, 627
cohort study, 556, 557-559, 560, 567,
568, 573, 583, 584, 585 n.104, 589,
590, 592, 593, 594, 607, 621, 624,
625, 626, 628, 657, 658-659, 716
conceptualization problems, 589-590
confidence intervals, 573, 578 n.85,
579-580, 581, 582 n.94, 585
n.106, 608, 621, 628
confounding factors, 24, 554, 563,
567-568, 572, 574, 575 n.74, 584
n.103, 585, 591-597, 598, 602, 605,
612-613, 621, 622, 628, 657 n.67
cross-sectional studies, 556, 560-561,
621-622
Daubert and, 551 n.2, 574 n.72, 575
n.74, 576 n.77, 578-579 n.85, 581
n.89, 610-611 n.185, 617 n.212
design of studies, 556-563, 589-590
differential diagnosis/etiology, 217
n.14, 512 n.21, 589-590, 591, 610
n.183, 613 n.193, 617-618
dose–response relationship, 563, 565,
585, 600, 603, 613, 616 n.204, 622
ecological fallacy, 623
ecological studies, 556-557, 561-563,
623
error sources, 556 n.19, 572-597, 612613, 619, 620, 621, 623, 624, 626,
627, 628, 629
experimentaI studies, 555-556
and exposure science, 505, 506, 508,
509, 512, 518, 519, 533, 535, 536,
537, 538, 540, 547
extrapolation, 535 n.75, 563, 565, 603
n.160, 613
false negatives (beta errors or Type II
errors), 577 n.81, 581-582, 620,
629
false positives (alpha errors or Type I
errors), 575-581, 589, 619, 627,
628, 629
false results (erroneous association),
572-597
general causation, 24, 551 n.2, 552,
565 n.48, 578 n.85, 597-606
generalizability of studies, 564, 595
n.133, 623
986
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
Evidence, defined, 51
Evolution, theory of, 50
Ex parte contact, 329-330
ExperimentaI studies, 555-556
Experts/expertise (see also individual
disciplines)
consulting vs. testifying, 33
court-appointed, 6-8, 14, 35, 311,
329-330, 489, 599 nn.141 & 143
defense due process rights to, 29, 127
ethical responsibilities, 125
examiners and witnesses, 30, 86, 369,
672-673, 693 n.26, 823 n.51, 827
n.73, 890
national register of, 8
resources for identifying, 8
secondary experts, 375-376
of survey interviewers, 409
unretained, 32, 373 n.62
Exposure science (see also Epidemiology;
Toxicology; specific substances)
absorption, 518, 523, 524, 526, 532,
534, 535, 543, 544, 545, 547
acute exposure, 545, 657 n.66, 671
admissibility of evidence, 25-26, 506
n.5, 512 n.22
analytical chemistry (direct
measurement), 528-530, 535, 540
analytical detection limits, 530
assessment of exposure, 510, 511, 512,
513 n.26, 519, 529, 531 n.66, 533534, 539, 543-544, 656-657
biomonitoring, 505 n.3, 535-537, 559,
587, 649 n.47, 657, 667, 680
blood evidence, 508, 509, 518-519,
535-537, 544, 656, 657, 672
blood lead levels (BPb), 536-537, 662,
667, 672
body burdens, 534-535, 537, 538, 545
causation analysis, 22, 25-26, 505,
509, 511-513, 518, 519 n.35, 525
n.53, 528 n.63, 534, 535, 538,
539, 543-544, 597-606
certification programs, 540
children and infants, 518, 521 n.40,
524 n.49, 527, 531 n.67, 536-537,
544, 552 n.4
replication of results, 604
selection bias, 583-585, 591, 627
specific causation, 24, 551 n.2, 552,
606-618
specificity of association, 605-606
standardized mortality (or morbidity)
ratio (SMR), 572, 607 n.171, 624,
628
statistical significance, 24, 573, 575581, 582, 585 n.106, 607 n.171,
619, 620, 631, 626, 628
stratification, 593, 596, 628
strength of association, 557, 566, 600,
602, 611 n.186, 626
suffìciency of evidence, 552-553 nn.7
& 9, 565 n.48, 579 n.85, 600
n.146, 604 n.164, 605 n.167, 610,
611, 612, 616-617
temporal relationships, 601-602, 606
time-line (secular trend) studies, 562563, 627
toxic agents, 555-556, 563-565
toxicology studies compared, 563-565,
603 n.160, 628, 636, 639, 644 n.29,
645-646 n.33, 647, 650-651, 655,
656, 657-660, 664-665, 674, 681
true association, 559, 568, 572, 574,
575, 581-582, 590, 591, 592
n.126, 625, 627, 629
two-tailed tests, 577 n.83
types of studies, 555-565
Equal Protection Clause, 2, 816 n.11
Error, defined, 51-52
Error rates (see also specific disciplines)
and reliability of methodology, 13, 29,
49 n.16, 64, 65, 69 n.80, 78, 79,
86, 87, 88-89, 97, 99, 102 n.290,
103 n.300, 122, 162, 171 nn.9698, 214, 217, 259 n.122, 628, 724725, 787, 806, 890, 903
zero, 29, 79, 99, 122, 171 n.97
European Committee for Standardization
(CEN), 906, 936, 956-957
European Union Regulation on
Registration, Evaluation,
Authorisation and Restriction of
Chemicals (REACH), 648, 649
987
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
lungs/respiratory tract/inhalation, 507,
509, 516, 517, 518, 519 n.36, 520,
522-523, 524, 526, 534, 544, 545,
546, 547
metabolism/metabolites, 518, 527
n.62, 534-535
modeling, 508, 521, 528, 530-533,
538, 544, 546
occupational exposures, 26, 217 n.14,
505-506 n.5, 511, 516, 517, 526,
529, 536, 539, 540, 652 n.54, 663
n.81
organic chemicals, 508-509, 513-514,
515, 520-521, 522, 528 n.63, 537
overview, 506-507
particulate matter (PM), 511 n.17, 512
n.22, 515, 523-524, 528 n.63, 532,
533 n.72
pathways of exposure, 512 n.26, 517,
518, 519-523, 524, 527-528, 531,
532, 533, 536, 538, 543, 544, 546
persistance, 521-522, 353-536, 537
pharmacokinetics, 535, 540
presentation of data, 541-542
processes of human exposure, 516-525
product contaminants, 505, 506, 507,
509-511, 515, 516, 519-520, 524,
526, 527, 528, 533, 537, 545, 546
qualifications of experts, 508, 539-540
quality of assessment, 537-539
quantification of exposure, 525-534
reconstruction of exposure, 512, 513
n.26, 529, 531 n.66, 539, 657 n.57
regulatory context, 505, 506, 509-511,
517 n.31, 528-529, 534
risk assessment, 505, 506, 507, 510511, 525, 526, 528, 533, 534, 535,
536, 537, 538 n.84, 547
routes of exposure, 507, 509, 518,
519, 520, 522-524, 526, 533, 536,
537 n.80, 538, 547, 588, 650 n.48,
682
scope of, 507-508
skin (dermal exposure), 507, 509, 516,
517, 518, 520, 523, 524, 526, 534,
536 n.76, 543, 544, 546, 547, 650,
660, 682
concentration, 506 n.5, 519 n.35, 525,
527-529, 530, 531, 532, 533, 534,
535, 536, 540 n.88, 541-542, 543,
544, 546
consumer products, 505, 506, 507,
509-511, 515, 516, 519-520, 524,
526, 527, 528, 533, 537, 545, 546
contexts for, 509-512
Daubert and, 25-26
distribution, 511, 518, 532, 534, 535,
546
dose, 507, 508, 509, 513 n.26, 518520, 525-528, 529, 531 n.67, 533534, 535, 536, 538, 539, 541, 544,
545, 546, 547
duration of exposure, 518, 519, 525,
526
environmental contaminants, 25, 508,
509, 510-511, 515-516, 520-522,
536, 537
environmental degradation, 521-522,
524, 531, 532
environmental media, 507, 508, 509,
510, 518, 521-522, 524, 527-528,
530, 533, 541, 546
environmental models, 530-533
environmental sampling, 528, 529,
530, 533, 538, 541
epidemiology and, 505, 506, 508, 509,
512, 518, 519, 533, 535, 536, 537,
538, 540, 547
excretion, 518, 523, 527, 532, 534-535
fate and transport, 507, 521, 532
gastrointestinal (GI) tract/ingestion,
518, 523-524, 525, 526, 531 n.67,
534, 546
general causation, 26
goal of, 318-319
hazardous waste sites, 528, 531, 536
n.78, 543-544, 546
indirect pathways, 512 n.26, 517, 518,
520, 527-528, 533, 546
industrial chemistry/chemicals, 507,
510, 511, 514-516, 521
inorganic chemicals, 513-514, 515,
520, 522
laboratory certification, 529-530
988
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
ballistics analysis, 120 n.415
CODIS, 61, 62 n.30, 145, 201
crime laboratory, 58
DNA analysis, 61-62, 69-70, 116,
145, 154, 156, 157, 178, 179
n.128, 180, 187, 194
fingerprint analysis, 67 n.66, 74, 75,
77, 78, 79-80, 81, 83 n.175
hair analysis, 112 n.357
toolmark identification, 203 n.300
Uniform Crime Reports, 231
Federal Communications Commission,
378
Federal Employees Liability Act, 947
Federal Highway Administration, 916
Federal Insanity Defense Reform Act, 868
Federal Judicial Center, 8-9, 63 n.39
Federal Rules of Civil Procedure
Rule 23, 30-31, 32
Rule 26, 32, 33, 34, 35, 374, 414,
417-418 n.246, 486-487, 696 n.33
Rule 53, 35
Federal Rules of Criminal Procedure
Rule 16, 125
Rule 23(a), 794 n.41
Federal Rules of Evidence
Rule 104(a), 12
Rule 401, 27 n.79, 101 n.281, 123,
785, 952 n.93
Rule 403, 29, 121, 167, 174, 181, 214,
610 n.185, 788-789, 806, 952 n.93
Rule 404, 171 n.96, 189-190 n.164,
789
Rule 406, 790
Rule 702, 12-13, 16, 17, 18, 22-23,
34-35, 63, 82 n.169, 90 n.221,
101, 102 n.291, 121 n.426, 122,
167, 174, 181, 214, 551 n.2, 579
n.85, 696 n.33, 776, 785-788, 806,
827 n.73, 846 n.179, 871 n.333,
939
Rule 703, 214, 361, 363-364, 610
n.184
Rule 704, 868, 944 n.72
Rule 706, 24 n.66, 135, 329-330
Rule 803, 363-364 n.12
Federal Trade Commission, 399 n.178
sources of exposure, generally, 516-517
specific causation, 26
standards for dose limits, 512 n.22,
521 n.43, 528-529, 536
target site (systemic) dose, 507, 519,
535, 536, 547
toxicology and, 505, 506, 508, 509,
518, 159, 533, 535, 537, 538, 540,
547
units of concentration, 525, 541-542
volatile chemicals, 514, 520, 521, 531,
650 n.48, 657 n.66, 668
Extrapolation
animals to humans, 4-5, 15, 23, 223
n.26, 535 n.75, 563, 565, 636,
641, 645, 646-647, 648, 658, 661662, 669-670, 692 n.21
in class actions, 238 n.72
of damages from past earnings, 438
defined, 682
dose–response relationships, 4-5, 603
n.160, 636, 641, 645, 648, 651
exclusion of evidence, 23, 692 n.21
representativeness of populations for,
613, 727
in risk assessment, 547, 603 n.160,
651, 661-662
from samples to populations, 226, 238
n.72, 244, 299, 909
from short exposures to multiyear
estimates, 648
statistical models, 222, 223, 226, 238,
244, 645 n.31
from tissue or cell cultures to humans,
564, 646-647, 652, 664
Exxon Shipping Co. v. Baker, 239
Eyewitness testimony, 13 n.10, 59, 62, 72
n.95, 328-329, 365 n.22, 384, 958
F
Federal Bureau of Investigation
anthrax investigation, 194
Automated Fingerprint Identification
System, 77
989
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Firearms identification (see Ballistics
evidence)
First Amendment, 792
Food additives, 5, 216, 510, 515, 517,
525 n.56, 919
Food and Drug Administration, 25, 217
n.14, 461-462, 525, 581 n.89, 640,
644 n.29, 647-648, 650, 670 n.97,
678, 696 n.33, 720, 728, 730, 731,
774, 775, 854, 856, 858, 862, 907,
921, 941 n.65
Food Safety Risk Analysis Clearinghouse,
910
Forensic dentistry, 57, 101-102, 105 (see
also Bite mark evidence)
Forensic identification expertise, 55-127
(see also Bite mark evidence; DNA;
Fingerprint evidence; Handwriting
evidence; Microscopic hair
evidence; other specific disciplines)
accreditation of crime laboratories, 28,
66, 68-69, 98, 154 n.48
admissibility of, 26-30, 57, 59-60,
61, 62-64, 65, 71, 72, 74 n.105,
76-77, 82, 85-86, 89-90, 101, 102,
103, 110-111, 112, 117, 118-119,
121 n.426, 122, 123, 124, 127
certification of examiners, 28, 66,
68-69, 79, 80 n.157, 89
clarity of testimony, 70, 120-121
class characteristics, 57, 72, 84, 92, 93,
94, 96, 97, 101, 114
closing arguments, 124
confirmation bias, 67 n.66, 80
Confrontation Clause and, 26-27, 30
contextual (expectation) bias, 29, 67
n.63, 80
court-appointed experts, 29-30
cross-examination, 30
Daubert and, 26-30, 57, 60, 62-64, 72,
74 n.105, 76-77, 82, 85-86, 89, 90
n.220, 101, 112, 118-119, 122, 124
defense experts, 80 n.157, 111 n.351,
124, 125 nn.450 & 454, 127
development of techniques, 58-60
DNA exonerations, 27, 62, 109-110,
116, 117, 119, 125 n.450, 134
Federation of American Societies for
Experimental Biology, 46
Federation of State Medical Boards, 698
Fiber analysis, 57, 61, 62 n.32, 71 n.88,
112, 119 n.409
Fifth Amendment, 790-792
Fingerprint evidence
admissibility, 27 n.78, 73, 81-83
analysis, comparison, evaluation, and
verification (ACE-V), 75, 76, 79,
81, 82-83
artifacts, 74 n.105, 75
Automated Fingerprint Identification
System (AFIS), 65, 77
“black box” approach, 81
case law development, 58, 71, 73, 81-83
confirmation bias, 67 n.66, 79-80
Daubert and, 74 n.105, 82, 122
DNA exonerations, 62 n.32
DNA profiling compared, 60 n.23, 61,
73, 74 n.107
empirical record, 59 n.17, 61 n.25,
76-81, 109 n.331
error rates, 78, 79, 122
expectation (context) bias, 80
FBI examiners, 28 n.82, 67 n.66, 74,
75, 77, 78, 79-80, 81, 83 n.175
Frye test, 82-83
Galton details, 73, 75 n.113
history, 72-73
individuation, 57 n.1, 72, 73, 82, 84,
117 n.393
latent prints, 74, 75-77, 78 n.143, 79,
80 n.156, 81, 82 n.169, 109 n.331
“match” opinions, 73-74, 75-76, 81
n.167, 82 n.169, 84, 175
Mayfield case, 28 n.82, 67, 75 n.116,
79-81
minimum threshold approach, 80
observer bias, 67
“one-discrepancy” rule, 75 n.111
population frequency data, 73
proficiency testing, 27 n.78, 67-68,
76-77, 78-79, 78 n.143
simultaneous impressions, 83
technique, 71, 73-76, 575 n.74
validity, 76-77
990
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
Forensic Sciences Foundation, 69 n.82
Formaldehyde, 522 n.47, 528 n.64, 653,
656 n.65, 658 n.70, 673 n.107,
674 n.110
Fourteenth Amendment, 792
Fourth Amendment, 796
Framingham Study, 708
Freud, Sigmund, 858
Frye (general acceptance) test, 12, 53, 60,
63, 82, 102 n.291, 103 n.300, 110
n.343, 133 n.7, 166, 173 n.102,
186, 189, 195 n.183, 197, 367,
368, 806-807, 866, 867, 949
Functional impairments
assessment, 842-846
clinical examination, 843-844
diagnosis of mental disorders vs.,
819-821
from mental disorders and, 841-846,
851-852, 860-861
predictive assessments, 851-852
structured assessment techniques,
844-846
treatment of, 860-861
Functional MRI (fMRI), 765, 766, 768772, 773, 775, 776-777, 778, 779,
780, 781, 782-783, 786, 787-788,
791, 796, 797-798, 800, 803-807,
809, 810-811, 838
Furman v. Georgia, 370
drug analysis, 59
empirical testing, 60-64, 66 n.62
ethical responsibilities of experts, 125
false-positive errors, 77, 78, 100, 109,
115
Fed. R. Evid. 401 and, 101 n.281, 123
Fed. R. Evid. 702 and, 12-13, 16, 17,
18, 22-23, 34-35, 63, 82 n.169, 90
n.221, 101, 102 n.291, 121 n.426,
122
Frye and, 60, 63, 82, 102 n.291, 103
n.300, 110 n.343
individual characteristics, 57, 72, 73,
84, 92, 93-94, 96, 97, 99, 113
individuation opinions, 57, 59 n.12,
60, 66, 71 n.91, 72, 74, 82, 84, 89,
90, 94, 106, 113, 114
interpretation of evidence, 27, 29
Kumho Tire and, 62-63, 89 n.214
laboratories, 28-29, 58-59, 66
laboratory report format, 28-29, 70
limitations on testimony, 121-124,
126-127
match probability, 72
NRC Forensic Science Report, 27,
30, 60 n.23, 64-70, 71, 74 n.106,
75-76, 77-78, 79, 82 n.171, 84,
85, 97, 100, 105 n.314, 108, 113114, 115-116, 119-120, 121, 122,
126
objectionable testimony, 29
observer effects, 67-68
prejudicial, 29
pretrial discovery, 125-126
procedural issues, 124-127
proficiency of experts, 28, 61, 62
n.30, 68, 69-70, 76, 78-79, 85,
87-89, 97-98, 116
reappraisal of, 60-64
recurrent problems, 120-124
reliability, 71-72
research recommendations, 66
terminology, 70, 71-72
testifying beyond the report, 126-127
validity, 27-28, 71-72
wood evidence, 58
Forensic Science Service (UK), 144
G
Gatekeeping role of judges, 6, 12-13,
16, 17, 102 n.291, 866 n.309,
901, 933, 956 (see also Case
management)
Gender discrimination (see Sex
discrimination)
General Electric v. Joiner, 14-16, 17, 18,
19-20, 24 n.65, 53, 82-83, 364
n.17, 563-564 n.44, 565 n.48, 579
n.85, 638 n.9, 661 n.77, 692-693,
932 n.56
991
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
observer effects, 67
proficiency studies, 85-89
technique, 83-84
Haplotypes, 178, 181, 182, 204
Hardy-Weinberg equilibrium, 165, 166,
204, 207
Harvard Center for Risk Analysis, 910
Harvard Medical Practice Study, 700-701
HCR-20, 848
Healthcare Facilities Accreditation
Program (HFAP), 700
Healthcare Research and Quality Act of
1999, 701
Hearsay, 214, 225 n.32, 227 n.37, 363364, 610 n.184, 956
Hemangiosarcoma, 672
Henricksen v. ConocoPhilips Co., 26
Heterozygosity, 139, 140, 147, 183 n.139,
199, 204
Hinkley, John, 800
HIV, 194, 195 n.181, 385, 402, 635 n.1,
713, 720, 730, 838
Homozygosity, 139, 140, 183 n.139, 199,
204
Hooke’s law, 269, 271, 281
Hughes, Howard, will dispute, 83
Hunt v. Cromartie, 2
Genes (see also Alleles; Chromosomes;
DNA)
amelogenin, 146, 147, 183
coding and noncoding sequences, 138,
139, 177, 201
defined, 138, 142
DQA, 202
Genome (see also Chromosomes; DNA)
Alu sequences, 199
defined, 203
environmental impacts, 663
Human Genome Project, 144, 149
mitochondrial, 150, 176, 177-178,
202
nuclear, 137, 138, 139, 140, 141, 142,
144, 148, 149-150, 176, 177, 204
structure, 137, 138, 204, 209
Genome-wide association studies, 581
n.90
Glass analysis, 57, 69 n.81, 116 n.383
Global Utilization of Streptokinase and
tPA for Occluded Coronary
Arteries Trial, 732
Gregg v. Georgia, 307 n.10, 370
H
Habeas corpus proceedings, 118, 119,
126, 170 n.92, 795, 815 n.5
Hair (see Microscopic hair evidence)
Handwriting evidence
case law development, 58, 63 n.41,
83, 89-90, 101 n.284
class characteristics, 84
Daubert and, 63 n.41, 85-86, 89, 90
n.220
empirical record, 59 n.17, 85-89
error rates, 86, 87, 88, 89
experts compared to laypersons, 86-87
Fed. R. Evid. 901(b)(3), 89
individual characteristics, 84
individuation opinions, 84
Kumho Tire and, 89 n.214
limits on testimony, 27 n.79, 89-90,
101 n.284, 121 n.426, 123 n.440
I
Impeachment by bias, 21
Infectious Disease Society of America
(IDSA), 728
Informed consent (see Medical informed
consent)
Innocence Project, 27 n.77
Insanity defense, 29 n.85, 222, 749, 763,
785, 786, 799, 800, 815 n.1, 817,
819-820, 865-866, 867, 868, 869
Institute of Healthcare Improvement, 701
Institute of Medicine, 20, 46, 49, 701,
702, 723, 728
Integrated Ballistics Information System
(IBIS), 95
992
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
questionnaires, 226 n.36
right to a trial by, 5, 21, 226 n.35,
794-795
selection, 225, 226 n.36, 249, 253
n.104
Jury Selection and Service Act, 226 n.36,
252 n.102
Justice for All Act, 62 n.30, 154 n.48
Intellectual property litigation, 307, 311,
429 n.1, 439, 440, 441, 498-499,
501, 502, 932-933, 938, 945-946
(see also Patents)
“Intelligent Design” litigation, 50
International Academy of Forensic
Toxicology, 678
International Agency for Research on
Cancer (IARC), 20, 564 n.46, 646,
656 nn.64 & 65, 660 n.75, 665, 678
International Association of Identification,
80 n.187
International Organization for
Standardization, 68
International Society of Regulatory
Toxicology and Pharmacology, 678
Intrauterine device (IUD) litigation, 574
n.70, 920-921
K
Knoll v. State, 112
Kochert v. Greater Lafayette Health Serv.
Inc., 22
KSR International Co. v. Teleflex, Inc.,
945-946
Kuhn, Thomas, 41-43, 44-45, 49, 50
Kumho Tire Co. v. Carmichael, 16-18, 35,
53, 62, 63, 89 n.214, 214 n.4, 308,
431, 575 n.74, 692-693, 866, 891,
899, 932 n.56, 933, 951
J
Joint Commission, 701
Jones & Laughlin Steel Corp. v. Pfeifer, 454
Jury
change-of-venue surveys, 376-377, 403
comprehension of evidence, 5, 29,
91 n.226, 118 n.400, 124, 126,
134, 167, 168, 169, 171, 173, 175
n.111, 189-190 n.164, 329, 463,
475, 693 n.26, 788, 868, 901, 937,
939, 947, 951, 956
damage awards, 238-239, 240 n.82,
475
death penalty cases, 370
decisions as indicators of national
opinion, 370
discrimination in composition, 7, 234,
249-250, 253 nn.104 & 105, 275278, 365
impartiality, 223 n.27, 224, 365, 370,
376-377, 403
instructions to, 29, 168 n.84, 383
n.104, 455, 943
law enforcement officers excluded
from, 252 n.102
L
Laboratories (see also DNA laboratories;
Laboratory reports)
accreditation, 28, 62 n.30, 66, 68-69,
70 n.83, 98, 154, 156, 171 n.98,
538
growth in number of, 58-59
practices, 28-29
quality assurance and quality control,
68, 529-530
regulation, 61, 66
Laboratory Proficiency Testing Program,
69, 97, 116
Laboratory reports
format standardization, 70
jury comprehension, 126
pretrial discovery, 125-126
testifying beyond the report, 126-127
Lanham Act, 363 n.10, 366, 382, 400,
422
Law, language of science compared, 51-52
993
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Major depressive disorder, 828 n.76, 829,
830, 832, 838, 839, 863 n.297, 879
n.357
Mammography, 259 n.122, 708, 710,
711-712, 726-727, 738-739
Management of expert evidence (see Case
management)
Manganese, 505 n.3, 515, 522 n.47, 528
n.64, 529 n.65, 653, 670 n.95
Mapes Casino, Inc. v. Maryland Casualty
Co., 257
Mass torts litigation, 366-367
Maximum Contaminant Levels (MCLs),
528-529
Mayfield, Brandon, 28 n.82, 67, 75
n.116, 79-81
Mayo Lung Project, 718
McNeilab, Inc. v. American Home Products
Corp., 392 n.146, 415
Medical education and training
accreditation, 695, 696, 697, 700, 701,
822, 823 n.49, 824, 873
continuing medical education, 700
licensure and credentialing, 695, 696
n.33, 697, 698-699
medical school, 695-696, 821, 822
n.44, 873
postgraduate training, 697-698, 824
Medical history
consistency with toxicology expert’s
opinion, 670-671
differential diagnosis, 672-673
disclosure of contradictory data,
674-675
interaction of chemicals, 673
laboratory tests, 672
susceptibility to environmental agents,
674
symptoms of toxic exposure, 671-672
Medical informed consent
principles and standards, 734-737
risk communication, 737-739
shared decisionmaking, 739-740
Medical malpractice litigation, 474, 689,
692, 694 n.27, 695, 698, 816-817,
852, 877, 878, 951
Law Enforcement Assistance
Administration (LEAA), 69, 97
n.262, 116
Lead exposure, 510 n.15, 521 n.41, 522,
524 n.49, 531 n.67, 536-537, 640,
653, 654, 662, 667, 870 n.332
Leapfrog Group, 701
Lethal injection, 3
Lie detection (see also Polygraph evidence)
fMRI, 801-807
nonlitigation applications, 807
Life expectancy, 365, 471, 472, 473, 474,
492, 493, 495-496, 702, 719, 727,
733
Likelihood ratios, 169 n.89, 172-173,
174, 175, 177-178, 185-186, 205,
206, 710-711
Lindbergh kidnapping trial, 58, 83, 89
Linear
associations, 261, 262, 264-268, 286,
321, 348, 352, 603 n.160
combinations, 271 n.139, 280, 287,
289, 290, 298
low-dose risk curve, 673 n.107
no-threshold model, 642-643 n.28
regression, 260 n.124, 264, 298, 316,
317 n.36, 336-339, 347, 353, 354
Linkage equilibrium, 166, 205, 207
Lockett v. Ohio, 795
Lockheed-Martin, 77-78
Lung cancer, 14, 15 n.26, 218, 219, 221,
552, 558-559, 570 n.63, 578 n.85,
585 n.104, 593, 597, 600, 602,
605, 606, 613, 615, 620, 635, 718719, 724, 733
M
Madrid terrorist train bombing, 28 n.82,
67, 75 n.116, 79-81
Magnetic resonance imaging (MRI), 720,
761, 763, 766-772, 773, 777-778,
781, 797, 800 n.52, 837-838
Magnetoencephalography (MEG), 761,
772-773
994
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
malpractice litigation, 474, 689, 692,
694 n.27, 695, 698, 816-817, 852,
877, 878, 951
medical decisionmaking, 704-740
medical reasoning, 693-695
meta-analysis, 722-723, 724, 725
natural frequency statements, 712, 737
negative predictive value, 710, 711,
744
null hypothesis, 724, 725
odds ratio, 738
pathophysiological reasoning, 715
personal injury cases, 689, 690 n.8
positive predictive value, 710, 711,
712, 744, 781
posterior probabilities, 710, 742
posttest probability, 742, 744
predictive value, 710, 711, 712, 744,
781
pretest probability, 710-711, 742, 744
prior probability, 708, 710-711
probabilistic reasoning, 707-714, 715
product liability cases, 22, 689, 693,
703, 731 n.189
prognostic testing, 721
prosecutor’s fallacy, 712
p-value, 724
qualifications of experts, 698, 700, 729
randomized controlled studies, 716,
718, 722-723, 724, 725, 729, 730731, 732, 736
“reasonable medical certainty/
probability,” 123, 691-692, 693,
694 n.27, 716
relative risk, 737, 738
screening tests, 708, 712-713, 714,
717-719, 726-727, 735, 736-737,
738-739, 743, 744
sensitivity of tests, 706 n.78, 708, 709,
710-711, 712, 713, 714, 719-720,
737, 742, 744
single event probability, 737
specificity of tests, 706 n.78, 708, 709,
710-711, 712, 713, 714, 719-720,
737, 742, 744
terminology, 689-692
testing, 717-721
Medical practice
delivery of care, 700-701, 721-722
education and training for, 695-700
outcome goals, 702-703
patient–physician encounters, 703-704
quality of care, 702-704, 721-722
Medical Practice Acts, 698
Medical testimony
absolute risk, 738
Bayes’ rule, 259 n.122, 706 n.78, 707714, 725, 742
causal reasoning, 23, 714-717, 742
clinical practice guidelines, 726-728
clinical reasoning process, 689, 705707, 741
conditional probability, 710, 712, 714,
737, 742
cross-sectional studies, 716, 736-737
Daubert and, 690, 692-693, 696 n.33
diagnostic reasoning, 689, 693, 704717, 719, 741
diagnostic testing, 704, 717, 719-720,
724, 742
diagnostic verification, 706, 715, 719,
742, 743, 744
differential etiology, 690-691, 741,
742-743
etiology, defined, 691, 743
evidence-based medicine, 722-723,
726, 733, 738
false-negative test, 708, 711, 724
false-positive test, 708, 710-711, 713,
714, 717-718, 720, 724, 725, 735
General Electric Co. v. Joiner, 692-693
hierarchy of evidence, 723-725
hypothesis generation, 705, 715, 742,
743
hypothesis modification, 706, 742, 743
hypothesis refinement, 706, 742, 743
hypothesis testing, 724-725
hypothetico-deductive approaches,
705, 707, 743
judgment and uncertainty in, 693-694,
721-734
Kumho Tire Co. v. Carmichael, 692-693
likelihood ratios, 711
995
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Mental health evidence (see also Mental
health assessment; Competency;
Diagnosis of mental disorders;
Treatment of mental disorders)
admissibility of, 815-821, 827, 846
n.179, 866, 867, 869 n.328, 886,
894
board certification of experts, 822823, 825, 826 n.68, 827, 874, 893
case example, 892-894
criminal responsibility, 815, 819 n.30,
844, 868, 869
damages assessments, 816 n.7, 819,
851, 863, 892, 893-894
Daubert and, 866, 886, 891 n.396
diagnostic vs. functional issues, 889-890
disability, 816 nn.11 & 12, 818, 819820, 831 n.87, 837, 851, 882, 894
disclosures, 890-891
emotional harm or distress, 432 n.22,
496, 810, 816, 818, 819, 820-821,
823, 851, 852, 863, 866
empirical support, 846, 847, 865, 866,
867, 868, 880-881, 891
experience of experts, 818, 822, 823,
824, 825, 826, 827, 841, 846, 848,
854, 871-873, 881, 889, 890
Frye and, 866, 867
functional impairments, 3, 819-821,
832, 840, 841-842, 883 n.373
Kumho Tire and, 866, 891
legal cases involving, 815-821
licensure of experts, 824, 825, 826,
827 n.73, 873-874
limitations of, 818, 836, 849-851, 859,
865-869
negligence actions, 816 n.7, 818, 827
n.73, 877 n.351, 892, 893
overview, 815-869
prior professional relationship, 875877, 893-894
psychiatric nurses, 826, 870-871
psychiatric social workers, 826, 870-871
psychiatrists, 821-823, 827, 834, 841,
847, 849, 853, 865, 870, 872-873,
874, 875, 876-877 n.349, 878, 879
n.358, 893
treatment-related, 728-734
true-negative test, 708, 710, 711, 724
true-positive test, 708, 711, 712-713,
717, 724
variations in care, 721-722
Medications for mental disorders (see also
specific medications)
categories of, 855-856
efficacy and effectiveness, 858
polypharmacy, 856-857
side effects, 857
targets of treatment, 853-855
Melendez-Diaz v. Massachusetts, 30, 62
n.33, 64 n.48, 71 n.88, 126, 789
Mental health assessment (see also
Competency; Diagnosis of mental
disorders)
adequacy of circumstances for, 880-881
collateral informants, interviews with,
835, 841, 843, 844, 847-848, 872,
876, 882-883, 894
conduct of, 877-885
contemporaneous, 817-818, 865, 881
cooperativeness of evaluee, 834, 849,
879-880, 881, 890
disclosure of limitations, 890-891
functional impairments, 817, 838,
842-846, 851-852, 864, 865, 882,
884, 885-886, 889-890
in-person examination, 851, 877-879,
884
malingering considerations, 839, 840841, 843, 881-882, 884-885, 894
predictive assessment, 818-819, 846852, 858, 863-864, 865, 881-882,
885-886, 887, 890
process of expert in reaching
conclusions, 889-891
records review, 817-818, 834-835,
839, 841, 843-844, 847-848, 878,
879, 881-882, 888, 890, 893
retrospective, 817, 818 n.18, 865, 881
structured instruments, 835, 836, 840
n.144, 841, 844, 845-846, 848,
849, 850, 851-852, 885-889, 893
996
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
individual characteristics, 113, 114-115
neutron activation analysis, 114, 123
n.437
proficiency testing, 69 n.81, 116
random-match probability, 118 n.401
technique, 112-113
Military Rules of Evidence, 794
Military Service Act, 249
Millsap v. McDonnell-Douglas, 490
Mini-Mental Status Examination
(MMSE), 837
Minnesota Multiphasic Personality
Inventory (MMPI), 836, 841, 886
Monte Carlo simulations, 284, 469
“More likely than not” standard, 27 n.79,
102, 123, 463, 577 n.81, 612, 613
n.194, 617
Morton Thiokol, 927
Multiple regression
admissibility, 308-309, 314 n.33, 319
n.43, 324 n.54, 330-332
antitrust litigation, 305, 306, 307 n.7,
313, 320, 321 n.48, 326, 328, 348
n.90
basics of, 333-351
bias, 312, 314, 315, 322 n.50, 325,
327, 352
causality, 309, 310, 311, 312, 314, 321
n.48, 322-324, 327, 353
census undercountng litigation, 307,
308
class certifications, 306-307
computer output interpretation,
346-348
confidence intervals, 321, 332 n.69,
342-343, 352
correlation, 309, 310, 311, 314, 315,
322, 324-325, 333-334, 342 n.79,
352, 354, 355, 356
court-appointed neutral experts,
329-330
covariates, 311-312, 322, 336-337,
350, 351, 352
cross-sectional analysis, 319, 345, 352
database information and analytical
procedures, 331-332
Daubert and, 308-309
psychodynamic theory, 859, 865-867,
891
psychology/psychologists, 822, 823
n.51, 824-825, 826-827, 847, 849,
865, 869, 870-871, 872, 873-874,
875, 876 n.349, 878, 888, 893-894
psychopharmacology (see Medications
for mental disorders)
psychotherapists/psychotherapy, 822,
825, 826-827, 858-859, 870, 873874, 876 n.349, 882, 891
qualifications of experts, 821-829,
869-877
relative risk, 851 n.214
social workers, 826
structured instrument reliability and
validity, 885-889
theoretical basis, 865, 885, 891
training of expert witnesses, 870-871
ultimate issue testimony, 867-869
violence risk (future dangerousness), 3,
819, 846-851, 878 n.354, 893
Mental retardation, 369-371, 815 n.5,
819, 829, 833, 836, 874 n.343 (see
also Competency)
Mercury, 515, 536 n.76, 642 n.27, 646,
653
Mesothelioma, 585 n.104, 588 n.114,
606, 635, 672
Meta-analysis, 217 n.14, 254 n.107, 289,
579, 581 n.89, 606-608, 624, 657
n.67, 722-723, 724, 725
Meyers v. Arcudi, 367 n.32, 368
Microscopic hair evidence (see also Fiber
analysis)
admissibility, 118-119, 178 n.123, 181
n.134
bias, 67
case law development, 112, 117-119
class characteristics, 114
Daubert and, 118-119
DNA analysis, 116, 143, 149 n.133,
151, 155, 170, 177, 178 n.123,
179, 180, 181 n.134
DNA exonerations, 62 n.32, 116, 117,
124
empirical record, 61, 113-117
997
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
job aptitude test–performance
correlation, 333-336
justifying choice of model, 317
least squares method, 326-327, 335,
336, 337, 339, 340, 341, 342, 345346, 347, 354, 355
linear associations, 321, 348, 352
linear regression model, 260 n.124,
264, 298, 316-317, 336-339, 347,
353, 354
mean, 332, 341, 342, 343, 350, 351,
354, 356
mean squared error (MSE), 344, 347,
354
measurement error, 327-328
misleading data, 349
model specification (research design),
311-317, 337
multicollinearity, 324-325, 354
nonlinearities, 339
nonprobability sampling, 332 n.68
normal curve, 342
normal distribution, 343 n.82, 354
null hypothesis, 319-321, 342-343,
348, 353, 354, 356
observational studies, 312, 318, 332,
340, 342, 347
one-tailed tests, 321, 354
outlier, 327 n.58, 345, 346, 354, 355
overview, 305-311
parameters, 312, 314 n.32, 315, 316317, 320, 324, 325, 326, 327, 332,
336, 337, 338, 340-341, 342, 343,
344, 347, 348 n.91, 352, 353, 354,
355, 356
patent infringement cases, 306, 307
n.12, 309, 311, 319, 321
percentages, 318 n.40, 320, 341, 345,
355
perfect collinearity, 324
practical significance of results 318321, 355
precision of results, 340-346
presentation of statistical evidence,
330-332
probability sample/sampling, 332, 355
probative value, 315 n.33
death penalty deterrence analysis, 307,
308
dependent variable, 305, 308, 311313, 314, 315, 316, 317, 321, 322324, 325 n.55, 326, 327, 332, 334,
335, 336, 337, 338, 339, 340, 345,
347, 348, 352, 353, 354, 355, 356
design of research, 308-309, 310,
311-317
discovery process, 330-331
discrimination cases, 305, 306, 309310, 312-314, 315-317, 318-320,
323-324, 336-347, 350-351
dispute resolution over statistical
studies, 331-332
economic damages, 305-306, 308
n.12, 311, 319, 328, 348, 431, 446,
450, 481, 499, 502
errors in modeling, 325-326
error terms, 325, 326, 336, 337, 339,
342, 344, 347, 348 n.91, 352, 355
explanatory (independent) variable(s),
305, 308, 310, 311-312, 313-316,
317 n.36, 322-325, 326, 327, 334,
335, 336, 337, 338, 339-340, 341,
342, 345, 346 n.88, 348, 349, 350352, 353, 354, 355, 356
fitted values, 270, 295, 298, 337, 338,
339, 353
forecasting, 348-349
formulation of research question, 311
functional form of model, 316-317
goodness of fit, 338, 340, 344-345,
347, 355
heteroscedasticity, 296, 326 nn.56 &
57, 353
hypothesis tests, 319-321, 342-343,
348, 353, 354, 356
hypothetical example, 350-351
independence, 325, 326, 353
influential data points, 327, 345, 346,
353
interaction variables, 316-317, 339
n.79, 351, 353
intercepts, 335, 338, 345, 347, 348,
353, 354
interpreting results, 318-328, 339-340
998
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
N
p-values, 320-321, 324 n.54, 347,
350-351, 354, 356
qualifications of experts, 328-329
quasi-experiments, 312, 355
random (sampling) error, 314 n.30,
336, 337, 339, 342 n.79, 355
random variables, 355, 356
ranges, 312 n.26, 333, 343, 345, 353
rate regulation cases, 307
recording data, 327, 330-331
regression line, 337-339
regression residuals, 336, 339, 344,
347, 354, 355, 356
reliability, 340, 341
robustness of results, 310-311, 322328, 346, 355
R-squared statistic, 314 n.31, 345,
347, 348, 350, 351, 353, 355
scatterplots, 333, 335, 337, 355
sensitivity to individual data points,
326-327, 345-346
serial correlation, 326 nn.56 & 57, 355
significance levels, 320-321, 324 n.54,
347, 350-351, 354, 356
size of sample, 316, 318-319
spurious correlations, 309, 310 n.24,
311, 322, 356
standard deviation (SD), 341, 343
n.83, 344, 348, 354, 356
standard errors (SE), 316 n.35, 326,
340-344, 347, 348, 349, 350, 354,
356
statistical significance, 318-319, 320321, 324 n.54, 347, 350-351, 354,
356
surveys, 307 n.8, 332
t-statistics, 320, 340-344, 347, 356
t-test, 320, 356
theory development, 311-317
time-series analysis, 317 n.37, 319,
323 n.52, 326, 345, 356
trends, 345
two-tailed tests, 321, 356, 577 n.83
voting rights cases, 307
Myelofibrosis, 217 n.14
National Academy of Engineering, 46
National Academy of Sciences, 46, 47, 649
National Aeronautics and Space
Administration, 927
National Ambient Air Quality Standards
(NAAQS), 529, 666
National Cancer Institute, 564 n.48
National Conference of Lawyers and
Scientists, 8
National Electronic Injury Surveillance
System, 909
National Fire Protection Association,
933-934
National Forensic DNA Review Panel,
156 n.53
National Geographic Society, 408 n.212
National Health and Nutrition Examination
Survey (NHANES), 536-537
National Highway Traffic Safety
Administration, 910-911, 920
National Human Genome Research
Institute, 149
National Institute for Environmental
Health Sciences, 20
National Institute of Forensic Sciences
(proposed), 66
National Institute of Justice, 156 n.53
National Institute of Standards and
Technology, 81
National Institutes of Health (NIH), 45,
646, 678, 733
National Quality Forum, 701
National Research Council, 20, 46
Ballistic imaging report, 99
DNA reports, 60-61, 125, 127, 133,
134 n.12, 141 n.19, 143, 161, 162,
163 n.72, 164 n.75, 166-167, 168
n.84, 169 n.89, 170 n.95, 174
n.110, 175, 176 n.114, 185, 187188, 192 n.170
Forensic Science Report, 27, 30, 60
n.23, 64-70, 71, 74 n.106, 75, 76,
77-78, 79, 82 n.171, 84, 85, 97,
100, 105 n.314, 108, 113-114,
115-116, 119-120, 121, 122, 126
999
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
National Rifle Association, 411
National Science Foundation, 45
National Toxicology Program (NTP),
655, 656 n.64
Natural experiments, 290, 312, 355
Natural Resources Defense Council, 678
Negligence, 893
Neuroimaging
accuracy and robustness of results,
781-782
computer assisted tomography (CAT),
762-763, 837-838, 893
countermeasures, 776, 783-784, 801,
805, 810
diffusion tensor imaging, 768
electroencephalography (EEG), 761,
766, 772-773, 791-792, 796, 803,
838-839
false negatives, 780, 782-783, 788
false positives, 780-781, 782-783, 788
functional MRI (fMRI), 765, 766,
768-772, 773, 775, 776-777, 778,
779, 780, 781, 782-783, 786,
787-788, 791, 796, 797-798, 800,
803-807, 809, 810-811, 838
magnetic resonance imaging (MRI),
761, 763, 766-772, 773, 777-778,
781, 797, 800 n.52, 837-838
magnetoencephalography (MEG), 761,
772-773
positron emission tomography (PET),
761, 763-765, 766, 773, 776, 785,
800 n.52, 838
single photon emission computed
tomography (SPECT), 761, 765766, 838
Neuroscience evidence (see also Brain)
admissibility issues, 784-796
blood oxygen level dependent
(BOLD) response, 768, 770, 771,
782, 783, 787-788, 804
character evidence, 789-790
Confrontation Clause and, 789
criminal responsibility (culpability)
determinations, 749, 772, 798,
799-801
Daubert and, 785-786, 806
Due Process Clause and, 792
Eighth Amendment and, 795
Employee Polygraph Protection Act,
792-794
examples of uses in litigation, 796-811
experimental design issues, 777-779
Fed. R. Evid. 403 and, 788-789, 806
Fed. R. Evid. 406 and, 790
Fed. R. Evid. 702 and, 776, 785-788,
806
Fifth Amendment and, 790-792
First Amendment and, 792
Fourteenth Amendment and, 792
Fourth Amendment and, 796
Frye, 806-807
group averages applied to individuals,
780-781
“habits of mind,” 788-789
insanity defense, 749, 763, 785, 786,
799, 800
interpreting study results, 776-784
lie detection, 772, 773, 776-777, 778779, 784, 787, 789 n.24, 790-791,
792-795, 797, 799, 801-807
mitigation in capital cases, 785, 795,
798, 799, 800
pain detection and management, 749,
753, 775, 778, 780-781, 799,
807-811
predisposition evidence, 789-790, 801
privacy rights, 792, 793-794, 796
privilege against self-incrimination,
790-792
relevance, 749, 773, 776, 777-778,
779, 785, 786, 788, 789-790, 792,
796, 797, 798, 799, 800, 801, 810
replication of finding, 776, 777, 804,
810
representativeness of studies, 779
Seventh Amendment and, 794-795,
805
Sixth Amendment and, 794-796
statistical issues, 770, 776, 777, 780,
782-783, 787, 801, 803, 804, 805,
809, 810
techniques, 760-776 (see also
Neuroimaging)
1000
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
Neutron activation analysis, 120 n.415,
123 n.440, 126
New Scotland Yard, 73, 78 n.143
Newton, Isaac, 42, 43
Non-Hodgkin’s lymphoma, 552 n.4,
740-741
Nonhuman DNA testing
affinal model, 198 n.194, 199
cats, 196, 197
dogs, 193, 197, 198 n.193
HIV, 194, 195 n.181
individual organisms, 195-198
microbial bioterrorism agents, 149, 194
mitochondrial, 193, 194, 195 n.180
phylogenetic analysis, 193, 194-195
sequencing/profiling methods, 196,
197-198
species and subspecies, 193-195
Nurses’ Health Study, 717
O
Objections to expert testimony, 29 (see
also Daubert)
Obscenity cases, 224, 365
Observational studies
causation, 215-216, 218, 220-222
epidemiology, 555-563, 566, 581, 590,
592, 593, 607, 608, 624, 625, 627
natural experiments, 312
Observer effects, 67-68
Occupational exposures, 26, 217 n.14,
505-506 n.5, 511, 516, 517, 526,
529, 536, 539, 540, 663 n.81
Occupational Safety and Health
Administration (OSHA), 528 n.64,
529 n.65, 649 n.44, 650, 658
n.70, 666 n.88, 670 n.97, 672, 673
n.107, 920
Odds ratio, 235, 289, 291, 566, 568-569,
573, 584, 589, 625, 738
Oklahoma City bombing, 949-950
Otero v. Warnick, 109-110
P
Paradigm, defined, 41-42
Parlodel litigation, 25 n.69, 562 n.38, 564
n.47, 591 n.122
Partial-birth abortion cases, 7
Patents
assessment of claims in, 935, 946
conflicts of interest, 48-49
drugs, 854
expert testimony, 933, 945-946
infringement cases, 3-4, 6, 193, 306,
307 n.12, 309, 311, 319, 321, 440
n.24, 441, 945-946
Paternity cases, 132, 158 n.56, 164, 172,
174, 206
Peer review, 13, 44-45, 48, 49 n.16, 50,
53, 64, 66, 102 n.290, 103 n.300,
108, 328, 375, 508 n.11, 608,
617-618 n.212, 677, 678, 679,
776-777, 786, 787, 803, 806, 866
nn.306 & 309, 886, 901, 931, 935,
938, 958 n.106
People v. Collins, 712
People v. Jennings, 58, 73, 81-82
People v. Linscott, 117 n.387, 124
People v. Marx, 110
People v. Pizarro, 147, 183, 184
Perfluoroctanoic acid, 505 n.3
Permissible Exposure Limits (PELs), 529
Personal injury cases, 435, 455, 470, 475,
665, 689, 690 n.8, 816, 901, 938,
942, 947
Pesticides, 4, 510, 516, 517, 520, 522,
527 n.60, 528-529, 530 n.66, 537,
543, 546, 586, 643 n.28, 648 n.42,
650 n.48, 652, 662 n.78, 671-672
n.102
Pfizer, Inc. v. Astra Pharmaceutical Products,
Inc., 387
Pharmacokinetics, 535, 540
Pheochromocytoma, 706
Philip Morris, Inc. v. Loew’s Theatres, Inc.,
231
Physician-assisted suicide, 3
1001
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Polychlorinated biphenyls (PCBs), 14,
15, 522, 545, 565 n.48, 581 n.89,
586 n.109, 640, 652, 661 n.77, 665
n.88, 668
Polygraph evidence (see also Lie detection)
accuracy level, 369, 805
admissibility of, 60, 790, 794, 795-796
as character evidence, 789 n.24
countermeasures, 783-784
Frye test, 60
history, 802
for mitigation in sentencing, 795
pretrial uses, 807
Sixth Amendment and, 794
statutory protections, 792-793, 807
surveys of general acceptance, 365
n.22, 367-368, 369
validity, 228
Popper, Karl, 40-41, 43, 44, 49, 50, 53, 64
Population
frequency data, 74, 85, 113, 119, 148,
155, 163, 164-165, 166, 178-179,
191, 195, 196 n.185, 197, 200,
203, 204-205, 207, 275
randomly mating, 165, 198, 204, 208
statistics, 217, 221 n.23, 223-225, 226227, 241-250, 275-277, 278-279,
292, 295, 296, 298, 299, 300, 301
Positron emission tomography (PET), 718
n.117, 761, 763-765, 766, 773,
776, 785, 800 n.52, 838
Post hoc ergo propter hoc fallacy, 217 n.14,
669 n.94
Preponderance of the evidence standard,
32 n.95, 252 n.103, 271 n.138,
314 n.33, 319 n.43, 565 n.48, 577,
610 n.182, 611 n.187, 659 n.72,
665, 692, 714
Presidential Recordings and Materials
Preservation Act of 1974, 242
Price fixing, 31 n.91, 306 n.6, 321 n.48,
322 n.50, 349 n.90, 491
Probabilty
conditional, 205, 209, 259 n.122, 273274, 284, 710, 712, 742
expressing testimony as, 123
forensic evidence match, 72, 118 n.401
medical probabilistic reasoning,
707-714
posterior, 172, 173-174, 241-242,
258-259, 274, 275, 283, 710, 742
prior, 173, 174, 241-242, 258, 259,
274, 275, 282, 283, 710, 711, 744
random-match, 60, 118 n.401, 135,
155, 164, 165, 167-171, 172, 173,
175-176, 181 n.34, 182, 186, 187,
188, 189, 190, 191-192, 196, 197,
198 n.194, 205, 208, 251 n.99
sampling, 184, 226, 230, 238 n.71,
241, 246, 248, 283, 293, 295, 299,
332, 355, 361, 362 n.8, 380, 381,
382, 385, 398 n.175, 408, 416,
419, 420, 421
theory, 173, 214, 258 (see also
Bayesian approach)
Product liability
causation, 24, 947
conflicts of interest, 21
damages, 22, 942-943
design disputes, 939-940
engineering testimony, 939-943, 947
exposure to contaminants, 505, 506,
507, 509-511, 515, 516, 519-520,
524, 526, 527, 528, 533, 537, 545,
546
manufacturing issues, 941
medical testimony, 22, 689, 693, 703,
731 n.189
personal injury, 22, 942
proof of defect, 943-944
sale and marketing concerns, 942-943
warning issues, 941-942
Proficiency testing (see also individual
disciplines)
blind, 61, 68, 69, 70 n.83, 80, 81, 98,
156, 160-161, 162, 171 n.98, 196,
207
and error rate, 69 n.80, 161-162,
171-172
external, 156
internal, 155-156
obstacles to, 28
mandatory, 69-70
programs, 69 n.82, 78, 85, 87, 88, 98
1002
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
quality of tests, 78-79, 86-87, 98
reporting results in court, 171
types, 69
Proof (see Causation; Standard of Proof)
Prosecutor’s fallacy, 712
Prostate-specific antigen (PSA) test,
735-736
Publication of research, 13, 45 n.11, 50,
64, 155, 328, 375, 384 n.108,
431, 590, 617 n.211, 678, 776777, 786, 787, 806, 827 n.23, 870
n.332, 886, 901
p-value (see individual disciplines)
Q
Qualifications of experts (see also
Proficiency testing; individual
disciplines)
advisory panel memberships, 678-679
certification, 68-69, 677-678, 874
conflicts of interest, 8, 21-22, 875
consideration of contrary conclusions,
22-23
Daubert and, 22-23, 431-432
discretion of judges in evaluating, 135
education and training, 328, 375, 431,
675-677, 870-871
experience, 431, 818, 822, 823, 824,
825, 826, 827, 841, 846, 848, 854,
871-873, 881, 889, 890
Federal Rule of Evidence 22, 702
knowledge requirement, 22-23
licensure, 873-874
prior relationships, 875-877
professional autonomy, 215-216
professional memberships, 328,
677-678
publications, 328, 375, 678
research grants, 678
university appointments, 679
Questioned document examinations,
83-84, 85, 87 n.198, 88, 89 n.215,
90, 121-122 (see also Handwriting
evidence)
R
R. v. Beamish, 196-197
Racial discrimination, 2, 7, 233, 267-268
n.133, 287, 306, 313, 314 n.33,
319, 324 n.53, 365 n.19
Radioactive iodine, 654, 662, 663, 666-667
Randomized Aldactone Evaluation Study,
729-730
Randomly mating population, 165, 198,
204, 208
Rape cases, 104, 131, 132, 159 n.58, 183,
370, 689 n.6 (see also Sexual assault
cases)
Rate regulation cases, 307
“Reasonable medical certainty”
testimony, 123
“Reasonable scientific certainty”
testimony, 122-124
Redistricting litigation, 2, 267-268 n.133,
307 n.9
Regulatory contexts
engineering evidence, 919, 947-948,
951-952, 953
exposure science and, 505, 506, 509511, 517 n.31, 528-529, 534
presumptive validity of government
testimony, 638 n,11
toxicology, 635, 636, 637, 638 n.10,
639, 640, 642, 644-645, 646, 647,
648 n.40, 649, 650-651, 654, 656
n.54, 657 n.67, 660, 665-666, 669
n.95, 670 n.96, 678
Reliability of scientific testimony (see also
specific disciplines)
between-observer variability, 228
correlation coefficients, 213, 227, 228,
260, 261-264, 265, 266, 286, 290,
301, 333
DNA identification, 60, 62 n.32, 73,
227
error rates and, 13, 29, 49 n.16, 53,
64, 65, 69 n.80, 78, 79, 86, 87,
88-89, 97, 99, 102 n.290, 103
n.300, 122, 162, 171 nn.96-98,
214, 217, 259 n.122, 628, 724725, 787, 806, 890, 903
1003
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
falsifiability or testability, 13, 16,
40-41, 43-44, 48, 49, 53, 63, 64,
866 n.309
general acceptance of methodology,
12, 13, 52 n.20, 53, 59-60, 63, 82,
102 n.291, 103 n.300, 110 n.343,
133 n.7, 166, 173 n.102, 186, 189,
195 n.183, 197, 367, 368, 617
n.211, 806-807, 866, 867, 949
peer review of research and, 13, 44-45,
48, 49 n.16, 50, 53, 64, 66, 102
n.290, 103 n.300, 108, 328, 375,
508 n.11, 608, 617 n.211, 677,
678, 679, 776-777, 786, 787, 803,
806, 866 nn.306 & 309, 886, 901,
931, 935, 938, 958 n.106
publication of research and, 13, 45
n.11, 50, 64, 155, 328, 375, 384
n.108, 431, 590, 617 n.211, 678,
776-777, 786, 787, 806, 827 n.23,
870 n.332, 886, 901
validity distinguished from, 13 n.11,
71-72
within-observer variability, 227-228
Reproducibility of research, 48
Rex v. Castleton, 73
Rhodes v. E.I. du Pont de Nemours & Co.,
31, 32, 505 n.3, 649 n.47
“Right-to-die” cases, 3, 5
Risk
absolute, 738
acceptable, 908, 909-910, 915-920
assessment, 505, 506, 507, 510-511,
525, 526, 528, 533, 534, 535, 536,
537, 538 n.84, 547, 603 n.160,
637, 642-643, 648-651, 657 n.67,
661-662, 663, 673 n.108, 678
n.116, 679, 683, 909, 910-914
attributable, 566, 570-571, 612 n.191,
619
calculations, 911-912, 918
cancer, 635, 638 n.12, 642-643, 644645, 649 n.46, 650, 653, 654, 655,
656, 659, 660 n.74, 665, 668-669,
670, 683
characterization, 637, 638 n.12, 645
n.31, 659, 683
communication, 737-739
extrapolation, 547, 603 n.160, 651,
661-662
low-dose risk curve, 673 n.107
medical, 737, 738
relative (RR), 234 n.62, 247 n.91,
295, 566-568, 569, 570 n.62, 572,
573, 574, 575-576, 577 n.82, 578
n.85, 579, 580, 581, 582, 591, 594,
601, 611, 612, 614-615, 616, 619,
621, 626, 627, 659, 737, 738, 851
n.214
violence (future dangerousness), 3,
819, 846-851, 878 n.354, 893
Risk World Web site, 910
R.J. Reynolds Tobacco Co. v. Loew’s
Theatres, Inc., 231
Rochin v. California, 792
Rock v. Arkansas, 795
Rofecoxib (Vioxx), 731-732
Rorschach ink-blot test, 836, 886
Royal College of Physicians, 705
Rupe v. Wood, 795
S
Sacco and Vanzetti trial, 58, 91
Schizophrenia, 800, 832, 839, 842, 847,
853, 858, 859, 860, 863, 864, 872,
877 n.351, 893
Science and scientists
Daubert’s definition, 39 n.3
evidence defined, 51
“good science,” 64 n.45
historical background, 38-39, 42
honesty and integrity, 43, 45, 50
importance in litigation, 3-4
indicators of good science, 13, 49 n.16
institutions, 45-46
language of, 51-52
law compared with, 51-52
legal training and education, 9
myths and facts about, 47-50
objectives, 52
1004
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
peer review and, 44-45, 48, 49 n.16,
50, 53
as profession or career, 45-47
pseudoscience, 49, 52
reward system and authority structure,
46-47
theories, 39-45, 50, 51
view of Daubert, 52-54
Science, State, Justice, Commerce, and
Related Agencies Appropriations
Act, 64
Scientific method
criteria for judging, 53 (see also
Reliability of scientific testimony)
Francis Bacon and, 39-40, 42, 43, 50
Kuhn’s paradigm shifts, 41-43
Popper’s falsification theory, 40-41
synthesis of multiple studies, 20, 23
Scientific revolutions, 41-43
Scientific Working Group for Firearms
and Toolmarks, 93 n.236, 154
n.46, 156, 157 n.54, 159
Scientific Working Group on DNA
Analysis Methods (SWGDAM),
62 n.29
Scotland Yard, 73
Securities and Exchange Commission, 484
Securities litigation, 429 n.1, 431, 448
Semen evidence, 58, 62 n.32, 143, 151,
155, 159 n.58, 169 n.89
Seventh Amendment rights, 5, 21, 794795, 805
Sex discrimination, 233, 234-235, 257
n.115, 270-271, 272, 279-282,
305-306, 313, 315-317, 318 n.42,
319-320, 323-324, 336-347, 350351, 365-366
Sex offenders, 819, 849, 859-860
Sex-typing test, 146-147
Sexual assault cases, 104, 147, 158, 182,
184, 689 n.86, 800, 871 n.332
Shoe print evidence, 57, 62 n.32
Silicone gel breast implant litigation, 6, 7,
247 n.91, 589 n.117, 638
Single photon emission computed
tomography (SPECT), 761, 765766, 838
Sixth Amendment, 794-796
Society for Neuroscience, 750 n.2
Society for Psychophysiological Research,
367
Society of Breast Imaging, 727
Society of Occupational and
Environmental Health, 678
Society of Toxicology (SOT), 678
Soil contaminants, 507, 510 n.15, 517,
524 n.49, 531, 532, 536 n.78, 537,
667, 669 n.95
Soil evidence, 57, 62 n.32, 669 n.95
St. Valentine’s Day Massacre, 91
Standard of proof
“more likely than not,” 27 n.79, 32
n.95, 102, 123, 463, 577 n.81,
611, 612, 616
preponderance of the evidence, 32
n.95, 252 n.103, 271 n.138, 314
n.33, 319 n.43, 565 n.48, 610
n.182, 611 n.187, 659 n.72, 665,
692, 714
reasonable certainty, 123, 434 n.11,
461-463, 468, 691-692, 693, 694
n.27, 716
Standard of review, 14, 16, 17, 18, 19,
21, 25, 100 n.279, 101 n.282, 104
n.303, 112 n.353, 226 n.36, 563564 n.44, 565 n.48, 693, 827 n.73,
846 n.179, 874 n.343, 947 n.83
Standard Table of Mortality, 365
Stanley v. Georgia, 792
State of Arizona v. Bogan, 195-196
State of Arizona v. Garrison, 107 n.319,
215
State of Arizona v. Krone, 109, 110 n.341
State of Connecticut v. Pappas, 178, 179,
180 n.133, 181 nn.135 & 136
State of Louisiana v. Schmidt, 194, 195
State of Washington v. Leuluaialii, 197
Statistics, 211-302 (see also Multiple
regression)
absolute value, 275, 282, 283
admissibility and weight of studies,
214
aggregation from multiple sources,
235, 254 n.107
1005
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
alternative hypotheses, 205, 254 n.106,
255 n.110, 257, 276, 278, 283,
297, 299, 300, 319-321, 353, 582
anecdotal evidence, 217, 218, 220, 310
applied statistics, 214, 229, 328
appraisal approaches, 242-244, 248249, 278
artifacts from multiple testing, 256
associations between variables, 213,
217-218, 219, 221-222, 230, 233235, 252 n.103, 253, 254, 260263, 264, 265 n.129, 266, 285,
286, 291, 295, 298, 312, 321, 352,
356
averages, 219, 226 n.35, 238 n.70, 241,
242-243, 244-245, 246, 248 n.95,
264, 265 n.129, 266, 269, 278-279,
284, 287, 289, 294, 298, 300
base of a percentage, 233
Bayesian approach (subjectivist), 173,
174, 242 n.48, 273-275
benchmarks, 230-231
bias, 220, 224-225, 226, 227, 240,
241, 246, 249, 256, 266 n.130,
283, 285, 290, 293, 296, 312, 314,
315, 322 n.50, 325, 327, 352
categories for comparison, 231-232
causality inferred from, 213, 216-223,
249, 260-272, 288
census undercount litigation, 2-3,
213, 223-224, 247 n.90, 268, 275
n.149, 307, 308
center-of-distribution measure,
238-239
central limit theorem, 276, 278, 279,
284, 285
changes in data collection, 232
coding reliability, 227-228
collection of data, 216-230, 231
comparisons, 233
conditional probability, 205, 209, 273274, 284, 710, 712, 742
confidence intervals, 165 n.76, 213,
230, 240 n.83, 241, 243-246,
247, 248, 249, 252-253, 255, 259
n.121, 284-285, 289, 321, 332
n.69, 342-343, 352
confounding variables, 219, 220, 221,
222, 240, 257 n.115, 262-264,
265, 266, 285, 286, 288, 289,
298-299
convenience sample, 224-225, 248,
285, 287
correlation coefficients, 213, 227, 228,
260, 261-264, 265, 266, 286, 290,
301, 333
Daubert and, 214, 217 n.14, 227 n.37
dependent variables (outcome variable,
response variable), 219, 264, 268,
270, 274, 285, 286, 287, 288, 290,
294, 295
design of research, 213, 214, 216-230,
231, 240, 243, 246, 279, 301, 308309, 310, 311-317
disclosure of methods and
nonsupporting analyses, 216
discovery process, 216, 217 n.14, 310
n.24, 330-332
discrimination cases, 213, 228, 233234, 250, 253 n.105, 257 nn.115 &
116, 260, 270-271, 272, 275, 276,
279-282
distributions, 236-239, 248 n.95, 251,
259 n.123, 275, 276, 278, 279,
283-284, 286, 287, 288, 290-291,
292, 293, 294, 296, 297, 298, 300,
301
DNA profiling, 74 n.105, 134 n.12
ecological regression, 266, 267
enhancing statistical testimony,
215-216
epidemiological study analysis,
574-582
estimation/estimators, 213, 226 n.35,
228, 229, 230, 232 n.56, 241,
242-249, 252-253, 256-257, 258
n.117, 259 n.123, 260, 264, 265,
267, 269-270, 271, 273 n.146,
278, 279, 280, 281-282, 283, 284,
285, 287, 289, 292, 295, 296, 298,
299-300, 301
expertise/experts, 214, 215-216, 222,
248 nn.93 & 94, 251, 271, 272
external validity, 222, 301
1006
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
false negatives, 115-116, 162, 232, 254
n.106, 259 n.122, 301, 577 n.81,
581-582, 620, 708, 711, 724, 780,
782-783, 788
false positives, 161-162, 170-171, 232,
251 n.100, 254 n.106, 575-581,
589, 619, 627, 708, 710-711, 713,
714, 717-718, 720, 724, 725, 735,
780-781, 782-783, 788
frequentist approach (objectivist), 187,
189, 190, 241-242, 247, 252, 254
n.106, 258, 259, 273, 275, 287,
611 n.188
gender discrimination, 233, 234-235,
257 n.115, 270-271, 272, 279-282
generalization of results, 222-223, 301
graphs, 213, 230, 236-237, 240 n.82,
260-261, 272 n.144, 276, 294, 296
histograms, 236-237, 276-279, 283,
284, 288
hypothesis (significance) tests, 40, 163,
213, 241, 249-253, 254, 255, 297,
319-320, 352, 353, 354, 356, 574582, 626, 724, 725
income–education association, 219,
260-262, 264-266, 312
independence, 227, 228, 269, 275,
288, 325, 326, 353, 714
independent variables, 219, 221, 268,
285, 287, 288, 289, 290, 294, 295,
297, 305 n.2, 308, 353
individual measurements, 227-228
inferences drawn from data, 213, 217,
220, 221, 222, 227, 240-259, 264,
266, 270, 283
integrity of data, 229
intercept, 265-266, 267, 269-270
n.135, 280, 335, 338, 345, 347,
348, 353, 354
internal validity, 222, 228, 229, 288
interquartile range, 239
least squares method, 269-270, 271,
280, 289, 295
linear associations, 261, 262, 264-268,
286, 321, 348, 352
linear combinations, 271 n.139, 280,
287, 289, 290, 298
linear regression, 260 n.124, 264, 298,
316, 317 n.36, 336-339, 347, 353,
354
mean, 213, 230, 238, 239, 240, 247
n.92, 269, 278, 279, 284, 286, 289,
290, 291, 293, 296, 297, 298, 332,
341, 342, 343, 350, 351, 354, 356
median, 213, 230, 238, 239, 240 n.82,
289, 292, 492
misleading data, 220, 230, 231, 247
n.92, 265, 349
missing data, 223-224, 229, 332
mode, 238, 289
models and model development, 241,
253-257, 268-272, 279-281
multiple hypothesis tests, 256-257
Nixon papers valuation, 242-246, 247,
248-249, 278-279
nonresponse bias, 225, 226, 249, 290,
332
normal curve, 239 n.81, 244, 246,
255, 276, 277-278, 279, 284, 287,
298, 303, 342
normal distribution, 239 n.81, 284,
290, 292, 294, 298, 343 n.82, 354
null hypothesis, 241, 249-251, 252,
253-254, 257, 271 n.138, 275,
276, 278, 282, 283, 287, 288, 290,
291, 292, 296, 297, 299-301, 319321, 342-343, 348, 353, 354, 356,
574, 576, 577 nn.81 & 83, 579,
581-582, 619, 620, 625, 626, 628,
629, 724, 725
observational studies, 213, 217-218,
219, 220-222, 241, 248, 269, 285,
288, 290, 291, 312, 318, 332, 340,
342, 347
odds ratio, 235, 289, 291, 566, 568569, 573, 584, 589, 625, 738
one-tailed tests, 255-256, 291, 297,
321, 354, 577 n.83
outliers, 238, 239 n.76, 240, 262, 263,
291, 327 n.58, 345, 346, 354, 355
1007
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
p-values, 213, 230, 240 n.83, 241,
249-256, 257, 258, 271 n.138,
281, 287, 288, 289, 290, 291-292,
296-297, 299, 300, 320-321, 324
n.54, 347, 350-351, 354, 356, 575
n.74, 576, 577 n.81, 578-579, 626,
628, 724
qualifications of statistical experts,
215-216, 275
quotas, 225, 361 n.4
random (sampling) error, 240, 241,
243, 244, 246, 248, 249 n.96, 252,
256, 257, 258, 269, 271, 280, 282,
287, 293, 295, 296, 314 n.30, 336,
337, 339, 342 n.79, 355, 388, 556
n.19, 572 n.67, 573, 574-582, 585
n.106, 589, 612-613, 621, 623,
624, 626, 627, 628
random sample/sampling, 164-165
nn.75 & 76, 178, 225-226 &
nn.32-35, 230, 241, 242, 247, 248,
249, 250, 275, 276, 277, 278, 283,
284, 288, 290, 296, 297, 299, 332,
363 n.12, 380-382, 383, 385-386,
412, 420, 421
random-start systematic sample, 299
random variables, 288, 289, 293-294,
295, 355, 356
randomized controlled studies, 218,
220, 221, 222, 230, 241, 248, 285,
294, 301, 398, 555, 556, 580 n.89,
592, 607, 621, 648 n.42, 658, 716,
718, 722-723, 724, 725, 729, 730731, 732, 736
randomness, generally, 222, 230, 240
n.83, 285, 290, 555, 626
ranges, 237, 239, 245, 247, 250, 253,
276, 284, 288, 292, 293, 294, 298,
299, 300, 312 n.26, 333, 343, 345,
353
rates, 218, 220, 221, 226, 230-233,
234, 235, 236, 243, 250, 253
n.105, 258 n.119, 259 n.122, 266,
267, 268, 275, 279, 284, 288, 290,
291, 294, 298
recording data, 217, 229, 327, 330
parameters, 241, 247, 248, 254 n.106,
255, 269-270, 271, 275, 280, 281282, 283, 284, 287, 291, 292, 293,
295, 298, 299, 300, 312, 314 n.32,
315, 316-317, 320, 324, 325, 326,
327, 332, 336, 337, 338, 340-341,
342, 343, 344, 347, 348 n.91, 352,
353, 354, 355, 356
percentages, 88-89, 213, 224 n.31,
230-232, 233-234, 247, 248 n.93,
249 n.96, 257, 267-268, 284, 293,
318 n.40, 320, 341, 345, 355, 381,
382-383, 499, 626, 781, 787
percentiles, 2, 239, 288, 289, 292,
293, 525
population, 217, 221 n.23, 223-225,
226-227, 241-250, 275-277, 278279, 292, 295, 296, 298, 299, 300,
301
posterior probabilities, 172, 173-174,
241-242, 258-259, 274, 275, 283,
710, 742
power of statistical tests, 174, 181
n.136, 253-254, 255, 276-277,
283, 292, 296, 579 n.88, 581-582,
607, 626, 644, 645, 724, 730, 805
n.65, 848
practical significance of results, 252,
292, 318-321, 355
presentation and analysis of data,
230-240
prior probability, 173, 174, 241-242,
258, 259, 274, 275, 282, 283, 710,
711, 744
probability sampling, 184, 226, 230,
238 n.71, 241, 246, 248, 283, 293,
295, 299, 332, 355, 361, 362 n.8,
380, 381, 382, 385, 398 n.175,
408, 416, 419, 420, 421
probability theory, 173, 214, 258 (see
also Bayesian approach)
product rule, 165-167, 198, 199, 204205, 207, 273-274
professional autonomy, 215-216
public school funding litigation, 2
1008
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
redistricting litigation, 2, 267-268
n.133, 307 n.9
regression lines, 213, 260, 264-268,
294
regression models/analysis, 213, 221
n.21, 248 n.94, 256, 257 n.115,
260-272, 279-282, 284, 285, 286,
288, 289, 293, 294-295, 298 (see
also Multiple regression)
relative risk (RR), 234 n.62, 247 n.91,
295, 566-568, 569, 570 n.62, 572,
573, 574, 575-576, 577 n.82, 578
n.85, 579, 580, 581, 582, 591, 594,
601, 611, 612, 614-615, 616, 619,
621, 627, 659, 737, 738, 851 n.214
reliability of measurements, 217, 223,
227-229, 247 n.91, 248 n.94, 269,
270, 291, 295, 301, 340, 341
rival hypotheses, 163, 174, 257
sample size, 243, 246-247, 252-253,
254-255, 318-319
scatter diagrams, 213, 240 n.82, 260262, 263, 264, 265, 267, 296, 333,
335, 337, 355
selection bias, 224-225, 226 n.36, 249,
290, 293, 296
selection ratio, 234, 235, 275
significance levels, 213, 230, 240
n.83, 241, 249-256, 257, 258, 271
n.138, 281-282, 287, 288, 289,
290, 291-292, 296-297, 299, 300,
320-321, 324 n.54, 347, 350-351,
354, 356, 575 n.74, 576, 577 n.81,
578-579, 626, 628, 724
significance testing, 40, 163, 213, 220
n.19, 241, 249-253, 254, 255-256,
291, 297, 300, 319-320, 321, 352,
353, 354, 356, 574-582, 626, 724,
725
slopes and intercepts, 265-266, 267,
269-270 n.135, 280, 335, 338,
345, 347, 348, 353, 354
Spock jury example, 249-250, 275-278
standard deviation (SD), 126, 213,
230, 239, 240, 242, 243, 247 n.92,
248, 251-252 nn.101 & 103, 278,
179 n.153, 286, 293, 298, 301,
341, 343 n.83, 344, 348, 354, 356
standard error (SE), 213, 230, 240
n.83, 241, 243-246, 248, 249, 251,
255, 258, 276, 278, 279, 281-282,
284-285, 289, 290, 293, 294, 298,
300, 316 n.35, 326, 340-344, 347,
348, 349, 350, 354, 356
stratification, 221 n.23, 226, 288, 290,
299, 308
subfields, 214
surveys, 213, 214 nn.4 & 5, 223-227,
229 n.45, 257 n.115, 290, 307 n.8,
332
technical difficulties, 247-249
theoretical statistics, 214, 241-242,
247, 250, 255-256, 258, 270, 273275, 277-278, 279, 284, 290 (see
also Bayesian approach)
time-series analysis, 231, 317 n.37,
319, 323 n.52, 326, 345, 356
trademark infringement, 363 n.10,
366, 373, 376, 378, 379, 382
n.101, 387, 396 n.165, 397-398,
399-400, 401, 410, 413-414, 421
transposition fallacy, 250-251 n.100,
258 n.119, 259 n.122
trends, 233, 236, 264-265, 345
t-statistics, 281-282, 297, 299-300,
320, 340-344, 347, 356
two-expert cases, 215, 329
two-tailed tests, 255-256, 297, 300,
321, 356, 577 n.83
Type I (alpha) error, 251 n.100, 283
Type II (beta) error, 254 n.106, 283,
301
units measured, 226-227
units of analysis, 217, 223-225,
266-268
validity of measurement process, 222223, 228-229, 241, 288, 301
variability measures, 239-240
voting rights cases, 213, 266-268, 307
Statutes of limitations, 134
1009
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
confidence interval, 380, 381, 383,
419
confìdentiality, 405, 417, 418
consumer impressions, 361, 366, 373,
377 n.79, 378, 386-387, 393, 397,
399-400, 410, 413 n.228
consumer preferences, 231-232, 365,
377, 382, 385-386, 396 n.166,
416, 470
control groups, 394, 397-401, 421
control questions, 368, 369, 394, 401
convenience sampling, 224-225, 285,
382, 383 n.104, 385, 398 n.175,
419, 420
coverage or noncoverage error, 362
n.8, 378, 407, 419, 420
on expert acceptance of, 63 n.39,
367-369
data entry, 229, 363, 405, 412-413
Daubert and, 214 n.5
design of survey, 362, 363, 367, 373376, 381 n.97, 384, 386, 389, 394,
396, 399, 400 n.185, 406, 409, 414
n.231, 415, 416, 420
disclosure of methodology and results,
362, 373 n.62, 389 n.132, 405,
410, 413-415, 416-417
“don’t know” or “no opinion”
options, 362 n.7, 389-391, 421
economic damages determinations,
389, 431, 469-470, 482, 483, 484,
486
error and bias minimization, 362 n.7,
382, 406-407, 411-412
ethical obligations of survey research
organization, 417
expertise in design, conduct, and
analysis, 364 n.16, 372, 375, 385386, 398, 399, 409
expertise in testimony, 362 n.8, 367,
372, 375-376, 381 n.96, 382, 383,
385, 408, 414, 416
extrapolation of data from, 226
filters to prevent guessing by
respondents, 389-391, 420, 421
on general acceptance of scientific
expertise, 365, 367-369
Stearns, Richard, 6
Structured instruments and tests
administration, 888
mental status assessment, 885-889
population appropriateness, 886-887
reliability and validity, 885-886
scoring, 888-889
training considerations, 887-888
Summary judgment, 3, 14, 16, 18, 24
n.64, 309 n.20, 315 n.33, 384
n.110, 547, 644 n.29, 669 n.95,
673 n.105, 817 n.12
Survey research
acquiescence, 394, 400
admissibility under Daubert, 214, 226
n.36, 361, 363-369, 410
Atkins v. Virginia, 369-371
attorney independence, 374
audio computer-assisted selfinterviewing (ACASI), 402
bias, 225, 226, 249, 290, 332, 362,
364 n.16, 373, 374, 379, 381 n.96,
383-386, 394, 395 n.160, 396,
407, 408, 410, 411-412, 416, 417
causal inferences, 398
causal propositions, 392, 397-401, 421
census undercounts, 307 n.8
change of venue, 365, 376-377 n.76,
388, 403, 413 n.228
children and other special populations,
377
clarity of questions, 362, 387-388,
389, 402-403, 406, 410
closed-ended questions, 392-394, 395,
399, 419
cluster sampling, 380, 419
community standards assessment, 224,
369-371
computer-assisted interview (CAI),
402, 412, 419
computer-assisted personal
interviewing (CAPI), 403, 405,
410, 419
computer-assisted telephone
interviewing (CATl), 402, 405,
410, 419
1010
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
identifying the appropriate population,
367, 376-377, 379, 383-384
individual testimony compared to, 372
in-person (face-to-face) interviews,
363, 382, 383 n.106, 385 n.112,
392, 396, 401, 402-403, 404, 405,
419
instrument design and structure,
387-409
internet surveys, 382 n.102, 401, 403,
405, 406-408
interviewer errors and bias, 402, 406407, 411
interviewer training and qualifications,
376, 386, 388 n.129, 389, 394,
395, 402-403, 409-410, 411
mall intercept surveys, 386, 398 n.175
mail surveys, 383 n.106, 384, 396,
401, 402 n.193, 403, 405-406
marketing, 364 n.13, 373 n.63, 382
measurement error, 362 n.8, 401, 420,
422
missing data, 229, 376, 385
mixed-mode design, 409
monitoring administration, 411-412
nonprobability sampling, 361, 382,
383 n.104, 420
nonresponse bias, 225, 226, 249, 290,
332, 362 n.8, 383-385, 407, 408,
416
objectivity in administration, 410-411
objectivity of, 362, 374, 387, 393
open-ended questions, 391-394, 406,
413, 420
order of questions, 395-396, 402, 403,
406-407, 408-409 n.217, 411, 420
pilot tests, 388, 389, 416-417
population definition and sampling,
223-225, 361, 362 n.8, 380, 381,
382, 383 n.104, 385, 398 n.175,
408, 416, 419, 420, 421
pretests, 388-389, 414 n.231, 430
primacy effect, 396, 420
probability sampling, 226, 361, 362
n.8, 380, 381, 382, 385, 398 n.175,
408, 416, 419, 420, 421
probes to clarify ambiguous responses,
389, 394-395, 402-403, 406, 410,
421
professional standards for survey
researchers, 371, 389 n.131, 417
public opinion, 369, 370-371, 403
n.195
purpose of survey, 373
qualifications of experts, 375-376, 381
n.96
questions, 368, 369, 373, 391-394,
395, 397-401, 419, 421
random assignment, 398 n.175
random error, 314 n.30, 336, 337,
339, 342 n.79, 355, 388
random sample/sampling, 332, 363
n.12, 380-382, 383, 385-386, 412,
420, 421
random selection, 398 n.175, 408
random-digit dialing, 404, 408
randomized controlled studies, 398
recency effect, 396, 420, 421
relevance of survey, 362, 363, 367,
368, 370, 373, 374, 375, 376, 377378, 379, 380-383, 386, 407, 413
report content, 362, 364 n.13, 371,
372, 373, 376 n.75, 377, 386, 401
n.186, 405, 413 n.228, 415-417
representativeness of respondents, 226,
362, 367, 370, 379, 380-383, 384,
405, 406, 407, 409, 417
response grouping, 413, 416
response rates, 226, 362, 367-368,
383, 384-385, 390, 405-406, 407,
408, 409, 416
sample surveys, 223, 361-363, 365
n.18, 381, 382
sampling error, 362 n.8, 380-381, 382,
398, 416, 419, 421
sampling frame (or universe), 224,
225, 226, 267, 283, 292, 293, 296,
297, 377-379, 404 n.198, 406,
415, 419, 420, 421
screening respondents, 386-387, 401
n.188, 404, 415, 420, 421
selection bias, 226 n.36, 385-386, 408
self-selected pseudosurveys, 407-408
1011
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
skip pattern, 402, 403, 406, 410, 421
sponsorship disclosure, 372, 374,
410-411
stratified sampling, 225, 299, 380,
381-382, 421
surveyor-respondent privilege, 417
systematic sampling, 380 n.93
target population, 362, 367, 371, 376,
377-378, 379, 382, 383, 384, 385,
386, 406, 407, 409, 415, 419, 420,
421
telephone surveys, 363, 371, 384, 396,
401, 402, 403-405, 407, 408, 410,
411, 412, 419
validation of interviews, 412
weight, evidentiary, 362-363, 368,
377-378, 379 n.89, 396 n.166,
399, 408, 413-414, 415
weights/weighting, statistical, 382,
384, 408, 416
T
Tarrance Group, 371
Technical Working Group on
DNA Identification Methods
(TWGDAM), 61-62 n.29, 154
n.46
Thalidomide, 562-563, 653
Thematic Apperception Test, 886
Theory, law vs. science, 51
Threshold Limit Values (TLVs), 529
Tissue plasminogen accelerator, 732-733
Toolmark evidence
ballistics, 72 n.93, 93 n.241, 96-97,
98, 99, 103 n.300
case law development, 102-103
class characteristics, 96
empirical testing, 61, 65
error rates, 98
exclusion, 27 n.79
identification testimony, 96-97
individual characteristics, 96
proficiency testing, 98
random markings, 94, 99
Toxic Substances Control Act (TSCA),
648 nn.41 & 42, 663, 666
Toxic tort cases, 19, 21, 25-26, 31 n.91,
213, 223, 238, 505, 512, 551 n.2,
635, 636, 637, 638, 639, 645, 649,
663, 665, 667 n.91, 669
Toxicology
absorption, 636, 640, 646-647, 661,
662-663, 666-667, 680, 682
acute toxicity, 641, 668, 671, 680
acute toxicity testing, 641
additive effects, 673, 680
agents of concern, 652, 653-654
animal research, 510, 563-565, 603
n.160, 625, 636, 637, 639, 640647, 648, 654, 655, 656, 658, 659,
660, 661-662, 663, 664, 669-670,
673, 674, 675 n.111, 677, 680, 682
antagonism, 673, 680
benchmark dose, 642, 670 n.96, 680
bioassay, 644, 648, 664 n.83, 680
bioavailability of compounds, 545, 667
biodistribution of toxic agents, 636,
640, 646-647, 661, 662, 667-668,
681
biological monitoring, 639, 649 n.47,
657, 667, 680
biological plausiblility of associations,
644 n.29, 661, 664-665, 680
blood analysis, 508, 509, 518-519,
535-537, 544, 635, 636, 637 n.8,
653, 656, 657, 662, 667, 672
cancer risk, 635, 638 n.12, 642-643,
644-645, 649 n.46, 650, 653, 654,
655, 656, 659, 660 n.74, 665, 668669, 670, 683
carcinogenicity bioassay, 644, 654655, 680
carcinogens/carcinogenicity, 643 n.29,
644, 645, 647 nn.37 & 38, 649
n.44, 650 n.49, 651, 655-656, 658
n.70, 659, 660 n.74, 670 n.97, 673
n.105, 680
chemical toxicology, 635, 636, 637638, 639-640, 644, 645, 646, 647,
649, 654, 663, 673, 677, 681
1012
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
exposure evidence, 636, 637-638,
640, 641-642 n.26, 643 n.28, 644
n.28, 645 n.31, 647 n.37, 658,
659, 660-670
and exposure science, 505, 506, 508,
509, 518, 519, 533, 535, 537, 538,
540, 547
extrapolation from animals and cell
research to humans, 636, 641, 645,
646-647, 648, 651, 652, 658, 661662, 664, 669-670
extrapolation from short exposures to
multiyear estimates, 648
general causation, 637 n.7, 638, 657
n.87, 659, 660-665
good laboratory practice (GLP), 647648, 682
hazard identification, 637 n.7, 649,
650 n.47, 651, 656 n.64, 682
hydrogeologists, hydrologists, 682
immunotoxicology, 640, 653, 677
n.115, 678 n.116, 680-681, 682
in vitro research, 639, 640, 645-646,
647, 648 n.42, 654, 658, 659, 664,
674, 682, 683
in vivo research, 639, 640-645, 646,
647, 654, 664, 682, 683
indirect-acting chemicals, 668, 682
inhalation toxicology, 640, 650 n.48,
651, 656 n.65, 657 n.66, 662, 667,
668, 670 n.97, 674, 678 n.116,
680, 682
Joiner and, 638 n.9, 661 n.77
laboratory tests, 671 n.102, 672
latency period for disease, 512, 660
n.74, 668-669
legal contexts for, 635, 637-638
lethal dose 50 (LD50), 641, 682
level of exposure and, 638 n.12, 641642, 658 n.70, 660 n.74, 665, 667
n.91, 669-670, 673 n.107, 682,
683
lifetime bioassay, 648, 680
maximum contaminant level (MCL),
670 n.97
maximum tolerated dose (MTD), 644645, 682
chronic toxicity, 644, 652, 653, 672,
680
chronic toxicity tests, 644-645
clinical ecologists, 677 n.115, 680-681
clinical studies, 510, 640, 648 n.42,
656 n.64, 658, 659, 661
compounds, 640, 644, 648 n.42, 661664, 667-669, 672, 674, 681
confounding factors, 657 n.67, 665,
672-673, 681
contact, 534
Daubert and, 638 n.9, 643 n.28, 664
n.84, 669 n.94
dermatotoxicology, 640, 641 n.24,
650 n.48, 653
design of studies, 639-646
differential diagnosis, 512 n.21, 672,
676 n.113, 681
direct-acting toxic agents, 636 n.2,
668, 681
DNA damage, 645, 654-655, 656,
663, 682
dose, dosage, 525, 636-637, 638, 641,
642, 644-645, 646, 647, 648, 651,
657 n.67, 658-659, 660, 661, 664,
665, 667 n.91, 668, 670 n.96, 673
n.107, 674, 677 n.115, 680, 681,
682, 684
dose-response curve, 646, 651, 673
n.107, 681
dose-response relationships, 635, 639,
641, 642-643, 646, 649, 651, 658,
663 n.82, 669, 670, 676, 680, 681
end points, 641, 642, 652, 653-654,
666
epidemiology and, 563-565, 603
n.160, 628, 636, 639, 644 n.29,
645-646 n.33, 647, 650-651, 655,
656, 657-660, 664-665, 674, 681
epigenetics, 643 n.28, 681
etiology, 670-671, 676 n.113, 682
excretion of toxicants, 636, 640, 646,
647, 661, 662, 668, 682
exposure assessment, 510, 533-534,
543-544, 637, 638 n.13, 642, 649,
650, 651, 656-657, 658, 665, 671,
672, 674
1013
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
medical history and, 645 n.31, 670675, 676 n.113
metabolism, 535-536, 668, 674
molecular toxicology, 640, 645, 654655, 656, 663, 678 n.116, 682
multiple chemical hypersensitivity, 677
n.115, 680-682
mutagens and mutagenesis, 642, 643
n.28, 645, 654-655, 670, 683
neurotoxicology, 640, 647 n.38, 663664 n.82, 670 n.95, 678 n.116,
683
no observable effect level (NOEL),
641-642, 669-670
no threshold model, 642-643,
669-670
one-hit theory, 642, 643, 651, 683
pharmacokinetics, 646, 648, 674, 675,
683
potentiation, 673, 683
premarket testing of drugs, 648
qualifications of experts, 646 n.33,
660-661, 675-679
randomized controlled studies, 648
n.42, 658
regulatory context, 635, 636, 637,
638 nn.10 & 12, 639, 640, 642,
644-645, 646, 647, 648 n.40, 649,
650-651, 654, 656 n.54, 657 n.67,
660, 665-666, 669 n.95, 670 n.96,
678
reproductive toxicology, 640, 653,
662, 666
risk assessment, 637, 642-643, 648651, 657 n.67, 661-662, 663, 673
n.108, 678 n.116, 679, 683
risk characterization, 637, 638 n.12,
645 n.31, 659, 683
safety assessment, 640, 647-649, 683
scientific foundation of studies, 23
specific causation, 23, 637 n.7, 638,
645 n.31, 659 n.72, 665-666, 669670 n.95
statistical evaluation, 640, 642, 644,
645, 658-659, 670 n.96, 676
structure–activity relationships (SAR),
647 n.37, 648 n.42, 663, 683
susceptibility/sensitivity differences,
527 n.61, 636, 641 n.25, 646 n.35,
650 n.48, 661 n.77, 662-663, 666,
674
symptoms of exposure, 637 n.5, 641642 n.26, 657 n.66, 662, 667 n.91,
669 n.94, 671-672
synergistic effect, 673, 683
systemic, 534-535
target organ dose, 636, 646
target organ specificity, 651-656,
662-663
temporal relationships, 636, 641 n.26,
664, 665, 668-669
teratogen and teratogenicity, 645 n.33,
684
threshold, 641-642, 643, 647 n.37,
650 n.49, 651, 657 n.66, 669-670,
683, 684
tort litigation, 635, 636-637, 638, 639,
645, 649, 663, 665, 667 n.91, 669
toxic agent defined, 684
Trademark infringement, 224, 308 n.12,
363 n.10, 366, 373, 376, 378, 379,
382 n.101, 387, 396 n.165, 397398, 399-400, 401, 410, 413-414,
421
Transposition fallacy, 168, 169 n.91, 170
n.92, 173, 209, 250-251 n.100,
258 n.119, 259 n.122
Treatment of mental disorders (see also
Medications for mental disorders)
cognitive behavioral and related
therapies, 859-860
electroconvulsive and other brain
stimulation therapies, 861-862
family and couples therapies, 860
functional impairments, 860-861
group therapies, 860
prediction of response to, 863-864
psychoanalysis, 858-859
psychodynamic psychotherapy, 859,
860, 865-866
psychosurgery, 863
supportive therapy, 860
talking therapies, 860
Trigon Ins. Co. v. United States, 33
1014
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Index
Troedel v. Wainwright, 126
Trop v. Dulles, 369
Tsar Nicholas family identification, 152,
177
U
Uniform Commercial Code, 466
United States v. Cordoba, 368
United States v. Cyphers, 123
United States v. Diaz, 101
United States v. Glynn, 27 n.79, 101-102,
124 n.444
United States v. Green, 27 n.79, 101, 122
n.427
United States v. Llera Plaza, 28, 73, 74, 79,
82 n.169
United States v. Mitchell, 74 n.105, 76-77,
82 n.170, 122
United States v. Monteiro, 98 n.265, 101,
122 n.434
United States v. Nacchio, 35
United States v. Orians, 368
United States v. Scheffer, 365 n.22, 368
n.35, 794, 795-796
United States v. Semrau, 803, 805-806
United States v. Starzecpyzel, 63 n.42, 84
n.183, 86 n.190, 89, 122 n.435
United States v. Williams, 63 n.42, 102
United States v. Yazback, 193-194
U.S. Census Bureau, 260, 365 n.18, 383,
484
U.S. Forest Service, Forest Products
Laboratory, 58 n.10
U.S. Geological Survey, 951, 958
U.S. Preventive Services Task Force,
726-727, 735, 738, 739
V
Vaginal
adenocarcinoma, 560, 609 n.178
DNA swabs, 147, 151, 158, 182, 183
Validity/validation
comparative measurements, 228
correlation coefficients, 228
criteria for determining, 484
damages data, 483-485
developmental, 155
DNA methods and procedures, 133,
134, 148, 150, 153, 154, 155, 185,
193, 195
external, 222, 301
forensic evidence, 27-28
internal, 155-156, 228-229
quantitative methods, 485
reliability distinguished from, 71-72
test-retest correlations, 228-229
Vanasen v. Tradewinds, 922
Victor Shirley, Inc. v. Creative Pipe, Inc., 35
Videotaped testimony, 7, 880-881
Vinyl chloride (monochloroethylene),
522, 605-606 n.169, 653, 672
Violence Risk Assessment Guide
(VRAG), 848
Voice stress analyzer, 792
Voiceprint evidence, 3, 62 n.32, 71 n.88,
73, 579 n.85
Volatile chemicals, 514, 520, 521, 531,
650 n.48, 657 n.66, 668
Voting Rights Act, 266-267 nn.131 &
133, 307 n.9
Voting rights cases, 213, 266-268, 307
W
Walker v. Soo Line Railroad Co., 947
Warning issues, 233, 941-942
Wechsler Adult Intelligence Scale (WAISIII), 836
Weight-of-the-evidence approach, 15,
16, 20
Weisgram v. Marley, 18-19, 22, 63
Wilhoite v. Olin Corp., 366-367, 392
n.144
Wilson v. Corestaff Services, L.P., 803,
806-807
1015
Copyright © National Academy of Sciences. All rights reserved.
Reference Manual on Scientific Evidence: Third Edition
Reference Manual on Scientific Evidence
Women’s Health Initiative (WHI),
716-717
World Health Organization (WHO), 655,
678
World Trade Organization, 650
Wrongful death, 238 n.71, 470, 471,
473-474, 475 n.77
Wrongful termination, 470, 471, 475,
491
Z
Zuni Public Schools District No. 89 v.
Department of Education, 2
Zyprexa litigation, 24
1016
Copyright © National Academy of Sciences. All rights reserved.