Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Sanning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507

Contents lists available at ScienceDirect

Studies in History and Philosophy of Biological and


Biomedical Sciences
journal homepage: www.elsevier.com/locate/shpsc

Is meta-analysis the platinum standard of evidence?


Jacob Stegenga
Department of Philosophy, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0119, USA

a r t i c l e i n f o a b s t r a c t
Keywords: An astonishing volume and diversity of evidence is available for many hypotheses in the biomedical and
Meta-analysis
social sciences. Some of this evidence—usually from randomized controlled trials (RCTs)—is amalgamated
Evidence
by meta-analysis. Despite the ongoing debate regarding whether or not RCTs are the ‘gold-standard’ of
Medicine
Randomized controlled trial (RCT) evidence, it is usually meta-analysis which is considered the best source of evidence: meta-analysis is
Sir Bradford Hill thought by many to be the platinum standard of evidence. However, I argue that meta-analysis falls far
Epidemiology short of that standard. Different meta-analyses of the same evidence can reach contradictory conclusions.
Meta-analysis fails to provide objective grounds for intersubjective assessments of hypotheses because
numerous decisions must be made when performing a meta-analysis which allow wide latitude for sub-
jective idiosyncrasies to influence its outcome. I end by suggesting that an older tradition of evidence in
medicine—the plurality of reasoning strategies appealed to by the epidemiologist Sir Bradford Hill—is a
superior strategy for assessing a large volume and diversity of evidence.
Ó 2011 Elsevier Ltd. All rights reserved.

When citing this paper, please use the full journal title Studies in History and Philosophy of Biological and Biomedical Sciences

1. Introduction conferences, and formal methods, such as meta-analysis. My focus


in this paper is on meta-analysis. I describe the purported virtues
Biomedical and social scientists are faced with a daunting vol- of meta-analysis and the aims that analysts set out to achieve with
ume of evidence for many hypotheses of interest. For example, this method, critically assess the details of the method, and argue
by 1985 there had been over 700 studies on the relationship be- that, contrary to the standard view regarding the epistemic status
tween class size and academic achievement, over 800 studies on of meta-analysis, meta-analysis does not have the virtues that many
the effectiveness of psychotherapy, and 120 studies testing if the claim for it.
phase of the moon affects human behavior.1 The diversity of evi- Here is the definition from the U.K. National Health Service:
dence available for many hypotheses in medicine and the social sci-
Meta-analysis: a mathematical technique that combines the
ences is also daunting. Standard hypotheses regarding contemporary
results of individual studies to arrive at one overall measure
pharmaceutical interventions, for example, have evidence from com-
of the effect of a treatment.
putational models of toxicity, cell-based studies, experiments on
multiple animal species (murine and canine, and sometimes primate A frequent goal of using meta-analysis is to discover causal relation-
and porcine) investigating multiple organ systems, and multiple ships and to determine the magnitude of an effect for a particular
kinds of study designs on humans. This avalanche of a large volume magnitude of a purported cause. To achieve this end when faced
and diversity of evidence contributed to the formation of groups with a huge volume and diversity of evidence, many claim that, gi-
dedicated to the systematic review of evidence (such as the Cochra- ven its methodological virtues, meta-analysis is an especially good
ne Collaboration), to journals which publish reviews of existing evi- method (§2). I identify these methodological virtues as two general
dence rather than evidence from original research (e.g. Annual norms for any method of amalgamating evidence: Constraint—the
Review of Genetics or Epidemiologic Reviews), and to methods of amal- use of meta-analysis should constrain intersubjective assessments
gamating evidence, including social methods, such as consensus of hypotheses—and Objectivity—meta-analysis should be performed

E-mail address: jstegeng@ucsd.edu


1
See, e.g., Smith & Glass (1977), Glass & Smith (1979), and Rotton & Kelly (1985).

1369-8486/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.shpsc.2011.07.003
498 J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507

in a way which limits the influence of subjective biases and idiosyn- inputs and outputs. The importance of using systematic methods
crasies of particular researchers. of amalgamating evidence became apparent by the 1970s, when
I describe several cases to show that the use of meta-analysis scientists began to review a plethora of evidence with what some
often fails to achieve Constraint (§3). Meta-analysis fails to con- took to be personal idiosyncrasies: ‘‘A common method for inte-
strain intersubjective assessments of hypotheses because numer- grating several studies with inconsistent findings is to carp on
ous decisions must be made when performing a meta-analysis the design or analysis deficiencies of all but a few studies—those
which allow wide latitude for subjective idiosyncrasies to influence remaining frequently being one’s own work or that of one’s stu-
the results of a meta-analysis. Some of these decisions are required dents or friends’’ (Glass, 1976). An example of such a review is
for any method of amalgamating evidence while others are partic- (Pauling, 1986), in which the Nobel Laureate cited dozens of his
ular to the technical details of meta-analysis. The bulk of my argu- own studies supporting his pet hypothesis that large doses of vita-
ment involves a close examination of these decisions involved in min C can reduce the risk of catching a cold, and yet he did not cite
the methodological details of meta-analysis (§4). Meta-analysis is any studies contradicting this hypothesis, though several had been
performed by (i) selecting which primary studies are to be in- published (Knipschild, 1994). Similarly, a recent textbook on meta-
cluded in the meta-analysis, (ii) calculating the magnitude of the analysis worries that unsystematic reviews (sometimes called ‘nar-
effect due to a purported cause for each study, (iii) assigning a rative reviews’) can fail to constrain intersubjective assessments of
weight to each study, which is often determined by the size and hypotheses: ‘‘there are examples in the literature where two narra-
the quality of the study, and then (iv) calculating a weighted aver- tive reviews come to opposite conclusions, with one reporting that
age of the effect magnitudes. Although meta-analysis is often used a treatment is effective while the other reports that it is not’’
in the biological, human, and social sciences, my focus is on med- (Borenstein, Hedges, Higgins, & Rothstein, 2009). The solution to
ical research. I draw on the published guidance of the Cochrane this problem, according to the authors of this textbook, is to use
Collaboration, a primary institution of the so-called ‘evidence- meta-analysis, a more formal method which (it is claimed) can
based medicine’ movement which commissions a large number constrain intersubjective assessments of hypotheses. Likewise, a
of meta-analyses, to help describe the methodology of meta-anal- recent statistics textbook emphasizes a worry regarding reviewers’
ysis. Finally, I end by discussing an alternative, older, and arguably personal idiosyncrasies—‘‘the conclusions of one reviewer are often
better strategy for assessing a large volume and diversity of evi- partly subjective, perhaps weighing studies that support the
dence (§5), associated with the epidemiologist Sir Bradford Hill author’s preferences more heavily than studies with opposing
(1897–1991). views.’’ These authors suggest that meta-analysis is superior in this
Many arguments have been proposed debating whether or not regard, since ‘‘it is extremely difficult to balance multiple studies
randomized controlled trials (RCTs) provide the best evidence for by intuition alone without quantitative tools’’ (Whitlock & Schlut-
causal hypotheses in medicine and the social sciences.2 Cartwright er, 2009). The quantitative tool most often used to achieve such a
(2007), for instance, asks ‘‘Are RCTs the gold standard?’’ to which she ‘balance’ of multiple studies in medicine (and the social sciences)
answers ‘no’. However, despite the debates surrounding the gold- is meta-analysis.
standard status of RCTs, it is in fact meta-analysis which is at the The best account of the scientific value of meta-analysis is
top of the most prominent evidence hierarchies in medicine and so- rather simpler than one might suppose. One might think that an
cial policy.3 Coining a neologism analogous to the metaphor of the aim of meta-analysis is to satisfy a principle stipulating the consid-
gold-standard, it is widely thought that meta-analysis is the platinum eration of all available evidence for a hypothesis (such as Carnap’s
standard of evidence. In what follows I criticize the purported plati- ‘‘Principle of Total Evidence’’). However, as I argue below, meta-
num standard status of meta-analysis. analyses violate such a principle because they normally include
only a small fraction of available evidence. Alternatively, one might
2. Constraint and objectivity think that an aim of meta-analysis is to satisfy a principle of
robustness: hypotheses are often said to be more likely to be true
The first comprehensive meta-analysis performed on a single if they are supported by evidence from multiple independent
hypothesis with evidence from multiple sources was about extra- sources.5 However, because meta-analyses usually include only evi-
sensory perception (Rhine, Pratt, Stuart, Smith, & Greenwood, dence from a narrow range of methodological diversity (such as
1940).4 Meta-analysis later became the platinum standard of evi- RCTs), such evidence typically fails to be methodologically indepen-
dence in medicine and the social sciences for several reasons. The dent, which is often said to be a requirement of robustness argu-
sheer volume of available evidence meant that most users of evi- ments. One proposal to amalgamate diverse evidence is to use the
dence (e.g. physicians or policy-makers) could not be aware of all evidence to build causal models, or models of a network of intercon-
relevant evidence; a proposed solution was to produce systematic nected causal relations (Cartwright & Stegenga, 2011; Danks, 2005).
reviews of the available evidence. By the 1990s, hundreds of meta- Accordingly, one might think that an aim of meta-analysis is to con-
analyses were being published every year, and recently the number struct causal models. But meta-analyses amalgamate evidence on a
of published meta-analyses has exceeded two thousand per year single causal relation, not on a network of interconnected causal
(Sutton & Higgins, 2008). relations.
Meta-analysis became a prominent method in part due to the Instead, the best justification or explanation of the value of meta-
purported rigor of meta-analyses compared with qualitative meth- analysis is statistical: many purported causes in medicine and the
ods of amalgamating evidence. In contrast with qualitative litera- social sciences have a small observable effect, and so when analyz-
ture reviews and social methods of amalgamating evidence such ing data from a single study on an intervention with a small effect,
as consensus conferences, meta-analyses have both quantitative there might be no statistically significant difference between the

2
See, e.g., Worrall (2002, 2007), Borgenson (2008), Banerjee & Duflo (manuscript), Duflo & Kremer (manuscript), Deaton (2008), and Cartwright (2007, 2010).
3
Meta-analysis is at the top of the evidence hierarchies in the evidence ranking schemes of the Oxford Centre for Evidence-Based Medicine, the Scottish Intercollegiate
Guidelines Network, and the Australian National Health and Medical Research Council. As I discuss below, however, those meta-analyses which are usually considered to be the
best are those which include only RCTs.
4
This is a nice historical accident, because Hacking (1988) showed that the practice of randomizing subjects into different groups also began in psychical research—thus both
our gold standard of evidence and our platinum standard of evidence come from research in paranormal psychology.
5
See, e.g., Wimsatt (1981), Trout (1995), Thagard (1998), Douglas (2004), and Stegenga (2009).
J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507 499

experimental group and the control group. But by pooling data from association between SSRI use and suicide attempts, and only a weak
multiple studies the sample size of the analysis increases, which association between SSRI use and risk of self harm. In contrast, in the
tends to decrease the width of confidence intervals, thereby poten- meta-analysis reported by Fergusson et al. (2005), there was a rela-
tially rendering estimates of the magnitude of an intervention effect tively strong association between SSRI use and suicide attempts.
more precise, and perhaps statistically significant. One aim of meta- Similarly, contradictory conclusions have been reached from
analysis, then, is quantitative precision. Such quantitative precision meta-analyses on the benefits of acupuncture and homeopathy,
is perhaps best construed as a means to the end of constraint on mammography for women under fifty, and the use of antibiotics
intersubjective assessments of hypotheses. to treat otitis (see e.g. Linde & Willich, 2003).
In short, meta-analysis is a method to assess and amalgamate There is good reason to think that differential outcomes be-
evidence from multiple studies. Relative to other methods of amal- tween contradictory meta-analyses are associated with the ana-
gamating evidence, such as informal literature reviews or social lysts’ professional or financial affiliations. Several meta-analyses
methods like consensus conferences, meta-analysis is said to have have recently been published which amalgamate evidence testing
the virtues of constraining intersubjective assessments of hypoth- if formaldehyde exposure causes leukemia. Bachand, Mundt,
eses and doing so in a way which is not infused with the subjective Mundt, and Montgomery (2010) and Collins and Lineker (2004)
idiosyncrasies of the analysts. The purported rigor, transparency, conclude that formaldehyde exposure does not cause leukemia.
quantitative precision, and freedom from personal bias can be In contrast, Bosetti, McLaughlin, Tarone, Pira, and La Vecchia
summarized by these two general norms for any method of amal- (2008) found a modest elevation of risk of developing leukemia
gamating evidence: in professionals who work with formaldehyde, such as pathologists
and embalmers. Zhang, Steinmaus, Eastmond, Xin, and Smith
Constraint: An evidence amalgamation method should con-
(2009) found an even higher risk of developing leukemia among
strain intersubjective assessment of hypotheses.
professionals who work with formaldehyde. The meta-analyses
Objectivity: An evidence amalgamation method should not be
which concluded that formaldehyde exposure is not associated
sensitive to idiosyncratic or personal biases.
with leukemia were performed by employees of private consulting
A straightforward way of construing the relation between these two companies.7 In contrast, the authors of the two meta-analyses that
norms is that Objectivity is in the service of Constraint: an evidence found some evidence for a causal link between formaldehyde expo-
amalgamation method can constrain intersubjective assessments of sure and leukemia worked in academic and government institu-
hypotheses only if it is not sensitive to analysts’ idiosyncratic or tions.8 Lest readers think this is a crude ad hominem anecdote
personal biases. It is beyond the scope of this paper to provide a full regarding an isolated example, consider the following similar cases.
explication and assessment of these two norms.6 Nevertheless, they Barnes and Bero (1998) performed a quantitative assessment of
are, prima facie, worthwhile norms for any method of amalgamating multiple meta-analyses which reached contradictory conclusions
evidence. The important point for my present purpose is that statis- regarding the same hypothesis, and found a correlation between
ticians, institutions of evidence-based medicine, and other defenders the outcomes of the meta-analyses and the analysts’ relationships
of meta-analysis claim that, compared with other methods of assess- to industry. They analyzed 106 review papers on the health effects
ing and amalgamating a large volume of evidence, meta-analysis of passive smoking: thirty-nine of these reviews concluded that
best satisfies these norms. This is the basis of the purported plati- passive smoking is not harmful to health, and the remaining 67
num standard status of meta-analysis. concluded that there is at least some adverse health effect associ-
However, in the following section I argue that meta-analysis, ated with passive smoking. Of the variables investigated, the only
unfortunately, often fails to satisfy these norms (§3). In §4 I argue significant difference between the analyses that showed adverse
that the details of the methodology of a meta-analysis require health effects versus those that did not was the analysts’ relation-
many decisions at multiple stages which allow wide latitude for ship to the tobacco industry: analysts who had received funding
an analyst’s idiosyncrasies to affect its outcome. from the tobacco industry were 88 times more likely to conclude
that passive smoking has no adverse health effects compared with
3. Failure of constraint analysts who had not received tobacco funding.
Here is yet another example. Antihypertensive drugs have been
Epidemiologists have recently noted that multiple meta-analy- tested by hundreds of studies, and as of 2007 there had been 124
ses on the same hypotheses, performed by different analysts, can meta-analyses on such drugs. Meta-analyses of these drugs were
reach contradictory conclusions. For example, there have been five times more likely to reach positive conclusions regarding the
numerous inconsistent studies on the benefits and harms of a newer drugs if the reviewer had financial ties to a drug company (Yank,
synthetic dialysis membrane versus an older cellulose membrane Rennie, & Bero, 2007). Or consider the meta-meta review of
for patients with acute renal failure: one recent meta-analysis of meta-analyses of studies on spinal manipulation as a treatment
these studies found greater survival of such patients using the new- for lower back pain: some meta-analyses of this intervention have
er synthetic membrane compared with those using the older cellu- reached positive conclusions regarding the intervention while
lose membranes (Subramanian, Venkataraman, & Kellum, 2002), other meta-analyses have reached negative conclusions, and a fac-
while another meta-analysis reached the opposite conclusion (Jaber tor associated with positive meta-analyses was the presence of a
et al., 2002). Here is another example. Two meta-analyses published spinal manipulator on the review team (Assendelft, Koes, Knips-
in the same issue of the British Medical Journal came to contradictory child, & Bouter, 1995).
conclusions regarding whether or not an association exists between Such examples could easily be multiplied. I have made no at-
the use of selective serotonin reuptake inhibitors (SSRI, a common tempt to comprehensively document the cases in which multiple
class of antidepressant) and suicide attempts. In the meta-analysis meta-analyses on the same hypothesis reach contradictory conclu-
reported by Gunnell, Saperia, and Ashby (2005), there was no sions. These examples are merely meant to show that multiple

6
Recent excellent scholarship has investigated the notion of objectivity, both from a historical perspective (e.g., Daston & Galison, 2007) and from a philosophical perspective
(e.g., Douglas, 2004).
7
In the case of Collins & Lineker (2004) one of the authors was an employee of The Dow Chemical Company. An organization representing the chemical industry estimates that
formaldehyde exists in products which account for more than 5% of the U.S. gross national product (cited in Zhang et al., 2009).
8
I am grateful to Heather Douglas for bringing this example to my attention. She should not, of course, be held responsible for my interpretation of the case.
500 J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507

meta-analyses of the same primary set of evidence can reach contra- meta-analyses, such as pathophysiological evidence, and evidence
dictory conclusions, not that they must, or even often do, reach con- from animal experiments, mathematical models, and clinical
tradictory conclusions. The examples suggest that idiosyncratic expertise.
features of analysts influence the results of meta-analyses. More- In contrast, others argue that an evidence amalgamation meth-
over, the features of meta-analysis which explain its occasional fail- od should use all available evidence. Glass (1976), for instance,
ure to attain Constraint are shared by all meta-analyses. That is, the claims that an effect size of 2.0x from 3 RCTs testing a purported
conditions under which multiple meta-analyses of the same pri- causal relation should have a different impact on one’s assessment
mary evidence can reach contradictory conclusions are inherent fea- of the causal hypothesis when considered in the light of (i) 50
tures of the methodology common to all meta-analyses. I now turn matched case-control studies, purportedly testing the same causal
to a detailed examination of the methodology of meta-analysis. relation as the RCTs, that show an effect size of 2.2x, versus (ii) 50
matched case-control studies, purportedly testing the same causal
4. Inherent subjectivity relation as the RCTs, that show an effect size of 0.8x. A standard
argument supports Glass’s contention: if one’s assessment of the
The failure of Constraint in the above cases is at least partially a causal hypothesis were not different in the two scenarios, one
consequence of the failure of Objectivity: constraint on intersubjec- would effectively be committing the base-rate fallacy: one’s
tive assessments of hypotheses is not met by the meta-analyses in §3 assessment of a hypothesis after observing new evidence should
because the meta-analyses were not sufficiently objective. Subjec- also be guided by all of one’s previous evidence, and if it is not then
tivity is infused at many levels of a meta-analysis: when designing one is liable to make an ill-formed judgment of the probability that
and performing a meta-analysis, decisions must be made—based the hypothesis is true in light of the new evidence.
on judgment, expertise, and personal preferences—at each step of Here is another argument to support Glass’s contention. In (i)
a meta-analysis, which most importantly include the: there is concordance between the new evidence (from RCTs) and
the previous evidence (from case-control studies), which might sug-
(i) Choice of primary evidence gest that the two kinds of studies are converging on a true effect size
(ii) Choice of effect measure (but such concordance can occur for other reasons). In (ii) there is
(iii) Choice of quality assessment scale discordance between the new evidence (from RCTs) and the previ-
(iv) Choice of averaging technique ous evidence (from case-control studies), which might suggest (a)
that there is a systematic problem with the case-control studies, gi-
Some of these choices are not specific to meta-analysis (i and per- ven the known potential biases with case-control studies compared
haps iii), but are nevertheless relevant to explaining the shortcom- with RCTs (this is a typical response in the evidence-based medicine
ings of meta-analysis, while others are particular to the community when faced with discordance between RCTs and case-
technicalities of meta-analysis (ii and perhaps iv). The general control studies), (b) that there is a systematic problem with the
principles of meta-analysis are simple and are not unique to the RCTs, given the low number of them compared with the large num-
biomedical or social sciences. For example, a common method of ber of case-control studies, (c) that the two kinds of studies were not
combining multiple expert probability forecasts (say, for sunshine similar enough in all important parameters, including the causal
in three days, or for a stock price increase in the next fiscal quarter, structure of the study populations, (d) that the purported cause is
or for a victory for a presidential candidate) is to calculate a statis- spurious, (e) that a highly unlikely series of events has occurred. In
tical average: when multiple experts give probability forecasts, a other words, in (ii) there is no general reason to assume (a) as an
standard way to combine these multiple forecasts into a single explanation of the discordance, and if one blindly does assume (a)
forecast is to simply calculate an average of the probabilities. How- as an explanation then one is liable to be wrong.
ever straightforward a weighted average may seem, the subtleties Another way to put this consideration is that even if RCTs are jus-
of meta-analysis are complex. In what follows I consider each class tifiably the gold standard of evidence, that would not mean that evi-
of choices required in the steps of a meta-analysis. dence from non-randomized studies is negligible. Indeed, some of our
most believable causal hypotheses were first supported by evidence
4.1. Choice of primary evidence from non-randomized studies, and for many hypotheses we only
have evidence from non-randomized studies. A joke in such discus-
Multiple decisions must be made regarding what primary evi- sions is that there has never been a carefully performed RCT which
dence to include in a meta-analysis. I survey some of these deci- has tested the causal efficacy of parachutes (e.g. Smith & Pell, 2003).
sions, and critically evaluate arguments for particular strategies The exclusive use of a narrow range of evidence is purportedly
to these decisions. justified on the grounds that the methods of meta-analysis are only
valid for homogeneous evidence (I discuss this below), and by the
4.1.1. Methodological quality ‘‘garbage-in-garbage-out’’ argument: if low quality evidence is in-
The dominant view in evidence-based medicine is to include cluded in a meta-analysis, then the output of the meta-analysis will
only evidence from RCTs in a meta-analysis; according to a state- also be low quality, and so rather than including all available evi-
ment of leaders in evidence-based medicine, in a meta-analysis dence, meta-analyses should only include the ‘best evidence’ (e.g.
‘‘researchers should consider including only controlled trials with Slavin (1995), who argues that meta-analysis should be limited to
proper randomisation’’ (Egger, Smith, & Phillips, 1997). Such a view ‘best evidence synthesis’). There are numerous problems with this
excludes other common kinds of statistical evidence, including that argument, one of which is outlined above: if we ignore some evi-
from cohort studies and case-control studies, as well as non-statis- dence, even if it comes from a method deemed to be of low quality,
tical evidence which is not in the domain of usual technical we effectively commit the base-rate fallacy.9 Moreover, there is no

9
My appeal to the base-rate fallacy here might suggest that I am relying on Bayesian principles. But the problem with ignoring evidence should be a problem for everyone.
Worrall (2002) and Cartwright (2007) have forcefully argued that there is no single ‘gold standard’ of evidence and thus we ought to take into account evidence of all kinds when
available. Moreover, the possibility of ‘defeating’ evidence provides further reason why one ought to consider all available evidence. For example, if Beth, a specialist in ocean
geography, tells me that Kiribati is an island nation in the Atlantic, then I have some evidence that Kiribati is indeed an island nation in the Atlantic; but if I later get evidence that
Beth is a compulsive liar then I have lost my reason to believe that Kiribati is an island nation in the Atlantic. Attending to some of my evidence (Beth’s claim) and ignoring other
evidence (about Beth’s honesty) leads me to believe something false.
J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507 501

reason why an analyst cannot assess lower-quality evidence appro- effect in a narrow range of subject diversity. Thus, there can be good
priately, simply by assigning a lower weight to such evidence when reasons for limiting the diversity of participants, interventions, and
calculating the weighted average. Finally, the veiled premise of the kinds of outcomes to be included in a meta-analysis. Nevertheless,
garbage-in–garbage-out argument—that all and only non-random- though, such parameters of meta-analyses are decision points which
ized studies require problematic background assumptions in order can influence the outcomes of a meta-analysis.
for evidence from such studies to be truth-conducive—is false. All Other limitations to the primary evidence included in a meta-
methods presuppose background assumptions that must be met for analysis are more troublesome. Consider the following Cochrane
the evidence from such methods to be considered truth-conducive, guidance: ‘‘we strongly recommend that review authors should
and such assumptions may or may not be problematic, but this de- not make any attempt to combine evidence from randomized trials
pends on specific features of the study design, both in the abstract and NRS [non-randomized studies]’’ (13.2.1.1). No justification is
and in relation to one’s hypothesis of interest. In short, although all provided for this limitation; not only is evidence from non-ran-
evidence is inductively risky, there are good reasons for including as domized studies not to be amalgamated with evidence from RCTs,
much evidence as possible in a meta-analysis. Regardless, when per- but neither is evidence from pathophysiological knowledge, back-
forming a meta-analysis one must make a decision regarding the ground considerations of underlying mechanisms, animal experi-
breadth of methodological quality to include, and this decision might ments, and results from mathematical models. Such a practice
be made differently by different analysts. could limit the external validity of a meta-analysis, since RCTs on
humans are typically performed with relatively narrow study
4.1.2. Methodological diversity parameters while other kinds of evidence—including evidence
Another justification for only including evidence from select from non-randomized human studies, studies on animals, and
methods is the possibility of variable treatment effects among dif- experiments designed to elucidate causal mechanisms which are
ferent subjects or different experimental circumstances. Consider often performed on tissue and cell cultures—can have diverse
the following guidance from the Cochrane Collaboration: study parameters at lower cost. Moreover, as discussed above, this
practice violates a principle of total evidence, which comes with
you have to be confident that clinical and methodological diver-
possibly significant epistemic risk: neglecting other kinds of evi-
sity is not so great that we should not be combining studies at
dence risks making an uninformed judgment (or, the base-rate fal-
all. This is a judgement, based on evidence, about how we think
lacy) on a hypothesis.
the treatment effect might vary in different circumstances.10
Methods of amalgamating evidence from multiple studies, but
For the Cochrane Collaboration, the standard for what counts as which systematically exclude all evidence but that from a single
methodological diversity is low; these meta-analyses only include kind of study, are not limited to medicine. A non-medical exam-
a narrow range of study designs in any given review. Some limita- ple is in ‘driving under the influence’ (DUI) cases. In most juris-
tion to the diversity of primary evidence which gets included in a dictions in the United States there are at least three kinds of
meta-analysis is justifiable. The Cochrane group gives the following evidence that can be used to detect intoxication of drivers: (1)
proviso: ‘‘Meta-analysis should only be considered when a group of a police officer’s subjective assessment of the driver12; (2) the
studies is sufficiently homogeneous in terms of participants, inter- driver’s blood alcohol concentration as extrapolated from a porta-
ventions and outcomes’’ (Cochrane Handbook 9.5.1). Including only ble breath test machine in the officer’s car; (3) the driver’s blood
studies with homogenous outcomes is fine if by ‘outcome’ they alcohol concentration as extrapolated from a more reliable breath
mean kind of outcome; for example, if one study tests the effect test machine in a police station (Mnookin, 2008). The use of
of a drug on lowering blood pressure, and another study tests the breath test machines is meant to mitigate officers’ subjective
effect of the same drug on the rate of heart attacks, then there is assessments; to use a term of Daston and Galison (2007), the
no shared outcome on which to calculate an average. More gener- ‘mechanical objectivity’ of breath test machines are thought to
ally, a meta-analysis is only meaningful if the data from multiple counter the subjectivity of officers. In many jurisdictions, evidence
studies is generated from a single kind of causal relation. But even from (3) trumps evidence from (2) or (1): if a driver is suspected
when multiple studies are purported to measure the same causal of being intoxicated according to (1), and fails the breath test in
relation, the only evidence that analysts have to assess this (besides (2), but gets to the station and then passes the breath test in
the substantive features of the study designs) is by the statistical (3), the driver is released with no charges. In short, in such cases
variability between the data from the studies. As the Cochrane a single kind of evidence trumps other available kinds of evidence.
group rightly states, this is a ‘judgement’ regarding whether or Thus medicine is not the only domain in which one kind of evi-
not a meta-analysis is even meaningful in the first place. dence trumps all other kinds of evidence. However, to the extent
Homogeneity of participants and interventions might be that one is committed to the principle of total evidence, one will
similarly justifiable. If we are interested in the effect of a given find such practices dissatisfying.
intervention, we must be consistent with what that intervention The obvious worry about the plurality of unconstrained deci-
is—although a narrow range of intervention diversity (say, using sions regarding the methodological diversity to be included in a
a single dose of an experimental drug) will narrow the range of meta-analysis is that such choices can vary between analysts,
conclusions one can draw about the intervention. Likewise for and if so, such differences might affect the outcome of a meta-
the use of a narrow range of participants—before we can know if analysis.
an intervention works in a broad demographic, it is reasonable to
try to determine if it works in a narrower demographic.11 (But of 4.1.3. Discordance
course, if we already have evidence from a broader population of Another choice that must be made regarding which primary
subjects, including non-human subjects, then we should not ignore evidence to include in a meta-analysis is the degree of discor-
such evidence.) Moreover, some interventions only have a specific dance—that is, the degree to which evidence from different

10
Cochrane website http://www.cochrane-net.org/openlearning/html/mod13-4.htm (accessed 20.10.2009).
11
I give short shrift to a growing debate: Epstein (2007) argues that our knowledge of the safety and efficacy of many biomedical interventions is limited because for too many
years these interventions were tested on a narrow demographic range of subjects.
12
This subjective assessment is itself comprised of various kinds of evidence, including the driver’s ability to perform behavioral tasks, the driver’s conversational capability, and
the driver’s outward appearance and smell.
502 J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507

primary studies disagree or contradict each other—that the analyst advantages over published study-level outcomes which I do not
is willing to accept amongst the primary set of evidence. discuss here): often patient-level data is confidential or is pro-
The Cochrane Collaboration Handbook has a section which dis- tected by corporate interests. Other practical problems regarding
cusses strategies for dealing with discordant primary evidence access to primary evidence include studies published in languages
(9.5.3). An examination of these strategies is revealing. One strat- foreign to the analyst, and evidence available only in the ‘gray lit-
egy is to ‘‘explore’’ the discordance: discordance might be due to erature’ of conference proceedings and dissertations; evidence
systematic differences between studies, and so a post-hoc meta- from ‘gray literature’ tends to have lower estimates of medical
study can be done to determine if systematic differences between interventions than does evidence published in mainstream litera-
studies are related to systematic differences in outcomes. Another ture (McAuley, Pham, Tugwell, & Moher, 2000). How intensely an
strategy is to exclude studies from the meta-analysis: the Hand- analyst grapples with these practical problems of data access can
book claims that discordance might be a result of several outlying influence the results of a meta-analysis.
studies, and if some factor can be found that might explain the dis-
cordance between these outlying studies and the remainder of the 4.1.5. Summary
studies, then those outliers can be excluded. The Handbook notes, A number of decisions must be made regarding which studies to
however, that ‘‘Since usually at least one characteristic can be include in a meta-analysis, including the acceptable range of meth-
found for any study in any meta-analysis which makes it different odological quality of studies, the acceptable range of study param-
from the others, this criterion is unreliable because it is all too easy eter diversity, whether or not to exclude studies with outlying
to fulfill.’’ Indeed, a study can be similar or dissimilar to another data, how hard to look through the gray literature, if the File
study on an infinite number of features, and so if one had sufficient Drawer Problem is severe or not, and whether or not a meta-anal-
data and resources, one could always find a potential difference- ysis is even feasible in the first place. In the words of a critic of
maker about a study that would purportedly justify its exclusion. meta-analysis: ‘‘It is precisely in those areas where there is most
Finally, when faced with discordant primary evidence, the Cochra- disagreement that these methods [meta-analysis] are least appli-
ne group suggests that a meta-analysis may not be meaningful—‘‘If cable’’ (Eysenck, 1984). In terms of the norms described in §2,
you have clinical, methodological or statistical heterogeneity it the plurality of required decisions regarding which studies to in-
may be better to present your review as a systematic review using clude in a meta-analysis threatens Objectivity, and thereby Con-
a more qualitative approach to combining results. . .’’13 This is be- straint. Regardless of how justified the decisions regarding choice
cause, as discussed above, the primary evidence might be discordant of primary evidence are for any particular meta-analysis, they
not because of random variations of measures from a single causal must be based on expertise and judgment, thereby inviting idio-
relation, but rather because the multiple primary studies were mea- syncrasy, and allowing a degree of latitude in the possible results
suring multiple causal relations. of a meta-analysis.15
Each of these strategies for dealing with discordance can be
pursued in a multitude of ways, with varying amounts of time 4.2. Choice of effect measure
and energy devoted to the particular strategies. There is no reason
to think that different analysts will follow these strategies in the Data from primary studies must be summarized quantitatively
same way. Differing approaches to discordance have a direct affect by a standardized measure, usually referred to as an ‘effect mea-
on the outcomes of meta-analyses. sure’, before being amalgamated into a weighted average. An effect
measure (sometimes also called an outcome measure) is used to
4.1.4. Data access summarize data into an ‘effect size’, which is an estimate of the
Decisions regarding what primary evidence to include in a meta- magnitude of the purported strength of the cause-effect relation-
analysis are constrained by what primary evidence is available. The ship under investigation. Multiple effect measures can be used
internet has improved access to primary evidence. Nevertheless, a for this—frequent choices include the odds ratio, the risk differ-
well-known problem in medical research is publication bias: papers ence, and the correlation coefficient (I give examples of these be-
which show statistically significant positive findings are more likely low). The choice of effect measure can influence the degree to
to be published than papers that have null or negative findings which the primary evidence appears concordant or discordant,
(especially when the research is funded by private companies—see and so ultimately the choice of effect measure influences the re-
Brown (2008)). An illustrative example is provided by Whittington sults of meta-analysis, and can even influence whether or not an
et al. (2004), who showed that the risk-benefit profile of some SSRIs analyst thinks a meta-analysis is worth doing in the first place.
for the treatment of childhood depression is positive when consid- The guidance from the Cochrane group will again help me to ex-
ering only published studies and negative when both published plain this.
and unpublished studies are evaluated. A corollary of publication As discussed above, the Cochrane group gives several strategies
bias has its own name: the File Drawer Problem.14 In short, reviewers for dealing with discordant primary evidence. One of these strate-
performing a meta-analysis often have less access to null or negative gies is to ‘‘change the effect measure’’, because discordance ‘‘may
evidence (because it is sitting in file drawers or on hard drives) than be an artificial consequence of an inappropriate choice of effect
they do to published positive evidence, and this is likely to influence measure.’’ The Cochrane Handbook is correct to claim that ‘‘when
the results of a meta-analysis (often, it seems, such influence is in control group risks vary, homogeneous odds ratios or risk ratios
the favor of the medical intervention under study). will necessarily lead to heterogeneous risk differences, and vice
A related problem is faced by analysts who want to do a versa.’’ This is simply due to the mathematical relationship be-
meta-analysis with patient-level outcomes (which has several tween ratios and differences. However, although it may be true

13
http://www.cochrane-net.org/openlearning/html/mod13-4.htm (accessed 04.08.2011).
14
That this is not called the Hard Drive Problem suggests that it has been with us for some time.
15
The issue of which primary studies to include in a meta-analysis is often appealed to by analysts when explaining contradictory outcomes between their own meta-analysis
and previous meta-analyses. For instance, in the report by Bachand et al. (2010)—one of the meta-analyses testing if formaldehyde exposure causes leukemia, discussed in §3—the
authors claimed that the apparently contradictory outcome of their meta-analysis with the outcome of an earlier meta-analysis was due to a difference in selection of primary
studies: ‘‘Zhang et al. (2009) identified all relevant epidemiological studies published on formaldehyde and lymphohematopoietic cancer, but due to lack of case-control studies
meeting their inclusion criteria, restricted their analysis to cohort and PMR studies.’’
J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507 503

that evidence from multiple studies appears discordant only be- potential biases) and the external validity of a study (i.e. the rele-
cause one effect measure is used rather than another, it might vance of the evidence to one’s general hypothesis of interest). Sci-
not be true: heterogeneity might simply be due to a lack of system- entists lack principles to precisely determine how these numerous
atic effect by the intervention. A hypothetical case will help me features should be weighed relative to each other. The trouble is
illustrate the trouble with choosing between effect measures based that different weighing schemes can give contradictory results
on discordance between primary studies. when evidence is amalgamated. An empirical demonstration of
Consider two studies (1 and 2), each with two experimental this was given by Jüni, Witschi, Bloch, and Egger (1999). They
groups (E and C), and each with a binary outcome (Y and N). The amalgamated data from 17 trials testing a particular medical inter-
table below indicates the possible outcomes for each study, where vention, using 25 different scales to assess study quality (thereby
the letters (a–d) are the numbers for each outcome in each group: effectively performing 25 meta-analyses).16 These quality assess-
ment scales varied in the number of assessed study attributes, from
Group Outcome a low of three attributes to a high of 34, and varied in the weight gi-
ven to the various study attributes; however, Jüni and his colleagues
Y N
note that ‘‘most of these scoring systems lack a focused theoretical
E a b basis.’’ Their results were troubling: the amalgamated effect sizes
C c d between these 25 meta-analyses differed by up to 117%—using ex-
actly the same primary evidence. The authors concluded that ‘‘the type
of scale used to assess trial quality can dramatically influence the
The risk ratio (RR) is defined as: interpretation of meta-analytic studies.’’
Not only does the choice of quality assessment scale dramati-
RR ¼ ½a=ða þ bÞ=½c=ðc þ dÞ
cally influence the results of meta-analysis, but so does the choice
The risk difference (RD) is defined as: of analyst. A quality assessment scale known as the ‘risk of bias
tool’ was devised by the Cochrane group to assess the degree to
RD ¼ a=ða þ bÞ  c=ðc þ dÞ
which the results of a study ‘‘should be believed.’’ Alberta research-
Now, suppose for Study 1 the numbers for the two outcomes in each ers distributed 163 manuscripts of RCTs among five reviewers, who
group are a = 1, b = 1, c = 1, d = 3 and for Study 2 they are a = 6, b = 2, assessed the RCTs with this tool, and they found the inter-rater
c = 3, d = 5. This would give the following effect sizes for the two agreement of the quality assessments to be very low (Hartling
studies: et al., 2009). In other words, even when given a single quality
RR1 ¼ 2; RR2 ¼ 2; RD1 ¼ 0:25; RD2 ¼ 0:375 assessment tool, and training on how to use it, and a narrow range
of methodological diversity, there was a wide variability in assess-
Thus a meta-analysis on just these two studies, using risk difference ments of study quality.
as the effect measure, would have discordant primary effect sizes to Much evidence suggests that personal differences in the assess-
amalgamate (0.25 and 0.375); but by switching the effect measure ment of the quality of scientific studies is a deeply rooted phenom-
to risk ratios the meta-analysis would have concordant primary re- enon. Kunda (1990) presents psychological research on what she
sults to amalgamate (2 and 2). Although the Cochrane Collaboration calls ‘‘motivated reasoning’’, in which subjects assess evidence dif-
advises changing the effect measure if the primary studies have dis- ferentially depending on subjective idiosyncrasies.17 For example,
cordant results, choosing between effect measures on the basis of after reading a scientific article which concludes that consuming caf-
trying to avoid discordance is ad hoc. More to the point, the choice feine is risky for females, female caffeine consumers were less con-
of effect measure is another decision in which personal judgment is vinced by the article than were females who do not consume
required, and the fact that there are multiple effect measures allows caffeine. In another study, subjects were presented with mixed evi-
a range of outputs for any meta-analysis. Again, this threatens dence about the efficacy of capital punishment, and both supporters
Objectivity, since some analysts might choose to change their effect and opponents of capital punishment subsequently became more
measure when the primary evidence appears discordant using the polarized in their respective views, which is perhaps best explained
originally chosen effect measure, while other analysts might resist by a differential assessment of the mixed evidence.18
such switching given that such switching seems ad hoc. Regardless In short, when performing a meta-analysis, analysts must
of one’s view of whether or not such switching is ad hoc, one’s choose a quality assessment scale and apply the scale to the assess-
choice of effect measure has a direct influence on the outcome of ment of particular primary-level studies. The choice of a quality
a meta-analysis, and thus differing choices of effect measures di- assessment scale, and variations in the assessments of quality by
rectly threatens what I have been calling Constraint. different analysts, violates what I have been calling Objectivity,
and the above examples show that such a violation of Objectivity
4.3. Choice of quality assessment scale straightforwardly threatens Constraint: differing decisions regard-
ing one’s quality assessment scale lead to contradictory outcomes
Analysts often attempt to account for differences in the size and of a meta-analysis.
methodological quality of primary studies included in a meta-anal-
ysis by weighing the primary studies with a formalized quality 4.4. Choice of averaging technique
assessment scale. The conclusion of a meta-analysis depends on
how the primary evidence is weighed, because the weights are Once effect measures are calculated for each primary study, two
used as a multiplier when the primary effect sizes are averaged. common ways to determine the average effect measure are possi-
There are many features of evidence that should influence how pri- ble: sub-group averages and pooled averages. In a pooled average,
mary evidence is weighed, including multiple features that influ- all subjects from the included studies are merged in the analysis as
ence the internal validity of a study (e.g. freedom from numerous if they were part of one large study with no distinct demographic

16
These quality assessment scales were summarized and described in Moher et al. (1995).
17
I am grateful to Boaz Miller for bringing these findings to my attention, and to the discussion of them in Miller (2010).
18
Although these examples suggest that differential assessments of the quality of scientific studies is influenced by non-epistemic features of the subjects involved in the
assessment, such differential assessment of the quality of scientific studies can also arise by subjects variably weighing relevant epistemic considerations.
504 J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507

sub-groups. One problem with the pooled average approach is provide decision-making rules’’ (Klein and Williams, 2000). Such
Simpson’s paradox: the comparative success rate of two groups pessimism is perhaps most acutely justified when the discordant pri-
can be reversed in all of their respective sub-groups, so if a mary evidence comes from very different kinds of experiments.
meta-analysis simply pooled all participants into an analysis of Nevertheless, there is, at least at first glance, a tension between the
overall groups then the calculated effect of the intervention could purported objectivity and quantificational simplicity of meta-
be the opposite of what one would find in every sub-group. An- analyses and the subjectivity and qualitative complexity required
other problem with the pooled average approach is that different to assess and interpret the relevant aspects of all available evidence.
demographic groups might respond differently to an intervention. A consideration of an older tradition of evidence in medicine,
For example, a drug might, on average, have a large benefit to associated with the epidemiologist Sir Bradford Hill (1897–1991),
males and a small harm to females, and if data from these groups might go some way toward resolving this tension. Hill was one
were combined in a pooled average we would erroneously con- of the leading epidemiologists involved in the first large-scale
clude that the drug has, on average, a small benefit to all people, case-control studies during the 1950s which showed a correlation
including females. between smoking and lung cancer (Doll & Hill, 1950, 1954). Hill’s
Maintaining distinct sub-groups in a meta-analysis, which the statistician nemesis Ronald Fisher (1890–1962) noted the absence
Cochrane group rightly advises, is an attempt to avoid such prob- of controlled experimental evidence required to prove that the
lems. However, maintaining sub-groups does not avoid Simpson’s smoking-cancer association was indeed causal. Fisher’s now infa-
paradox unless there is a principled way to demarcate sub-groups mous criticism was that the smoking-cancer correlation could be
such that the ‘true’ result one is interested in is relative to those explained by a confounding variable, or common cause of the
sub-groups and these exact sub-groups were used in the primary smoking and cancer. Fisher postulated a genetic predisposition
analyses. Moreover, to determine a sub-group average, either the which could be a common cause of both smoking and cancer,
sub-groups must be consistently demarcated amongst primary and so the observed association between smoking and cancer
studies, or the patient-level data necessary to demarcate sub- could be spurious. The only way to determine a true causal rela-
groups, such as age and gender, must be available to the analyst. tion, according to Fisher, was to perform a controlled experiment;
The former is often not the case and the latter is often not available. of course, for ethical reasons no such experiment could be per-
However, if patient-level demographic data is available to the ana- formed. Hill, at the time an epidemiologist at the London School
lyst, then the analyst can re-group individual sub-groups any way of Hygiene and Tropical Medicine, responded by appealing to a
she wishes until she finds something interesting, but of course plurality of reasoning strategies which, he claimed, when taken to-
such retrospective data-dredging is liable to support spurious find- gether made a compelling case that the observed association was
ings. More to the point, once again: the choice of average type— truly a causal relation (Hill, 1965).
pooled or sub-group (and if the latter, the choice of appropriate These reasoning strategies were as follows:
sub-groups)—is another decision point in the methodology of
meta-analysis which threatens Objectivity and Constraint. 1. Strength of associations between variables: strong associations
between variables are more likely to be causal than weak
4.5. Summary associations.
2. Consistency of results between studies: an association between
Let me recap. I am not the first to note difficulties with meta- variables which is observed in multiple studies is more likely to
analysis. Others have claimed that formal methods of amalgamat- be causal.19
ing evidence ‘‘bury under a series of assumptions many value judg- 3. Specificity of variables: a single specific cause has a single spe-
ments’’ (Lomas, Fulop, Gagnon, & Allen, 2003). I have attempted to cific effect; correlations between coarse-grained or non-specific
identify those specific aspects of meta-analysis in which such ‘‘va- variables are less-compelling evidence for a true causal relation.
lue judgments’’ have an influence on the results of a meta-analysis. 4. Temporality: a cause must precede its effect.
5. Biological gradient: a dose–response pattern of associations
between variables suggests a true causal relation.
5. The Hill strategy 6. Plausibility: a plausible biological mechanism which can
explain a correlation suggests that the association is a true cau-
A long-time critic of meta-analysis has argued that subjective sal relation.20
knowledge is necessary to properly assess a large volume and 7. Coherence: a causal interpretation of an association should not
diversity of evidence: conflict with other relevant knowledge, and epidemiological
evidence should cohere with evidence from laboratory
A good review is based on intimate personal knowledge of the
experiments.
field, the participants, the problems that arise, the reputation
8. Experimental evidence: despite criticisms from Fisher, Hill of
of different laboratories, the likely trustworthiness of individual
course recognized the value, when available, of evidence from
scientists, and other partly subjective but extremely relevant
controlled experiments.
considerations. Meta-analysis rules out any such subjective fac-
9. Analogy: analogies with other known causal relations can aid in
tors. (Eysenck, 1994)
causal inference; that is, if the purported cause and purported
While I concur that meta-analysis has a primary aim of ruling out effect are similar in important respects to a known cause and
subjective factors when amalgamating evidence (which is another its effect, then there is at least some reason to think that the
way of stating the Objectivity norm), if my arguments in §4 are cor- purported causal relation is real.
rect, then meta-analysis is not successful at reaching this aim. Others
have urged that in situations in which there is a large volume of pri- Although some have erroneously called these considerations ‘cau-
mary-level evidence which is discordant, we do not have (and likely sal criteria’, Hill considered them only as guidelines rather than
will not find) a satisfactory ‘‘formula or set of principles designed to necessary or sufficient conditions or ‘criteria’ (except perhaps for

19
It is worth noting that meta-analysis can be thought of as a formal technique to assess the ‘consistency’ criterion. Framing meta-analysis this way shows just how much meta-
analysis neglects, but also shows that it can be a useful technique nevertheless.
20
For an interesting study of an eighteenth century case in which the search for a causal mechanism led the researchers astray, see De Vreese (2008).
J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507 505

temporality, which is plausibly a necessary condition for a causal this is suggestive that the hypothesis is roughly correct. For instance,
relation). Since Hill seems to have intended these as epistemic in §3 I discussed meta-analyses which tested whether formaldehyde
desiderata for discovering causal relations, I will simply call them exposure causes leukemia. One of these (Zhang et al., 2009) con-
‘desiderata’.21 Although Hill granted that no single desideratum cluded that formaldehyde exposure is indeed associated with leuke-
was necessary or sufficient to demonstrate causality, he claimed that mia, and in addition to this conclusion the authors proposed possible
jointly the desiderata could make for a good argument for the pres- causal mechanisms meant to undergird the outcome of their meta-
ence of a causal relation (Doll, 2003).22 Each particular desideratum analysis, thereby appealing to the coherence and plausibility
could use philosophical critique, but the important point for the pur- desiderata.25
pose of contrast with meta-analysis is the plurality of reasons and Some epidemiologists now argue that desiderata such as those
sources of evidence that Hill appealed to.23 used by Hill should be employed more often (Weed, 1997),
The desiderata appealed to by Hill depend on diverse kinds of whereas others argue that such criteria should not be used to as-
evidence, which lack a shared quantitative measure—like that of sess causal relations (Charlton, 1996). At the very least, the Hill
evidence solely from RCTs—such that the evidence can be com- strategy of dealing with a huge volume and diversity of evidence
bined by a simple weighted average. The four specific problems I might, given the problems with meta-analysis discussed in §3
raised for meta-analysis—the choice of primary evidence to in- and §4, be more virtuous than meta-analysis.
clude, the choice of a metric or effect size to quantify the evidence,
the choice of a quality assessment scale to assess or weigh the evi- 6. Conclusion
dence, and the choice of averaging technique—are even more trou-
blesome for the Hill strategy. Thus one might think: meta-analysis I have argued that meta-analyses fail to adequately constrain
has the virtue of amalgamating evidence with objectivity and intersubjective assessments of hypotheses. This is because the
quantitative simplicity, yet has the vice of amalgamating only a numerous decisions that must be made when designing and per-
narrow range of evidence, while the Hill strategy has the virtue forming a meta-analysis require personal judgment and expertise,
of considering all available evidence, yet has the vice of qualitative and allow personal biases and idiosyncrasies of reviewers to influ-
subjectivity. But given my arguments in §3 and §4, the purported ence the outcome of the meta-analysis. The failure of Objectivity at
virtues of meta-analysis—objectivity and constraint—are less least partly explains the failure of Constraint: that is, the subjectiv-
apparent than many have thought. ity required for meta-analysis explains how multiple meta-analy-
Since Hill’s desiderata are not individually necessary (with the ses of the same primary evidence can reach contradictory
exception, noted above, of the temporality desideratum) for infer- conclusions regarding the same hypothesis.
ring causal relations, one can have evidence which satisfies only Defenders of meta-analysis have noted that although my cri-
some of the desiderata while still having ample justification for tique shows that there are better and worse ways to perform a
causal inference. There is, then, some malleability in the Hill strat- meta-analysis, it does not follow that we ought to discard the tech-
egy. Defenders of formal methods of amalgamating evidence, such nique altogether. I agree. Although I have used the published guid-
as meta-analysis, might object to such malleability. Such an objec- ance from the Cochrane group as a foil to frame my criticisms, the
tion could appeal to the Objectivity and Constraint norms: if the Cochrane group has been active in working to improve the quality
Hill strategy is so malleable, then different analysts could apply of meta-analyses. There have been multiple attempts at formulat-
the Hill strategy in a variety of ways which reach contradictory ing the features that a report of a meta-analysis should include,
conclusions. This objection would misfire twice over. First, I have prominently including that of the Quality of Reporting of Meta-
already shown that meta-analysis is also highly malleable. This is analyses (QUORUM) group (Moher et al., 1999). This response from
not a mere tu quoque. The complexity of assessing and amalgamat- defenders of meta-analysis does not, however, directly address my
ing a large volume and diversity of evidence might inevitably re- central argument, namely that the epistemic prominence given to
quire malleable techniques, in which case malleability per se meta-analysis is unjustified, since meta-analysis allows idiosyn-
could hardly be a criticism of such a technique. Second, when prop- cratic biases to influence its results, which in turn explains why
erly applied the desiderata are constraining. If a meta-analysis sup- the results of meta-analyses are unconstrained. The upshot to this
ports a hypothesis while most of Hill’s desiderata provide reasons critique, one might claim, is merely to urge the improvement of the
against belief in the hypothesis, this ought to sustain serious reser- quality of meta-analyses in ways similar to that already proposed
vation in this hypothesis. For example, Hodge (2007) reports a by the QUORUM and Cochrane group, in order to achieve some
meta-analysis which concludes that intercessory prayer (praying higher degree of constraint.26 However, my discussion of the many
on behalf of others) has a small but significant effect on the well- particular decisions that must be made when performing a meta-
being of those prayed for. Such a claim, of course, fares poorly on analysis suggests that such improvements can only go so far. For
at least several of Hill’s desiderata.24 Endorsing the Hill strategy, at least some of these decisions, the choice between available op-
then, does not mean endorsing a more tolerant or relaxed attitude tions is entirely arbitrary; the various proposals to enhance the
toward amalgamating evidence compared with purportedly rigorous transparency of reporting of meta-analyses are unable, in principle,
and quantitative methods of amalgamating evidence. Conversely, if to referee between these arbitrary choices. More generally, this
most of the desiderata coherently support a particular hypothesis, rejoinder from the defenders of meta-analysis—that we ought not

21
I am grateful to an anonymous referee for this suggestion.
22
The Hill strategy could perhaps be understood as part of a shift in epidemiological concepts of cause and disease from a monocausal to a multifactorial model; for a discussion
of concepts of cause and disease in epidemiology, see Broadbent (2009).
23
See Howick, Glasziou, & Aronson (2009) for a recent analysis and restructuring of Hill’s desiderata, and Rothman & Greenland (2005) for a brief discussion of each of the
desiderata. Woodward (2010) provides a careful analysis of the specificity desideratum.
24
Moreover, this is another example in which multiple meta-analyses reach contradictory conclusions. Masters & Spielmans (2007) and Roberts, Ahmed, Hall, & Davison (2009)
both report meta-analyses which conclude that intercessory prayer has no effect.
25
However, it should be clear that nothing very general can be said regarding when the satisfaction of the desiderata are sufficient to infer causality.
26
For an illustration of the variable quality of meta-analyses, consider this: meta-analyses which were not performed by Cochrane collaborators were twice as likely to have
positive conclusion statements compared with meta-analyses performed by Cochrane collaborators (Tricco, Tetzlaff, Pham, Brehaut, & Moher 2009). Assuming that Cochrane
meta-analyses were higher quality than non-Cochrane meta-analyses (surely a safe assumption), it follows that better meta-analyses are less likely to have a positive conclusion
regarding a medical intervention.
506 J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507

altogether discard the technique—over-states the strength of the Cartwright, N., & Stegenga, J. (2011). A theory of evidence for evidence-based policy.
In Dawid, Twining, & Vasilaki (Eds.), Evidence, inference and enquiry. Oxford
conclusion I have argued for, which is not that meta-analysis is en-
University Press.
tirely a bad method of amalgamating evidence, but rather is that Charlton, B. G. (1996). Attribution of causation in epidemiology: Chain or mosaic?
meta-analysis ought not be considered the best kind of evidence Journal of Clinical Epidemiology, 49, 105–107.
for assessing causal hypotheses in medicine and the social sciences. Cochrane Handbook. <http://www.cochrane.org/resources/handbook> (available
online).
I have not argued that meta-analysis cannot provide any compelling Collins, J. J., & Lineker, G. A. (2004). A review and meta-analysis of formaldehyde
evidence, but rather, contrary to the standard view, I have argued exposure and leukemia. Regulatory Toxicology and Pharmacology, 40, 81–91.
that meta-analysis is not the platinum standard of evidence. Danks, D. (2005). Scientific coherence and the fusion of experimental results. The
British Journal for the Philosophy of Science, 56, 791–807.
One of the primary criticisms I raised against meta-analysis is Daston, L., & Galison, P. (2007). Objectivity. Cambridge: Zone Books.
its reliance on a narrow range of evidential diversity. An older tra- De Vreese, L. (2008). Causal (mis)understanding and the search for scientific
dition of evidence in medicine, associated with the epidemiologist explanations: A case study from the history of medicine. Studies in History and
Philosophy of Biological and Biomedical Sciences, 39, 14–24.
Sir Bradford Hill, is in this respect superior. Moreover, the Hill Deaton, A. (2008). Instruments of development: Randomisation in the tropics, and
strategy can accommodate the response from defenders of meta- the search for the elusive keys to economic development. Proceedings of the
analysis considered immediately above: the ‘consistency’ desider- British Academy, 162, 123–160.
Doll, R. (2003). Fisher and Bradford Hill: Their personal impact. International Journal
atum can be tested by meta-analysis, and so even if one were to of Epidemiology, 32, 929–931.
use the Hill strategy, one could still use meta-analysis as part of Doll, R., & Hill, A. B. (1950). Smoking and carcinoma of the lung: Preliminary report.
one’s assessment of a hypothesis of interest. Meta-analysis, then, British Medical Journal, 2(4682), 739–748.
Doll, R., & Hill, A. B. (1954). The mortality of doctors in relation to their smoking
would be one of many kinds of evidence appealed to when amal-
habits. British Medical Journal, 1(4877), 1451–1455.
gamating available evidence for some hypothesis. However, there Douglas, H. (2004). The irreducible complexity of objectivity. Synthese, 138(3),
is no formal method for assessing, quantifying, and amalgamating 453–473.
the very disparate kinds of evidence that Hill considered. Thus the Duflo, E., & Kremer, M. (2003). The use of randomization in the evaluation of development
effectiveness. World Bank manuscript. <http://www.povertyactionlab.org/
Hill strategy lacks the apparent objectivity and quantificational methodology> (accessed on 15.03.2011).
simplicity of meta-analysis. But given the central argument of this Egger, M., Smith, G. D., & Phillips, A. N. (1997). Meta-analysis: Principles and
paper, the fact that the Hill strategy lacks a simple method of procedures. British Medical Journal, 315, 1533–1537.
Epstein, S. (2007). Inclusion: The politics of difference in medical research. Chicago:
objectively amalgamating diverse evidence is not a strike against Chicago University Press.
it relative to meta-analysis, since I have argued that the quantita- Eysenck, H. (1984). Meta-analysis: An abuse of research integration. Journal of
tive simplicity and objectivity of the latter is a chimera. Despite the Special Education, 18(1), 41–59.
Eysenck, H. (1994). Systematic reviews: Meta-analysis and its problems. British
ubiquitous view that meta-analysis is the platinum standard of Medical Journal, 309, 789–792.
evidence in medicine, meta-analysis is not, in the end, very shiny. Fergusson, D., Doucette, S., Glass, K. C., Shapiro, S., Healy, D., Hebert, P., et al. (2005).
Association between suicide attempts and selective serotonin reuptake
inhibitors: Systematic review of randomised controlled trials. British Medical
Acknowledgments Journal, 330, 396–399.
Glass, G. V. (1976). Primary, secondary and meta-analysis of research. Educational
Nancy Cartwright, Heather Douglas, Miriam Solomon, Eric Researcher, 10, 3–8.
Glass, G. V., & Smith, M. L. (1979). Meta-analysis of research on class size and
Martin, and Boaz Miller gave detailed feedback on earlier drafts
achievement. Educational Evaluation and Policy Analysis, 1(1), 2–16.
of this paper, and I am grateful for discussions with audiences at Gunnell, D., Saperia, J., & Ashby, D. (2005). Selective serotonin reuptake inhibitors
University of Toronto, Michigan State University, University of (SSRIs) and suicide in adults: Meta-analysis of drug company data from placebo
Western Ontario, the American Association for the Advancement controlled, randomised controlled trials submitted to the MHRA’s safety review.
British Medical Journal, 330, 385–388.
of Science, participants in my seminar at Virginia Tech, and Hacking, I. (1988). Telepathy: Origins of randomization in experimental design. Isis,
members of the UCSD Philosophy of Science Reading Group. Three 79(3), 427–451.
anonymous reviewers suggested many valuable improvements. Hartling, L., Ospina, M., Liang, Y., Dryden, D., Hooten, N., Seida, J., et al. (2009). Risk
of bias versus quality assessment of randomised controlled trials: Cross
sectional study. British Medical Journal, 339, b4012.
Hill, B. (1965). The environment and disease: Association or causation? Proceedings
References of the Royal Society of Medicine, 58, 295–300.
Hodge, D. R. (2007). A systematic review of the empirical literature on intercessory
Assendelft, W. J., Koes, B. W., Knipschild, P. G., & Bouter, L. M. (1995). The prayer. Research on Social Work Practice, 17, 174–187.
relationship between methodological quality and conclusions in reviews of Howick, J., Glasziou, P., & Aronson, J. K. (2009). The evolution of evidence
spinal manipulation. The Journal of the American Medical Association, 274, hierarchies: What can Bradford Hill’s ‘guidelines for causation’ contribute?
1942–1948. Journal of the Royal Society of Medicine, 102, 186–194.
Bachand, A. M., Mundt, K. A., Mundt, D. J., & Montgomery, R. R. (2010). Jaber, B. L., Lau, J., Schmid, C. H., Karsou, S. A., Levey, A. S., & Pereira, B. J. (2002).
Epidemiological studies of formaldehyde exposure and risk of leukemia and Effect of biocompatibility of hemodialysis membranes on mortality in acute
nasopharyngeal cancer: A meta-analysis. Critical Reviews in Toxicology, 40(2), renal failure: A meta-analysis. Clinical Nephrology, 57(4), 274–282.
85–100. Jüni, P., Witschi, A., Bloch, R., & Egger, M. (1999). The hazards of scoring the quality
Banerjee, A., & Duflo, E. (Unpublished manuscript). The experimental approach to of clinical trials for meta-analysis. The Journal of the American Medical
development economics. MIT JPAL manuscript. <http://www.povertyactionlab. Association, 282(11), 1054–1060.
org/methodology> (accessed on 15.03.2011). Klein, R., & Williams, A. (2000). Setting priorities: what is holding us back—
Barnes, D., & Bero, L. (1998). Why review articles on the health effects of passive inadequate information or inadequate institutions? In Coulter & Ham (Eds.), The
smoking reach different conclusions. The Journal of the American Medical global challenge of health care rationing. Open University Press.
Association, 279(19), 1566–1570. Knipschild, P. (1994). Systematic reviews: Some examples. British Medical Journal,
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction 309, 719–721.
to meta-analysis. Chichester: John Wiley and Sons. Kunda, Z. (1990). The case for motivated reasoning. Psychological Bulletin, 108(3),
Borgenson, K. (2008). Valuing and evaluating evidence in medicine. PhD diss., 480–498.
University of Toronto. Linde, K., & Willich, S. (2003). How objective are systematic reviews? Differences
Bosetti, C., McLaughlin, J. K., Tarone, R. E., Pira, E., & La Vecchia, C. (2008). between reviews on complementary medicine. Journal of the Royal Society of
Formaldehyde and cancer risk: A quantitative review of cohort studies through Medicine, 96, 17–22.
2006. Annals of Oncology, 19, 29–43. Lomas, J., Fulop, N., Gagnon, D., & Allen, P. (2003). On being a good listener: Setting
Broadbent, A. (2009). Causation and models of disease in epidemiology. Studies in priorities for applied health services research. Milbank Quarterly, 81(3),
History and Philosophy of Biological and Biomedical Sciences, 40, 302–311. 363–388.
Brown, J. (2008). The community of science. In Carrier, Howard, & Kourany (Eds.), Masters, K. S., & Spielmans, G. I. (2007). Prayer and health: Review, meta-analysis,
The challenge of the social and the pressure of practice: Science and values revisited. and research agenda. Journal of Behavioral Medicine, 30(4), 329–338.
Pittsburgh: University of Pittsburgh Press. McAuley, L., Pham, B., Tugwell, P., & Moher, D. (2000). Does the inclusion of grey
Cartwright, N. (2007). Are RCTs the gold standard? Biosocieties, 2, 11–20. literature influence estimates of intervention effectiveness reported in meta-
Cartwright, N. (2010). The long road from ‘it works somewhere’ to ‘it will work for us’. analyses? Lancet, 356(9237), 1228–1231.
Philosophy of Science Association, Presidential Address. Miller, B. (2010). A social theory of knowledge. PhD diss., University of Toronto.
J. Stegenga / Studies in History and Philosophy of Biological and Biomedical Sciences 42 (2011) 497–507 507

Mnookin, J. (2008). Under the influence of technology: DUI and the legal production of Tricco, A. C., Tetzlaff, J., Pham, B., Brehaut, J., & Moher, D. (2009). Non-Cochrane vs.
objectivity. UCSD Science Studies Colloquium (21.04.2008). Cochrane reviews were twice as likely to have positive conclusion statements:
Moher, D., Cook, D. J., Eastwood, S., Olkin, I., Rennie, D., Stroup, D., et al. (1999). Cross-sectional study. Journal of Clinical Epidemiology, 62(4), 380–386.
Improving the quality of reports of meta-analyses of randomised controlled Trout, J. D. (1995). Diverse tests on an independent world. Studies in History and
trials: The QUORUM statement. Lancet, 354, 1896–1900. Philosophy of Science, 26(3), 407–429.
Moher, D., Jadad, A. R., Nichol, G., Penman, M., Tugwell, P., & Walsh, S. (1995). Weed, D. (1997). On the use of causal criteria. International Journal of Epidemiology,
Assessing the quality of randomized controlled trials: An annotated 26, 1137–1141.
bibliography of scales and checklists. Controlled Clinical Trials, 16, 62–73. Whitlock, M., & Schluter, D. (2009). The analysis of biological data. Greenwood
Pauling, L. (1986). How to live longer and feel better. New York: W.H. Freeman. Village: Roberts and Company Publishers.
Rhine, J. B., Pratt, J. G., Stuart, C. E., Smith, B. M., & Greenwood, J. A. (1940). Whittington, C. J., Kendall, T., Fonagy, P., Cottrell, D., Cotgrove, A., & Boddington, E.
Extrasensory perception after sixty years. New York: Holt. (2004). Selective serotonin reuptake inhibitors in childhood depression:
Roberts, L., Ahmed, I., Hall, S., & Davison, A. (2009). Intercessory prayer for the Systematic review of published versus unpublished data. Lancet, 363(9418),
alleviation of ill health. Cochrane Database of Systematic Reviews, 15(2), 1341–1345.
CD000368. Wimsatt, W. (1981). Robustness, reliability, and overdetermination. In Brewer &
Rothman, K. J., & Greenland, S. (2005). Causation and causal inference in Collins (Eds.), Scientific inquiry and the social sciences. San Francisco: Jossey-Bass.
epidemiology. American Journal of Public Health, 95, S144–S150. Woodward, J. (2010). Causation in biology: Stability, specificity, and the choice of
Rotton, J., & Kelly, I. W. (1985). Much ado about the full moon: A meta-analysis of levels of explanation. Biology and Philosophy, 25, 287–318.
lunar-lunacy research. Psychological Bulletin, 97, 286–306. Worrall, J. (2002). What evidence in evidence-based medicine? Philosophy of
Slavin, R. (1995). Best evidence synthesis: An intelligent alternative to meta- Science, 69, S316–S330.
analysis. Journal of Clinical Epidemiology, 48(1), 9–18. Worrall, J. (2007). Why there’s no cause to randomize. The British Journal for the
Smith, M. L., & Glass, G. V. (1977). Meta-analysis of psychotherapy outcome studies. Philosophy of Science, 58, 451–488.
American Psychologist, 32, 752–760. Yank, V., Rennie, D., & Bero, L. A. (2007). Financial ties and concordance between
Smith, G. C., & Pell, J. P. (2003). Parachute use to prevent death and major trauma results and conclusions in meta-analyses: A retrospective cohort study. British
related to gravitational challenge: Systematic review of randomised controlled Medical Journal, 335, 1202–1205.
trials. British Medical Journal, 327(7429), 1459–1461. Zhang, L., Steinmaus, C., Eastmond, D. A., Xin, X. K., & Smith, M. T. (2009).
Stegenga, J. (2009). Robustness, discordance, and relevance. Philosophy of Science, Formaldehyde exposure and leukemia: A new meta-analysis and potential
76, 650–661. mechanisms. Mutation Research, 681, 150–168.
Subramanian, S., Venkataraman, R., & Kellum, J. A. (2002). Influence of dialysis
membranes on outcomes in acute renal failure: A meta-analysis. Kidney
International, 62, 1819–1823.
Sutton, A. J., & Higgins, J. P. T. (2008). Recent developments in meta-analysis.
Statistics in Medicine, 27, 625–650.
Thagard, P. (1998). Ulcers and bacteria I: Discovery and acceptance. Studies in
History and Philosophy of Biological and Biomedical Sciences, 29, 107–136.

You might also like