1
Justify Your Alpha
2
3
In Press, Nature Human Behavior
4
Daniel Lakens*1, Federico G. Adolfi2, Casper J. Albers3, Farid Anvari4, Matthew A. J. Apps5,
5
Shlomo E. Argamon6, Thom Baguley7, Raymond B. Becker8, Stephen D. Benning9, Daniel E.
6
Bradford10, Erin M. Buchanan11, Aaron R. Caldwell12, Ben van Calster13, Rickard Carlsson14,
7
Sau-Chin Chen15, Bryan Chung16, Lincoln J Colling17, Gary S. Collins18, Zander Crook19,
8
Emily S. Cross20, Sameera Daniels21, Henrik Danielsson22, Lisa DeBruine23, Daniel J.
9
Dunleavy24, Brian D. Earp25, Michele I. Feist26, Jason D. Ferrell27, James G. Field28, Nicholas
10
W. Fox29, Amanda Friesen30, Caio Gomes31, Monica Gonzalez-Marquez32, James A.
11
Grange33, Andrew P. Grieve34, Robert Guggenberger35, James Grist36, Anne-Laura van
12
Harmelen37, Fred Hasselman38, Kevin D. Hochard39, Mark R. Hoffarth40, Nicholas P.
13
Holmes41, Michael Ingre42, Peder M. Isager43, Hanna K. Isotalus44, Christer Johansson45,
14
Konrad Juszczyk46, David A. Kenny47, Ahmed A. Khalil48, Barbara Konat49, Junpeng Lao50,
15
Erik Gahner Larsen51, Gerine M. A. Lodder52, Jiří Lukavský53, Christopher R. Madan54, David
16
Manheim55, Stephen R. Martin56, Andrea E. Martin57, Deborah G. Mayo58, Randy J.
17
McCarthy59, Kevin McConway60, Colin McFarland61, Amanda Q. X. Nio62, Gustav Nilsonne63,
18
Cilene Lino de Oliveira64, Jean-Jacques Orban de Xivry65, Sam Parsons66, Gerit Pfuhl67,
19
Kimberly A. Quinn68, John J. Sakon69, S. Adil Saribay70, Iris K. Schneider71, Manojkumar
20
Selvaraju72, Zsuzsika Sjoerds73, Samuel G. Smith74, Tim Smits75, Jeffrey R. Spies76, Vishnu
21
Sreekumar77, Crystal N. Steltenpohl78, Neil Stenhouse79, Wojciech Świątkowski80, Miguel A.
22
Vadillo81, Marcel A. L. M. Van Assen82, Matt N. Williams83, Samantha E. Williams84, Donald
23
R. Williams85, Tal Yarkoni86, Ignazio Ziano87, Rolf A. Zwaan88
24
25
Affiliations
26
27
28
*1Human-Technology Interaction, Eindhoven University of Technology, Den Dolech,
5600MB, Eindhoven, The Netherlands
1
1
2Laboratory
of Experimental Psychology and Neuroscience (LPEN), Institute of Cognitive
2
and Translational Neuroscience (INCYT), INECO Foundation, Favaloro University,
3
Pacheco de Melo 1860, Buenos Aires, Argentina
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
2National
Scientific and Technical Research Council (CONICET), Godoy Cruz 2290, Buenos
Aires, Argentina
3Heymans
Institute for Psychological Research, University of Groningen, Grote Kruisstraat
2/1, 9712TS Groningen, The Netherlands
4College
of Education, Psychology & Social Work, Flinders University, Adelaide, GPO Box
2100, Adelaide, SA, 5001, Australia
5Department
of Experimental Psychology, University of Oxford, New Radcliffe House,
Oxford, OX2 6GG, UK
6Department
of Computer Science, Illinois Institute of Technology, Chicago, IL, 10 W. 31st
Street, Chicago, IL 60645, USA
7Department
of Psychology, Nottingham Trent University, Nottingham, 50 Shakespeare
Street, Nottingham, NG1 4FQ, UK
8Faculty
of Linguistics and Literature, Bielefeld University, Bielefeld, Universitätsstraße 25,
33615 Bielefeld, Germany
9Psychology,
University of Nevada, Las Vegas, Las Vegas, 4505 S. Maryland Pkwy., Box
455030, Las Vegas, NV 89154-5030, USA
10Psychology,
WI. 53706, USA
22
11Psychology,
23
12Health,
24
25
26
27
28
University of Wisconsin-Madison, Madison, 1202 West Johnson St. Madison
Missouri State University, 901 S. National Ave, Springfield, MO, 65897, USA
Human Performance, and Recreation, University of Arkansas, Fayetteville, 155
Stadium Drive, HPER 321, Fayetteville, AR, 72701, USA
13Department
of Development and Regeneration, KU Leuven, Leuven, Herestraat 49 box
805, 3000 Leuven, Belgium, Belgium
13Department
of Medical Statistics and Bioinformatics, Leiden University Medical Center,
Postbus 9600, 2300 RC, Leiden, The Netherlands
2
1
2
3
4
5
6
14Department
Kalmar, Sweden
15Department
16Department
8
18Centre
11
12
13
14
UK
19Department
20School
of Psychology, Bangor University, Bangor, Adeilad Brigantia, Bangor, Gwynedd,
LL57 2AS, UK
21Ramsey
Decision Theoretics, 4849 Connecticut Ave. NW #132, Washington, DC 20008,
16
22Department
19
20
21
22
23
24
25
of Psychology, The University of Edinburgh, 7 George Square, Edinburgh, EH8
9JZ, UK
USA
18
of Psychology, University of Cambridge, Cambridge CB2 3EB, UK
for Statistics in Medicine, University of Oxford, Windmill Road, Oxford, OX3 7LD,
15
17
of Surgery, University of British Columbia, Victoria, #301 - 1625 Oak Bay Ave,
Victoria BC Canada, V8R 1B1 , Canada
17Department
10
of Human Development and Psychology, Tzu-Chi University, No. 67, Jieren
St., Hualien City, Hualien County, 97074, Taiwan
7
9
of Psychology, Linnaeus University, Kalmar, Stagneliusgatan 14, 392 34,
of Behavioural Sciences and Learning, Linköping University, SE-581 83,
Linköping, Sweden
23Institute
of Neuroscience and Psychology, University of Glasgow, Glasgow, 58 Hillhead
Street, UK
24College
of Social Work, Florida State University, 296 Champions Way, University Center C,
Tallahassee, FL, 32304, USA
25Departments
of Psychology and Philosophy, Yale University, 2 Hillhouse Ave, New Haven
CT 06511, USA
26Department
of English, University of Louisiana at Lafayette, P. O. Box 43719, Lafayette LA
70504, USA
26
27Department
27
USA
of Psychology, St. Edward's University, 3001 S. Congress, Austin, TX 78704,
3
1
2
3
4
5
6
7
8
9
10
11
27Department
of Psychology, University of Texas at Austin, 108 E. Dean Keeton Stop A8000,
Austin, TX 78712-1043, USA
28Department
of Management, West Virginia University, 1602 University Avenue,
Morgantown, WV 26506, USA
29Department
of Psychology, Rutgers University, New Brunswick, 53 Avenue E, Piscataway
NJ 08854, USA
30Department
of Political Science, Indiana University Purdue University, Indianapolis,
Indianapolis, 425 University Blvd CA417, Indianapolis, IN 46202, USA
31Booking.com,
32Department
Herengracht 597, 1017 CE Amsterdam, The Nederlands
of English, American and Romance Studies, RWTH - Aachen University,
Aachen, Kármánstraße 17/19, 52062 Aachen, Germany
12
33School
of Psychology, Keele University, Keele, Staffordshire, ST5 5BG, UK
13
34Centre
of Excellence for Statistical Innovation, UCB Celltech, 208 Bath Road, Slough,
14
Berkshire SL1 3WE, UK
15
35Translational
16
35University
17
36Department
18
19
20
21
22
23
24
25
26
27
28
Neurosurgery, Eberhard Karls University Tübingen, Tübingen, Germany
Tübingen, International Centre for Ethics in Sciences and Humanities, Germany
of Radiology, University of Cambridge, Box 218, Cambridge Biomedical
Campus, CB2 0QQ, UK
37Department
of Psychiatry, University of Cambridge, Cambridge, 18b Trumpington Road,
CB2 8AH, UK
38Behavioural
Science Institute, Radboud University Nijmegen, Montessorilaan 3, 6525 HR,
Nijmegen, The Netherlands
39Department
of Psychology, University of Chester, Chester, Department of Psychology,
University of Chester, Chester, CH1 4BJ, UK
40Department
of Psychology, New York University, 4 Washington Place, New York, NY
10003, USA
41School
of Psychology, University of Nottingham, Nottingham, University Park, NG7 2RD,
UK
4
1
42None,
2
43Department
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Independent, Stockholm, Skåpvägen 5, 12245 ENSKEDE, Sweden
Linköping,, Sweden
44School
of Clinical Sciences, University of Bristol, Bristol, Level 2 academic offices, L&R
Building, Southmead Hospital, BS10 5NB, UK
45Occupational
46The
Faculty of Modern Languages and Literatures, Institute of Linguistics, Psycholinguistics
Department, Adam Mickiewicz University, Al. Niepodległości 4, 61-874, Poznań, Poland
47Department
48Center
for Stroke Research Berlin, Charité - Universitätsmedizin Berlin, Hindenburgdamm
30, 12200 Berlin, Germany
48Max
Planck Institute for Human Cognitive and Brain Sciences, Stephanstraße 1a, 04103
Leipzig, Germany
48Berlin
School of Mind and Brain, Humboldt-Universität zu Berlin, Luisenstraße 56, 10115
Berlin, Germany
40Social
Sciences, Adam Mickiewicz University, Poznań, Szamarzewskiego 89, 60-568
Poznan, Poland
21
51School
22
52 Department
25
of Psychological Sciences, University of Connecticut, Storrs, CT, Department
of Psychological Sciences, U-1020, Storrs, CT 06269-1020, USA
50Department
24
Orthopaedics and Research, Sahlgrenska University Hospital, 413 45
Gothenburg, Sweden
20
23
of Clinical and Experimental Medicine, University of Linköping, 581 83
of Psychology, University of Fribourg, Faucigny 2, 1700 Fribourg, Switzerland
of Politics and International Relations, University of Kent, Canterbury CT2 7NX, UK
of Sociology / ICS, University of Groningen, Grote Rozenstraat 31, 9712 TG
Groningen, The Netherlands
53Institute
of Psychology, Czech Academy of Sciences, Hybernská 8, 11000 Prague, Czech
Republic
26
54School
of Psychology, University of Nottingham, Nottingham, NG7 2RD, UK
27
55Pardee
RAND Graduate School, RAND Corporation, 1200 S Hayes St, Arlington, VA
28
22202, USA
5
1
56Psychology
2
USA
3
57Psychology
4
5
6
57Department
8
59Center
11
of Psychology, School of Philosophy, Psychology, and Language Sciences,
University of Edinburgh, 7 George Square, EH8 9JZ Edinburgh, UK
58Dept
10
of Language Department, Max Planck Institute for Psycholinguistics, Nijmegen,
Wundtlaan 1, 6525XD, The Netherlands
7
9
and Neuroscience, Baylor University, Waco, One Bear Place 97310, Waco TX,
of Philosophy, Major Williams Hall, Virginia Tech, Blacksburg, VA, US
for the Study of Family Violence and Sexual Assault, Northern Illinois University,
DeKalb, IL, 125 President's BLVD., DeKalb, IL 60115, USA
60School
of Mathematics and Statistics, The Open University, Milton Keynes, Walton Hall,
Milton Keynes MK7 6AA, UK
12
61Skyscanner,
13
62School
14
UK
15
63Stress
16
17
18
15 Laurison Place, Edinburgh, EH3 9EN, UK
of Biomedical Engineering and Imaging Sciences, King's College London, London,
Research Institute, Stockholm University, Stockholm, Frescati Hagväg 16A, SE-
10691 Stockholm, Sweden
63Department
of Clinical Neuroscience, Karolinska Institutet, Nobels väg 9, SE-17177
Stockholm, Sweden
19
63Department
20
64Laboratory
of Psychology, Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
of Behavioral Neurobiology, Department of Physiological Sciences, Federal
21
University of Santa Catarina, Florianópolis, Campus Universitário Trindade, 88040900,
22
Brazil
23
24
65Department
of Kinesiology, KU Leuven, Leuven, Tervuursevest 101 box 1501, B-3001
Leuven, Belgium
25
66Department
of Experimental Psychology, University of Oxford, Oxford, UK
26
67Department
of Psychology, UiT The Arctic University of Norway, Tromsø, Norway
27
68Department
of Psychology, DePaul University, Chicago, 2219 N Kenmore Ave, Chicago, IL
28
60657, USA
6
1
2
69Center
for Neural Science, New York University, 4 Washington Pl Room 809 New York, NY
10003, USA
3
70Department
of Psychology, Boğaziçi University, Bebek, 34342, Istanbul, Turkey
4
71Psychology,
University of Cologne, Cologne,Herbert-Lewin-St. 2, 50931, Cologne,
5
Germany
6
7
8
9
72Saudi
Human Genome Program, King Abdulaziz City for Science and Technology
(KACST); Integrated Gulf Biosystems, Riyadh, Saudi Arabia
73Cognitive
Psychology Unit, Institute of Psychology, Leiden University, Wassenaarseweg
52, 2333 AK Leiden, The Netherlands
10
73Leiden
11
74Leeds
12
75Institute
13
76Center
14
76Department
15
16
17
18
19
20
21
22
23
24
25
26
27
Institute for Brain and Cognition, Leiden University, Leiden, The Netherlands
Institute of Health Sciences, University of Leeds, Leeds, LS2 9NL, UK
for Media Studies, KU Leuven, Leuven, Belgium
for Open Science, 210 Ridge McIntire Rd Suite 500, Charlottesville, VA 22903, USA
of Engineering and Society, University of Virginia, Thornton Hall, P.O. Box
400259, Charlottesville, VA 22904, USA
77Surgical
Neurology Branch, National Institute of Neurological Disorders and Stroke,
National Institutes of Health, Bethesda, MD 20892, USA
78Department
of Psychology, University of Southern Indiana, 8600 University Boulevard,
Evansville, Indiana, USA
79Life
Sciences Communication, University of Wisconsin-Madison, Madison, Wisconsin, 1545
Observatory Drive, Madison, WI 53706, USA
80Department
of Social Psychology, Institute of Psychology, University of Lausanne, Quartier
UNIL-Mouline, Bâtiment Géopolis, CH-1015 Lausanne, Switzerland
81Departamento
de Psicología Básica, Universidad Autónoma de Madrid, c/ Ivan Pavlov 6,
28049 Madrid, Spain
82Department
of Methodology and Statistics, Tilburg University, Warandelaan 2, 5000 LE
Tilburg, The Netherlands
7
1
2
3
4
82Department
of Sociology, Utrecht University, Padualaan 14, 3584 CH, Utrecht, The
Netherlands
83School
of Psychology, Massey University, Auckland, Private Bag 102904, North Shore,
Auckland, 0745, New Zealand
5
84Psychology,
6
USA
7
85Psychology,
University of California, Davis, Davis, One Shields Ave, Davis, CA 95616, USA
8
86Department
of Psychology, University of Texas at Austin, 108 E. Dean Keeton Stop A8000,
9
Austin, TX 78712-1043, USA
10
87Marketing
11
88Department
12
Saint Louis University, St. Louis, MO, 3700 Lindell Blvd, St. Louis, MO 63108,
Department, Ghent University, Tweekerkenstraat 2, 9000 Ghent, Belgium
of Psychology, Education, and Child Studies, Erasmus University Rotterdam,
Rotterdam, Burgemeester Oudlaan 50, 3000 DR, Rotterdam, The Netherlands
13
14
Author Contributions. Daniel Lakens, Nicholas W. Fox, Monica Gonzalez-Marquez, James
15
A. Grange, Nicholas P. Holmes, Ahmed A. Khalil, Stephen R. Martin, Vishnu Sreekumar,
16
and Crystal N. Steltenpohl participated in brainstorming, drafting the commentary, and data-
17
analysis. Casper J. Albers, Shlomo E. Argamon, Thom Baguley, Erin M. Buchanan, Ben van
18
Calster, Zander Crook, Sameera Daniels, Daniel J. Dunleavy, Brian D. Earp, Jason D.
19
Ferrell, James G. Field, Anne-Laura van Harmelen, Michael Ingre, Peder M. Isager, Hanna
20
K. Isotalus, Junpeng Lao, Gerine M. A. Lodder, David Manheim, Andrea E. Martin, Kevin
21
McConway, Amanda Q. X. Nio, Gustav Nilsonne, Cilene Lino de Oliveira, Jean-Jacques
22
Orban de Xivry, Gerit Pfuhl, Kimberly A. Quinn, Iris K. Schneider, Zsuzsika Sjoerds, Samuel
23
G. Smith, Jeffrey R. Spies, Marcel A. L. M. Van Assen, Matt N. Williams, Donald R. Williams,
24
Tal Yarkoni, and Rolf A. Zwaan participated in brainstorming and drafting the commentary.
25
Federico G. Adolfi, Raymond B. Becker, Michele I. Feist, and Sam Parsons participated in
26
drafting the commentary, and data-analysis. Matthew A. J. Apps, Stephen D. Benning,
27
Daniel E. Bradford, Sau-Chin Chen, Bryan Chung, Lincoln J Colling, Henrik Danielsson, Lisa
28
DeBruine, Mark R. Hoffarth, Erik Gahner Larsen, Randy J. McCarthy, John J. Sakon, S. Adil
8
1
Saribay, Tim Smits, Neil Stenhouse, Wojciech Świątkowski, and Miguel A. Vadillo
2
participated in brainstorming. Farid Anvari, Aaron R. Caldwell, Rickard Carlsson, Emily S.
3
Cross, Amanda Friesen, Caio Gomes, Andrew P. Grieve, Robert Guggenberger, James
4
Grist, Kevin D. Hochard, Christer Johansson, Konrad Juszczyk, David A. Kenny, Barbara
5
Konat, Jiří Lukavský, Christopher R. Madan, Deborah G. Mayo, Colin McFarland,
6
Manojkumar Selvaraju, Samantha E. Williams, and Ignazio Ziano did not participate in
7
drafting the commentary because the points that they would have raised had already been
8
incorporated into the commentary, or endorse a sufficiently large part of the contents as if
9
participation had occurred. Except for the first author, authorship order is alphabetical.
10
11
Acknowledgements: We’d like to thank Dale Barr, Felix Cheung, David Colquhoun, Hans
12
IJzerman, Harvey Motulsky, and Richard Morey for helpful discussions while drafting this
13
commentary. Daniel Lakens was supported by NWO VIDI 452-17-013. Federico G. Adolfi
14
was supported by CONICET. Matthew Apps was funded by a Biotechnology and Biological
15
Sciences Research Council AFL Fellowship (BB/M013596/1). Gary Collins was supported by
16
the NIHR Biomedical Research Centre, Oxford. Zander Crook was supported by the
17
Economic and Social Research Council [grant number C106891X]. Emily S. Cross was
18
supported by the European Research Council (ERC-2015-StG-677270). Lisa DeBruine is
19
supported by the European Research Council (ERC-2014-CoG-647910 KINSHIP). Anne-
20
Laura van Harmelen is funded by a Royal Society Dorothy Hodgkin Fellowship (DH150176).
21
Mark R. Hoffarth was supported by the National Science Foundation under grant SBE
22
SPRF-FR 1714446. Junpeng Lao was supported by the SNSF grant 100014_156490/1.
23
Cilene Lino de Oliveira was supported by AvH, Capes, CNPq. Andrea E. Martin was
24
supported by the Economic and Social Research Council of the United Kingdom [grant
25
number ES/K009095/1]. Jean-Jacques Orban de Xivry is supported by an internal grant from
26
the KU Leuven (STG/14/054) and by the Fonds voor Wetenschappelijk Onderzoek
27
(1519916N). Sam Parsons was supported by the European Research Council (FP7/2007–
28
2013; ERC grant agreement no; 324176). Gerine Lodder was funded by NWO VICI 453-14-
9
1
016. Samuel Smith is supported by a Cancer Research UK Fellowship (C42785/A17965).
2
Vishnu Sreekumar was supported by the NINDS Intramural Research Program (IRP). Miguel
3
A. Vadillo was supported by Grant 2016-T1/SOC-1395 from Comunidad de Madrid. Tal
4
Yarkoni was supported by NIH award R01MH109682.
5
6
Competing Interests: The authors declare no competing interests.
7
8
Abstract: In response to recommendations to redefine statistical significance to p ≤ .005, we
9
propose that researchers should transparently report and justify all choices they make when
10
designing a study, including the alpha level.
11
10
1
Justify Your Alpha
2
3
Benjamin et al.1 proposed changing the conventional “statistical significance” threshold (i.e.,
4
the alpha level) from p ≤ .05 to p ≤ .005 for all novel claims with relatively low prior odds.
5
They provided two arguments for why lowering the significance threshold would
6
“immediately improve the reproducibility of scientific research.” First, a p-value near .05
7
provides weak evidence for the alternative hypothesis. Second, under certain assumptions,
8
an alpha of .05 leads to high false positive report probabilities (FPRP2; the probability that a
9
significant finding is a false positive).
10
11
We share their concerns regarding the apparent non-replicability of many scientific studies,
12
and agree that a universal alpha of .05 is undesirable. However, redefining “statistical
13
significance” to a lower, but equally arbitrary threshold, is inadvisable for three reasons: (1)
14
there is insufficient evidence that the current standard is a “leading cause of non-
15
reproducibility”1; (2) the arguments in favor of a blanket default of p ≤ .005 do not warrant the
16
immediate and widespread implementation of such a policy; and (3) a lower significance
17
threshold will likely have negative consequences not discussed by Benjamin and colleagues.
18
We conclude that the term “statistically significant” should no longer be used and suggest
19
that researchers employing null hypothesis significance testing justify their choice for an
20
alpha level before collecting the data, instead of adopting a new uniform standard.
21
22
Lack of evidence that p ≤ .005 improves replicability
23
24
Benjamin et al.1 claimed that the expected proportion of replicable studies should be
25
considerably higher for studies observing p ≤ .005 than for studies observing .005 < p ≤ .05,
26
due to a lower FPRP. Theoretically, replicability is related to the FPRP, and lower alpha
27
levels will reduce false positive results in the literature. However, in practice, the impact of
28
lowering alpha levels depends on several unknowns, such as the prior odds that the
11
1
examined hypotheses are true, the statistical power of studies, and the (change in) behavior
2
of researchers in response to any modified standards.
3
4
An analysis of the results of the Reproducibility Project: Psychology3 showed that 49%
5
(23/47) of the original findings with p-values below .005 yielded p ≤ .05 in the replication
6
study, whereas only 24% (11/45) of the original studies with .005 < p ≤ .05 yielded p ≤ .05
7
(χ2(1) = 5.92, p = .015, BF10 = 6.84). Benjamin and colleagues presented this as evidence of
8
“potential gains in reproducibility that would accrue from the new threshold.” According to
9
their own proposal, however, this evidence is only “suggestive” of such a conclusion, and
10
there is considerable variation in replication rates across p-values (see Figure 1).
11
Importantly, lower replication rates for p-values just below .05 are likely confounded by p-
12
hacking (the practice of flexibly analyzing data until the p-value passes the “significance”
13
threshold). Thus, the differences in replication rates between studies with .005 < p ≤ .05
14
compared to those with p ≤ .005 may not be entirely due to the level of evidence. Further
15
analyses are needed to explain the low (49%) replication rate of studies with p ≤ .005, before
16
this alpha level is recommended as a new significance threshold for novel discoveries
17
across scientific disciplines.
18
19
Weak justifications for the α = .005 threshold
20
21
We agree with Benjamin et al. that single p-values close to .05 never provide strong
22
“evidence” against the null hypothesis. Nonetheless, the argument that p-values provide
23
weak evidence based on Bayes factors has been questioned4. Given that the marginal
24
likelihood is sensitive to different choices for the models being compared, redefining alpha
25
levels as a function of the Bayes factor is undesirable. For instance, Benjamin and
26
colleagues stated that p-values of .005 imply Bayes factors between 14 and 26. However,
27
these upper bounds only hold for a Bayes factor based on a point null model and when the
28
p-value is calculated for a two-sided test, whereas one-sided tests or Bayes factors for non-
12
1
point null models would imply different alpha thresholds. When a test yields BF = 25 the data
2
are interpreted as strong relative evidence for a specific alternative (e.g., μ = 2.81), while a p
3
≤ .005 only warrants the more modest rejection of a null effect without allowing one to reject
4
even small positive effects with a reasonable error rate5. Benjamin et al. provided no
5
rationale for why the new p-value threshold should align with equally arbitrary Bayes factor
6
thresholds. We question the idea that the alpha level at which an error rate is controlled
7
should be based on the amount of relative evidence indicated by Bayes factors.
8
9
The second argument for α = .005 is that the FPRP can be high with α = .05. Calculating the
10
FPRP requires a definition of the alpha level, the power of the tests examining true effects,
11
and the ratio of true to false hypotheses tested (the prior odds). Figure 2 in Benjamin et al.
12
displays FPRPs for scenarios where most hypotheses are false, with prior odds of 1:5, 1:10,
13
and 1:40. The recommended p ≤ .005 threshold reduces the minimum FPRP to less than
14
5%, assuming 1:10 prior odds (the true FPRP might still be substantially higher in studies
15
with very low power). This prior odds estimate is based on data from the Reproducibility
16
Project: Psychology3 using an analysis modelling publication bias for 73 studies6. Without
17
stating the reference class for the “base-rate of true nulls” (e.g., does this refer to all
18
hypotheses in science, in a discipline, or by a single researcher?), the concept of “prior odds
19
that H1 is true” has little meaning. Furthermore, there is insufficient representative data to
20
accurately estimate the prior odds that researchers examine a true hypothesis, and thus,
21
there is currently no strong argument based on FPRP to redefine statistical significance.
22
23
How a threshold of p ≤ .005 might harm scientific practice
24
25
Benjamin et al. acknowledged that their proposal has strengths as well as weaknesses, but
26
believe that its “efficacy gains would far outweigh losses.” We are not convinced and see at
27
least three likely negative consequences of adopting a lowered threshold.
28
13
1
Risk of fewer replication studies. All else being equal, lowering the alpha level requires larger
2
sample sizes and creates an even greater strain on already limited resources. Achieving
3
80% power with α = .005, compared to α = .05, requires a 70% larger sample size for
4
between-subjects designs with two-sided tests (88% for one-sided tests). While Benjamin et
5
al. propose α = .005 exclusively for “new effects” (and not replications), designing larger
6
original studies would leave fewer resources (i.e., time, money, participants) for replication
7
studies, assuming fixed resources overall. At a time when replications are already relatively
8
rare and unrewarded, lowering alpha to .005 might therefore reduce resources spent on
9
replicating the work of others. More generally, recommendations for evidence thresholds
10
need to carefully balance statistical and non-statistical considerations (e.g., the value of
11
evidence for a novel claim vs. the value of independent replications).
12
13
Risk of reduced generalisability and breadth. Requiring larger sample sizes across scientific
14
disciplines may exacerbate over-reliance on convenience samples (e.g., undergraduate
15
students, online samples). Specifically, without (1) increased funding, (2) a reward system
16
that values large-scale collaboration, and (3) clear recommendations for how to evaluate
17
research with sample size constraints, lowering the significance threshold could adversely
18
affect the breadth of research questions examined. Compared to studies that use
19
convenience samples, studies with unique populations (e.g., people with rare genetic
20
variants, patients with post-traumatic stress disorder) or with time- or resource-intensive data
21
collection (e.g., longitudinal studies) require considerably more research funds and effort to
22
increase the sample size. Thus, researchers may become less motivated to study unique
23
populations or collect difficult-to-obtain data, reducing the generalisability and breadth of
24
findings.
25
26
Risk of exaggerating the focus on single p-values. Benjamin et al.’s proposal risks (1)
27
reinforcing the idea that relying on p-values is a sufficient, if imperfect, way to evaluate
28
findings, and (2) discouraging opportunities for more fruitful changes in scientific practice
14
1
and education. Even though Benjamin et al. do not propose p ≤ .005 as a publication
2
threshold, some bias in favor of significant results will remain, in which case redefining p ≤
3
.005 as "statistically significant" would result in greater upward bias in effect size estimates.
4
Furthermore, it diverts attention from the cumulative evaluation of findings, such as
5
converging results of multiple (replication) studies.
6
7
No one alpha to rule them all
8
9
We have two key recommendations. First, we recommend that the label “statistically
10
significant” should no longer be used. Instead, researchers should provide more meaningful
11
interpretations of the theoretical or practical relevance of their results. Second, authors
12
should transparently specify—and justify—their design choices. Depending on their choice of
13
statistical approach, these may include the alpha level, the null and alternative models,
14
assumed prior odds, statistical power for a specified effect size of interest, the sample size,
15
and/or the desired accuracy of estimation. We do not endorse a single value for any design
16
parameter, but instead propose that authors justify their choices before data are collected.
17
Fellow researchers can then evaluate these decisions, ideally also prior to data collection,
18
for example, by reviewing a Registered Report submission7. Providing researchers (and
19
reviewers) with accessible information about ways to justify (and evaluate) design choices,
20
tailored to specific research areas, will improve current research practices.
21
22
Benjamin et al. noted that some fields, such as genomics and physics, have lowered the
23
“default” alpha level. However, in genomics the overall false positive rate is still controlled at
24
5%; the lower alpha level is only used to correct for multiple comparisons. In physics,
25
researchers have argued against a blanket rule, and for an alpha level based on factors
26
such as the surprisingness of the predicted result and its practical or theoretical impact8. In
27
non-human animal research, minimizing the number of animals used needs to be directly
28
balanced against the probability and cost of false positives. Depending on these and other
15
1
considerations, the optimal alpha level for a given research question could be higher or
2
lower than the current convention of .059,10,11.
3
4
Benjamin et al. stated that a “critical mass of researchers” endorse the standard of a p ≤
5
.005 threshold for “statistical significance.” However, the presence of a critical mass can only
6
be identified after a norm has been widely adopted, not before. Even if a p ≤ .005 threshold
7
were widely accepted, this would only reinforce the misconception that a single alpha level is
8
universally applicable. Ideally, the alpha level is determined by comparing costs and benefits
9
against a utility function using decision theory12. This cost-benefit analysis (and thus the
10
alpha level)13 differs when analyzing large existing datasets compared to collecting data from
11
hard-to-obtain samples.
12
13
Conclusion
14
15
Science is diverse, and it is up to scientists to justify the alpha level they decide to use. As
16
Fisher noted14: "...no scientific worker has a fixed level of significance at which, from year to
17
year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each
18
particular case in the light of his evidence and his ideas." Research should be guided by
19
principles of rigorous science15, not by heuristics and arbitrary blanket thresholds. These
20
principles include not only sound statistical analyses, but also experimental redundancy
21
(e.g., replication, validation, and generalisation), avoidance of logical traps, intellectual
22
honesty, research workflow transparency, and accounting for potential sources of error.
23
Single studies, regardless of their p-value, are never enough to conclude that there is strong
24
evidence for a substantive claim. We need to train researchers to assess cumulative
25
evidence and work towards an unbiased scientific literature. We call for a broader mandate
26
beyond p-value thresholds whereby all justifications of key choices in research design and
27
statistical practice are transparently evaluated, fully accessible, and pre-registered whenever
28
feasible.
16
1
References
2
3
4
1. Benjamin, D. J., et al. Nature Human Behaviour 2, 6-10 https://doi.org/10.1038/s41562017-0189-z (2017).
5
2. Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L., & Rothman, N. Journal of
6
the National Cancer Institute 96, 434-442 https://doi.org/10.1093/jnci/djh075 (2004).
7
3. Open Science Collaboration. (2015). Science 349 (6251), 1-8
8
https://doi.org/10.1126/science.aac4716 (2015).
9
10
11
12
4. Senn, S. Statistical issues in drug development (2nd ed). (John Wiley & Sons, 2007).
5. Mayo, D. Statistical inference as severe testing: How to get beyond the statistics wars.
(Cambridge University Press, 2018).
6. Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. Journal of the American
13
Statistical Association 112(517), 1–10
14
https://doi.org/10.1080/01621459.2016.1240079 (2017).
15
7. Chambers, C.D., Dienes, Z., McIntosh, R.D., Rotshtein, P., & Willmes, K. Cortex 66, A1-2
16
https://doi.org/10.1016/j.cortex.2015.03.022 (2015).
17
8. Lyons, L. Discovering the Significance of 5 sigma. Preprint at
18
19
20
21
22
23
24
http://arxiv.org/abs/1310.1284 (2013).
9. Field, S. A., Tyre, A. J., Jonzen, N., Rhodes, J. R., & Possingham, H. P. Ecology Letters
7(8), 669-675 https://doi.org/10.1111/j.1461-0248.2004.00625.x (2004).
10. Grieve, A. P. Pharmaceutical Statistics 14(2), 139–150 https://doi.org/10.1002/pst.1667
(2015).
11. Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. PLOS ONE 7(2), e32734
https://doi.org/10.1371/journal.pone.0032734 (2012).
25
12. Skipper, J. K., Guenther, A. L., & Nass, G. The American Sociologist 2(1), 16–18 (1967).
26
13. Neyman, J., & Pearson, E. S. Philosophical Transactions of the Royal Society of London
27
A: Mathematical, Physical and Engineering Sciences 231 694–706
28
https://doi.org/10.1098/rsta.1933.0009 (1933).
17
1
14. Fisher R. A. Statistical methods and scientific inferences. (Hafner, 1956).
2
15. Casadevall, A., & Fang, F. C. mBio 7(6), e01902-16. https://doi.org/10.1128/mbio.01902-
3
16 (2016).
4
18
1
Figure Caption
2
3
Figure 1. The proportion of studies3 replicated at α = .05 (with a bin width of .005). Window
4
start and end positions are plotted on the horizontal axis. The error bars denote 95%
5
Jeffreys confidence intervals. R code to reproduce Figure 1 is available from
6
https://osf.io/by2kc/.
19
Proportion of studies replicated
1.00
●
0.75
number of studies
●
0.50
● 10
●
●
●
● 30
●
●
0.25
●
0.00
20
●
●
●
0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050
Original study p−value
● 40