Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Justify your alpha

Nature Human Behaviour

1 Justify Your Alpha 2 3 In Press, Nature Human Behavior 4 Daniel Lakens*1, Federico G. Adolfi2, Casper J. Albers3, Farid Anvari4, Matthew A. J. Apps5, 5 Shlomo E. Argamon6, Thom Baguley7, Raymond B. Becker8, Stephen D. Benning9, Daniel E. 6 Bradford10, Erin M. Buchanan11, Aaron R. Caldwell12, Ben van Calster13, Rickard Carlsson14, 7 Sau-Chin Chen15, Bryan Chung16, Lincoln J Colling17, Gary S. Collins18, Zander Crook19, 8 Emily S. Cross20, Sameera Daniels21, Henrik Danielsson22, Lisa DeBruine23, Daniel J. 9 Dunleavy24, Brian D. Earp25, Michele I. Feist26, Jason D. Ferrell27, James G. Field28, Nicholas 10 W. Fox29, Amanda Friesen30, Caio Gomes31, Monica Gonzalez-Marquez32, James A. 11 Grange33, Andrew P. Grieve34, Robert Guggenberger35, James Grist36, Anne-Laura van 12 Harmelen37, Fred Hasselman38, Kevin D. Hochard39, Mark R. Hoffarth40, Nicholas P. 13 Holmes41, Michael Ingre42, Peder M. Isager43, Hanna K. Isotalus44, Christer Johansson45, 14 Konrad Juszczyk46, David A. Kenny47, Ahmed A. Khalil48, Barbara Konat49, Junpeng Lao50, 15 Erik Gahner Larsen51, Gerine M. A. Lodder52, Jiří Lukavský53, Christopher R. Madan54, David 16 Manheim55, Stephen R. Martin56, Andrea E. Martin57, Deborah G. Mayo58, Randy J. 17 McCarthy59, Kevin McConway60, Colin McFarland61, Amanda Q. X. Nio62, Gustav Nilsonne63, 18 Cilene Lino de Oliveira64, Jean-Jacques Orban de Xivry65, Sam Parsons66, Gerit Pfuhl67, 19 Kimberly A. Quinn68, John J. Sakon69, S. Adil Saribay70, Iris K. Schneider71, Manojkumar 20 Selvaraju72, Zsuzsika Sjoerds73, Samuel G. Smith74, Tim Smits75, Jeffrey R. Spies76, Vishnu 21 Sreekumar77, Crystal N. Steltenpohl78, Neil Stenhouse79, Wojciech Świątkowski80, Miguel A. 22 Vadillo81, Marcel A. L. M. Van Assen82, Matt N. Williams83, Samantha E. Williams84, Donald 23 R. Williams85, Tal Yarkoni86, Ignazio Ziano87, Rolf A. Zwaan88 24 25 Affiliations 26 27 28 *1Human-Technology Interaction, Eindhoven University of Technology, Den Dolech, 5600MB, Eindhoven, The Netherlands 1 1 2Laboratory of Experimental Psychology and Neuroscience (LPEN), Institute of Cognitive 2 and Translational Neuroscience (INCYT), INECO Foundation, Favaloro University, 3 Pacheco de Melo 1860, Buenos Aires, Argentina 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2National Scientific and Technical Research Council (CONICET), Godoy Cruz 2290, Buenos Aires, Argentina 3Heymans Institute for Psychological Research, University of Groningen, Grote Kruisstraat 2/1, 9712TS Groningen, The Netherlands 4College of Education, Psychology & Social Work, Flinders University, Adelaide, GPO Box 2100, Adelaide, SA, 5001, Australia 5Department of Experimental Psychology, University of Oxford, New Radcliffe House, Oxford, OX2 6GG, UK 6Department of Computer Science, Illinois Institute of Technology, Chicago, IL, 10 W. 31st Street, Chicago, IL 60645, USA 7Department of Psychology, Nottingham Trent University, Nottingham, 50 Shakespeare Street, Nottingham, NG1 4FQ, UK 8Faculty of Linguistics and Literature, Bielefeld University, Bielefeld, Universitätsstraße 25, 33615 Bielefeld, Germany 9Psychology, University of Nevada, Las Vegas, Las Vegas, 4505 S. Maryland Pkwy., Box 455030, Las Vegas, NV 89154-5030, USA 10Psychology, WI. 53706, USA 22 11Psychology, 23 12Health, 24 25 26 27 28 University of Wisconsin-Madison, Madison, 1202 West Johnson St. Madison Missouri State University, 901 S. National Ave, Springfield, MO, 65897, USA Human Performance, and Recreation, University of Arkansas, Fayetteville, 155 Stadium Drive, HPER 321, Fayetteville, AR, 72701, USA 13Department of Development and Regeneration, KU Leuven, Leuven, Herestraat 49 box 805, 3000 Leuven, Belgium, Belgium 13Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Postbus 9600, 2300 RC, Leiden, The Netherlands 2 1 2 3 4 5 6 14Department Kalmar, Sweden 15Department 16Department 8 18Centre 11 12 13 14 UK 19Department 20School of Psychology, Bangor University, Bangor, Adeilad Brigantia, Bangor, Gwynedd, LL57 2AS, UK 21Ramsey Decision Theoretics, 4849 Connecticut Ave. NW #132, Washington, DC 20008, 16 22Department 19 20 21 22 23 24 25 of Psychology, The University of Edinburgh, 7 George Square, Edinburgh, EH8 9JZ, UK USA 18 of Psychology, University of Cambridge, Cambridge CB2 3EB, UK for Statistics in Medicine, University of Oxford, Windmill Road, Oxford, OX3 7LD, 15 17 of Surgery, University of British Columbia, Victoria, #301 - 1625 Oak Bay Ave, Victoria BC Canada, V8R 1B1 , Canada 17Department 10 of Human Development and Psychology, Tzu-Chi University, No. 67, Jieren St., Hualien City, Hualien County, 97074, Taiwan 7 9 of Psychology, Linnaeus University, Kalmar, Stagneliusgatan 14, 392 34, of Behavioural Sciences and Learning, Linköping University, SE-581 83, Linköping, Sweden 23Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, 58 Hillhead Street, UK 24College of Social Work, Florida State University, 296 Champions Way, University Center C, Tallahassee, FL, 32304, USA 25Departments of Psychology and Philosophy, Yale University, 2 Hillhouse Ave, New Haven CT 06511, USA 26Department of English, University of Louisiana at Lafayette, P. O. Box 43719, Lafayette LA 70504, USA 26 27Department 27 USA of Psychology, St. Edward's University, 3001 S. Congress, Austin, TX 78704, 3 1 2 3 4 5 6 7 8 9 10 11 27Department of Psychology, University of Texas at Austin, 108 E. Dean Keeton Stop A8000, Austin, TX 78712-1043, USA 28Department of Management, West Virginia University, 1602 University Avenue, Morgantown, WV 26506, USA 29Department of Psychology, Rutgers University, New Brunswick, 53 Avenue E, Piscataway NJ 08854, USA 30Department of Political Science, Indiana University Purdue University, Indianapolis, Indianapolis, 425 University Blvd CA417, Indianapolis, IN 46202, USA 31Booking.com, 32Department Herengracht 597, 1017 CE Amsterdam, The Nederlands of English, American and Romance Studies, RWTH - Aachen University, Aachen, Kármánstraße 17/19, 52062 Aachen, Germany 12 33School of Psychology, Keele University, Keele, Staffordshire, ST5 5BG, UK 13 34Centre of Excellence for Statistical Innovation, UCB Celltech, 208 Bath Road, Slough, 14 Berkshire SL1 3WE, UK 15 35Translational 16 35University 17 36Department 18 19 20 21 22 23 24 25 26 27 28 Neurosurgery, Eberhard Karls University Tübingen, Tübingen, Germany Tübingen, International Centre for Ethics in Sciences and Humanities, Germany of Radiology, University of Cambridge, Box 218, Cambridge Biomedical Campus, CB2 0QQ, UK 37Department of Psychiatry, University of Cambridge, Cambridge, 18b Trumpington Road, CB2 8AH, UK 38Behavioural Science Institute, Radboud University Nijmegen, Montessorilaan 3, 6525 HR, Nijmegen, The Netherlands 39Department of Psychology, University of Chester, Chester, Department of Psychology, University of Chester, Chester, CH1 4BJ, UK 40Department of Psychology, New York University, 4 Washington Place, New York, NY 10003, USA 41School of Psychology, University of Nottingham, Nottingham, University Park, NG7 2RD, UK 4 1 42None, 2 43Department 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Independent, Stockholm, Skåpvägen 5, 12245 ENSKEDE, Sweden Linköping,, Sweden 44School of Clinical Sciences, University of Bristol, Bristol, Level 2 academic offices, L&R Building, Southmead Hospital, BS10 5NB, UK 45Occupational 46The Faculty of Modern Languages and Literatures, Institute of Linguistics, Psycholinguistics Department, Adam Mickiewicz University, Al. Niepodległości 4, 61-874, Poznań, Poland 47Department 48Center for Stroke Research Berlin, Charité - Universitätsmedizin Berlin, Hindenburgdamm 30, 12200 Berlin, Germany 48Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstraße 1a, 04103 Leipzig, Germany 48Berlin School of Mind and Brain, Humboldt-Universität zu Berlin, Luisenstraße 56, 10115 Berlin, Germany 40Social Sciences, Adam Mickiewicz University, Poznań, Szamarzewskiego 89, 60-568 Poznan, Poland 21 51School 22 52 Department 25 of Psychological Sciences, University of Connecticut, Storrs, CT, Department of Psychological Sciences, U-1020, Storrs, CT 06269-1020, USA 50Department 24 Orthopaedics and Research, Sahlgrenska University Hospital, 413 45 Gothenburg, Sweden 20 23 of Clinical and Experimental Medicine, University of Linköping, 581 83 of Psychology, University of Fribourg, Faucigny 2, 1700 Fribourg, Switzerland of Politics and International Relations, University of Kent, Canterbury CT2 7NX, UK of Sociology / ICS, University of Groningen, Grote Rozenstraat 31, 9712 TG Groningen, The Netherlands 53Institute of Psychology, Czech Academy of Sciences, Hybernská 8, 11000 Prague, Czech Republic 26 54School of Psychology, University of Nottingham, Nottingham, NG7 2RD, UK 27 55Pardee RAND Graduate School, RAND Corporation, 1200 S Hayes St, Arlington, VA 28 22202, USA 5 1 56Psychology 2 USA 3 57Psychology 4 5 6 57Department 8 59Center 11 of Psychology, School of Philosophy, Psychology, and Language Sciences, University of Edinburgh, 7 George Square, EH8 9JZ Edinburgh, UK 58Dept 10 of Language Department, Max Planck Institute for Psycholinguistics, Nijmegen, Wundtlaan 1, 6525XD, The Netherlands 7 9 and Neuroscience, Baylor University, Waco, One Bear Place 97310, Waco TX, of Philosophy, Major Williams Hall, Virginia Tech, Blacksburg, VA, US for the Study of Family Violence and Sexual Assault, Northern Illinois University, DeKalb, IL, 125 President's BLVD., DeKalb, IL 60115, USA 60School of Mathematics and Statistics, The Open University, Milton Keynes, Walton Hall, Milton Keynes MK7 6AA, UK 12 61Skyscanner, 13 62School 14 UK 15 63Stress 16 17 18 15 Laurison Place, Edinburgh, EH3 9EN, UK of Biomedical Engineering and Imaging Sciences, King's College London, London, Research Institute, Stockholm University, Stockholm, Frescati Hagväg 16A, SE- 10691 Stockholm, Sweden 63Department of Clinical Neuroscience, Karolinska Institutet, Nobels väg 9, SE-17177 Stockholm, Sweden 19 63Department 20 64Laboratory of Psychology, Stanford University, 450 Serra Mall, Stanford, CA 94305, USA of Behavioral Neurobiology, Department of Physiological Sciences, Federal 21 University of Santa Catarina, Florianópolis, Campus Universitário Trindade, 88040900, 22 Brazil 23 24 65Department of Kinesiology, KU Leuven, Leuven, Tervuursevest 101 box 1501, B-3001 Leuven, Belgium 25 66Department of Experimental Psychology, University of Oxford, Oxford, UK 26 67Department of Psychology, UiT The Arctic University of Norway, Tromsø, Norway 27 68Department of Psychology, DePaul University, Chicago, 2219 N Kenmore Ave, Chicago, IL 28 60657, USA 6 1 2 69Center for Neural Science, New York University, 4 Washington Pl Room 809 New York, NY 10003, USA 3 70Department of Psychology, Boğaziçi University, Bebek, 34342, Istanbul, Turkey 4 71Psychology, University of Cologne, Cologne,Herbert-Lewin-St. 2, 50931, Cologne, 5 Germany 6 7 8 9 72Saudi Human Genome Program, King Abdulaziz City for Science and Technology (KACST); Integrated Gulf Biosystems, Riyadh, Saudi Arabia 73Cognitive Psychology Unit, Institute of Psychology, Leiden University, Wassenaarseweg 52, 2333 AK Leiden, The Netherlands 10 73Leiden 11 74Leeds 12 75Institute 13 76Center 14 76Department 15 16 17 18 19 20 21 22 23 24 25 26 27 Institute for Brain and Cognition, Leiden University, Leiden, The Netherlands Institute of Health Sciences, University of Leeds, Leeds, LS2 9NL, UK for Media Studies, KU Leuven, Leuven, Belgium for Open Science, 210 Ridge McIntire Rd Suite 500, Charlottesville, VA 22903, USA of Engineering and Society, University of Virginia, Thornton Hall, P.O. Box 400259, Charlottesville, VA 22904, USA 77Surgical Neurology Branch, National Institute of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA 78Department of Psychology, University of Southern Indiana, 8600 University Boulevard, Evansville, Indiana, USA 79Life Sciences Communication, University of Wisconsin-Madison, Madison, Wisconsin, 1545 Observatory Drive, Madison, WI 53706, USA 80Department of Social Psychology, Institute of Psychology, University of Lausanne, Quartier UNIL-Mouline, Bâtiment Géopolis, CH-1015 Lausanne, Switzerland 81Departamento de Psicología Básica, Universidad Autónoma de Madrid, c/ Ivan Pavlov 6, 28049 Madrid, Spain 82Department of Methodology and Statistics, Tilburg University, Warandelaan 2, 5000 LE Tilburg, The Netherlands 7 1 2 3 4 82Department of Sociology, Utrecht University, Padualaan 14, 3584 CH, Utrecht, The Netherlands 83School of Psychology, Massey University, Auckland, Private Bag 102904, North Shore, Auckland, 0745, New Zealand 5 84Psychology, 6 USA 7 85Psychology, University of California, Davis, Davis, One Shields Ave, Davis, CA 95616, USA 8 86Department of Psychology, University of Texas at Austin, 108 E. Dean Keeton Stop A8000, 9 Austin, TX 78712-1043, USA 10 87Marketing 11 88Department 12 Saint Louis University, St. Louis, MO, 3700 Lindell Blvd, St. Louis, MO 63108, Department, Ghent University, Tweekerkenstraat 2, 9000 Ghent, Belgium of Psychology, Education, and Child Studies, Erasmus University Rotterdam, Rotterdam, Burgemeester Oudlaan 50, 3000 DR, Rotterdam, The Netherlands 13 14 Author Contributions. Daniel Lakens, Nicholas W. Fox, Monica Gonzalez-Marquez, James 15 A. Grange, Nicholas P. Holmes, Ahmed A. Khalil, Stephen R. Martin, Vishnu Sreekumar, 16 and Crystal N. Steltenpohl participated in brainstorming, drafting the commentary, and data- 17 analysis. Casper J. Albers, Shlomo E. Argamon, Thom Baguley, Erin M. Buchanan, Ben van 18 Calster, Zander Crook, Sameera Daniels, Daniel J. Dunleavy, Brian D. Earp, Jason D. 19 Ferrell, James G. Field, Anne-Laura van Harmelen, Michael Ingre, Peder M. Isager, Hanna 20 K. Isotalus, Junpeng Lao, Gerine M. A. Lodder, David Manheim, Andrea E. Martin, Kevin 21 McConway, Amanda Q. X. Nio, Gustav Nilsonne, Cilene Lino de Oliveira, Jean-Jacques 22 Orban de Xivry, Gerit Pfuhl, Kimberly A. Quinn, Iris K. Schneider, Zsuzsika Sjoerds, Samuel 23 G. Smith, Jeffrey R. Spies, Marcel A. L. M. Van Assen, Matt N. Williams, Donald R. Williams, 24 Tal Yarkoni, and Rolf A. Zwaan participated in brainstorming and drafting the commentary. 25 Federico G. Adolfi, Raymond B. Becker, Michele I. Feist, and Sam Parsons participated in 26 drafting the commentary, and data-analysis. Matthew A. J. Apps, Stephen D. Benning, 27 Daniel E. Bradford, Sau-Chin Chen, Bryan Chung, Lincoln J Colling, Henrik Danielsson, Lisa 28 DeBruine, Mark R. Hoffarth, Erik Gahner Larsen, Randy J. McCarthy, John J. Sakon, S. Adil 8 1 Saribay, Tim Smits, Neil Stenhouse, Wojciech Świątkowski, and Miguel A. Vadillo 2 participated in brainstorming. Farid Anvari, Aaron R. Caldwell, Rickard Carlsson, Emily S. 3 Cross, Amanda Friesen, Caio Gomes, Andrew P. Grieve, Robert Guggenberger, James 4 Grist, Kevin D. Hochard, Christer Johansson, Konrad Juszczyk, David A. Kenny, Barbara 5 Konat, Jiří Lukavský, Christopher R. Madan, Deborah G. Mayo, Colin McFarland, 6 Manojkumar Selvaraju, Samantha E. Williams, and Ignazio Ziano did not participate in 7 drafting the commentary because the points that they would have raised had already been 8 incorporated into the commentary, or endorse a sufficiently large part of the contents as if 9 participation had occurred. Except for the first author, authorship order is alphabetical. 10 11 Acknowledgements: We’d like to thank Dale Barr, Felix Cheung, David Colquhoun, Hans 12 IJzerman, Harvey Motulsky, and Richard Morey for helpful discussions while drafting this 13 commentary. Daniel Lakens was supported by NWO VIDI 452-17-013. Federico G. Adolfi 14 was supported by CONICET. Matthew Apps was funded by a Biotechnology and Biological 15 Sciences Research Council AFL Fellowship (BB/M013596/1). Gary Collins was supported by 16 the NIHR Biomedical Research Centre, Oxford. Zander Crook was supported by the 17 Economic and Social Research Council [grant number C106891X]. Emily S. Cross was 18 supported by the European Research Council (ERC-2015-StG-677270). Lisa DeBruine is 19 supported by the European Research Council (ERC-2014-CoG-647910 KINSHIP). Anne- 20 Laura van Harmelen is funded by a Royal Society Dorothy Hodgkin Fellowship (DH150176). 21 Mark R. Hoffarth was supported by the National Science Foundation under grant SBE 22 SPRF-FR 1714446. Junpeng Lao was supported by the SNSF grant 100014_156490/1. 23 Cilene Lino de Oliveira was supported by AvH, Capes, CNPq. Andrea E. Martin was 24 supported by the Economic and Social Research Council of the United Kingdom [grant 25 number ES/K009095/1]. Jean-Jacques Orban de Xivry is supported by an internal grant from 26 the KU Leuven (STG/14/054) and by the Fonds voor Wetenschappelijk Onderzoek 27 (1519916N). Sam Parsons was supported by the European Research Council (FP7/2007– 28 2013; ERC grant agreement no; 324176). Gerine Lodder was funded by NWO VICI 453-14- 9 1 016. Samuel Smith is supported by a Cancer Research UK Fellowship (C42785/A17965). 2 Vishnu Sreekumar was supported by the NINDS Intramural Research Program (IRP). Miguel 3 A. Vadillo was supported by Grant 2016-T1/SOC-1395 from Comunidad de Madrid. Tal 4 Yarkoni was supported by NIH award R01MH109682. 5 6 Competing Interests: The authors declare no competing interests. 7 8 Abstract: In response to recommendations to redefine statistical significance to p ≤ .005, we 9 propose that researchers should transparently report and justify all choices they make when 10 designing a study, including the alpha level. 11 10 1 Justify Your Alpha 2 3 Benjamin et al.1 proposed changing the conventional “statistical significance” threshold (i.e., 4 the alpha level) from p ≤ .05 to p ≤ .005 for all novel claims with relatively low prior odds. 5 They provided two arguments for why lowering the significance threshold would 6 “immediately improve the reproducibility of scientific research.” First, a p-value near .05 7 provides weak evidence for the alternative hypothesis. Second, under certain assumptions, 8 an alpha of .05 leads to high false positive report probabilities (FPRP2; the probability that a 9 significant finding is a false positive). 10 11 We share their concerns regarding the apparent non-replicability of many scientific studies, 12 and agree that a universal alpha of .05 is undesirable. However, redefining “statistical 13 significance” to a lower, but equally arbitrary threshold, is inadvisable for three reasons: (1) 14 there is insufficient evidence that the current standard is a “leading cause of non- 15 reproducibility”1; (2) the arguments in favor of a blanket default of p ≤ .005 do not warrant the 16 immediate and widespread implementation of such a policy; and (3) a lower significance 17 threshold will likely have negative consequences not discussed by Benjamin and colleagues. 18 We conclude that the term “statistically significant” should no longer be used and suggest 19 that researchers employing null hypothesis significance testing justify their choice for an 20 alpha level before collecting the data, instead of adopting a new uniform standard. 21 22 Lack of evidence that p ≤ .005 improves replicability 23 24 Benjamin et al.1 claimed that the expected proportion of replicable studies should be 25 considerably higher for studies observing p ≤ .005 than for studies observing .005 < p ≤ .05, 26 due to a lower FPRP. Theoretically, replicability is related to the FPRP, and lower alpha 27 levels will reduce false positive results in the literature. However, in practice, the impact of 28 lowering alpha levels depends on several unknowns, such as the prior odds that the 11 1 examined hypotheses are true, the statistical power of studies, and the (change in) behavior 2 of researchers in response to any modified standards. 3 4 An analysis of the results of the Reproducibility Project: Psychology3 showed that 49% 5 (23/47) of the original findings with p-values below .005 yielded p ≤ .05 in the replication 6 study, whereas only 24% (11/45) of the original studies with .005 < p ≤ .05 yielded p ≤ .05 7 (χ2(1) = 5.92, p = .015, BF10 = 6.84). Benjamin and colleagues presented this as evidence of 8 “potential gains in reproducibility that would accrue from the new threshold.” According to 9 their own proposal, however, this evidence is only “suggestive” of such a conclusion, and 10 there is considerable variation in replication rates across p-values (see Figure 1). 11 Importantly, lower replication rates for p-values just below .05 are likely confounded by p- 12 hacking (the practice of flexibly analyzing data until the p-value passes the “significance” 13 threshold). Thus, the differences in replication rates between studies with .005 < p ≤ .05 14 compared to those with p ≤ .005 may not be entirely due to the level of evidence. Further 15 analyses are needed to explain the low (49%) replication rate of studies with p ≤ .005, before 16 this alpha level is recommended as a new significance threshold for novel discoveries 17 across scientific disciplines. 18 19 Weak justifications for the α = .005 threshold 20 21 We agree with Benjamin et al. that single p-values close to .05 never provide strong 22 “evidence” against the null hypothesis. Nonetheless, the argument that p-values provide 23 weak evidence based on Bayes factors has been questioned4. Given that the marginal 24 likelihood is sensitive to different choices for the models being compared, redefining alpha 25 levels as a function of the Bayes factor is undesirable. For instance, Benjamin and 26 colleagues stated that p-values of .005 imply Bayes factors between 14 and 26. However, 27 these upper bounds only hold for a Bayes factor based on a point null model and when the 28 p-value is calculated for a two-sided test, whereas one-sided tests or Bayes factors for non- 12 1 point null models would imply different alpha thresholds. When a test yields BF = 25 the data 2 are interpreted as strong relative evidence for a specific alternative (e.g., μ = 2.81), while a p 3 ≤ .005 only warrants the more modest rejection of a null effect without allowing one to reject 4 even small positive effects with a reasonable error rate5. Benjamin et al. provided no 5 rationale for why the new p-value threshold should align with equally arbitrary Bayes factor 6 thresholds. We question the idea that the alpha level at which an error rate is controlled 7 should be based on the amount of relative evidence indicated by Bayes factors. 8 9 The second argument for α = .005 is that the FPRP can be high with α = .05. Calculating the 10 FPRP requires a definition of the alpha level, the power of the tests examining true effects, 11 and the ratio of true to false hypotheses tested (the prior odds). Figure 2 in Benjamin et al. 12 displays FPRPs for scenarios where most hypotheses are false, with prior odds of 1:5, 1:10, 13 and 1:40. The recommended p ≤ .005 threshold reduces the minimum FPRP to less than 14 5%, assuming 1:10 prior odds (the true FPRP might still be substantially higher in studies 15 with very low power). This prior odds estimate is based on data from the Reproducibility 16 Project: Psychology3 using an analysis modelling publication bias for 73 studies6. Without 17 stating the reference class for the “base-rate of true nulls” (e.g., does this refer to all 18 hypotheses in science, in a discipline, or by a single researcher?), the concept of “prior odds 19 that H1 is true” has little meaning. Furthermore, there is insufficient representative data to 20 accurately estimate the prior odds that researchers examine a true hypothesis, and thus, 21 there is currently no strong argument based on FPRP to redefine statistical significance. 22 23 How a threshold of p ≤ .005 might harm scientific practice 24 25 Benjamin et al. acknowledged that their proposal has strengths as well as weaknesses, but 26 believe that its “efficacy gains would far outweigh losses.” We are not convinced and see at 27 least three likely negative consequences of adopting a lowered threshold. 28 13 1 Risk of fewer replication studies. All else being equal, lowering the alpha level requires larger 2 sample sizes and creates an even greater strain on already limited resources. Achieving 3 80% power with α = .005, compared to α = .05, requires a 70% larger sample size for 4 between-subjects designs with two-sided tests (88% for one-sided tests). While Benjamin et 5 al. propose α = .005 exclusively for “new effects” (and not replications), designing larger 6 original studies would leave fewer resources (i.e., time, money, participants) for replication 7 studies, assuming fixed resources overall. At a time when replications are already relatively 8 rare and unrewarded, lowering alpha to .005 might therefore reduce resources spent on 9 replicating the work of others. More generally, recommendations for evidence thresholds 10 need to carefully balance statistical and non-statistical considerations (e.g., the value of 11 evidence for a novel claim vs. the value of independent replications). 12 13 Risk of reduced generalisability and breadth. Requiring larger sample sizes across scientific 14 disciplines may exacerbate over-reliance on convenience samples (e.g., undergraduate 15 students, online samples). Specifically, without (1) increased funding, (2) a reward system 16 that values large-scale collaboration, and (3) clear recommendations for how to evaluate 17 research with sample size constraints, lowering the significance threshold could adversely 18 affect the breadth of research questions examined. Compared to studies that use 19 convenience samples, studies with unique populations (e.g., people with rare genetic 20 variants, patients with post-traumatic stress disorder) or with time- or resource-intensive data 21 collection (e.g., longitudinal studies) require considerably more research funds and effort to 22 increase the sample size. Thus, researchers may become less motivated to study unique 23 populations or collect difficult-to-obtain data, reducing the generalisability and breadth of 24 findings. 25 26 Risk of exaggerating the focus on single p-values. Benjamin et al.’s proposal risks (1) 27 reinforcing the idea that relying on p-values is a sufficient, if imperfect, way to evaluate 28 findings, and (2) discouraging opportunities for more fruitful changes in scientific practice 14 1 and education. Even though Benjamin et al. do not propose p ≤ .005 as a publication 2 threshold, some bias in favor of significant results will remain, in which case redefining p ≤ 3 .005 as "statistically significant" would result in greater upward bias in effect size estimates. 4 Furthermore, it diverts attention from the cumulative evaluation of findings, such as 5 converging results of multiple (replication) studies. 6 7 No one alpha to rule them all 8 9 We have two key recommendations. First, we recommend that the label “statistically 10 significant” should no longer be used. Instead, researchers should provide more meaningful 11 interpretations of the theoretical or practical relevance of their results. Second, authors 12 should transparently specify—and justify—their design choices. Depending on their choice of 13 statistical approach, these may include the alpha level, the null and alternative models, 14 assumed prior odds, statistical power for a specified effect size of interest, the sample size, 15 and/or the desired accuracy of estimation. We do not endorse a single value for any design 16 parameter, but instead propose that authors justify their choices before data are collected. 17 Fellow researchers can then evaluate these decisions, ideally also prior to data collection, 18 for example, by reviewing a Registered Report submission7. Providing researchers (and 19 reviewers) with accessible information about ways to justify (and evaluate) design choices, 20 tailored to specific research areas, will improve current research practices. 21 22 Benjamin et al. noted that some fields, such as genomics and physics, have lowered the 23 “default” alpha level. However, in genomics the overall false positive rate is still controlled at 24 5%; the lower alpha level is only used to correct for multiple comparisons. In physics, 25 researchers have argued against a blanket rule, and for an alpha level based on factors 26 such as the surprisingness of the predicted result and its practical or theoretical impact8. In 27 non-human animal research, minimizing the number of animals used needs to be directly 28 balanced against the probability and cost of false positives. Depending on these and other 15 1 considerations, the optimal alpha level for a given research question could be higher or 2 lower than the current convention of .059,10,11. 3 4 Benjamin et al. stated that a “critical mass of researchers” endorse the standard of a p ≤ 5 .005 threshold for “statistical significance.” However, the presence of a critical mass can only 6 be identified after a norm has been widely adopted, not before. Even if a p ≤ .005 threshold 7 were widely accepted, this would only reinforce the misconception that a single alpha level is 8 universally applicable. Ideally, the alpha level is determined by comparing costs and benefits 9 against a utility function using decision theory12. This cost-benefit analysis (and thus the 10 alpha level)13 differs when analyzing large existing datasets compared to collecting data from 11 hard-to-obtain samples. 12 13 Conclusion 14 15 Science is diverse, and it is up to scientists to justify the alpha level they decide to use. As 16 Fisher noted14: "...no scientific worker has a fixed level of significance at which, from year to 17 year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each 18 particular case in the light of his evidence and his ideas." Research should be guided by 19 principles of rigorous science15, not by heuristics and arbitrary blanket thresholds. These 20 principles include not only sound statistical analyses, but also experimental redundancy 21 (e.g., replication, validation, and generalisation), avoidance of logical traps, intellectual 22 honesty, research workflow transparency, and accounting for potential sources of error. 23 Single studies, regardless of their p-value, are never enough to conclude that there is strong 24 evidence for a substantive claim. We need to train researchers to assess cumulative 25 evidence and work towards an unbiased scientific literature. We call for a broader mandate 26 beyond p-value thresholds whereby all justifications of key choices in research design and 27 statistical practice are transparently evaluated, fully accessible, and pre-registered whenever 28 feasible. 16 1 References 2 3 4 1. Benjamin, D. J., et al. Nature Human Behaviour 2, 6-10 https://doi.org/10.1038/s41562017-0189-z (2017). 5 2. Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L., & Rothman, N. Journal of 6 the National Cancer Institute 96, 434-442 https://doi.org/10.1093/jnci/djh075 (2004). 7 3. Open Science Collaboration. (2015). Science 349 (6251), 1-8 8 https://doi.org/10.1126/science.aac4716 (2015). 9 10 11 12 4. Senn, S. Statistical issues in drug development (2nd ed). (John Wiley & Sons, 2007). 5. Mayo, D. Statistical inference as severe testing: How to get beyond the statistics wars. (Cambridge University Press, 2018). 6. Johnson, V. E., Payne, R. D., Wang, T., Asher, A., & Mandal, S. Journal of the American 13 Statistical Association 112(517), 1–10 14 https://doi.org/10.1080/01621459.2016.1240079 (2017). 15 7. Chambers, C.D., Dienes, Z., McIntosh, R.D., Rotshtein, P., & Willmes, K. Cortex 66, A1-2 16 https://doi.org/10.1016/j.cortex.2015.03.022 (2015). 17 8. Lyons, L. Discovering the Significance of 5 sigma. Preprint at 18 19 20 21 22 23 24 http://arxiv.org/abs/1310.1284 (2013). 9. Field, S. A., Tyre, A. J., Jonzen, N., Rhodes, J. R., & Possingham, H. P. Ecology Letters 7(8), 669-675 https://doi.org/10.1111/j.1461-0248.2004.00625.x (2004). 10. Grieve, A. P. Pharmaceutical Statistics 14(2), 139–150 https://doi.org/10.1002/pst.1667 (2015). 11. Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. PLOS ONE 7(2), e32734 https://doi.org/10.1371/journal.pone.0032734 (2012). 25 12. Skipper, J. K., Guenther, A. L., & Nass, G. The American Sociologist 2(1), 16–18 (1967). 26 13. Neyman, J., & Pearson, E. S. Philosophical Transactions of the Royal Society of London 27 A: Mathematical, Physical and Engineering Sciences 231 694–706 28 https://doi.org/10.1098/rsta.1933.0009 (1933). 17 1 14. Fisher R. A. Statistical methods and scientific inferences. (Hafner, 1956). 2 15. Casadevall, A., & Fang, F. C. mBio 7(6), e01902-16. https://doi.org/10.1128/mbio.01902- 3 16 (2016). 4 18 1 Figure Caption 2 3 Figure 1. The proportion of studies3 replicated at α = .05 (with a bin width of .005). Window 4 start and end positions are plotted on the horizontal axis. The error bars denote 95% 5 Jeffreys confidence intervals. R code to reproduce Figure 1 is available from 6 https://osf.io/by2kc/. 19 Proportion of studies replicated 1.00 ● 0.75 number of studies ● 0.50 ● 10 ● ● ● ● 30 ● ● 0.25 ● 0.00 20 ● ● ● 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.050 Original study p−value ● 40