Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Statistical Evidence in Experimental Psychology

2011, Perspectives on Psychological Science

Statistical inference in psychology has traditionally relied heavily on p-value significance testing. This approach to drawing conclusions from data, however, has been widely criticized, and two types of remedies have been advocated. The first proposal is to supplement p values with complementary measures of evidence, such as effect sizes. The second is to replace inference with Bayesian measures of evidence, such as the Bayes factor. The authors provide a practical comparison of p values, effect sizes, and default Bayes factors as measures of statistical evidence, using 855 recently published t tests in psychology. The comparison yields two main results. First, although p values and default Bayes factors almost always agree about what hypothesis is better supported by the data, the measures often disagree about the strength of this support; for 70% of the data sets for which the p value falls between .01 and .05, the default Bayes factor indicates that the evidence is only anecdota...

Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests Perspectives on Psychological Science 6(3) 291–298 ª The Author(s) 2011 Reprints and permission: sagepub.com/journalsPermissions.nav DOI: 10.1177/1745691611406923 http://pps.sagepub.com Ruud Wetzels1, Dora Matzke1, Michael D. Lee2, Jeffrey N. Rouder3, Geoffrey J. Iverson2, and Eric-Jan Wagenmakers1 1 Department of Psychology, University of Amsterdam, Amsterdam, The Netherlands; 2Department of Cognitive Sciences, University of California, Irvine; and 3Department of Psychological Sciences, University of Missouri-Columbia Abstract Statistical inference in psychology has traditionally relied heavily on p-value significance testing. This approach to drawing conclusions from data, however, has been widely criticized, and two types of remedies have been advocated. The first proposal is to supplement p values with complementary measures of evidence, such as effect sizes. The second is to replace inference with Bayesian measures of evidence, such as the Bayes factor. The authors provide a practical comparison of p values, effect sizes, and default Bayes factors as measures of statistical evidence, using 855 recently published t tests in psychology. The comparison yields two main results. First, although p values and default Bayes factors almost always agree about what hypothesis is better supported by the data, the measures often disagree about the strength of this support; for 70% of the data sets for which the p value falls between .01 and .05, the default Bayes factor indicates that the evidence is only anecdotal. Second, effect sizes can provide additional evidence to p values and default Bayes factors. The authors conclude that the Bayesian approach is comparatively prudent, preventing researchers from overestimating the evidence in favor of an effect. Keywords hypothesis testing, t test, p value, effect size, Bayes factor Experimental psychologists use statistical procedures to convince themselves and their peers that the effect of interest is real, reliable, replicable, and hence worthy of academic attention. A representative example comes from Mussweiler (2006), who studied whether particular actions can activate a corresponding stereotype. To test this hypothesis empirically, Mussweiler unobtrusively induced half the participants, the experimental group, to move in a portly manner that is stereotypic for the overweight. The other half, the control group, made no such movements. Next, all participants were given an ambiguous description of a target person and then used a 9-point scale (ranging from 1 ¼ not at all to 9 ¼ very) to rate this person on dimensions that correspond to the overweight stereotype (e.g., ‘‘unhealthy,’’ ‘‘sluggish,’’ and ‘‘insecure’’). To assess whether performing the stereotypic motion affected the rating of the ambiguous target person, Mussweiler computed a t statistic, t(18) ¼ 2.1, and found that this value corresponded to a low p value (p < .05).1 Following conventional protocol, Mussweiler concluded that the low p value should be taken to provide ‘‘initial support for the hypothesis that engaging in stereotypic movements activates the corresponding stereotype’’ (Mussweiler, 2006, p. 28). The use of t tests and corresponding p values in this way constitutes a common and widely accepted practice in the psychological literature. It is, however, not the only possible or reasonable approach to measuring evidence and making statistical and scientific inferences. Indeed, the use of t tests and p values has been widely criticized (e.g., Cohen, 1994; Cumming, 2008; Dixon, 2003; Howard, Maxwell, & Flemming, 2000; Lee & Wagenmakers, 2005; Loftus, 1996; Nickerson, 2000; Wagenmakers, 2007). There are at least two different criticisms, coming from different perspectives and resulting in different remedies. First, many have argued that null hypothesis tests should be supplemented with other Corresponding Author: Ruud Wetzels, Department of Psychology, University of Amsterdam, Roetersstraat 15, 1018 WB Amsterdam, The Netherlands E-mail: wetzels.ruud@gmail.com Downloaded from pps.sagepub.com at Universiteit van Amsterdam SAGE on May 24, 2011 292 Wetzels et al. statistical measures, such as confidence intervals and effect sizes. Within psychology, this approach to remediation has sometimes been institutionalized, being required by journal editors or recommended by the American Psychological Association (e.g., American Psychological Association, 2010; Cohen, 1988; Erdfelder, 2010; Wilkinson & the Task Force on Statistical Inference, 1999). A second, more fundamental criticism that comes from Bayesian statistics is that there are basic conceptual and practical problems with p values. Although Bayesian criticism of psychological statistical practice dates back to at least Edwards, Lindman, and Savage (1963), it has become especially prominent and increasingly influential in the last decade (e.g., Dienes, 2008; Gallistel, 2009; Kruschke, 2010a, 2010c; Lee, 2008; Myung, Forster, & Browne, 2000; Rouder, Speckman, Sun, Morey, & Iverson, 2009). One standard Bayesian measure for quantifying the amount of evidence from the data in support of an experimental effect is the Bayes factor (Gönen, Johnson, Lu, & Westfall, 2005; Rouder et al., 2009; Wetzels, Raaijmakers, Jakab, & Wagenmakers, 2009). The measure takes the form of an odds ratio: It is the probability of the data under one hypothesis relative to that under another (Dienes, 2011; Kass & Raftery, 1995; Lee & Wagenmakers, 2005). With this background, it seems that psychological statistical practice currently stands at a three-way fork in the road. Staying on the current path means continuing to rely on p values. A modest change is to place greater focus on the additional inferential information provided by effect sizes and confidence intervals. A radical change is struck by moving to Bayesian approaches, such as Bayes factors. The path that psychological science chooses seems likely to matter. It is not just that there are philosophical differences between the three choices. It is also clear that the three measures of evidence can be mutually inconsistent (e.g., Berger & Sellke, 1987; Rouder et al., 2009; Wagenmakers, 2007; Wagenmakers & Grünwald, 2006; Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010). In this article, we assess the practical consequences of choosing among inference by p values, by effect sizes, and by Bayes factors. By practical consequences, we mean the extent to which conclusions of extant studies change according to the inference measure that is used. To assess these practical consequences, we reanalyzed 855 t tests reported in articles from the 2007 issues of Psychonomic Bulletin & Review (PBR) and Journal of Experimental Psychology: Learning, Memory, and Cognition (JEP:LMC). For each t test, we compute the p value, the effect size, and the Bayes factor and study the extent to which they provide information that is redundant, complementary, or inconsistent. On the basis of these analyses, we suggest the best direction for measuring statistical evidence from psychological experiments. Three Measures of Evidence In this section, we describe how to calculate and interpret the p value, the effect size, and the Bayes factor. For concreteness, we use Mussweiler’s (2006) study on the effect of action on stereotypes. The mean score of the control group, Mc, was 5.8 on a weight-stereotype scale (sc ¼ 0.69, nc ¼ 10), and the mean score of the experimental group, Me, was 6.4 (se¼ 0.66, ne¼ 10). The p value The interpretation of p values is not straightforward, and their use in hypothesis testing is heavily debated (Cohen, 1994; Cortina & Dunlap, 1997; Cumming, 2008; Dixon, 2003; Frick, 1996; Gigerenzer, 1993, 1998; Hagen, 1997; Killeen, 2005, 2006; Kruschke, 2010a, 2010c; Lee & Wagenmakers, 2005; Loftus, 1996; Nickerson, 2000; Schmidt, 1996; Wagenmakers & Grünwald, 2006; Wainer, 1999). The p value is the probability of obtaining a test statistic (in this case, the t statistic) at least as extreme as the one that was observed in the experiment, given that the null hypothesis is true and the sample is generated according to a specific intended procedure, such as fixed sample size. Fisher (1935) interpreted these p values as evidence against the null hypothesis. The smaller the p value, the more evidence there was against the null hypothesis. Fisher viewed these values as self-explanatory measures of evidence that did not need further guidance. In practice, however, most researchers (and reviewers) adopt a .05 cutoff: p values less than .05 constitute evidence for an effect, and those greater than .05 do not. More fine-grained categories are possible, and Wasserman (2004, p. 157) proposes the gradations shown in the top of Table 1. Note that the top part of Table 1 lists various categories of evidence against the null hypothesis. A basic limitation of null hypothesis significance testing is that it does not allow a researcher to gather evidence in favor of the null (Dennis, Lee, & Kinnell, 2008; Gallistel, 2009; Rouder et al., 2009; Wetzels et al., 2009). For the data from Mussweiler (2006), we compute a p value based on the t test. The t test is designed to test whether a difference between two means is significant. First, we calculate the t statistic: Me  Mc 6:42  5:79 t ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ffi ¼ 2:09   ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 1 1 2 þ 0:46 spooled ne þ nc 10 10 where Me and Mc are the means of both groups, nc and ne are the sample sizes, and s2pooled estimates the common population variance: s2pooled ¼ ðne  1ÞSe2 þ ðnc  1Þs2c ne þ nc  2 Next, the t statistic with ne þ nc  2 ¼ 18 degrees of freedom results in a p value slightly larger than .05 (& .051). For our concrete example, Table 1 leads to the conclusion that the p value is on the cusp between ‘‘no evidence against H0’’ and ‘‘positive evidence against H0.’’ Downloaded from pps.sagepub.com at Universiteit van Amsterdam SAGE on May 24, 2011 293 Statistical Evidence in Psychology Table 1. Evidence Categories for p Values (adapted from Wasserman, 2004, p. 157), for Effect Sizes (as proposed by Cohen, 1988), and for Bayes Factor BFA0 (Jeffreys, 1961) Statistic Interpretation p value <.001 .001–.01 .01–.05 >.05 Effect size <0.2 0.2–0.5 0.5–0.8 0.8 Bayes factor >100 30–100 10–30 3–10 1–3 1 1/3–1 1/10–1/3 1/30–1/10 1/100–1/30 <1/100 Decisive evidence against H0 Substantive evidence against H0 Positive evidence against H0 No evidence against H0 Small effect size Small to medium effect size Medium to large effect size Large to very large effect size Decisive evidence for HA Very strong evidence for HA Strong evidence for HA Substantial evidence for HA Anecdotal evidence for HA No evidence Anecdotal evidence for H0 Substantial evidence for H0 Strong evidence for H0 Very strong evidence for H0 Decisive evidence for H0 Note: For the Bayes factor categories, we replaced the label ‘‘worth no more than a bare mention’’ with ‘‘anecdotal.’’ Also, in contrast to p values, the Bayes factor can quantify evidence in favor of the null hypothesis. The effect size Effect sizes quantify the magnitude of an effect and serve as a measure of how much the results deviate from the null hypothesis (Cohen, 1988; Richard, Bond, & Stokes-Zoota, 2003; Rosenthal, 1990; Rosenthal & Rubin, 1982; Thompson, 2002). For the data from Mussweiler (2006), the effect size, d, is calculated as follows: d¼ Me  Mc 6:42  5:79 ¼ 0:93 ¼ 0:68 spooled Note that in contrast to the p value, the effect size is independent of sample size; increasing the sample size does not increase effect size but instead allows it to be estimated more accurately. Effect sizes are often interpreted in terms of the categories introduced by Cohen (1988), as listed in the middle of Table 1, ranging from ‘‘small’’ to ‘‘very large.’’ For our concrete example, d ¼ 0.93, and we conclude that this effect is large to very large. Interestingly, the p value was on the cusp between the categories ‘‘no evidence against H0’’ and ‘‘positive evidence against H0,’’ whereas the effect size indicates the effect to be strong. The Bayes factor In Bayesian statistics, uncertainty (or degree of belief) is quantified by probability distributions over parameters. This makes the Bayesian approach fundamentally different from the classical ‘‘frequentist’’ approach, which relies on sampling distributions of data (Berger & Delampady, 1987; Berger & Wolpert, 1988; Jaynes, 2003; Lindley, 1972). Within the Bayesian framework, one may quantify the evidence for one hypothesis relative to another. The Bayes factor is the most commonly used (although certainly not the only possible) Bayesian measure for doing so (Jeffreys, 1961; Kass & Raftery, 1995). The Bayes factor is the probability of the data under one hypothesis relative to the other. When a hypothesis is a simple point, such as the null, then the probability of the data under this hypothesis is simply the likelihood evaluated at that point. When a hypothesis consists of a range of points, such as all positive effect sizes, then the probability of the data under this hypothesis is the weighted average of the likelihood across that range. This averaging automatically controls for the complexity of different models, as has been emphasized in Bayesian literature in psychology (e.g., Pitt, Myung, & Zhang, 2002; Rouder et al., 2009). We take as the null that a parameter a is restricted to 0 (i.e., H0: a ¼ 0), and we take as the alternative that a is not zero (i.e., HA: a 6¼ 0). In this case, the Bayes factor given data D is simply the ratio where the integral in the denominator takes the average evidence over all values of a, weighted by the prior probability of those values p(a | HA) under the alternative hypothesis. An alternative—but formally equivalent—conceptualization of the Bayes factor is R pðDjHA ; aÞpðajHA Þda pðDjHA Þ ; BFA0 ¼ ¼ pðDjH0 Þ pðDjH0 Þ as a measure of the change from prior model odds to posterior model odds, brought about by the observed data. This change is often interpreted as the weight of evidence (Good, 1983, 1985). Before seeing the data D, the two hypotheses H0 and HA are assigned prior probabilities p(H0) and p(HA). The ratio of the two prior probabilities defines the prior odds. When the data D are observed, the prior odds are updated to posterior odds, which is defined as the ratio of the posterior probabilities, p(H0 | D) and p(HA | D): pðHA jDÞ pðDjHA Þ pðHA Þ ¼  :ð1Þ pðH0 jDÞ pðDjH0 Þ pðH0 Þ Equation 1 shows that the change from prior odds to posterior odds is quantified by p(D| HA)/ p(D| H0): the Bayes factor, BFA0. Under either conceptualization, the Bayes factor has an appealing and direct interpretation as an odds ratio. For example, BFA0 ¼ 2 implies that the data are twice as likely to have occurred under HA than under H0. Jeffreys (1961) proposed a set of verbal labels to categorize the Bayes factor according to its evidential impact. This set of labels, presented at the bottom of Table 1, facilitates scientific communication but should only be considered an approximate descriptive articulation of different standards of evidence (Kass & Raftery, 1995). In general, calculating Bayes factors is more difficult than calculating p values and effect sizes. However, psychologists can now turn to easy-to-use Web pages to calculate the Bayes Downloaded from pps.sagepub.com at Universiteit van Amsterdam SAGE on May 24, 2011 294 Wetzels et al. factor for many common experimental situations or use software such as WinBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000; Wetzels, Lee, & Wagenmakers, 2010; Wetzels et al., 2009).2 In this article, we use the Bayes factor calculation described in Rouder et al. (2009). Rouder et al.’s development is suitable for one-sample and two-sample designs, and the only necessary input is the t value and sample size. The Bayes factor that we report in this article is the result of a default Bayesian t test (for details, see Rouder et al., 2009). The test is default because it applies regardless of the phenomenon under study: For every experiment, one uses the same prior distribution on effect size for the alternative hypothesis, the Cauchy (0,1) distribution. This prior distribution has statistical advantages that make it an appropriate default choice (for example, it has excellent theoretical properties in the limit, N ! 1 and t ! 1; for details, see Liang, Paulo, Molina, Clyde, & Berger, 2008). The default test is easy to use and avoids informed specification of prior distributions that other researchers may contest. Conversely, one may argue that the informed specification of priors is the appropriate way to take problem-specific prior knowledge into account. Bayesian statisticians are divided over the relative merits of default versus informed specifications of prior distributions (Press, Chib, Clyde, Woodworth, & Zaslavsky, 2003). In our opinion, the default test provides an excellent starting point of analysis, one that may later be supplemented with a detailed problem-specific analysis (see Dienes, 2008, 2011, this issue; Kruschke, 2010a, 2010b, 2011, this issue, for additional discussion of informed priors). In our concrete example, the resulting Bayes factor for t ¼ 2.09 and a sample size of 20 observations is BFA0 ¼ 1.56. Accordingly, the data are 1.56 times more likely to have occurred under the alternative hypothesis than under the null hypothesis. This Bayes factor falls into the category ‘‘anecdotal.’’ In other words, this Bayes factor indicates that although the alternative hypothesis is slightly favored, we do not have sufficiently strong evidence from the data to reject or accept either hypothesis. Comparing p Values, Effect Sizes, and Bayes Factors For our concrete example, the three measures of evidence are not in agreement. The p value was on the cusp between the categories ‘‘no evidence against H0’’ and ‘‘positive evidence against H0,’’ the effect size indicates a large to very large effect size, and the Bayes factor indicates that the data support the null hypothesis almost as much as they support the alternative hypothesis. If this example is not an isolated one, and the measures differ in many psychological applications, then it is important to understand the nature of those differences. To address this question, we studied all of the empirical results evaluated by a t test in the 2007 volumes of PBR and JEP:LMC. This sample was composed of 855 t tests from 252 articles. These articles covered 2,394 journal pages and addressed many topics that are important in modern experimental psychology. Our sample suggests, on average, that an article published in PBR and Fig. 1. The relationship between effect size and p values. Points denote comparisons (855 in total). Points denoted by circles indicate relative consistency between the effect size and p value, whereas those denoted by triangles indicate gross inconsistencies. The scale of the axes is based on the decision categories, as given in Table 1. JEP:LMC contains about 3.4 t tests, which amounts to one t test for every 2.8 pages. For simplicity, we did not include t tests that resulted from multiple comparisons in analysis of variance designs (for a Bayesian perspective on multiple comparisons, see Scott and Berger, 2006). Even though our t tests are sampled from the field of experimental and cognitive psychology, we expect our findings to generalize to many other subfields of psychology, as long as the studies in these subfields use the same level of statistical significance, approximately the same number of participants, and approximately the same number of trials per participant (Howard et al., 2000). In the next sections, we describe the empirical relation between the three measures of evidence, starting with the relation between effect sizes and p values. Comparing effect sizes and p values The relationship between the obtained p values and effect sizes is shown as a scatter plot in Figure 1. Each point corresponds to one of the 855 comparisons. Different panels are introduced to distinguish the different evidence categories, as given in Table 1. Figure 1 suggests that p values and effect sizes capture roughly the same information in the data. Large effect sizes tend to correspond to low p values, and small effect sizes tend to correspond to large p values. The two measures, however, are far from identical. For instance, a p value of .01 can correspond to effect sizes ranging from about 0.2 to 1, and an effect size near 0.5 can correspond to p values ranging from about Downloaded from pps.sagepub.com at Universiteit van Amsterdam SAGE on May 24, 2011 295 Statistical Evidence in Psychology Fig. 2. The relationship between Bayes factor and effect size. Points denote comparisons (855 in total). The scale of the axes is based on the decision categories, as given in Table 1. Fig. 3. The relationship between Bayes factor and p value. Points denote comparisons (855 in total). The scale of the axes is based on the decision categories, as given in Table 1. .001 to .05. The triangular points in the top-right panel of Figure 1 highlight gross inconsistencies. These eight studies have a large effect size, above 0.8, but their p values do not indicate evidence against the null hypothesis. A closer examination revealed that these studies had p values very close to .05 and were comprised of small sample sizes. Comparing p values and Bayes factors Comparing effect sizes and Bayes factors The relationship between the obtained Bayes factors and effect sizes is shown in Figure 2. Much as with the comparison of p values with effect sizes, it seems clear that the default Bayes factor and effect size generally agree, though not exactly. No striking inconsistencies are apparent: No study with an effect size greater than 0.8 coincides with a Bayes factor below 1/3, nor does a study with very low effect size below 0.2 coincide with a Bayes factor above 3. The two measures, however, are not identical. They differ in the assessment of strength of evidence. Effect sizes above 0.8 range all the way from anecdotal to decisive evidence in terms of the Bayes factor. Also note that small to medium effect sizes (i.e., those between 0.2 and 0.5) can correspond to Bayes factor evidence in favor of either the alternative or the null hypothesis. This last observation supports the premise that Bayes factors may quantify support for the null hypothesis. Figure 2 shows that about one-third of all studies produced evidence in favor of the null hypothesis. In about half of these studies favoring the null, the evidence is substantial. Because of the file-drawer problem (i.e., only significant effects tend to get published), this is an underestimate of the true number of null findings and their Bayes factor support. The relationship between the obtained Bayes factors and p values is shown in Figure 3, again using interpretative panels. It is clear that default Bayes factors and p values largely covary with each other. Low Bayes factors correspond to high p values, and high Bayes factors correspond to low p values, a relationship that is much more exact than for our previous two comparisons. The main difference between default Bayes factors and p values is one of calibration; p values accord more evidence against the null than do Bayes factors. Consider the p values between .01 and .05, values that correspond to ‘‘positive evidence’’ and that usually pass the bar for publishing in academia. According to the default Bayes factor, 70% of these experimental effects convey evidence in favor of the alternative hypothesis that is only ‘‘anecdotal.’’ This difference in the assessment of the strength of evidence is dramatic and consequential. Conclusion We compared p values, effect sizes, and default Bayes factors as measures of statistical evidence in empirical psychological research. Our comparison was based on a total of 855 different t statistics from all published articles in two major empirical journals in 2007. In virtually all studies, the three different measures of evidence are broadly consistent: Small p values correspond to large effect sizes and large Bayes factors in favor of the alternative hypothesis. Despite the fact that the measures of evidence reach the same conclusion about what hypothesis is best supported by the data, however, the measures differ with respect to the strength of that support. In particular, we noted Downloaded from pps.sagepub.com at Universiteit van Amsterdam SAGE on May 24, 2011 296 Wetzels et al. that p values between .01 and .05 often correspond to what, in Bayesian terms, is only anecdotal evidence in favor of the alternative hypothesis. The practical ramifications of this are considerable. Practical ramifications Our results showed that when the p value falls in the interval from .01 to .05, there is a 70% chance that the default Bayes factor indicates the evidence for the alternative hypothesis to be only anecdotal or ‘‘worth no more than a bare mention’’; this means that the data are no more than three times more likely under the alternative hypothesis than they are under the null hypothesis. Hence, for the studies under consideration here, it seems that a p-value criterion more conservative than .05 is appropriate. Alternatively, researchers could avoid computing a p value altogether and instead compute the Bayes factor. Both methods help prevent researchers from overestimating the strength of their findings and help keep the field from incorporating ambiguous findings as if these were real and reliable (Ioannidis, 2005). As a practical illustration, consider a series of recent experiments on precognition (Bem, 2011). In nine experiments with over 1,000 participants, Bem intended to show that precognition exists, that is, that people can foresee the future. And indeed, eight out of nine experiments yielded a significant result. However, most p values fell in the ambiguous range of .01 to .05, and across all nine experiments, a Bayes factor analysis indicates about as much evidence for the alternative hypothesis as against it (Kruschke, 2011; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). We believe that this situation typifies part of what could be improved in psychological research today. It is simply too easy to obtain a p value below .05 and to subsequently publish the result. When researchers publish ambiguous results as if they were real and reliable, this damages the field as a whole: Time, effort, and money will be invested to replicate the phenomenon, and when replication fails, the burden of proof is almost always on the part of the researcher who, after all, failed to replicate a phenomenon that was demonstrated to be present (with a p value between .01 and .05). Thus, our empirical comparison shows that the academic criterion of .05 is too liberal. Note that this problem would not be solved by opting for a stricter significance level, such as .01. It is well known that the p value decreases as the sample size, n, increases. Hence, if psychologists switch to a significance level of .01 but inevitably increase their sample sizes to compensate for the stricter statistical threshold, then the phenomenon of anecdotal evidence will start to plague p values even when these p values are lower than .01. Therefore, we make a case for Bayesian statistics in the next section. A case for Bayesian statistics We have compared the conclusions from the different measures of evidence. It is easy to make a case for Bayesian statistical inference in general, based on arguments already well documented in statistics and psychology (e.g., Dienes, 2008; Jaynes, 2003; Kruschke, 2010a, 2010c; Lee & Wagenmakers, 2005; Lindley, 1972; Wagenmakers, 2007). We briefly mention three arguments here. First, unlike null hypothesis testing, Bayesian inference does not violate basic principles of rational statistical decision making, such as the stopping rule principle or the likelihood principle (Berger & Delampady, 1987; Berger & Wolpert, 1988; Dienes, 2011). This means that the results of Bayesian inference do not depend on the intention with which the data were collected. As stated by Edwards et al. (1963, p. 193), ‘‘the rules governing when data collection stops are irrelevant to data interpretation. It is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience.’’ Second, Bayesian inference takes model complexity into account in a rational way. Specifically, the Bayes factor has the attraction of not assigning a special status to the null hypothesis and so makes it theoretically possible to measure evidence in favor of the null (e.g., Dennis et al., 2008; Gallistel, 2009; Kass & Raftery, 1995; Rouder et al., 2009). Third, we believe that Bayesian inference provides the kind of answers that researchers care about. In our experience, researchers are usually not that interested in the probability of encountering data at least as extreme as those that were observed, given that the null hypothesis is true and the sample was generated according to a specific intended procedure. Instead, most researchers want to know what they have learned from the data about the relative plausibility of the hypotheses under consideration. This is exactly what is quantified by the Bayes factor. These advantages notwithstanding, the Bayes factor is not a measure of the mere size of an effect. Hence, the measure of effect size confers additional information, particularly when small numbers of participants or trials are involved. So, especially for these sorts of studies, there is an argument for reporting both a Bayes factor and an effect size. We note that, from a Bayesian perspective, the effect size can naturally be conceived as (a summary statistic of) the posterior distribution of a parameter representing the effect, under an uninformative prior distribution. In this sense, a standard Bayesian combination of parameter estimation and model selection could encompass all of the useful measures of evidence we observed (for an example of how Bayes factor estimation can be incorporated in a Bayesian estimation framework, see, for instance, Kruschke, 2011). Our final thought is that reasons for adopting a Bayesian approach now are amplified by the promise of using an extended Bayesian approach in the future. In particular, we think the hierarchical Bayesian approach, which is standard in statistics (e.g., Gelman & Hill, 2007) and is becoming more common in psychology (e.g. Kruschke, 2010b, 2010c; Lee, in press; Rouder & Lu, 2005), could fundamentally change how psychologists identify effects. Hierarchical Bayesian analysis can be a valuable tool both for meta-analyses and for the Downloaded from pps.sagepub.com at Universiteit van Amsterdam SAGE on May 24, 2011 297 Statistical Evidence in Psychology analysis of a single study. In the meta-analytical context, multiple studies can be integrated, so that what is inferred about the existence of effects and their magnitude is informed, in a coherent and quantitative way, by a domain of experiments. In the context of a single experiment, a hierarchical analysis can be used to take variability across participants or items into account. In sum, our empirical comparison of 855 t tests shows that three often-used measures of evidence—p values, effect sizes, and Bayes factors—almost always agree about what hypothesis is better supported by the data. However, the measures often disagree about the strength of this support: for those data sets with p values in between .01 and .05, about 70% are associated with a Bayes factor that indicates the evidence to be only anecdotal or ‘‘worth no more than a bare mention’’ (Jeffreys, 1961). This analysis suggests that many results that have been published in the literature are not established as strongly as one would like. Declaration of Conflicting Interests The authors declared that they had no conflicts of interest with respect to their authorship or the publication of this article. Funding This research was supported by a Vidi grant from the Netherlands Organization for Scientific Research. Notes 1. The findings suggest that Mussweiler (2006) conducted a onesided t test. In the remainder of this article, we conduct twosided t tests. 2. A Web page for computing a Bayes factor online is http:// pcl.missouri.edu/bayesfactor, and a Web page to download a tutorial and a flexible R/WinBUGS function to calculate the Bayes factor can be found at http://www.ruudwetzels.com. References American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author. Bem, D.J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. Berger, J.O., & Delampady, M. (1987). Testing precise hypotheses. Statistical Science, 2, 317–352. Berger, J.O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association, 82, 112–139. Berger, J.O., & Wolpert, R.L. (1988). The likelihood principle (2nd ed.). Hayward, CA: Institute of Mathematical Statistics. Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003. Cortina, J.M., & Dunlap, W.P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2, 161–172. Cumming, G. (2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286–300. Dennis, S., Lee, M., & Kinnell, A. (2008). Bayesian analysis of recognition memory: The case of the list-length effect. Journal of Memory and Language, 59, 361–376. Dienes, Z. (2008). Understanding psychology as a science: An introduction to scientific and statistical inference. New York: Palgrave Macmillan. Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6, 274–290. Dixon, P. (2003). The p-value fallacy and how to avoid it. Canadian Journal of Experimental Psychology, 57, 189–202. Edwards, W., Lindman, H., & Savage, L.J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242. Erdfelder, E. (2010). A note on statistical analysis. Experimental Psychology, 57, 1–4. Fisher, R.A. (1935). The design of experiments. Edinburgh: Oliver and Boyd. Frick, R.W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379–390. Gallistel, C. (2009). The importance of proving the null. Psychological Review, 116, 439–453. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge, England: Cambridge University Press. Gigerenzer, G. (1993). The Superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum. Gigerenzer, G. (1998). We need statistical thinking, not statistical rituals. Behavioral and Brain Sciences, 21, 199–200. Gönen, M., Johnson, W.O., Lu, Y., & Westfall, P.H. (2005). The Bayesian two-sample t test. American Statistician, 59, 252–257. Good, I.J. (1983). Good thinking: The foundations of probability and its applications. Minneapolis: University of Minnesota Press. Good, I.J. (1985). Weight of evidence: A brief survey. In J.M. Bernardo, M.H. DeGroot, D.V. Lindley, & A.F.M. Smith (Eds.), Bayesian statistics 2 (pp. 249–269). New York: Elsevier. Hagen, R.L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15–24. Howard, G., Maxwell, S., & Fleming, K. (2000). The proof of the pudding: An illustration of the relative strengths of null hypothesis, meta-analysis, and Bayesian analysis. Psychological Methods, 5, 315–332. Ioannidis, J.P.A. (2005). Why most published research findings are false. PLoS Medicine, 2, 696–701. Jaynes, E.T. (2003). Probability theory: The logic of science. Cambridge, UK: Cambridge University Press. Jeffreys, H. (1961). Theory of probability. Oxford, UK: Oxford University Press. Kass, R.E., & Raftery, A.E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 377–395. Killeen, P.R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16, 345–353. Downloaded from pps.sagepub.com at Universiteit van Amsterdam SAGE on May 24, 2011 298 Wetzels et al. Killeen, P.R. (2006). Beyond statistical inference: A decision theory for science. Psychonomic Bulletin & Review, 13, 549–562. Kruschke, J.K. (2010a). Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1, 658–676. Kruschke, J.K. (2010b). Doing Bayesian data analysis: A tutorial introduction with R and BUGS. Burlington, MA: Academic Press. Kruschke, J.K. (2010c). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14, 293–300. Kruschke, J.K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6, 299–312. Lee, M.D. (2008). Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15, 1–15. Lee, M.D. (in press). How cognitive modeling can benefit from hierarchical Bayesian models. Journal of Mathematical Psychology. Lee, M.D., & Wagenmakers, E.-J. (2005). Bayesian statistical inference in psychology: Comment on Trafimow (2003). Psychological Review, 112, 662–668. Liang, F., Paulo, R., Molina, G., Clyde, M., & Berger, J. (2008). Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103, 410. Lindley, D.V. (1972). Bayesian statistics; a review. Philadelphia: Society for Industrial and Applied Mathematics. Loftus, G.R. (1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161–171. Lunn, D.J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337. Mussweiler, T. (2006). Doing is for thinking! Psychological Science, 17, 17–21. Myung, I.J., Forster, M.R., & Browne, M.W. (2000). A special issue on model selection. Journal of Mathematical Psychology, 44. Nickerson, R.S. (2000). Null hypothesis statistical testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301. Pitt, M.A., Myung, I.J., & Zhang, S. (2002). Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491. Press, S., Chib, S., Clyde, M., Woodworth, G., & Zaslavsky, A. (2003). Subjective and objective Bayesian statistics: Principles, models, and applications. Hoboken, NJ: Wiley-Interscience. Richard, F.D., Bond, C.F.J., & Stokes-Zoota, J.J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331–363. Rosenthal, R. (1990). How are we doing in soft psychology? American Psychologist, 45, 775–777. Rosenthal, R., & Rubin, D. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166–169. Rouder, J.N., & Lu, J. (2005). An introduction to Bayesian hierarchical models with an application in the theory of signal detection. Psychonomic Bulletin & Review, 12, 573–604. Rouder, J.N., Speckman, P.L., Sun, D., Morey, R.D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–237. Schmidt, F.L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115–129. Scott, J., & Berger, J. (2006). An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and Inference, 136, 2144–2162. Thompson, B. (2002). What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher, 31, 25–32. Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804. Wagenmakers, E.-J., & Grünwald, P. (2006). A Bayesian perspective on hypothesis testing. Psychological Science, 17, 641–642. Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cognitive Psychology, 60, 158–189. Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H.L.J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426–432. Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 4, 212–213. Wasserman, L. (2004). All of statistics: A concise course in statistical inference. New York: Springer. Wetzels, R., Lee, M., & Wagenmakers, E.-J. (2010). Bayesian inference using WBDev: A tutorial for social scientists. Behavior Research Methods, 42, 884–897. Wetzels, R., Raaijmakers, J., Jakab, E., & Wagenmakers, E.-J. (2009). How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t test. Psychonomic Bulletin & Review, 16, 752–760. Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604. Downloaded from pps.sagepub.com at Universiteit van Amsterdam SAGE on May 24, 2011