Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
P ERS PE CT IVE S ON PS YC HOLOGIC AL SC IENC E Correlations in Social Neuroscience Aren’t Voodoo Commentary on Vul et al. (2009) Matthew D. Lieberman,1 Elliot T. Berkman,1 and Tor D. Wager2 1 University of California, Los Angeles, 2Columbia University ABSTRACT—Vul, (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 PPSC Harris, Winkielman, and Pashler (2009, this issue) claim that many brain–personality correlations in fMRI studies are ‘‘likely . . . spurious’’ (p. xx), and ‘‘should not be believed’’ (p. xx). Several of their conclusions are incorrect. First, they incorrectly claim that whole-brain regressions use an invalid and ‘‘nonindependent’’ two-step inferential procedure, a determination based on a survey sent to researchers that only included nondiagnostic questions about descriptive process of plotting one’s data. We explain how whole-brain regressions are a valid single-step method of identifying brain regions that have reliable correlations with individual difference measures. Second, they claim that large correlations from whole-brain regression analyses may be the result of noise alone. We provide a simulation to demonstrate that typical fMRI sample sizes will only rarely produce large correlations in the absence of any true effect. Third, they claim that the reported correlations are inflated to the point of being ‘‘implausibly high.’’ Though biased post hoc correlation estimates are a well-known consequence of conducting multiple tests, Vul et al. make inaccurate assumptions when estimating the theoretical ceiling of such correlations. Moreover, their own ‘‘meta-analysis’’ suggests that the magnitude of the bias is approximately .12— a rather modest bias. In an article in this issue, Vul, Harris, Winkielman, & Pashler (2009, this issue) claim that brain–personality correlations in many social neuroscience studies and those in related fields are ‘‘implausibly high’’ (p. xx), ‘‘likely . . . spurious’’ (p. xx), and ‘‘should not be believed’’ (p. xx). The article was originally titled ‘‘Voodoo Correlations in Social Neuroscience’’ and was circulated widely in the scientific community, on the Internet, and in the Address correspondence to Matthew Lieberman, Department of Psychology, University of California, Los Angeles, Los Angeles, CA 90095-1563; e-mail: lieber@ucla.edu. Volume 4—Number 3 1128 popular press prior to publication. The word voodoo, as applied to science, carries a strong and specific connotation of fraudulence, as popularized by Robert Park’s (2000) book, Voodoo Science: The Road from Foolishness to Fraud. Though the title was subsequently changed to remove the word voodoo, the substance of the article and its connotations are unchanged: It is a pointed attack on social neuroscience. Much of the article’s prepublication impact was due to its aggressive tone, which is nearly unprecedented in the scientific literature and made it easy for the article to spread virally in the news. Thus, we felt it important to respond both to the tone and to the substantive arguments. The trouble with the Vul et al. article is that it rests on a fundamental misconception about how statistical procedures are used in neuroimaging studies. They point out that post-hoc correlation estimates from whole-brain hypothesis testing procedures will tend to be greater than the true correlation value (this has been widely known but also widely underappreciated). However, they imply that post-hoc reporting of correlations constitutes an invalid inferential procedure, when in fact it is a descriptive procedure that is entirely valid. In addition, the quantitative claims that give their arguments the appearance of statistical rigor are based on problematic assumptions. Thus, it is ironic that Vul et al.’s article—which critiques social neuroscience as having achieved popularity in prominent journals and the press due to shaky statistical reasoning—itself achieved popularity based on problematic claims about the process of statistical inference. Our goal in this reply is to clarify the inferential procedures in question to set the record straight and to take a closer look at how conducting whole-brain correlation analyses might quantitatively impact correlation estimates. DO WHOLE-BRAIN CORRELATIONS USE A ‘‘NONINDEPENDENT’’ TWO-STEP INFERENCE PROCEDURE? Vul et al. (p. xx) contend that correlations resulting from a search across multiple brain regions (or brain voxels), the dominant 299 Copyright r 2009 Association for Psychological Science P P S C 1 1 2 8 Journal Name Manuscript No. B Dispatch: 6.4.09 Author Received: Journal: PPSC No. of pages: 9 CE: Shwethal PE: Basha 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) Correlations Aren’t Voodoo PPSC method in neuroimaging research, is a two-step procedure in which the method used to select voxels to test (correlation) and the test performed on the resulting regions (correlation) are not independent. The clearest account of this comes from another paper by Vul and Kanwisher (in press) in which they describe the analogous situation in whole-brain contrast analyses and suggest that, ‘‘If one selects only voxels in which condition A produces a greater signal change than condition B, and then evaluates whether the signal change for conditions A and B differ in those voxels using the same data, the second analysis is not independent of the selection criteria’’ (p. 2). This statement is clearly pointing to the existence of two steps, each involving an inferential procedure, with the second inference guaranteed to produce significant results because of its nonindependence from the first inference. The problem is that we know of no researchers who conduct their analyses this way. We were able to contact authors from 23 of the 28 nonindependent articles reviewed by Vul et al. Each of the contacted authors reported that they used a single-step inferential procedure, rather than the two-step procedure described by Vul et al. Several authors expressed frustration that the multiple choice questions asked by Vul et al. did not allow the authors to indicate whether they used one or two inferential steps, contributing to Vul et al.’s misrepresentation of how these studies were conducted. So what do these researchers actually do? When a whole-brain regression analysis is conducted, the goal is typically to identify regions of the brain whose activity shows a reliable nonzero correlation with another individual difference variable. A likelihood estimate that this correlation was produced in the absence of any true effect (e.g., a p value) is computed for every voxel in the brain without any selection of voxels to test. This is the only inferential step in the procedure, and standard corrections for multiple tests are implemented to avoid false positive results. Subsequently, descriptive statistics (e.g., effect sizes) are reported on a subset of voxels or clusters. The descriptive statistics reported are not an additional inferential step, so there is no second inferential step. For any particular sample size, the t and r values are merely redescriptions of the p values obtained in the one inferential step and provide no additional inferential information of their own. The fact that Vul et al.’s questionnaire (see their Appendix A) only asks about plotting of correlations to determine whether a second inferential step has occurred is one of the primary sources of the misunderstanding that has emerged from their article. Vul et al. interpret plotting of data as a second inferential step, but this is incorrect. Plotting the correlation is a purely descriptive process, not an inferential process. Nevertheless, Vul clearly characterizes this as an example of the nonindependence error, ‘‘The most common, most simple, and most innocuous instance of nonindependence occurs when researchers simply plot (rather than test) the signal change in a set of voxels that were selected based on that same signal change’’ (Vul & 300 1128 Kanwisher, in press, p. 5). This statement implies that if a behavioral researcher correlated an outcome measure with extraversion, neuroticism, and psychopathy and found a significant relationship only with extraversion, then it would be an error to plot just the extraversion correlation. Although Vul et al. constructed the survey sent to authors with the intention of assessing which analyses used a second nonindependent inferential step, the questionnaire did not ask a single question about a second inferential step; it only asks about data plotting, which is nondiagnostic with respect to inferential methods. If the reporting of correlation values and scatterplots is merely descriptive, then why do it? Vul et al. imply that its purpose is to ‘‘sell’’ correlations that appear to be very strong. Scatterplots provide an implicit check on underlying assumptions that must be met if any standard inferential procedure is used. A correlation of r 5 .7 in a sample of 30 participants could, for example, be driven entirely by one or two outliers (constituting a violation of the normality assumption), and readers viewing the scatterplot would quickly see this and question the result. Thus, it is true that correlation scatterplots often look very compelling when r values are high, and they should not be taken as unbiased estimates of the population correlation coefficient, but they should be reported nonetheless. In sum, despite Vul et al.’s characterizing whole-brain regressions as ‘‘seriously defective’’ (p. xx), they provide a valid test, in a single inferential step, of which regions show a reliable linear relation with an individual difference measure. What reported correlations from whole-brain regressions really show is evidence for a nonzero effect, which is what they were designed to test. It is also true that the reported effect sizes (r, t, Z) from whole-brain analyses will be inflated (i.e., overestimated relative to the population effect size) on average. However, as we detail below, the magnitude of the inflation may be far less than Vul et al. would have readers believe. HOW OFTEN DO LARGE CORRELATIONS OCCUR WITHOUT ANY TRUE EFFECT? Vul et al. imply that the correlations in at least a sizeable subset of social neuroscience studies are not based on any true underlying relationship between psychological and neural variables (hence the terms voodoo and spurious). For all statistical tests, there is some likelihood that the observed result is spurious and the true population effect size is zero. This likelihood is what p values estimate. A p value of .05 in any research domain suggests that the observed effect would have occurred by chance in 5% of experimental samples. Because a typical whole-brain analysis involves thousands of tests, the likelihood of false positives is much greater, and thus correction for multiple comparisons is essential. Although spurious correlations will occur (see Figure 4 from Vul et al. on a simulation assuming N 5 10), one critical question in the context of correlational analyses in fMRI is how Volume 4—Number 3 Matthew D. Lieberman, Elliot T. Berkman and Tor D. Wager Map-wise false positive Results at r > .8 N = 10 N = 15 N = 18 N = 20 0.6 0.5 0.4 0.3 0.2 Detail: Studies with few false positives 1 Cumulative proportion of simulated studies Proportion simulated studies 0.7 N = 10 N = 15 N = 18 N = 20 0.8 0.6 0.4 0.2 0.1 0 0 10 20 30 0 40 Number of tests with r > .8 5 0 10 Number of tests with r > .8 Likelihood that particular numbers of false positive tests will occur (at a threshold of r > 0.8) Sample size 15 18 20 (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 PPSC 0 26.3% 76.2% 90.5% 1 39.1% 21.4% 9.1% 2 21.2% 2.3% 0.4% 3 9.7% 0.1% 0.0% 4 2.9% 0.0% 0.0% 5 0.4% 0.0% 0.0% 6 or more 0.4% 0.0% 0.0% Fig. 1. A simulation of the number of high false positive correlations (correlations above 0.8) that might reasonably occur in a typical whole-brain regression analysis. We conducted 1,000 simulated whole-brain regression analyses in which brain and covariate values were independent Gaussian random variables. The left panel shows a histogram of the number of simulated studies (y axis) that yielded a given number of tests in which r > 0.8 anywhere in the brain map (x axis). Studies with 10 subjects, as in Vul et al.’s simulation, yielded high numbers of false positive tests (typically 15 to 25). Studies with 18 subjects (the mean of the criticized studies) yielded very few false positive results. The right panel shows details of the histogram between 0 and 10 false positive results. With 18 participants, 76% of studies showed no false positive results at r > .8, 21% showed a single falsepositive test, and 2% showed exactly two false-positive tests. These results are illustrative rather than exact; the actual false positive rate depends on details of the noise structure in the data and can be estimated using nonparametric methods on the full data set. The results presented here depend principally on the sample size (N), the number of effective independent tests (NEIT) performed in the whole-brain analysis, and standard assumptions of independence and normally distributed data. To estimate the NEIT, we used the p value thresholds for 11 independent whole-brain analyses reported in Nichols and Hayasaka (2003) that yield p < .05 with family-wise error-rate correction for multiple comparisons as assessed by Statistical Nonparametric Mapping software. We then equated this p value threshold to a Bonferroni correction based on an unknown number of independent comparisons and solved for the unknown NEIT for each study. Averaging over the 11 contrast maps yielded an average of 7,768 independent comparisons. Individual studies may vary substantially from this average. Dividing the number of voxels in each map by the NEIT for each study and averaging yielded a mean of 25.3 voxels per test; thus, each false positive result can be thought of as a significant region encompassing 25 voxels. often large correlations such as those targeted by Vul et al. will occur in the absence of any true effect—and, when prior anatomical hypotheses are available, how often they will occur in the expected anatomical locations. To assess how frequently spurious correlations might occur in a typical whole-brain re- Volume 4—Number 3 1128 gression analysis, we conducted a simulation (see Fig. 1). We examined how often correlations ! .80 are expected to be observed anywhere in the brain in the absence of any true signal (this depends on the sample size and number of effective independent comparisons; see Figure 1 legend for details). With 301 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) Correlations Aren’t Voodoo PPSC 18 subjects (the average N was 18.25 in the studies reviewed by Vul et al.), 76% of the simulated studies reported no correlation of r ! .80 by chance anywhere in the (simulated) brain. Only 2% reported two or more false positive correlations. This suggests that in actual studies with similar properties and multiple comparison procedures, the great majority of reported effects of this magnitude reflect a true underlying relationship. In addition, false positive activations are likely to be randomly and uniformly distributed throughout the brain. If each of the social neuroscience studies in question had reported no more than one or two significant correlations in regions uniformly distributed over the brain across studies, there would be reason to question whether they were meaningful as a set. However, many studies report multiple correlated regions in the same approximate brain areas, which is consistent with the notion of distributed networks underlying social and affective phenomena. For example, among the articles critiqued by Vul et al. are studies examining fear of pain (Ochsner et al., 2006), empathy for pain (Singer et al., 2004, 2006), and social pain (Eisenberger, Lieberman, & Williams, 2003). In each of these pain-related studies, significant correlations were reported between individual difference measures and activity in the dorsal anterior cingulate cortex, a region central to the experience of pain (Price, 2000). The results of these studies are clearly not distributed uniformly over the brain, as would be expected if these correlations were spurious. The same point is made by metaanalyses of the neuroimaging literature on emotion, which clearly show ‘‘hot spots’’ of consistently replicated activity across laboratories and task variants (Kober et al., 2008; Wager et al., 2008). It is important to note that our meta-analyses suggest that, to a first order of approximation, results from studies of social and emotional processes are no more randomly distributed across the brain than are results from studies in other areas of cognitive neuroscience such as working memory (Wager & Smith, 2003), controlled response selection (Nee, Wager, & Jonides, 2007), and long-term memory (van Snellenberg & Wager, in press). In sum, even without considering any prior anatomical hypotheses, most, but not all, of the large correlations that Vul et al. target are likely to represent real relationships between brain activity and psychological variables. Furthermore, the use of prior anatomical hypotheses that limit false positive findings are the rule, rather than the exception. It is difficult to reasonably claim that the correlations, as a set, are ‘‘voodoo.’’ HOW INFLATED ARE NONINDEPENDENT CORRELATIONS? It is a statistical property of any analysis in which multiple tests are conducted that observed effect sizes in significant tests will be inflated (i.e., larger than would be expected in a repeated sample; Tukey, 1977). Vul et al. suggest that so-called nonindependent correlations (descriptive correlation results from 302 1128 significant regions in voxel-wise searches) resulting from wholebrain analyses are ‘‘inflated to the point of being completely untrustworthy’’ (p. xx) and ‘‘should not be believed’’ (p. xx). It is true that there is inflation in such correlations (though not because of any invalid inferential procedure), it would be useful to know just how inflated these correlations are in the social neuroscience findings they criticize. Although it is impossible to know for sure, the ‘‘meta-analysis’’1 by Vul et al. provides some measure of this inflation within the social neuroscience literature. In their Figure 5, Vul et al. plot the strength of correlations using what they deem to be acceptable independent procedures in green and so-called nonindependent (biased) correlations in red. The obvious conclusion to draw is that the nonindependent correlations have higher values than the gold-standard independent correlations, and thus they are systematically inflated. To assess the average magnitude of the independent and nonindependent correlations, we collected all the articles cited in Vul et al.’s meta-analysis and extracted all of the correlations that met the inclusion criteria they describe. In doing so, we were surprised to find several anomalies between the set of correlations included in the Vul et al. meta-analysis and the set of correlations actually in the articles. We identified 54 correlations in the articles used in their meta-analysis that met their inclusion criteria, but were omitted from the meta-analysis without explanation. We also found three ‘‘correlations’’ in the meta-analysis that were really effect sizes associated with main effects rather than correlations (see the Appendix for a breakdown). Among the nonindependent correlations, almost 25% of the correlations reported in the original articles were not included in Vul et al.’s meta-analysis. The vast majority of the omitted correlations (50 of 54) and mistakenly included effects (3 of 3), if properly included or excluded, would work against Vul et al.’s hypothesis of inflated correlations due to nonindependent correlation reporting (see Figure 2). In other words, the omitted correlations were not randomly distributed with respect to the group means, as would be expected from clerical errors. Of the 41 omitted nonindependent correlations, 38 had values lower than the mean of included nonindependent correlations. The mean of the omitted nonindependent correlations (.61) was 1 Although Vul et al. characterize their review as a meta-analysis, their selection of studies for inclusion appears biased and nonreproducible. The selection of studies includes articles with large correlations that Vul et al. were likely aware of prior to sampling the literature (i.e., those papers that brought the issue of large correlations to their attention). If Vul et al. knew the magnitude of the correlations in these articles and then chose search terms guaranteed to include these in the meta-analysis, this would seem to be the kind of sampling bias that Vul et al. accuse others of. In addition, the selection of studies in their review is not reproducible. Vul et al. indicate that they searched for ‘‘social terms (e.g., jealousy, altruism, personality, grief)’’ (p. x), which is obviously an incomplete description. However, just to take one example, we searched for altruism and found several other fMRI papers on empathy from the time period covered by the Vul et al. review that were omitted from the metaanalysis for no discernable reason. Given that a number of these studies replicate the Singer et al. (2004) findings, it again raises questions about the selective inclusion of studies in their review. Volume 4—Number 3 Matthew D. Lieberman, Elliot T. Berkman and Tor D. Wager A Correlations omitted from “indepenent” analyses Mean of included “independent” correlations (r=.57) 5 # omitted 4 3 2 1 0 .52 .54 .56 .58 .60 .62 .64 .66 .68 .70 .72 .74 r-value B Correlations omitted from “non-indepenent” analyses Mean of included “non-independent” correlations (r=.69) 10 9 # omitted 8 7 6 5 4 3 2 1 0 <.52 .52 .54 .56 .58 .60 .62 .64 .66 .68 .70 r-value Fig. 2. Distribution of correlations in papers surveyed by Vul et al. but omitted from their metaanalysis. A: Independent correlations that were omitted from the Vul et al. meta-analysis. The dotted line indicates the mean of independent correlations (.57) that were included in their metaanalysis. Twelve of the 13 omitted independent correlations were higher than this mean. B: Nonindependent correlations that were omitted from the Vul et al. meta-analysis. The dotted line indicates the mean of nonindependent correlations (.69) that were included in their meta-analysis. Thirty-eight of the 41 omitted nonindependent correlations were lower than this mean. (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 PPSC significantly lower than the mean of the included nonindependent correlations (.69), t(173) 5 4.06, p < .001. Of the 13 omitted independent correlations, 12 had values higher than the mean of the included independent correlations. The mean of the omitted independent correlations (.63) was significantly higher than the mean of the included independent correlations (.57), t(129) 5 2.74, p < .01. All three of the included nonindependent correlations that should have been omitted had values higher than the mean of the included nonindependent correlations. Based solely on the correlations that Vul et al. included in their meta-analysis, the mean of the nonindependent correlations (average r 5 .69) is higher than the mean of the inde- Volume 4—Number 3 1128 pendent correlations (average r 5 .57), t(254) 5 5.31, p < .001 (see Figure 3a). This would suggest an average inflation of .12, which is not insignificant, but hardly worthy of the attacks made by Vul et al. However, there are reasons to believe that the estimate of the inflation within this sample of correlations may itself be inflated. One reason why independent correlations from region-of-interest (ROI) analyses will tend to be smaller on average than nonindependent correlations from whole-brain analyses has nothing to do with the validity of either method. The minimum reportable r value in a study depends on the p value threshold, which will typically differ between the ROI analyses (used to 303 Correlations Aren’t Voodoo A B Uncorrected for Restriction of Range 40% Independent correlations (uncorrected) Non-independent correlations 30% Corrected for Restriction of Range 40% 30% 20% 20% 10% 10% Independent correlations (corrected) Non-independent correlations 0% 0% .20 .30 .40 .50 .60 .70 .80 .90 .20 .30 .40 .50 .60 .70 .80 .90 r-values r-values Fig. 3. Distribution of independent and nonindependent correlations uncorrected and corrected for restriction of range, based on papers included in the meta-analysis by Vul et al. A: A reconstruction of the correlations plotted in Figure 5 of Vul et al. Correlations are plotted as a percentage of total correlations of each type. In this display, nonindependent correlations (average r 5 .69) are inflated relative to the independent correlations (average r 5 .57) by an average of .12. B: A reanalysis of the data from the studies included in the meta-analysis by Vul et al. Independent correlations using a procedure likely to result in restricted range issues were corrected;, 52 correlations in the relevant papers that were omitted by Vul et al. were included, and 3 ‘‘correlations’’ that were not actually correlations were removed. In the reanalysis, the nonindependent correlations (average r 5 .69) are no longer observed to be inflated relative to independent correlations (average r 5 .70). generate the independent correlations) and whole-brain analyses (used to generate the nonindependent correlations). If an ROI analysis is examining effects in two regions in a sample of 18 subjects, then the p value threshold is .025 for a corrected p value of .05, and thus the minimum reportable correlation would be an r of .51. In a whole-brain analysis of 18 subjects using a p value threshold of .005, the minimum reportable correlation is an r of .62, and at a p value threshold of .001, the minimum reportable correlation is an r of .69. Thus, a portion of the difference observed in their meta-analysis is due to these reporting constraints rather than the analytic method per se. (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 PPSC ARE INDEPENDENT CORRELATIONS UNBIASED ESTIMATES? Although Vul et al. focus on potential bias in nonindependent correlations, another reason for mean differences in nonindependent and independent correlations is biases in the independent correlations. The accuracy of correlation estimates relative to population values depends on the details of the study procedures in complex ways, and there are several potential sources of bias in the independent correlations that Vul et al. consider to be the gold standard. To illustrate this complexity, we know of at least one statistical effect that causes many of the correlations in the independent analyses to be systematically underestimated. Why would this be the case? Half of the independent correlations were computed on voxels or clusters se- 304 1128 lected from analyses of group-average contrast effects (e.g., voxels that were more active in Task A than in Task B without regard for the individual difference variable). Because low variability is one of two factors that increase t values, selecting voxels with high t values for subsequent correlation analyses will tend to select voxels with low variability across subjects. This selection procedure restricts the range of the brain data and works against finding correlations with other variables.2 We reanalyzed the correlations in Vul et al.’s meta-analysis by (a) applying a correction for restricted range to the 58 independent correlations obtained using the procedure likely to result in restricted range, (b) including the previously omitted correlations, and (c) removing the three noncorrelations that were mistakenly included in the original meta-analysis. Inde2 When a subsample has systematically lower variance than the full sample (i.e., restriction of range), correlations between the subsample and individual difference measures will produce correlation values that are smaller than the true correlation in the population (Thorndike, 1949). To give a simple analogy, imagine a correlation of .65 exists between age and spelling ability in 5 to 18 year olds. If we only sample 9 and 9.5 year olds, the observed correlation between age and spelling will be lower because we will have sampled from a restricted range of the age variable. Fortunately, the restriction of range effect can be corrected using the following formula from Cohen, Cohen, West, and Aiken (2003, p. 58), if the variance of the restricted sample and full sample are known: ryx ðsdX =sdX Þ !ryx ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi"fficffiffi"ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffifficffiffiffiffiffiffi#ffiffiffiffiffiffiffiffiffiffiffi#ffiffi : 1 þ r2YXc sd2Xc =SD2Xc & 1 Volume 4—Number 3 Matthew D. Lieberman, Elliot T. Berkman and Tor D. Wager pendent correlations based on anatomically defined regions of interest do not have restricted range and thus were not corrected. Because we do not have access to the raw fMRI data from each of the surveyed studies, we estimated the full and restricted sample variances needed for the correction formula from one of our data sets and applied these variances to all of the independent correlations in the meta-analysis.3 In our reanalysis, there was no longer any difference between the independent (average r 5 .70) and the nonindependent (average r 5 .69) correlation distributions, t(304) 5 &0.57, p > .10 (see Figure 3b).4 Thus, when adjusted for restriction of range, the independent and nonindependent samples of correlations do not support Vul et al.’s assertion of massive inflation. This should be seen as an exercise rather than a complete analysis, because we could not compute the variance for the full and restricted samples in each study, and because we did not attempt to take all other possible sources of bias into account. Indeed, calculating the bias in effect size would be at least as complex as determining a valid multiple comparisons corrections threshold, which requires detailed information about the data covariance structure in each study. Nevertheless, it does suggest that whatever inflation does exist may be far more modest and less troubling than Vul et al.’s characterization suggests. ARE SUCH LARGE CORRELATIONS THEORETICALLY POSSIBLE? (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 PPSC The upper limit on the observed correlation between two measures is constrained by the square root of the product of the reliabilities of the two measures as measured in a particular sample. Vul et al. suggest that many nonindependent correlations violate this upper limit on what should be observable. On the basis of a handful of studies that examined the reliability of fMRI data, Vul et al. provide estimates of what they believe a likely average reliability is for fMRI data ( ' .70). Similarly, they suggest that personality measures are likely to have reliabilities in the .70–.80 range. Applying the products of the reliabilities formula, they conclude that the maximum upper bound for observable correlations is .74. It is troubling that Vul et al. would make the bold claim that observed correlations from social neuroscience above .74 are 3 For the full sample variance, we extracted data from a set of voxels distributed throughout the brain selected without consideration of t test values. For the restricted sample variance, we extracted data from voxels with a significant group effect, as was typical of the independent studies. As expected, the average standard deviation in the full (2.82) and restricted samples (1.33) were significantly different from one another, t(48) 5 4.63, p < .001. 4 Of several formulas considered for restricted range correction, the Cohen et al. (2003) formula that we used was the most conservative. Using Thorndike’s formula (1949), the independent correlations actually become significantly higher than the nonindependent correlations. Also, if we only use the correlations that Vul et al. included in the correction for restricted range analysis, the results are the same—there is no longer a significant difference between the samples. Volume 4—Number 3 1128 ‘‘impossibly high’’ (p. xx) and above the ‘‘theoretical upper bound’’ (p. xx) of what can legitimately be observed. This claim is based on a rough estimate of reliability that is then generalized across a range of measures. If we estimated that grocery store items cost, on average, about $3, would it then be theoretically impossible to find a $12 item? Vul et al. make this claim despite the facts that (a) fMRI reliability has never been assessed for social neuroscience tasks; (b) if one is generalizing from previously measured reliabilities to measures with unknown reliability, it is the highest known reliabilities, not the average, that might best describe the theoretical maximum correlation observable; and (c) they acknowledge in Footnote 19 that some independent correlations are above .74 due to sampling fluctuations of observed correlations, an acknowledgement that should also extend to the nonindependent correlations.5 If we assume that brain regions in fMRI studies can have reliabilities above .90, as multiple studies have demonstrated (Aron, Gluck, & Poldrack, 2006; Fernandez et al., 2003), then the reliability of the individual difference measures actually used becomes critical. Consider, for example, the correlation (r 5 .88) between a social distress measure and activation in the dorsal anterior cingulate cortex during a social pain manipulation (Eisenberger et al., 2003) that is singled out by Vul et al. from the first page of their article. If one generically assumes that individual difference measures will all have reliabilities of .70– .80, then one would falsely conclude that the observed correlation in that study is not theoretically possible. However, multiple studies have reported reliabilities for this social distress measure between .92 and .98 (Oaten, Williams, Jones, & Zadro, 2008; Van Beest & Williams, 2006), a fact that Vul et al. were aware of.6 Applying reliabilities of .90 for fMRI and .95 for the social distress measure yields a theoretical upper limit on observable correlations at .92. Thus, by Vul et al.’s own criteria, a .88 correlation is theoretically possible in this case. This is just one example, but it points to the more general mistake of making claims about the theoretical upper bound of correlations based on approximate guesses of the measures’ reliability. CONCLUSIONS Our reply has focused on several misconceptions in the Vul et al. article that unfortunately have been sensationalized by the authors and by the media prior to publication. Because social neuroscience has garnered a lot of attention in a short period of time, singling it out for criticism may make for better headlines. 5 After correcting for restricted range, 46% of the independent correlations are above .74 and thus also violate Vul et al.’s theoretical upper bound. 6 One of the authors of the Vul et al. article emailed one of the authors of the Eisenberger et al. (2003) article about reliabilities for this social distress measure prior to the submission of their manuscript and further inquired specifically about one of the .92 reliabilities (K. D. Williams, personal communication, January 17, 2009). Consequently, it is disappointing that Vul et al. did not indicate that this .88 correlation was not violating the theoretical upper limit for this study. 305 Correlations Aren’t Voodoo As this article makes clear, however, Vul et al.’s criticisms rest on shaky ground at best. Vul et al. describe a two-step inferential procedure that would be bad science if anyone did it, but as far as we know, nobody does.7 They used a survey to assess which authors use this method, but they did not include any questions that would actually assess whether the nonindependence error had occurred. As long as standard procedures for addressing the issue of multiple comparisons are applied in a reasonable sample size, large correlations will occur by chance only rarely, and most observed effects will reflect true underlying relationships. Vul et al.’s own meta-analysis suggests that the nonindependent correlations are only modestly inflated, calling into question the use of labels such as ‘‘spurious’’ and ‘‘untrustworthy.’’ Finally, Vul et al. make incorrect assumptions when attempting to use average expected reliabilities to inform on the theoretically possible observed correlations. Ultimately, we should all be mindful that the effect sizes from whole-brain analyses are likely to be inflated, but confident in the knowledge that such correlations reflect meaningful relationships between psychological and neural variables to the extent that valid multiple comparisons procedures are used. There are various ways to balance the concerns of false positive results and sensitivity to true effects, and social neuroscience correlations use widely accepted practices from cognitive neuroscience. These practices will no doubt continue to evolve. In the meantime, we’ll keep doing the science of exploring how the brain interacts with the social and emotional worlds we live in. (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 PPSC Acknowledgments—We would like to thank the following individuals (in alphabetical order) for feedback on drafts of this paper and relevant discussions: Arthur Aron, Mahzarin Banaji, Peter Bentler, Sarah Blakemore, Colin Camerer, Turhan Canli, Jessica Cohen, William Cunningham, Ray Dolan, Mark D’Esposito, Naomi Eisenberger, Emily Falk, Susan Fiske, Karl Friston, Chris Frith, Rita Goldstein, Didier Grandjean, Amanda Guyer, Christine Hooker, Christian Keysers, William Killgore, Ethan Kross, Claus Lamm, Martin Lindquist, Jason Mitchell, Dean Mobbs, Keely Muscatell, Thomas Nichols, Kevin Ochsner, John O’Doherty, Stephanie Ortigue, Jennifer Pfeifer, Daniel Pine, Russ Poldrack, Joshua Poore, Lian Rameson, Antonio Rangel, Steve Reise, James Rilling, David Sander, Ajay Satpute, Sophie Schwartz, Tania Singer, Thomas Straube, Hidehiko 7 An important general lesson from this discussion is that post-hoc correlations will tend to be inflated—a statistical phenomenon understood since the 1800s—and should not be taken at face value as estimates of the correlation magnitude. As with any behavioral study of correlations, one should use crossvalidation to quantify the exact magnitude of the predictive relationship of one variable on a second variable, as Vul et al. suggest. However, this valid point should not be taken as support for Vul et al.’s argument that the hypothesistesting framework used to analyze brain–behavior correlations is flawed. This is not the case. 306 1128 Takahashi, Shelley Taylor, Alex Todorov, Patrik Vuilleumier, Paul Whalen, and Kip Williams. REFERENCES Aron, A.R., Gluck, M.A., & Poldrack, R.A. (2006). Long-term testretest reliability of functional MRI in a classification learning task. NeuroImage, 29, 1000–1006. Cohen, J., Cohen, P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlational analysis for the behavioral sciences. Mahwah, NJ: Erlbaum. Eisenberger, N.I., Lieberman, M.D., & Williams, K.D. (2003). Does rejection hurt? An fMRI study of social exclusion. Science, 302, 290–292. Fernández, G., Specht, K., Weis, S., Tendolkar, I., Reuber, M., Fell, J., et al. (2003). Intrasubject reproducibility of presurgical language lateralization and mapping using fMRI. Neurology, 60, 969–975. Hooker, C.I., Verosky, S.C., Miyakawa, A., Knight, R.T., & D’Esposito, M. (2008). The influence of personality on neural mechanisms of observational fear and reward learning. Neuropsychologia, 466, 2709–2724. Kober, H., Barrett, L.F., Joseph, J., Bliss-Moreau, E., Lindquist, K., & Wager, T.D. (2008). Functional grouping and cortical-subcortical interactions in emotion: A meta-analysis of neuroimaging studies. NeuroImage, 42, 998–1031. Kross, E., Egner, T., Ochsner, K., Hirsch, J., & Downey, G. (2007). Neural dynamics of rejection sensitivity. Journal of Cognitive Neuroscience, 19, 945–956. Leland, D., Arce, E., Feinstein, J., & Paulus, M. (2006). Young adult stimulant users increased striatal activation during uncertainty is related to impulsivity. NeuroImage, 33, 725–731. Mobbs, D., Hagan, C.C., Azim, E., Menon, V., & Reiss, A.L. (2005). Personality predicts activity in reward and emotional regions associated with humor. Proceedings of the National Academy of Sciences, USA, 102, 16502–16506. Nee, D.E., Wager, T.D., & Jonides, J. (2007). Interference resolution: Insights from a meta-analysis of neuroimaging tasks. Cognitive, Affective, and Behavioral Neuroscience, 7, 1–17. Nichols, T., & Hayasaka, S. (2003). Controlling the familywise error rate in functional neuroimaging: A comparative review. Statistical Methods in Medical Research, 12, 419–446. Oaten, M., Williams, K.D., Jones, A., & Zadro, L. (2008). The effects of ostracism on self-regulation in the socially anxious. Journal of Social and Clinical Psychology, 27, 471–504. Ochsner, K.N., Ludlow, D.H., Knierim, K., Hanelin, J., Ramachandran, T., Glover, G.C., & Mackey, S.C. (2006). Neural correlates of individual differences in pain-related fear and anxiety. Pain, 120, 69–77. Park, R.L. (2000). Voodoo science: The road from foolishness to fraud. New York: Oxford University Press. Posse, S., Fitzgerald, D., Gao, K., Habel, U., Rosenberg, D., Moore, G.J., & Schneider, F. (2003). Real-time fMRI of temporolimbic regions detects amygdala activation during single-trial self-induced sadness. NeuroImage, 18, 760–768. Price, D.D. (2000). Psychological and neural mechanisms of the affective dimension of pain. Science, 288, 1769–1772. Rilling, J.K., Glenn, A.L., Jairam, M.R., Pagnoni, G., Goldsmith, D.R., Elfenbein, H.A., & Lilienfeld, S.O. (2007). Neural correlates of social cooperation and non-cooperation as a function of psychopathy. Biological Psychiatry, 61, 1260–1271. Volume 4—Number 3 Matthew D. Lieberman, Elliot T. Berkman and Tor D. Wager Singer, T., Seymour, B., O’Doherty, J., Kaube, H., Dolan, R., & Frith, C.D. (2004). Empathy for pain involves the affective but not sensory components of pain. Science, 303, 1157–1162. Singer, T., Seymour, B., O’Doherty, J.P., Stephan, K.E., Dolan, R.J., & Frith, C.D. (2006). Empathetic neural responses are modulated by the perceived fairness of others. Nature, 439, 466–469. Thorndike, R.L. (1949). Personnel selection. New York: John Wiley. Tukey, J.W. (1977). Exploratory data analysis. Reading, MA: AddisonWesley. Van Beest, I., & Williams, K.D. (2006). When inclusion costs and ostracism pays, ostracism still hurts. Journal of Personality and Social Psychology, 91, 918–928. van Snellenberg, J.X., & Wager, T.D. (in press). Cognitive and motivational functions of the human prefrontal cortex. In E. Goldberg & D. Bougakov (Eds.), Luria’s legacy in the 21st century. Oxford, United Kingdom: Oxford University Press. Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Voodoo correlations in social neuroscience. Perspectives on Psychological Science, 4, xx–xx. Vul, E., & Kanwisher, N. (in press). Begging the question: The nonindependence error in fMRI data analysis. In S. Hanson & M. Bunzl (Eds.), Foundations and philosophy for neuroimaging. Cambridge, MA: MIT Press. Wager, T.D., Barrett, L.F., Bliss-Moreau, E., Lindquist, K., Duncan, S., & Kober, H. et al. (2008). The neuroimaging of emotion. In M. Lewis, J.M. Haviland-Jones, & L.F. Barrett (Eds.), Handbook of emotions (3rd ed., pp. 249–271). New York: Guilford Press. Wager, T.D., & Smith, E.E. (2003). Neuroimaging studies of working memory: A meta-analysis. Cognitive, Affective, and Behavioral Neuroscience, 3, 255–274. APPENDIX: SAMPLING ERRORS IN THE VUL ET AL. (2009) META-ANALYSIS 1. In Study 4 (Ochsner et al., 2006), one nonindependent correlation was not included in the analysis. (BWUS PPSC 1128.PDF 06-Apr-09 22:46 221177 Bytes 9 PAGES n operator=bs.anantha) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 PPSC Volume 4—Number 3 1128 2. In Study 6 (Eisenberger et al., 2003), Vul et al. included three ‘‘correlations’’ that were not in fact correlations. For three of the main effect analyses comparing exclusion to inclusion, the authors reported an effect size r statistic, along with t and p. No individual difference variable was involved in these analyses. 3. In Study 7 (Hooker, Verosky, Miyakawa, Knight, & D’Esposito, 2008), three independent correlations were not included in the analysis. 4. In Study 21 (Rilling et al., 2007), 35 nonindependent correlations from Table 8 were not included, and one other correlation from the manuscript was also not included. Although these correlations are listed as a table of r values, it is conceivable that they were left out of the analysis because p values were not presented. A simple calculation would have confirmed that, with 22 subjects, nearly all of these correlations are significant at p < .005 (and most at p < .001) and thus met the sampling criteria. 5. In Study 22 (Mobbs, Hagan, Azim, Menon, & Reiss, 2005), five nonindependent correlations were included in Figure 5. However, these correlations were calculated from ROIs obtained in a contrast analysis comparing two conditions, and they should have therefore been classified as independent correlations. 5. In Study 31 (Singer et al., 2006), four nonindependent correlations that are described in the text were not included, though they were listed numerically in the supplementary materials (as indicated in the main text). 6. In Study 39 (Posse et al., 2003), one independent correlation was not included in the analysis. 7. In Study 45 (Leland et al., 2006), one independent correlation was not included in the analysis. 8. In Study 53 (Kross, Egner, Ochsner, Hirsch, & Downey, 2007), three independent correlations were not included in the analysis. 307 Author Query Form _______________________________________________________ Journal Article PPSC 1128 _______________________________________________________ Dear Author, During the copy-editing of your paper, the following queries arose. Please respond to these by marking up your proofs with the necessary changes/additions. Please write your answers clearly on the query sheet if there is insufficient space on the page proofs. If returning the proof by fax do not write too close to the paper's edge. Please remember that illegible mark-ups may delay publication. Query No. Description Author Response No Queries .