Tuesday, March 6, 2012

Skeletons in the closet

 As someone who has thrown lots of stones in recent years, it's easy to forget that anyone who publishes enough will end up with some skeletons in their closet.  I was reminded of that fact today, when Dorothy Bishop posted a detailed analysis of a paper that was published in 2003 on which I am a coauthor.

This paper studied a set of children diagnosed with dyslexia who were scanned before and after treatment with the Fast ForWord training program.  The results showed improved language and reading function, which were associated with changes in brain activation. 

Dorothy notes four major problems with the study:
  • There was no dyslexic control group; thus, we don't know whether any improvements over time were specific to the treatment, or would have occurred with a control treatment or even without any treatment.
  • The brain imaging data were thresholded using an uncorrected threshold.
  • One of the main conclusions (the "normalization" of activation following training") is not supported by the necessary interaction statistic, but rather by a visual comparison of maps.
  • The correlation between changes in language scores and activation was reported for only one of the many measures, and it appeared to have been driven by outliers.
Looking back at the paper, I see that Dorothy is absolutely right on each of these points.  In defense of my coauthors, I would note that points 2-4 were basically standard practice in fMRI analysis 10 years ago (and still crop up fairly often today).  Ironically,  I raised two of of these issues in my recent paper for the special issue of Neuroimage celebrating the 20th anniversary of fMRI, in talking about the need for increased methodological rigor:

Foremost, I hope that in the next 20 years the field of cognitive neuroscience will increase the rigor with which it applies neuroimaging methods. The recent debates about circularity and “voodoo correlations” ( [Kriegeskorte et al., 2009] and [Vul et al., 2009]) have highlighted the need for increased care regarding analytic methods. Consideration of similar debates in genetics and clinical trials led (Ioannidis, 2005) to outline a number of factors that may contribute to increased levels of spurious results in any scientific field, and the degree to which many of these apply to fMRI research is rather sobering:
•small sample sizes
•small effect sizes
•large number of tested effects
•flexibilty in designs, definitions, outcomes, and analysis methods
•being a “hot” scientific field
Some simple methodological improvements could make a big difference. First, the field needs to agree that inference based on uncorrected statistical results is not acceptable (cf. Bennett et al., 2009). Many researchers have digested this important fact, but it is still common to see results presented at thresholds such as uncorrected p < .005. Because such uncorrected thresholds do not adapt to the data (e.g., the number of voxels tests or their spatial smoothness), they are certain to be invalid in almost every situation (potentially being either overly liberal or overly conservative). As an example, I took the fMRI data from Tom et al. (2007), and created a random “individual difference” variable. Thus, there should be no correlations observed other than Type I errors. However, thresholding at uncorrected p < .001 and a minimum cluster size of 25 voxels (a common heuristic threshold) showed a significant region near the amygdala; Fig. 1 shows this region along with a plot of the “beautiful” (but artifactual) correlation between activation and the random behavioral variable. This activation was not present when using a corrected statistic. A similar point was made in a more humorous way by Bennett et al. (2010), who scanned a dead salmon being presented with a social cognition task and found activation when using an uncorrected threshold. There are now a number of well-established methods for multiple comparisons correction (Poldrack et al., 2011), such that there is absolutely no excuse to present results at uncorrected thresholds. The most common reason for failing to use rigorous corrections for multiple tests is that with smaller samples these methods are highly conservative, and thus result in a high rate of false negatives. This is certainly a problem, but I don't think that the answer is to present uncorrected results; rather, the answer is to ensure that one's sample is large enough to provide sufficient statistical power to find the effects of interest.
Second, I have become increasingly concerned about the use of “small volume corrections” to address the multiple testing problem. The use of a priori masks to constrain statistical testing is perfectly legitimate, but one often gets the feeling that the masks used for small volume correction were chosen after seeing the initial results (perhaps after a whole-brain corrected analysis was not significant). In such a case, any inferences based on these corrections are circular and the statistics are useless. Researchers who plan to use small volume corrections in their analysis should formulate a specific analysis plan prior to any analyses, and only use small volume corrections that were explicitly planned a priori. This sounds like a remedial lesson in basic statistics, but unfortunately it seems to be regularly forgotten by researchers in the field.
Third, the field needs to move toward the use of more robust methods for statistical inference (e.g., Huber, 2004). In particular, analyses of correlations between activation and behavior across subjects are highly susceptible to the influence of outlier subjects, especially with small sample sizes. Robust statistical methods can ensure that the results are not overly influenced by these outliers, either by reducing the effect of outlier datapoints (e.g., robust regression using iteratively reweighted least squares) or by separately modeling data points that fall too far outside of the rest of the sample (e.g., mixture modeling). Robust tools for fMRI group analysis are increasingly available, both as part of standard software packages (such as the “outlier detection” technique implemented in FSL: Woolrich, 2008) and as add-on toolboxes (Wager et al., 2005). Given the frequency with which outliers are observed in group fMRI data, these methods should become standard in the field. However, it's also important to remember that they are not a panacea, and that it remains important to apply sufficient quality control to statistical results, in order to understand the degree to which one's results reflect generalizeable patterns versus statistical figments.
It should be clear from these comments that my faith in the results of any study that uses such problematic methods (as the Temple et al. study did) is relatively weak.  I personally have learned my lesson and our lab now does its best to adhere to these more rigorous standards, even when they mean that a study sometimes ends up being unpublishable.   I can only hope that others will join me.