Tuesday, March 6, 2012

Skeletons in the closet

 As someone who has thrown lots of stones in recent years, it's easy to forget that anyone who publishes enough will end up with some skeletons in their closet.  I was reminded of that fact today, when Dorothy Bishop posted a detailed analysis of a paper that was published in 2003 on which I am a coauthor.

This paper studied a set of children diagnosed with dyslexia who were scanned before and after treatment with the Fast ForWord training program.  The results showed improved language and reading function, which were associated with changes in brain activation. 

Dorothy notes four major problems with the study:
  • There was no dyslexic control group; thus, we don't know whether any improvements over time were specific to the treatment, or would have occurred with a control treatment or even without any treatment.
  • The brain imaging data were thresholded using an uncorrected threshold.
  • One of the main conclusions (the "normalization" of activation following training") is not supported by the necessary interaction statistic, but rather by a visual comparison of maps.
  • The correlation between changes in language scores and activation was reported for only one of the many measures, and it appeared to have been driven by outliers.
Looking back at the paper, I see that Dorothy is absolutely right on each of these points.  In defense of my coauthors, I would note that points 2-4 were basically standard practice in fMRI analysis 10 years ago (and still crop up fairly often today).  Ironically,  I raised two of of these issues in my recent paper for the special issue of Neuroimage celebrating the 20th anniversary of fMRI, in talking about the need for increased methodological rigor:

Foremost, I hope that in the next 20 years the field of cognitive neuroscience will increase the rigor with which it applies neuroimaging methods. The recent debates about circularity and “voodoo correlations” ( [Kriegeskorte et al., 2009] and [Vul et al., 2009]) have highlighted the need for increased care regarding analytic methods. Consideration of similar debates in genetics and clinical trials led (Ioannidis, 2005) to outline a number of factors that may contribute to increased levels of spurious results in any scientific field, and the degree to which many of these apply to fMRI research is rather sobering:
•small sample sizes
•small effect sizes
•large number of tested effects
•flexibilty in designs, definitions, outcomes, and analysis methods
•being a “hot” scientific field
Some simple methodological improvements could make a big difference. First, the field needs to agree that inference based on uncorrected statistical results is not acceptable (cf. Bennett et al., 2009). Many researchers have digested this important fact, but it is still common to see results presented at thresholds such as uncorrected p < .005. Because such uncorrected thresholds do not adapt to the data (e.g., the number of voxels tests or their spatial smoothness), they are certain to be invalid in almost every situation (potentially being either overly liberal or overly conservative). As an example, I took the fMRI data from Tom et al. (2007), and created a random “individual difference” variable. Thus, there should be no correlations observed other than Type I errors. However, thresholding at uncorrected p < .001 and a minimum cluster size of 25 voxels (a common heuristic threshold) showed a significant region near the amygdala; Fig. 1 shows this region along with a plot of the “beautiful” (but artifactual) correlation between activation and the random behavioral variable. This activation was not present when using a corrected statistic. A similar point was made in a more humorous way by Bennett et al. (2010), who scanned a dead salmon being presented with a social cognition task and found activation when using an uncorrected threshold. There are now a number of well-established methods for multiple comparisons correction (Poldrack et al., 2011), such that there is absolutely no excuse to present results at uncorrected thresholds. The most common reason for failing to use rigorous corrections for multiple tests is that with smaller samples these methods are highly conservative, and thus result in a high rate of false negatives. This is certainly a problem, but I don't think that the answer is to present uncorrected results; rather, the answer is to ensure that one's sample is large enough to provide sufficient statistical power to find the effects of interest.
Second, I have become increasingly concerned about the use of “small volume corrections” to address the multiple testing problem. The use of a priori masks to constrain statistical testing is perfectly legitimate, but one often gets the feeling that the masks used for small volume correction were chosen after seeing the initial results (perhaps after a whole-brain corrected analysis was not significant). In such a case, any inferences based on these corrections are circular and the statistics are useless. Researchers who plan to use small volume corrections in their analysis should formulate a specific analysis plan prior to any analyses, and only use small volume corrections that were explicitly planned a priori. This sounds like a remedial lesson in basic statistics, but unfortunately it seems to be regularly forgotten by researchers in the field.
Third, the field needs to move toward the use of more robust methods for statistical inference (e.g., Huber, 2004). In particular, analyses of correlations between activation and behavior across subjects are highly susceptible to the influence of outlier subjects, especially with small sample sizes. Robust statistical methods can ensure that the results are not overly influenced by these outliers, either by reducing the effect of outlier datapoints (e.g., robust regression using iteratively reweighted least squares) or by separately modeling data points that fall too far outside of the rest of the sample (e.g., mixture modeling). Robust tools for fMRI group analysis are increasingly available, both as part of standard software packages (such as the “outlier detection” technique implemented in FSL: Woolrich, 2008) and as add-on toolboxes (Wager et al., 2005). Given the frequency with which outliers are observed in group fMRI data, these methods should become standard in the field. However, it's also important to remember that they are not a panacea, and that it remains important to apply sufficient quality control to statistical results, in order to understand the degree to which one's results reflect generalizeable patterns versus statistical figments.
It should be clear from these comments that my faith in the results of any study that uses such problematic methods (as the Temple et al. study did) is relatively weak.  I personally have learned my lesson and our lab now does its best to adhere to these more rigorous standards, even when they mean that a study sometimes ends up being unpublishable.   I can only hope that others will join me.

      11 comments:

      1. If only all scientists were as willing to admit previous mistakes.

        I think, however, your second-to-last sentence reveals the real problem. If you perform a study badly and get "interesting" results, somebody somewhere will publish it (PNAS if you know the right people). Doing the study properly can make it unpublishable.

        ReplyDelete
      2. Even if activity maps are corrected for multiple testing the results can be invalid

        http://www.sciencedirect.com/science/article/pii/S1053811912003825?v=s5

        ReplyDelete
      3. As a reviewer, what is your practice in detecting wrong usage of small volume correction?

        ReplyDelete
        Replies
        1. It's really tough to give a general rule. I feel better about it if it follows directly from previous work by the same group, such that the same area is corrected across multiple papers and there is a stronger reason to think that they set out with that specific hypothesis in mind.

          Delete
        2. I agree. Cases, like FFA to face processing, amygdala to emotion and hippocampals to memory, are widely accepted.

          But there are some other omnipotent regions in fMRI studies, such as the insula cortex.
          In addition, as we know, a researcher can always find several published papers that support what he finds in his result, when trying to do some interpretation. That is, they can always collect some papers to support their choices of regions for small volume correction.

          So, I wonder whether our neuroscientists can provide a list of cognition-region pairs (e.g., face recognition-FFA) that are proper to use with small volume correction.

          Delete
        3. I think a better solution would be to give people a chance to pre-register their hypotheses and ROIs prior to data collection and analysis. That would allow more flexibility for new hypotheses but would still help prevent post-hoc SVCs.

          Delete
      4. That's to say, researchers should publish their papers mostly based on the self-discipline. :)

        ReplyDelete
      5. In another opinion paper regarding fMRI analyses and reporting (http://www.ncbi.nlm.nih.gov/pubmed/21856431), Dr. Poldrack states “Some simple methodological improvements could make a big difference. First, the field needs to agree that inference based on uncorrected statistical results is not acceptable (cf. Bennett et al., 2009). Many researchers have digested this important fact, but it is still common to see results presented at thresholds such as uncorrected p<.005. Because such uncorrected thresholds do not adapt to the data (e.g., the number of voxels tests or their spatial smoothness), they are certain to be invalid in almost every situation (potentially being either overly liberal or overly conservative).” This is a good point, but given the fact that Dr. Poldrack has published papers in high impact journals that rely heavily on inferences from data using uncorrected thresholds (e.g. http://www.ncbi.nlm.nih.gov/pubmed/16157284), and does not appear to have issued any statements to the journals regarding their validity, one wonders whether Dr. Poldrack wants to have his cake and eat it too, so to say. A similar point can be made regarding Dr. Poldrack’s attitude regarding the use of small volume correction. In this paper, he states “Second, I have become increasingly concerned about the use of “small volume corrections” to address the multiple testing problem. The use of a priori masks to constrain statistical testing is perfectly legitimate, but one often gets the feeling that the masks used for small volume correction were chosen after seeing the initial results (perhaps after a whole-brain corrected analysis was not significant). In such a case, any inferences based on these corrections are circular and the statistics are useless”. While this is also true, one wonders whether Dr. Poldrack only trusts his group to use this tool correctly, since it is frequently employed in his papers.
        In a third opinion paper (http://www.ncbi.nlm.nih.gov/pubmed/20571517), Dr. Poldrack discusses the problem of circularity in fMRI analyses. While this is also an important topic, Dr. Poldrack’s group has also published papers using circular analyses (e.g. http://www.jneurosci.org/content/27/14/3743.full.pdf, http://www.jneurosci.org/content/26/9/2424, http://www.ncbi.nlm.nih.gov/pubmed/17255512).

        ReplyDelete
      6. (Final)

        I would like to note that the reason for this comment is not to malign Dr. Poldrack or his research, but rather to attempt to clarify Dr. Poldrack’s opinion of how others should view his previous research when it fails to meet the rigorous standards that he persistently endorses. I am very much in agreement with Dr. Poldrack that rigorous methodology and transparency are important foundations for building a strong science. As a graduate student, it is frustrating to see high-profile scientists such as Dr. Poldrack call for increased methodological rigor by new researchers (typically while, rightfully, labeling work that does not meet methodological standards as being unreliable) when they (1) have benefited (and arguably continue to benefit) from the relatively lower barriers to entry that come from having entered a research field before the emergence of a rigid methodological framework (i.e. in having Neuron/PNAS/Science papers on their CV that would not be published in a low-tier journal today due to their methodological problems) , and (2) not applying the same level of criticism or skepticism to their own previous work as they do to emerging work when it does not meet current standards of rigor or transparency. I would like to know what Dr. Poldrack’s opinions are on these issues. I greatly appreciate any time and/or effort spent reading and/or replying to this comment.



        ReplyDelete
      7. thanks for your comments. I will address them in an upcoming blog post.

        ReplyDelete
      8. here is my response, thanks again for your comments: http://www.russpoldrack.org/2016/07/having-my-cake-and-eating-it-too.html

        ReplyDelete