Friday, July 22, 2016

Having my cake and eating it too?

Several years ago I blogged about some of the challenges around doing science in a field with emerging methodological standards.  Today, a person going by the handle "Student" posted a set of pointed questions to this post, which I am choosing to respond to here as a new post rather than burying them in the comments on the previous post. Here are the comments:

Dr. Poldrack has been at the forefront of advocating for increased rigor and reproducibility in neuroimaging and cognitive neuroscience. This paper provides many useful pieces of advice concerning the reporting of fMRI studies, and my comments are related to this paper and to other papers published by Dr. Poldrack. One of the sections in this paper deals specifically with the reporting of methods and associated parameters related to the control of type I error across multiple tests. In this section, Dr. Poldrack and colleagues write that "When cluster-based inference is used, this should be clearly noted and both the threshold used to create the clusters and the threshold for cluster size should be reported". I strongly agree with this sentiment, but find it frustrating that in later papers, Dr. Poldrack seemingly disregards his own advice with regard to the reporting of extent thresholds, opting to report only that data were cluster-corrected at P<0.05 (e.g. http://cercor.oxfordjournals.org/content/20/3/524.long, http://cercor.oxfordjournals.org/cgi/content/abstract/18/8/1923, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2876211/). In another paper (http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/19915091/), the methods report that "Z (Gaussianised T ) statistic images were thresholded using cluster-corrected statistics with a height threshold of Z > 2.3 (unless otherwise noted) and a cluster probability threshold of P < 0.05, whole- brain corrected using the theory of Gaussian random fields", although every figure presented in the paper notes that the statistical maps shown were thresholded at Z>1.96, P<0.05, corrected. This last instance is particularly confusing, and borders on being misleading. While these are arguably minor omissions, I find it odd that I am thus far unable to find a paper where Dr. Poldrack actually follows his own advice here.  
In another opinion paper regarding fMRI analyses and reporting (http://www.ncbi.nlm.nih.gov/pubmed/21856431), Dr. Poldrack states “Some simple methodological improvements could make a big difference. First, the field needs to agree that inference based on uncorrected statistical results is not acceptable (cf. Bennett et al., 2009). Many researchers have digested this important fact, but it is still common to see results presented at thresholds such as uncorrected p<.005. Because such uncorrected thresholds do not adapt to the data (e.g., the number of voxels tests or their spatial smoothness), they are certain to be invalid in almost every situation (potentially being either overly liberal or overly conservative).” This is a good point, but given the fact that Dr. Poldrack has published papers in high impact journals that rely heavily on inferences from data using uncorrected thresholds (e.g. http://www.ncbi.nlm.nih.gov/pubmed/16157284), and does not appear to have issued any statements to the journals regarding their validity, one wonders whether Dr. Poldrack wants to have his cake and eat it too, so to say. A similar point can be made regarding Dr. Poldrack’s attitude regarding the use of small volume correction. In this paper, he states “Second, I have become increasingly concerned about the use of “small volume corrections” to address the multiple testing problem. The use of a priori masks to constrain statistical testing is perfectly legitimate, but one often gets the feeling that the masks used for small volume correction were chosen after seeing the initial results (perhaps after a whole-brain corrected analysis was not significant). In such a case, any inferences based on these corrections are circular and the statistics are useless”. While this is also true, one wonders whether Dr. Poldrack only trusts his group to use this tool correctly, since it is frequently employed in his papers. 
In a third opinion paper (http://www.ncbi.nlm.nih.gov/pubmed/20571517), Dr. Poldrack discusses the problem of circularity in fMRI analyses. While this is also an important topic, Dr. Poldrack’s group has also published papers using circular analyses (e.g. http://www.jneurosci.org/content/27/14/3743.full.pdf, http://www.jneurosci.org/content/26/9/2424, http://www.ncbi.nlm.nih.gov/pubmed/17255512). 
I would like to note that the reason for this comment is not to malign Dr. Poldrack or his research, but rather to attempt to clarify Dr. Poldrack’s opinion of how others should view his previous research when it fails to meet the rigorous standards that he persistently endorses. I am very much in agreement with Dr. Poldrack that rigorous methodology and transparency are important foundations for building a strong science. As a graduate student, it is frustrating to see high-profile scientists such as Dr. Poldrack call for increased methodological rigor by new researchers (typically while, rightfully, labeling work that does not meet methodological standards as being unreliable) when they (1) have benefited (and arguably continue to benefit) from the relatively lower barriers to entry that come from having entered a research field before the emergence of a rigid methodological framework (i.e. in having Neuron/PNAS/Science papers on their CV that would not be published in a low-tier journal today due to their methodological problems) , and (2) not applying the same level of criticism or skepticism to their own previous work as they do to emerging work when it does not meet current standards of rigor or transparency. I would like to know what Dr. Poldrack’s opinions are on these issues. I greatly appreciate any time and/or effort spent reading and/or replying to this comment. 

I appreciate these comments, and in fact I have been struggling with exactly these same issues myself, and my realizations about the shortcomings of our past approaches to fMRI analysis have shaken me deeply. Student is exactly right that I have been a coauthor on papers using methods or reporting standards that I now publicly claim to be inappropriate. S/he is also right that my career has benefited substantially from papers published in high profile journals prior using these methods that I now claim to inappropriate.  I'm not going to either defend or denounce the specific papers that the commentator mentions.  I am in agreement that some of my papers in the past used methods or standards that we would now find problematic, but I am actually heartened by that: If we were still satisfied with the same methods that we had been using 15 years ago, then that would suggest that our science had not progressed very far.  Some of those results have been replicated (at least conceptually), which is also heartening, but that's not really a defense.

I also appreciate Student's frustration with the fact that someone like myself can become prominent doing studies that are seemingly lacking according to today's standards, but then criticize the field for doing the same thing.  But at the same time I would ask: Is there a better alternative?  Would you rather that I defended those older techniques just because they were the basis for my career?  Should I lose my position in the field because I followed what we thought were best practices at the time but which turned out to be flawed? Alternatively, should I spend my entire career re-analyzing my old datasets to make sure that my previous claims withstand every new methodological development?  My answer to these questions has been to try to use the best methods I can, and to to be as open and transparent as possible.  Here I'd like to outline a few of the ways in which we have tried to do better.

First, I would note that if someone wishes to look back at the data from our previous studies and reanalyze them, almost all of them are available openly through openfmri.org, and in fact some of them have been the basis for previous analyses of reproducibility.  I and my lab have also spend a good deal of time and effort advocating for and supporting data sharing by other labs, because we think that ultimately this is one of the best ways to address questions about reproducibility (as I discussed in the recent piece by Greg Miller in Science).

Second, we have done our best to weed out questionable research practices and p-hacking.  I have become increasingly convinced regarding the utility of pre-registration, and I am now committed to pre-registering every new study that our lab does (starting with our first registration committed this week).  We are also moving towards the standard use of discovery and validation samples for all of our future studies, to ensure that any results we report are replicable. This is challenging due to the cost of fMRI studies, and it means that we will probably do less science, but that's part of the bargain.

Third, we have done our best to share everything.  For example, in the MyConnectome study, we shared the entire raw dataset, as well as putting an immense amount of working into sharing a reproducible analysis workflow.  Similarly, we now put all of our analysis code online upon publication, if not earlier.  

None of this is a guarantee, and I'm almost certain that in 20 years, either a very gray (and probably much more crotchety) version of myself or someone else will come along and tell us why the analyses were we doing in 2016 were wrong in some way that seems completely obvious in hindsight.  That's not something that I will get defensive about because it means that we are progressing as a science.  But it also doesn't mean that we weren't justified to do what we are doing now, trying to follow the best practices that we know how.  





13 comments:

  1. I think the student has some excellent observations, particularly with regard to the fact that science has gotten harder (which is great it should be hard), but the goalposts haven't moved in terms of getting a job (which is both terrible and hypocritical). Those of us on search committees need to put the new normal into practice by lowering our expectation of the number of publications, and also increasing our level of scrutiny of papers on the shortlist so that papers with less rigorous methods and unreplicated results are downweighted, regardless of impact factor. The students need to feel confident that we have their back. If they are willing to take the high road and come out of a PhD with only 2-3 pubs that are solid exemplars of how science should be done, they need to know that we'll consider that as a highly competitive job application relative to someone with 8 pubs that are a bit sloppier.

    We also need to remind ourselves that there is a cost to doing science better. Doing science correctly IS harder, which means more hours in the lab and/or less publications. Transparent practices can help offset the cost of better science by improving research efficiency, and we should embrace those improvements. However they won't completely offset those costs. Let's all be realistic about what we're asking for and expecting.

    ReplyDelete
  2. I think that there's still too much divide between methods/opinion papers, and papers based on data collection and research. In the latter case there's still a lot of good hypotheses and nice data, but low standard methods. These studies, by leading researchers and institutions, still contribute to build a suboptimal status quo and the wrong example.
    I saw a lot of that at the latest ohbm.

    Daniele Marinazzo

    ReplyDelete
  3. I think the Student's letter bring good points. In many ways it teaches that "dogmas" in science can be just as bad as dogmas in religion. In other words, the bad vs. good science is decided not upon the dogmatism of statistic approach but rather on what stood the test of time and what failed. I am not at all convinced that the "dogmas of good science" will improve the science. Instead, the evolutionary approach should decided on what withstood the tests of time and what failed. In many way, what this field needs is more attention given to negative findings reports where sufficiently powered samples were used to show that one or many previous findings were not replicable, thus setting the precedent for striking them out.

    ReplyDelete
  4. Dr. Poldrack, thank you for taking the time to address my letter. I understand that science is in a constant state of progress, and so it is not reasonable to expect that any study will be "perfect" or 100% compliant with contemporary standards in terms of methodology/design/analyses/etc. in part because standards change as the field progresses and in part because doing science is an imperfect art. I think that even poorly done/methodologically deficient studies can provide important insights and sometimes uncover robust and important phenomena, but also agree that they are more prone to biases and artifacts that may lead to incorrect conclusions. I am at the early stages of my career, and I am sure that I will also one day look back on studies that I have done and wish I would have done things differently or "more optimally" (to be honest, I feel this way already sometimes). I understand that we (as scientists) are all playing a game where not only do the methodological/analytical rules change over time, but the evaluative rules and scoring work against what, in my opinion, should be our ultimate goal, which is to develop robust theories that robustly explain and predict natural (in our case, neural and by extension behavioral) phenomena. So, I understand that there is no clear answer regarding "what to do" about older studies that may not be as reliable as they were thought to be when they were published. However, I know that this is something that many people wonder about in lab meetings and private conversations, and think that public discussion about the difficulties of dealing with such things is helpful to moving the field forward.

    Regarding some of the comments, I do think that one major obstacle facing early career researchers is navigating the perception that many things are "facts" and not worth "re-studying" because the results (or just a specific interpretation of the results) of a single small study have been cited so frequently throughout the literature that they are not questioned (as a note, there are several such "citation trails" that I have personally followed only to discover that no one ever actually found the the thing that the citations are provided to support). I think that this, in concert with (1) the moving of career/grant goalposts in a way that ultimately makes reaching them more difficult without cutting corners, overselling one's results, or focusing on being the first to do something using a "hot" new technique, (2) a focus on developing a literature of novel findings and compelling narratives, and (3) a reluctance to dedicate resources to scrutinizing/verifying long-held dogmas are at least as much of a threat to developing a deep understanding of neurobiology and behavior as deficient methods.

    I am grateful to Dr. Poldrack and other commenters for entertaining my ideas and observations and responding to them in an understanding and respectful way. I was worried after I wrote my initial letter that it would be perceived as disrespectful or accusatory, and I am happy that it does not seem to have been the case.

    ReplyDelete
    Replies
    1. All very good points. I think it's particularly good to point out that even a study with horrible methods is not necessarily false. Our 1999 Neuroimage paper on semantic and phonological processing in the inferior frontal cortex (also my first fmri study) used methods that we would now find abysmal (eight subjects, fixed effects analysis, weak ad hoc correction for multiple comparisons, manual registration of slices to the Talairach atlas) but the results inspired a meta-analysis that confirmed the result, and subsequent studies have also confirmed it. Then again, a broken clock is also right once a day, and it's impossible to know whether we just got lucky on this one - mostly this just highlights that we really should not be concluding much at all on the basis of any single paper. Unfortunately, there is not enough replication in our field for us to really know how to judge many findings are replicable, so the main way we can achieve cumulative knowledge is through tools like Neurosynth and Neurovault.

      Delete
    2. This is very true. I don't think that a study with horrible methods should be the basis for strong conclusions (and I think everyone agrees with that), but I am aware of many such studies that nonetheless uncovered reliable and important effects. I would also note that the attention payed to describing the how and why of the relevant psychological/behavioral manipulations in papers such as that one is noticeably greater than many similarly oriented contemporary studies.

      Delete
  5. Also, I would like to thank Dr. Poldrack for his efforts to increase transparency and data-sharing regarding his own research. The dedication of time and resources to things such as these, despite their lack of what might be called career-boosting potential, is something I greatly respect and appreciate.

    ReplyDelete
    Replies
    1. many thanks! I'm glad we have been able to have such a productive discussion over this topic.

      Delete
  6. I am also guilty of having published many papers in the past which were not up to the standards we now expect in terms of the number of subjects, anatomical methods and statistical rigour. But that doesn't mean that the results were false, only that they need replication. And indeed we repeated some of the important studies to check if they could be replicated. For example, we did a motor sequence learning study using PET in 1994, repeated it with PET a few years later, then repeated it again with fMRI. The last study used 3 subjects (!!!) because at the time our computers could only handle continuous acquisition for 45 minutes with 3 subjects. We then did a related study using PET, but with rhythm learning. Then Floyer-Lea and Matthews did a very similar motor sequence study (sequence of forces) and lo and behold much the same results.

    This is not an argument for inadequate studies, only an argument that if you believe in your results (on inadequate data) it is in your own hands to check them yourself with repeated studies

    ReplyDelete
  7. I got into neuroimaging around the same time many of the publications linked in Student's comments came out, and I also look back on the "state of the art" at that time with a mix of bemusement and horror. I expect to do the same in another 10 years with the methods I use now. In an odd way this is a sign of immense progress. When you compare the statistical standards of fields like genomics to neuroimaging, we have a long way to go - but we have also made incredible iterative progress since the early 90s. I think this is true of nearly any "new" (<30 year old) field.

    I think the biggest adjustment that remains is, as others have pointed out, for search committees to appropriately weight quality vs. quantity in publications for neuroimagers. That might include considering "rigor and reproducibility" (as NIH has recently begun to emphasize) such as work that may take longer and produce fewer papers in order to include collection and analysis of replication samples.

    ReplyDelete
  8. I am all for advocating rigour in science. But don't let's forget that in the end what matters is how important the result is. We should be focussing on creativity at the same time as experimental care, and creativity involves exploration. So don't let be so hung up on rigour that we cease to explore.

    ReplyDelete
    Replies
    1. I think that's a very important point, Dick. John Ioannidis made the point strikingly in http://onlinelibrary.wiley.com/doi/10.1111/add.12720/full: "I would loathe seeing a ‘perfect’ scientific literature where everything is pre-registered, all checklist items are checked and papers are written by robotic automata before the research is conducted, but no real progress is made. We need to find ways to improve science without destroying it." One problem is that the current publication system forces people to present their results as if they were hypothesis-driven when really they were exploratory (i.e HARKing).

      Delete
    2. Indeed. It is very easy to get caught up in "the game" that we forget our purpose. My colleague and I have been rather fortunate to made some striking observations while analyzing standard rsfmri data because we followed our noses so to speak. The fact that you can discover something new in an experiment conducted numerous times, if you look, does bring a smile to my face.....

      which disappears quickly enough when thinking about the publication process ...

      Delete