russpoldrack.org

Saturday, May 4, 2013

Crowdsourcing my next fMRI task

For those of you who have been following my self-tracking study, I am happy to announce that we are once again collecting data as of April 30. I had a followup audiometry exam this week which showed that the poor reading in early March appears to have been a fluke, and that my hearing does not seem to have changed since the beginning of the study.

We have decided to make a number of changes to the MRI protocol based on our analyses of the first 6 months of data. In particular, we are going to reduce the frequency of the structural and DTI acquisitions (because, as expected, we don't see much in the way of changes over time). In addition, after acquisition of 14 sessions on a working memory task, I have decided to reduce its frequency and add another task fMRI paradigm that will be collected once a week.

My question for you is: What should I do next? I have several things in mind, but I want to see if the community can do even better. If you have an idea for task fMRI paradigm where a large number of sessions from a single individual would be useful, please email me (poldrack at utexas dot edu) with a brief proposal that outlines specifically why it would be interesting to perform the task repeatedly on a single person and how many sessions it would require. I will work with the winner to implement the task, and that person will of course be offered coauthorship on any papers that include those data. There are two requirements:

Each session of the task must be accomplished within a single 10-minute imaging run.
The task cannot require any special response or stimulation devices (for stimulation we have video projection and headphones; for response, we have several button boxes available as well as eye tracking, noise-cancelling microphone, and physiological monitoring).

I look forward to hearing your ideas!

Friday, May 3, 2013

The dimensional approach to studying mental illness

The NIHM Director Tom Insel turned a lot of heads recently with his announcement that NIMH-funded research will no longer focus on DSM-based diagnostic categories, and instead will focus on understanding the dimensions of mental function that underlie mental health disorders and often cut across multiple diagnoses. This new focus will be centered around the Research Domain Criteria (RDoC), which comprises a set of psychological constructs that are thought to represent the most important dimensions that are relevant to mental health disorders. This is great news, as it has become clear (particularly from recent genetic research) that diagnostic categories, while being statistically reliable, do not have a great deal of biological reality. Dr. Insel lays out the problem nicely:

Unlike our definitions of ischemic heart disease, lymphoma, or AIDS, the DSM diagnoses are based on a consensus about clusters of clinical symptoms, not any objective laboratory measure. In the rest of medicine, this would be equivalent to creating diagnostic systems based on the nature of chest pain or the quality of fever. Indeed, symptom-based diagnosis, once common in other areas of medicine, has been largely replaced in the past half century as we have understood that symptoms alone rarely indicate the best choice of treatment.

We have long been interested in this dimensional approach; in particular, it was the basis for the Consortium for Neuropsychiatic Phenomics developed by my colleague Robert Bilder at UCLA several years ago, in which we are examining two specific domains of mental function (memory and executive function) across both healthy individuals and people with several different psychiatric diagnoses (schizophrenia, bipolar disorder, and ADHD). This work is still ongoing but should start to bear fruit in the next year as the neuroimaging and genetic data analyses are completed. It's great news that more work like this will be funded in the future.

Mining for dimensions

Genetic analyses have been able to take advantage of large numbers of subjects to identify the genetic overlap between different psychiatric diagnostic groups. Unfortunately we don't have those kinds of datasets (yet) for neuroimaging studies. In a paper published in PLOS Computational Biology last year, we asked whether we could use the Neurosynth meta-analytic database as a proxy for such data, in order to see if we can identify how different psychiatric disorders are clustered together in terms of brain function. We first used topic modeling to identify groupings of terms related to psychiatric and neurological disorders in the full text of a set of about 4,400 published papers. In some cases these were specific to single disorders, but in many cases they reflected overlapping disorders (such as alcoholism and antisocial personality disorder, or schizophrenia and bipolar disorder). We then created "topic maps" for each of these by looking at which regions in the brain showed activity that was correlated with the presence of each topic in each paper (using the activation coordinates automatically extracted by Neurosynth). Here are some examples of these topic maps:

We then were able to ask the fundamental question that is raised by the dimensional approach: How are different diagnoses related in terms of their neural underpinnings. It's important to keep in mind that we are not directly examining data from different diagnostic groups; instead, we are examining the neuroimaging data that are associated with the presence of those diagnostic labels across papers. Nonetheless, the data have the potential to give us insight into how different disorders relate with regard to neural activity patterns observed in the literature. Here is what the clustering looked like across disorders:

(the abbreviations are: APH: aphasia, DLX:dyslexia, SLI: specific language impairment, DA: drug abuse, AD:Alzheimer's disease, DEP:depressive disorder, MDD:major depressive disorder, ANX:anxiety disorder, PAN: panic disorder, BPD: bipolar disorder, CD: conduct disorder, GAM: gambling, MD: mood disorder, PD: Parkinson's disease, OCD: obsessive compulsive disorder, PHO: phobia, EAT: eating disorder, SZ: schizophrenia, OBE: obesity, COC: cocaine related disorder, PSY: psychotic disorder, PAR: paranoid disorder, SZTY: schizotypal personality disorder, TIC: tic disorder, ALC: alcoholism, ALX: alexia, ADD: attention deficit disorder, AMN: amnesia, AUT: autism, ASP: Asperger syndrome.)

This clustering is very interesting in that it shows that there are four major branches of mental health disorders that can be identified on the basis of brain activity coordinates from published papers: language disorders (green), mood/anxiety disorders and drug abuse (orange), psychotic and externalizing disorders (yellow), and autism and memory disorders (purple). The fact that this worked with meta-analytic data gives us great hope that once it is applied to subject-level data, it will provide a great deal of power to identify the dimensions along which psychiatric disorders vary.

How reliable are the dimensional descriptions?

The move towards RDoC as a basis for mental health research also raises the question of reliability. The strength of the DSM was that its diagnoses were reliable, in the sense that two different psychiatrists, shown the same symptoms, are highly likely to make the same diagnosis. How will we ensure that the mappings from behavior to psychological dimensions are equally reliable? This is a problem that we have long struggled with, and in particular was a driver for the development of our Cognitive Atlas project. This project aims to provide a more formal basis for the many psychological constructs that are used by researchers but rarely defined in an explicit way. It makes a clear distinction between mental concepts (i.e. the unobservable psychological constructs that researchers want to measure) and tasks (i.e., the behavioral tests used to measure to those constructs), and most importantly, provides a way to specify the relations between those two levels. For example, here is the page representing the concept of "response inhibition":

This page provides a specific definition of the concept along with links to many different tasks that are thought to measure the concept, such as the stop signal task:

Recently we added a new function to the Cognitive Atlas, called Collections. A collection is a set of concepts and relations that together form a larger theoretic framework; it can also be used to identify sets of tasks that are included together in task batteries. We have started to implement the various RDoC frameworks as Collections; for example, the Working Memory Matrix:

There is much more work to do, and we hope that researchers interested in RDoC will help by adding their favorite tasks and concepts to the appropriate matrices. In the long run we are hopeful that formal frameworks like the Cognitive Atlas will come to play the same role for dimensionally-driven research that the DSM has played for diagnosis-driven research, providing a basis for reliable specification of the relations between psychological dimensions and behavioral tasks.

Friday, April 19, 2013

On "healthy choices"

This week, I received an email describing new nutritional recommendations at one of the campus eateries:

Please come by O's Campus Cafe over the next few weeks to see how they are making healthy eating easier!
O's Campus Cafe has analyzed their menu and labeled their foods in order to make it easier for you to make the healthiest choices. All foods have been given a red, yellow, or green rating based on how nutrient dense they are. Foods that have received a "Red" rating have very little positive nutrient content, are higher in unhealthy nutrients, and should be consumed minimally. Foods that have received a "Yellow" rating have some positive nutrient content, some unhealthy nutrients, and should be consumed less often. Foods that have received a "Green" are high in positive nutrient content, have little to no unhealthy nutrients, and can be consumed frequently. The healthy nutrients evaluated in these developing these ratings are protein, fiber, unsaturated fat, vitamins A, C, E, B-12, thiamin, riboflavin, and folate, and minerals calcium, magnesium, iron, and potassium. The nutrients considered unhealthy are saturated fat, cholesterol, and sodium.

I would hope that guidelines regarding "healthy" choices at a major research university would be backed up with evidence, but it turns out that they were simply based on an off-the-shelf rating system (the Nutrient Rich Food Index). To see whether they are supported by actual evidence, I used PubMed to look for relevant results from randomized controlled trials; I avoided epidemiological studies given all of the difficulties in making solid conclusions from nutritional epidemiology research (e.g. http://www.nytimes.com/2007/09/16/magazine/16epidemiology-t.html). Here is what I found regarding the supposedly "unhealthy" nutrients:

Reduction of saturated fat intake was associated with reduced cardiovascular events but not with any changes in all-cause or cardiovascular mortality (Hooper et al., 2012). A nice review of lipid metabolism by Lands (2008) highlights how the search for mechanisms by which saturated fats might be harmful has failed: "Fifty years later [after starting his research in the 1960's], I still cannot cite a definite mechanism or mediator by which saturated fat is shown to kill people."

A review of the effects of dietary sodium reduction in randomized controlled trials showed "no strong evidence of any effect of salt reduction" on all-cause mortality (Taylor et al., 2011).

I searched deeply on PubMed but was unable to find any systematic reviews of the effects of dietary cholesterol reduction on mortality (please let me know if you know of any); there are lots of studies looking at cholesterol reduction using statins, but that doesn't really inform the question of dietary cholesterol reduction. However, given that serum cholesterol levels bear very little relation to dietary cholesterol intake (Gertler et al., 1950), the labeling of dietary cholesterol as "unhealthy" seems unsupported by evidence.

The first two reviews listed above come from the Cochrane Collaboration, which is generally viewed as a highly objective source for systematic reviews in evidence-based medicine.

What about the supposed "healthy" choices?

I also looked for data relevant to the the health benefits of the supposedly healthy choices:

A recent report from a randomized controlled trial showed that replacement of saturated fats with omega-6 linoleic acid (an unsaturated, and thus "healthy" fat) was associated with increased, rather than decreased all-cause mortality (Ramsden et al., 2013). This makes sense given the association of omega-6 fatty acids with inflammation (e.g. Lands, 2012) and the powerful role that inflammation plays in many diseases.

Increased fiber intake in the DART (diet and reinfarction trial) of individuals who had already suffered a heart attack was associated with no effects on all-cause mortality after 10 years.(Ness et al., 2002). An earlier Cochrane review also showed no effects of fiber intake on incidence or recurrence of colon cancer (Asano & McLeod, 2002).

I don't claim to be an expert in this domain, but the best evidence available seems to point towards the lack of any scientific validity of these "health" labels.

Isn't it better than nothing?

One might ask whether it's still better to have some guidance that pushes people to try to eat more healthily, even if it's not supported by science. If the recommendations were made with full disclosure of the lack of scientific evidence (ala the disclaimer on dietary supplements) then I would be ok with that. Instead, the recommendations are made in the context of a "nutritionism" that pretends that we will be healthy if we just get the right mix of healthy nutrients and avoid the bad ones, and also pretends to know which are which. I personally would rather have no advice than unsupported pseudoadvice.

At the same time, it's clear from the obesity epidemic that people need help in making better food choices, so what should we do instead? I would propose implementing a simple rating system based the first of Michael Pollan's food rules: "Eat food". Anything that has been freshly picked, harvested, or slaughtered would get a "healthy rating" and anything that comes from a box, can, or bag would get an "unhealthy" rating. It would be much easier than computing micronutrient values, and I challenge anyone to show that it would be any less objectively healthy than the proposed "healthy choices" which would label a fresh grass-fed steak as less healthy than a pile of french fries fried in corn oil.

Saturday, April 6, 2013

How well can we predict future criminal acts from fMRI data?

A paper recently published in PNAS by Aharoni et al. entitled "Neuroprediction of future arrest" has claimed to demonstrate that future criminal acts can be predicted using fMRI data. In the study, the group performed fMRI on 96 individuals who had previously been incarcerated, using a go/no-go task. They then followed up the individuals (up to four years after release) and recorded whether they had been rearrested. A survival model was used to model the likelihood of being re-arrested, which showed that activation in the dorsal anterior cingulate cortex (dACC) during the go/no-go task was associated with rearrest, such that individuals with higher levels of dACC activity during the task were less likely to be rearrested. This fits with the idea that the dACC is involved in cognitive control, and that cognitive control is important for controlling impulses that might land one back in jail. For example, using a median split of dACC activity, they found that the upper half had a rearrest rate of 46% while the lower half had a rearrest rate of 60%. Survival models also showed that dACC was the only variable amongst a number tested that had a significant relation to rearrest.

This is a very impressive study, made even more so by the fact that the authors released the data for the tested variables (in spreadsheet form) with the paper. However, there is one critical shortcoming to the analyses reported in the paper, which is that they do not examine out-of-sample predictive accuracy. As I have pointed out recently, statistical relationships within a sample generally provide an overly optimistic estimate of the ability to generalize to new samples. In order to be able to claim that one can "predict" in a real-world sense, one has to validate the predictive accuracy of the technique on out-of-sample data.

With the help of Jeanette Mumford (my local statistical guru), I took the data from the Aharoni paper and examined the ability to predict rearrest on out-of-sample data using crossvalidation; the code and data for this analysis are available at https://github.com/poldrack/criminalprediction. The proper way to model the data is using a survival model that can deal with censored observations (since subjects differed in how long they were followed). We did this in R using the Cox regression model from the R rms library. We replicated the reported finding of a significant effect of dACC activation on rearrest in the Cox model, with parameter estimates matching those reported in the paper, suggesting to me that we had correctly replicated their analysis.

We examined predictive accuracy using the pec library for R, which generates out-of-sample prediction error curves for survival models. We used 10-fold crossvalidation to estimate the prediction error, and ran this 100 times to assess the variability of the prediction error estimates. The figure below shows the prediction error as a function of time for the reference model (which simply estimates a single survival curve for the whole group) in black, and the model including dACC activation as a predictor in green; the thick lines represent the mean prediction error across the 100 crossvalidation runs, and the light lines represent the curve for each individual run.

This analysis shows that there is a slight benefit to out-of-sample prediction of future rearrest using dACC activation, particularly in the period from 20 to 48 months after release. However, this added prediction ability is exceedingly small; if we take the integrated Brier score across the period of 0-48 months, which is a metric for assessment of probabilistic predictions (taking the value of 0 for perfect predictions and 1 for completely inaccurate predictions), we see that the score for the reference model is 0.214 and the score for the model with dACC as a predictor is 0.207. We found slightly improved prediction (integrated Brier score of 0.203) if we also added Age alongside dACC as a predictor.

The take-away message from this analysis is that fMRI can indeed provide information relevant to whether an individual will be rearrested for a crime. However, this added predictability is exceedingly small, and we don't know whether there are other (unmeasured) demographic or behavioral measures that might provide similar predictive power. In addition, these analyses highlight the importance of using out-of-sample prediction analyses whenever one makes a claim about the predictive ability of neuroimaging data for any outcome. We are currently preparing a manuscript that will address the issue of "neuroprediction" in greater detail.

Wednesday, March 13, 2013

My adventures in self-quantification

In September 2012 I began a project to characterize how my own brain function and metabolism fluctuate over the course of an entire year. This has involved MRI scans three times a week along with blood draws once a week and daily tracking of a large set of potentially interesting variables. This post is the first installment in my story about the experience.

First, the motivation. In the last couple of years I have become very interested in understanding the dynamics of brain function over a days-months timescale and how they relate to cognitive function and bodily metabolism. This interest has been spurred by a number of influences, such as my growing interest in nutrition and its relation to brain function as well as my ongoing interest in better understanding psychiatric disorders. Once I started thinking about the issue, it became very clear that there were basically no data in existence that provide any insight into how the function of an individual's brain fluctuates over such a relatively long time course. This is probably not surprising, because doing studies with volunteers that require repeated testing over a long period of time is very challenging.

At some point in 2011 it dawned upon me that I should try to bootstrap such a study by collecting data from myself. There were several inspirations for this idea. First was Michael Snyder's "integrated personal omics" study, published in Cell in 2011, in which he repeatedly collected blood from himself and performed a broad set of "omics" measures on his samples, which provided some interesting insights into the temporal dynamics of metabolic function. Second was my interaction with Laurie Frick, who is the artist-in-residence at the UT Imaging Research Center. Laurie's work is based on patterns that she finds in data obtained by self-tracking, and she is deeply enmeshed in the Quantified Self movement (see her excellent TEDx talk). Talking to her got me increasingly interested in tracking a broader set of data about myself. The very fun book Smoking Ears and Screaming Teeth also convinced me that self-experimentation is not (completely) crazy, and actually has long been an important tool for scientific discovery.

In early 2012 I began hatching a plan to collect a broad set of data on myself. It was essential that the all aspects of data collection were as consistent as possible in order to minimize extraneous variability in the data (such as time of day effects). I ended up settling on a schedule of three MRI scanning sessions a week, at consistent times of day and days of the week (one afternoon and two mornings every week). Each of the MRI scanning sessions includes a resting state fMRI scan, which will allow us to assess how functional connectivity between brain regions fluctuates over time. In addition, once a week I perform other scans, including structural MRI (T1- and T2-weighted), diffusion tensor MRI (to assess white matter connectivity), and task fMRI (using a working memory task with faces, scenes, and chinese characters).

I also wanted to collect biological samples in order to measure the relation between bodily metabolism and brain function. Working with some molecular biologists here at UT (along with helpful input from the Snyder lab at Stanford), we developed a protocol in which I have 20 ml of blood drawn once a week (while fasting, immediately after one of the morning MRI scans). This sample is then processed to extract RNA, white blood cells, and plasma, all of which are frozen for later analysis. This will let us examine many different aspects of metabolism, including gene expression (via RNA sequencing), metabolomic and proteomic analyses, and other potential analyses to relate metabolism to brain function.

Finally, I realized that the dataset would be most useful if I also collected as much data as possible about my daily life activities. Working with Zack Simpson, we developed a self-tracking app using the Appsoma framework, which allows me to easily complete surveys every morning and evening and after every MRI scan. These data are automatically fed into a web database which is the central repository for all of the self-tracking data in the study other than MRI and biological analyses. Some of the things that I track daily include:

- blood pressure and weight (using a FitBit Aria wireless scale)

- foods eaten, alcohol intake, and supplements/medicines taken

- exercise, time spent outdoors, and physical soreness

- a free-text log of daily events

- sleep quality (assessed both by subjective report and using a ZEO sleep monitor)

After every scan I also complete a mood questionnaire and also provide a structured report of what I was thinking about during the resting state fMRI scan. I should note that other than the addition of all of these tracking activities, I have done my best to keep my life as consistent as possible and have avoided any other major lifestyle changes.

With this plan in place, we began data collection on September 25, 2012. We treated the first month as a pilot period, and made some changes to the imaging protocol to optimize data collection, beginning the production period on October 22, 2012. In total so far we have collected 20 blood samples and 55 MRI scanning sessions. Members of the research team have started analyzing the data, though I have made every effort not to expose myself to the results of any analyses that examine changes over time, because I don't want the results to feed back and change my behavior.

When I describe this study, many people ask if I am worried about being exposed to MRI scanning so often. My answer has been "no", at least not with regard to the magnetic fields involved in MRI; there is no evidence of lasting effects of MRI exposure (though of course we can't ever prove that something is safe). However, soon after the study began it became clear that there was a side effect that I had to worry about, which is the intense noise of the MRI scanner. I have long suffered from tinnitus (which I attribute to too many loud rock shows as a youngster without ear plugs), and within the two weeks of scanning I noticed that my tinnitus was increasing. For this reason, I went to the UT Speech and Hearing Center and had my hearing tested. I had never had my hearing tested as an adult, but I was not terribly shocked to find out that I had quite significant high frequency hearing loss. Because I don't want to damage my hearing any further, I have continued to get tested each month. The results had been fairly stable until early March, when they showed about a slight worsening at 6000 Hz (consistent with a subjective increase in tinnitus around the same time). For this reason, I am taking the month of March off of scanning, and will have my hearing re-tested at the beginning of April before resuming the scans.

We will also make some changes to the MRI protocol to reduce scanner noise. We have an OptoAcoustics noise canceling headphone system in place at our imaging center that works quite well to reduce the noise of the functional MRI scans, so those can continue without much danger. However, we will likely discontinue some of the other scans (such as gradient field maps, which are useful but not necessary) and greatly reduce the frequency of others that we don't expect to change much over time (including the anatomical and diffusion scans), because those scans are not compatible with the noise cancellation system. I am hopeful that with these changes I can continue scanning without danger of further hearing damage, while still collecting a very useful dataset.

Assuming that I am able to continue scanning, I plan to collect 50 weeks worth of usable data, which should provide sufficient power for an initial set of analyses. This will likely take though the end of 2013 due to travel and other events that will interfere with data collection during some weeks. Once we have completed our initial set of analyses, nearly all of the data will be made available to other researchers, which I hope will help spur new analyses. I'll keep you all posted as the study moves along.

Wednesday, February 20, 2013

Anatomy of a coding error

A few days ago, one of the students who I collaborate with found a very serious mistake in some code that I had written. The code (which is openly available through my github repo) performed a classification analysis using the data from a number of studies from the openfmri project, and the results are included in a paper that is currently under review. None of us likes to admit mistakes, but it's clear that they happen often, and the only way to learn from them is to talk about them. This is why I strongly encourage my students to tell me about their mistakes and discuss them in our lab meeting. This particular mistake highlights several important points:

Sharing code is good, but only if someone else actually looks at it very closely.
You can't rely on tools to fail when you make a mistake.
Classifiers are very good at finding information, even if it's not the information you had in mind.

The code in question is 4_classify_wholebrain.py which reads in the processed data (saved in a numpy file) and classifies each dataset (with about 184K features and 400 observations) into one of 23 different classes (representing different tasks). The code was made publicly available before submitting the paper; while I have no way of knowing whether the reviewers have examined it, it's fair to say that even if they did, they would most likely not have caught this particular bug unless they were very eagle-eyed. As it happens, a student here was trying to reproduce my analyses independently, and was finding much lower classification accuracies than the ones I had reported. As he dug into my code, it became clear that this difference was driven by a (lazy, in hindsight) coding mistake on my part.

The original code can be viewed here - the snippet in question (cleaned up a bit) is:

skf=StratifiedKFold(labels,8)

if trainsvm:
pred=N.zeros(len(labels))
for train,test in skf:
clf=LinearSVC()
clf.fit(data[train],labels[train])
pred[test]=clf.predict(data[test])

Pretty simple - it creates a crossvalidation object using sklearn, then loops through, fitting to the train folds and computing the predicted class for the test fold. Running this, I got about 93% test accuracy on the multiclass problem; had I gotten 100% accuracy I would have been sure that there was a problem, but given that we have previously gotten around 80% for similar problems, I was not terribly shocked by the high accuracy. Here is the problem:

In [9]: data.shape
Out[9]: (182609, 400)

When I put the data into the numpy object, I had voxels as the first dimension, whereas for classification analysis one would usually put the observations in rows rather than columns. Now, numpy is smart enough that when I give it the train list as an array index, it uses it as an index on the first dimension. However, because of the transposition of the dimensions in the data, the effect was to classify voxels, rather than subjects:

In [10]: data[train].shape
Out[10]: (350, 400)

In [11]: data[test].shape
Out[11]: (50, 400)

When I fix this by using the proper data reference (as in the current revision of the code on the repo), then it looks as it should (i.e. all voxels included for the subjects in the train or test folds):

In [12]: data[:,train].T.shape
Out[12]: (350, 182609)

In [14]: data[:,test].T.shape
Out[14]: (50, 182609)

When I run this with the fixed code I get about 53% accuracy; still well above chance (remember that it's a 23-class problem), but much less than the 93% we had gotten previously.

It's worth noting that randomization tests with the flawed code showed the expected null distribution; the source of the information being used by the classifier is a bit of a mystery, but likely reflects the fact that the distance of the voxels in the matrix is related to their distance in space in the brain, and the labels were grouped together sequentially in the label file, such that they were correlated with physical distance in the brain and thus provided information that could drive the classification.

This is clearly a worst-case scenario for anyone who codes up their own analyses; the paper has already been submitted and you find an error that greatly changes the results. Fortunately, the exact level of classification accuracy is not central to the paper in question, but it's worrisome nonetheless.

What are the lessons to be learned here? Most concretely, it's important to check the size of data structures whenever you are slicing arrays. I was lazy in my coding of the crossvalidation loop, and I should have checked that the size of the dataset being fed into the classifier was what I expected it to be (the difference between 400 and 182609 would be pretty obvious). It might have added an extra 30 seconds to my initial coding time but would have saved me from a huge headache and hours of time needed to rerun all of the analyses.

Second, sharing code is necessary but not sufficient for finding problems. Someone could have grabbed my code and gotten exactly the same results that I got; only if they looked at the shape of the sliced arrays would they have noticed a problem. I am becoming increasingly convinced that if you really want to believe a computational result, the strongest way to do that is to have an independent person try to replicate it without using your shared code. Failing that, one really wants to have a validation dataset that one can feed into the program where you know exactly what the output should be; randomization of labels is one way of doing this (i.e., where the outcome should be chance) but you also want to do this with real signal as well. Unfortunately this is not trivial for the kinds of analyses that we do, but perhaps some better data simulators would help make it easier.

Finally, there is a meta-point about talking openly about these kinds of errors. We know that they happen all the time, yet few people ever talk openly about their errors. I hope that others will take my lead in talking openly about errors they have made so that people can learn from them and be more motivated to spend the extra time to write robust code.

Wednesday, January 16, 2013

Is reverse inference a fallacy? A comment on Hutzler

A new paper by Florian Hutzler has been published online at Neuroimage which claims to show that reverse inference is not as problematic as has been claimed in my previous publications (TICS, 2006; Neuron, 2010). I had previously reviewed this paper for another journal (I signed my review so this is not a surprise), and I'm happy to see that some of my concerns about the paper were addressed in the version that was published at Neuroimage. However, I still have one major concern about the general framing of the paper.

I would first like to be clear about what I said about reverse inference in my 2006 paper:

"It is crucial to note that this kind of ‘reverse inference’ is not deductively valid, but rather reﬂects the logical fallacy of afﬁrming the consequent...However, cognitive neuroscience is generally interested in a mechanistic understanding of the neural processes that support cognition rather than the formulation of deductive laws. To this end, reverse inference might be useful in the discovery of interesting new facts about the underlying mechanisms. Indeed, philosophers have argued that this kind of reasoning (termed ‘abductive inference’ by Pierce [8]), is an essential tool for scientiﬁc discovery [9]."

Thus, while I did point out the degree to which reverse inference reflects a fallacy under deductive logic, I also pointed out that it could be potentially useful under other forms of reasoning; it's a bit of a stretch to go from this statement to using the term "reverse inference fallacy" which has started to pervade peer reviews. This is unfortunate in my view, if only because authors must often think that I am the culprit! (I assure you all that I would never use this phrase in a review.) The potential utility of reverse inference has been further cashed out in the Neurosynth project (Yarkoni et al, 2011). I say all of this just to highlight the fact that I have never painted reverse inference as wholly fallacious, but rather have tried to highlight ways in which its limited utility can be quantified (e.g. through meta-analysis) or its power improved (e.g., through the use of machine learning methods).

The Hutzler paper applies reverse inference in a much more restrictive sense than it has usually been discussed, which he calls "task-specific functional specificity." The idea is that given some task, one can compute (e.g., using meta-analysis) the reverse inference conditional on that task (which I had noted but not further explored in my 2006 paper). I have no quibbles with the paper's analysis, and I think it nicely shows how reverse inference can be useful within a limited domain (in fact, Anthony Wagner and I made this point in 2004 in regard to left prefrontal function). My general concern is that the situation described in the Hutzler paper is fairly different from the one in which most reverse inference is performed. Here is what I said in my initial review of his paper, which still holds for the published version:

If it is true that reverse inference is helpful within the context of a specific task, then that’s perfectly fine, except that in the wild reverse inference is rarely used within the same task. In fact, it’s almost always used in task domains where one doesn’t know what to expect! See my recent Neuron paper for examples of these kinds of reverse inferences; rarely does one see a reverse inference based on prior data from very similar tasks. Thus, the paper basically makes my point for me by showing that the procedure is only effective in very specific cases which are outside of the standard way it is used.

In summary, while I agree with the analysis presented by Hutzler, I hope that readers will go beyond the title (which I think oversells the result) to see that it really shows the success of reverse inference in a very limited domain.