russpoldrack.org: 2012

Sunday, December 16, 2012

The perils of leave-one-out crossvalidation for individual difference analyses

There is a common tendency of researchers in the neuroimaging field to use the term "prediction" to describe observed correlations. This is problematic because the strength of a correlation does not necessarily imply that one can accurately predict the outcome of new (out-of-sample) observations. Even if the underlying distributions are normally distributed, the observed correlation will generally overestimate the accuracy of predictions on out-of-sample observations due to overfitting (i.e., fitting the noise in addition to the signal). In degenerate cases (e.g., when the observed correlation is driven by a single outlier), it is possible to observe a very strong in-sample correlation with almost zero predictive accuracy for out-of-sample observations.

The concept of crossvalidation provides a way out of this mess; by fitting the model to subsets of the data and examining predictive accuracy on the held-out samples, it's possible to directly assess the predictive accuracy of a particular statistical model. This approach has become increasingly popular in the neuroimaging literature, which should be a welcome development. However, crossvalidation in the context of regression analyses turns out to be very tricky, and some of the methods that are being used in the literature appear to be problematic.

One of the most common forms of crossvalidation is "leave-one-out" (LOO) in which the model is repeatedly refit leaving out a single observation and then used to derive a prediction for the left-out observation. Within the machine learning literature, it is widely appreciated that LOO is a suboptimal method for cross-validation, as it gives estimates of the prediction error that are more variable than other forms of crossvalidation such as K-fold (in which the training/testing is performed after breaking the data into K groups, usually 5 to 10) or bootstrap; see Chapter 7 in Hastie et al. for a thorough discussion of this issue).

There is another problem with LOO that is specific to its use in regression. We discovered this several years ago when we started trying to predict quantitative variables (such as individual learning rates) from fMRI data. One thing that we always do when running any predictive analysis is to perform a randomization test to determine the distribution of performance when the relation between the data and the outcome is broken (e.g., by randomly shuffling the outcomes). In a regression analysis, what one expects to see in this randomized-label case is a zero correlation between the predicted and actual values. However, what we actually saw was a very substantial negative bias in the correlation between predicted and actual values. After lots of head-scratching this began to make sense as an effect of overfitting. Imagine that you have a two-dimensional dataset where there is no true relation between X and Y, and you first fit a regression line to all of the data; the slope of this line should on average be zero, but will likely deviate from zero on each sample due to noise in the data (an effect that is bigger for smaller sample sizes). Then, you drop out one of the datapoints and fit the line again; let's say you drop out one of the observations at the extreme end of the X range. On average, this is going to have the effect of bringing the estimated regression line closer to zero than the full estimate (unless your left-out point was right at the mean of the Y distribution). If you do this for all points, you can then see how this would result in a negative correlation between predicted and actual values when the true slope is zero, since this procedure will tend to pull the line towards zero for the extreme points.

To see an example of this fleshed out in an ipython notebook, visit http://nbviewer.ipython.org/4221361/ - the full code is also available at https://github.com/poldrack/regressioncv. This code creates random X and Y values and then tests several different crossvalidation schemes to examine their effects on the resulting correlations between predicted and true values. I ran this for a number of different samples sizes, and the results are shown in the figure below (NB: figure legend colors fixed from original post).

The best (i.e. least biased) performance is shown by the split half method. Note that this is a special instance of split half crossvalidation with perfectly matched X distributions, because it has exactly the same values on the X variable for both halves. The worst performance is seen for leave-one-out, which is highly biased for small N's but shows substantial bias even for very large N's. Intermediate performance is seen when a balanced 8-fold crossvalidation scheme is used; the "hi" and "lo" versions of this are for two different balancing thresholds, where "hi" ensures fairly close matching of both X and Y distributions across folds whereas "lo" does not. We have previously used balanced crossvalidation schemes on real fMRI data (Cohen et al., 2010) and found them to do a fairly good job of removing bias in the null distributions, but it's clear from these simulations that bias can still remain.

As an example using real data, I took the data from a paper entitled "Individual differences in nucleus accumbens activity to food and sexual images predict weight gain and sexual behavior" by Demos et al. The title makes a strong predictive claim, but what the paper actually found was an observed correlation of r=0.37 between neural response and future weight gain. I ripped the data points from their scatterplot using PlotDigitizer and performed a crossvalidation analysis using leave-one-out and 4-fold crossvalidation either with or without balancing of the X and Y distributions across folds (see code and data in the github repo). The table below shows the predictive correlations obtained using each of the measures (the empirical null was obtained by resampling the data 500 times with random labels; the 95th percentile of this distribution is used as the significance cutoff):

CV method	r(pred,actual)	r(pred,actual) with random labels	95%ile
LOO	0.176395	-0.276977	0.154975
4-fold	0.183655	-0.148019	0.160695
Balanced 4-fold	0.194456	-0.066261	0.212925

This analysis shows that, rather than the 14% of variance implied by the observed correlation, one can at best predict about 4% of the variance in weight gain from brain activity; this result is not significant by a resampling test, though the LOO and 4-fold results, while weaker numerically, are significant compared to their respective empirical null distributions. (Note that a small amount of noise was likely introduced by the data-grabbing method, so the results with the real data might be a bit better.)

UPDATE: As noted in the comments, the negative bias can be largely overcome by fixing the intercept to zero in the linear regression used for prediction. Here are the results obtained using the zero-intercept model on the Demos et al. data:

CV method	r(pred,actual)	r(pred,actual) with random labels	95%ile
LOO	0.258	-0.067	0.189
4-fold	0.263	-0.055	0.192
Balanced 4-fold	0.256	-0.037	0.213

This gets us up to about 7% variance accounted for by the predicted model, or about half of that implied by the in-sample correlation, and now the correlation is significant by all CV methods.

Take-home messages:

Observed correlation is generally larger than predictive accuracy for out-of-sample observations, such that one should not use the term "predict" in the context of correlations.
Cross-validation for predicting individual differences in fMRI analysis is tricky.
Leave-one-out should probably be avoided in favor of balanced k-fold schemes
One should always run simulations of any classifier analysis stream using randomized labels in order to assess the potential bias of the classifier. This means running the entire data processing stream on each random draw, since bias can occur at various points in the stream.

PS: One point raised in discussing this result with some statisticians is that it may reflect the fact that correlation is not the best measure of the match between predicted and actual outcomes. If someone has a chance to take my code and play with some alternative measures, please post the results to the comments section here, as I don't have time to try it out right now.

PPS: I would be very interested to see how this extends to high-dimensional data like those generally used in fMRI. I know that the bias effect occurs in that context given that this is how we discovered it, but I have not had a chance to simulate its effects.

Tuesday, May 15, 2012

Obesity, health, and Gary Taubes

I recently posted a link on Facebook to Gary Taubes' article about why the campaign to stop America's obesity crisis keeps failing, and my friend Scott raised the following issue:

Taubes may or may not be on to something. But, he comes off like an Intelligent Design guy, "The experts are wrong, and they can't handle the Truth that I'm bringing."

I agree that Taubes' writings sometimes have the feel of a crazy outsider fighting against the establishment. However, both my personal experience (as well as those of a number of friends) and my (non-expert) reading of the literature both suggest that Taubes is largely right on in terms of his critique of the standard dogma regarding weight loss, food, and health.

First, the testimonial. As I noted in a previous post, it was Taubes' writings that were largely responsible for pushing me towards the low-carb way of eating that I have followed for more than a year now. After cutting carbs way down (except for my daily dose of dark chocolate, which is non-negotiable), I lost 20 pounds of fat and have kept it off, without any sense that I am being deprived; I basically eat whatever I want whenever I want, as long as it's real food and paleo-friendly (i.e., avoiding refined sugar, grains, and seed oils). Most important, I feel great eating this way; in particular, whereas I used to get serious hunger pangs and energy dips 3-4 hours after eating, I can now easily fast for 24 hours without feeling particularly hungry. My wife Jen has also had an interesting experience on this diet. She has been able to maintain her weight or lose weight while never feeling hungry, whereas on our old carb-heavy vegetarian diet she was only able to lose weight through radical caloric restriction that left her constantly famished. Similarly, a number of friends and family members have found that they were able to lose a substantial amount of weight after reducing carbs, while still feeling like they were able to eat to satiety. I have *never* heard anyone say that they went on a low-carb diet and gained weight; more often, I have heard from people that they went on a low-carb diet and lost weight but then were afraid that all of the saturated fat was going to cause them to have a heart attack any day.

This brings us to one of the central points of Taubes' writings, which is that the standard story about what comprises a healthy diet, namely the link between heart disease, cholesterol, and saturated fat, is just plain wrong. If you want a good overview of his general narrative without reading the books, I would suggest three NY Times pieces: one on the relation between dietary fat and disease from 2002, a blistering critique of epidemiological research from 2007, and his piece "Is Sugar Toxic" from 2011.

There are a lot of claims in the Taubes books, and I have not looked into all of them. However, to the degree that I have looked into the claims that I found most important and relevant to my own diet, I have found them to all have fairly compelling scientific bases. The most important regards the relation between heart disease, cholesterol, and saturated fat. It is amazing how the supposed unhealthiness of dietary cholesterol and saturated fat has become a "fact" that is repeated almost reflexively (e.g., most recently I encountered it in Tyler Cowen's "An Economist Gets Lunch"). I think that in part it is due to the visual similarity of saturated fat in meat and the plaques that are seen in atherosclerosis; it's just too easy to believe that the saturated fat that we eat is "clogging our arteries." The data appear to say otherwise. First, it has been known since 1950 that serum cholesterol bears little relation to dietary cholesterol; the mechanisms behind this are laid out nicely in Peter Attia's recent series on cholesterol. Second, a large recent meta-analysis (including data from more than 347,000 individuals across 21 published studies) found no relation between saturated fat intake and heart disease or stroke. Similarly, a recent Cochrane Collaborative meta-analysis of intervention studies showed that there was no significant reduction of total or cardiovascular mortality due to changes in dietary fat. Although I think Taubes is correct in his arguments that epidemiological studies are hugely problematic (which I will discuss some other time), I trust these large meta-analyses much more than I trust any individual study (e.g., the China Study), especially when they show no effect (given all of the biases towards publishing positive effects). I have also put my money where my mouth is: I now eat a high-fat diet including full-fat yogurt, eggs, and bacon almost every morning. It look me several months to stop craving sugar, but I'm now perfectly happy to a dinner without dessert, and in fact I no longer have a taste for foods that are extremely sugary.

Another of Taubes' main assertions is that obesity is caused primarily by carbohydrates (fleshed out in his book Why We Get Fat). My feeling here is that obesity is an incredibly complex problem that involves both the body and the brain, and any story that tries to simplify it to a single component of our environment is bound to be wrong. That said, it's clear to me from the person experience described above that the "calories in, calories out" story is wrong, and that it does matter a lot what one eats, not just how much. There has recently been a big argument in the blogosphere recently between Stephan Guyenet and Taubes over the relative importance of peripheral factors (e.g., insulin's effects on fat storage) versus neural factors (e.g., the role of satiety hormones and reward pathways); both of them have staked out strong positions, and I think that the truth is likely to fall somewhere in the middle (as usual). As a neuroscientist I clearly think that the brain plays an important role; I won't talk more about that here, maybe some time soon. I have not dug very deeply into the science of feeding trials in animals, in part because my reading of several summaries of this work suggests that the details can be very tricky but very important. In particular, without a great deal of control over exactly what kinds of nutrients are in the food, it can be very difficult to make any conclusions from the data. There are however some individual studies in humans that do provide support for Taubes' claims. For example, the A to Z Weight Loss Study showed that subjects assigned to the Atkins diet lost more weight and had better metabolic outcomes than people assigned to low-fat/high-carb diets like the Ornish Diet. It's just one study, and it would be good to see more, but this along with my personal experience is enough to convince me.

Later in the discussion that I mentioned above, Scott made the following additional observation:

But the question remains, why doesn't the science on the cellular level ever reach the public health scientific community? I'm usually very skeptical when some sort of conspiracy is trotted out to explain the lack of uptake.

There are a lot of answers to this, which Taubes goes into great detail to discuss in his books. But the general problem is that science itself can be a slow-moving ship; when it gets drawn well off course (as it appears to have been by the anti-fat arguments of Keys and others), it can take a long time to get back, and even longer for that new scientific knowledge to get translated into medical education and practice. What is most striking to me is how studies whose data seem clearly inconsistent with the standard view are often presented in a way that suggests that they support the view, often by picking and choosing specific conditions. Taubes gives numerous examples of this, as does Denise Minger's detailed analysis of the China Study. I've also noticed it on a number of occasions when reading papers in this literature. Thus, unless one is reading the papers closely (or following others who do this), it can be easy to continue to think that the standard model remains valid; you just can't take the abstract (or even the results section) at face value. However, the fact that the rise in obesity has occurred alongside declining fat intake (coupled with increasing carb intake) over the last 40 years makes it pretty clear to me that the standard theory is just plain wrong, and that the carb theory is a viable alternative that needs to be studied more intently. Unfortunately, many of the thought leaders in this area continue to expound solutions based on the "calories-in/calories-out" and low-fat ideas that got us here in the first place.

Wednesday, April 25, 2012

Things I like to do in Beijing

Several friends have asked for suggestions about things to do while in Beijing for the OHBM meeting this June. I don't have any particular wisdom, but having visited several times I thought that I would share a few of my favorite things (along with photos from some of our past trips). I'm not including many of the obvious attractions (Forbidden City, Summer Palace, Olympic Park) because I figure those will be in every guide book.

798 Art Complex: an amazing art complex created from an old factory complex. If you like modern art you can easily spend more than 1/2 a day there. There is a great noodle shop tucked away in the middle of the complex, and also at least one really nice cafe.

Hutongs: These are the old neighborhoods in the center of the city. Some of them are very touristy, but if you walk a few blocks off the main street you can find some streets that feel pretty far from touristy. I would particularly recommend the area around Heizhima Hutong, where these photos were taken.

Drum Tower: This was built in 1272 and served as the official timepiece of the Chinese government until 1924. If you show up at the right time, you can see an awesome drumming show.

Great Wall: We visited the Great Wall at Badaling in 2005, which is apparently the most touristy place to go but also relatively close to Beijing. Go early in the day, to avoid both crowds and heat. Perhaps the best part of visiting at Badaling is that there is a roller coaster that can take you down from the top.

Eating

Roasted duck hearts at Quanjude

We have had a lot of wonderful meals in Beijing, both as fish-etarians on our first two visits and as omnivores on our most recent visit. Prepare to eat well, but also be prepared to have your sensibilities challenged. A few highlights are:

Roast duck: As recently reformed vegetarians, we spent our first two visits to China without trying "Peking Duck" (or, as they call it in Beijing, "roast duck"). On our last visit we had it at Quanjude and it was pretty awesome. We had the full on roast duck experience, including "duck breast" (which is basically just fat and skin) and roasted duck hearts. A must-have. (NB: If you order the duck hearts, they come with a bowl of flaming liquid. Apparently you are not supposed to actually dip the heart into the liquid, as I did.)

Spicy snails at Spicy Grandma restaurant

Sichuan food: Our friends in China are largely from Sichuan province, and thus we often end up eating at Sichuan restaurants. The Sichuan peppercorn has an amazing numbing quality. Also be sure to try the Sichuan hot pot, which is like a very spicy version of shabu shabu. I would suggest bringing a significant ration of Pepto Bismol, as the western gut starts to ache after a few days of this kind of spicy food. But it is so worth the burn.

Yunnan food: One of the most amazing meals we had was at the Rainbow Restaurant in the Beijing Sun Palace Hotel. The greeters are dressed in traditional Yunnan dress, and the food is absolutely amazing with a heavy focus on mushrooms.

A dish that contained "smelly tofu" - actually really tasty

Grilled matsutake musrooms at Rainbow

Tuesday, March 6, 2012

Skeletons in the closet

As someone who has thrown lots of stones in recent years, it's easy to forget that anyone who publishes enough will end up with some skeletons in their closet. I was reminded of that fact today, when Dorothy Bishop posted a detailed analysis of a paper that was published in 2003 on which I am a coauthor.

This paper studied a set of children diagnosed with dyslexia who were scanned before and after treatment with the Fast ForWord training program. The results showed improved language and reading function, which were associated with changes in brain activation.

Dorothy notes four major problems with the study:

There was no dyslexic control group; thus, we don't know whether any improvements over time were specific to the treatment, or would have occurred with a control treatment or even without any treatment.
The brain imaging data were thresholded using an uncorrected threshold.
One of the main conclusions (the "normalization" of activation following training") is not supported by the necessary interaction statistic, but rather by a visual comparison of maps.
The correlation between changes in language scores and activation was reported for only one of the many measures, and it appeared to have been driven by outliers.

Looking back at the paper, I see that Dorothy is absolutely right on each of these points. In defense of my coauthors, I would note that points 2-4 were basically standard practice in fMRI analysis 10 years ago (and still crop up fairly often today). Ironically, I raised two of of these issues in my recent paper for the special issue of Neuroimage celebrating the 20th anniversary of fMRI, in talking about the need for increased methodological rigor:

Foremost, I hope that in the next 20 years the field of cognitive neuroscience will increase the rigor with which it applies neuroimaging methods. The recent debates about circularity and “voodoo correlations” ( [Kriegeskorte et al., 2009] and [Vul et al., 2009]) have highlighted the need for increased care regarding analytic methods. Consideration of similar debates in genetics and clinical trials led (Ioannidis, 2005) to outline a number of factors that may contribute to increased levels of spurious results in any scientific field, and the degree to which many of these apply to fMRI research is rather sobering:
•small sample sizes

•small effect sizes

•large number of tested effects

•flexibilty in designs, definitions, outcomes, and analysis methods

•being a “hot” scientific field

Some simple methodological improvements could make a big difference. First, the field needs to agree that inference based on uncorrected statistical results is not acceptable (cf. Bennett et al., 2009). Many researchers have digested this important fact, but it is still common to see results presented at thresholds such as uncorrected p < .005. Because such uncorrected thresholds do not adapt to the data (e.g., the number of voxels tests or their spatial smoothness), they are certain to be invalid in almost every situation (potentially being either overly liberal or overly conservative). As an example, I took the fMRI data from Tom et al. (2007), and created a random “individual difference” variable. Thus, there should be no correlations observed other than Type I errors. However, thresholding at uncorrected p < .001 and a minimum cluster size of 25 voxels (a common heuristic threshold) showed a significant region near the amygdala; Fig. 1 shows this region along with a plot of the “beautiful” (but artifactual) correlation between activation and the random behavioral variable. This activation was not present when using a corrected statistic. A similar point was made in a more humorous way by Bennett et al. (2010), who scanned a dead salmon being presented with a social cognition task and found activation when using an uncorrected threshold. There are now a number of well-established methods for multiple comparisons correction (Poldrack et al., 2011), such that there is absolutely no excuse to present results at uncorrected thresholds. The most common reason for failing to use rigorous corrections for multiple tests is that with smaller samples these methods are highly conservative, and thus result in a high rate of false negatives. This is certainly a problem, but I don't think that the answer is to present uncorrected results; rather, the answer is to ensure that one's sample is large enough to provide sufficient statistical power to find the effects of interest.

Second, I have become increasingly concerned about the use of “small volume corrections” to address the multiple testing problem. The use of a priori masks to constrain statistical testing is perfectly legitimate, but one often gets the feeling that the masks used for small volume correction were chosen after seeing the initial results (perhaps after a whole-brain corrected analysis was not significant). In such a case, any inferences based on these corrections are circular and the statistics are useless. Researchers who plan to use small volume corrections in their analysis should formulate a specific analysis plan prior to any analyses, and only use small volume corrections that were explicitly planned a priori. This sounds like a remedial lesson in basic statistics, but unfortunately it seems to be regularly forgotten by researchers in the field.

Third, the field needs to move toward the use of more robust methods for statistical inference (e.g., Huber, 2004). In particular, analyses of correlations between activation and behavior across subjects are highly susceptible to the influence of outlier subjects, especially with small sample sizes. Robust statistical methods can ensure that the results are not overly influenced by these outliers, either by reducing the effect of outlier datapoints (e.g., robust regression using iteratively reweighted least squares) or by separately modeling data points that fall too far outside of the rest of the sample (e.g., mixture modeling). Robust tools for fMRI group analysis are increasingly available, both as part of standard software packages (such as the “outlier detection” technique implemented in FSL: Woolrich, 2008) and as add-on toolboxes (Wager et al., 2005). Given the frequency with which outliers are observed in group fMRI data, these methods should become standard in the field. However, it's also important to remember that they are not a panacea, and that it remains important to apply sufficient quality control to statistical results, in order to understand the degree to which one's results reflect generalizeable patterns versus statistical figments.

It should be clear from these comments that my faith in the results of any study that uses such problematic methods (as the Temple et al. study did) is relatively weak. I personally have learned my lesson and our lab now does its best to adhere to these more rigorous standards, even when they mean that a study sometimes ends up being unpublishable. I can only hope that others will join me.

Thursday, February 9, 2012

Quitting cable

I was inspired by Nathan's post at Flowing Data to say a bit about how our experiment with giving up cable TV is going. Back in September, we turned off our U-Verse subscription, sent back the DVR, and started getting our TV solely from the computer (we use a Mac Mini as our media center PC). Here is our experience so far:

We watch a lot less TV. Our TV routine has now morphed from watching 2-3 hours per night into watching a single show every night (recent favorites are The Layover and Top Chef, with Colbert Report as our fallback). In its place we are reading a lot more; in fact, much of the money that we are saving on cable is probably flowing to the Kindle store at Amazon. However, we have also recently started using the Austin Public Library's ebook lending service which is a great way to save on ebooks. I've also been playing the guitar more often.

Sometimes you really want live TV. The one problem with getting everything from the web is that it's often hard to find a good live stream; we had this problem on new year's eve. To solve this, I recently installed a solution to allow us to view live broadcast TV from the computer, using an
Elgato EyeTV One Computer TV Tuner with a Mohu Leaf HDTV antenna. With this slick combination we are able to get 13 channels of over-the-air HDTV for free. The EyeTV software is really nice; it has good DVR functionality and an integrated TV Guide. It's very much like having cable with 13 channels, except that the DVR functions are much better than any set-top DVR we ever had.

Hulu Plus > Netflix. We have found that Hulu Plus meets our TV viewing needs quite well. Sure we have to watch some commercials, but we are usually able to get new shows the next day after they air, and the selection is pretty good. I tried a free trial of Netflix online, but we have not found that it has much to offer us, except for an occasional movie. However, we watch movies pretty rarely, and so it probably makes more sense for us to just buy them from iTunes. For shows that are not available on the web or via Hulu (e.g., The Layover), we buy them from iTunes as well. It's not cheap but we still come out ahead in the long run.

Media center software sucks. We tried using both Plex and Boxee on the mac mini, but gave up on both after too many things just didn't work; in particular, the Hulu integration on Plex was really frustrating, as it seems like it should work but then it never quite does. Now we just watch Hulu content through a web browser, live/recorded TV through EyeTV, and iTunes content through iTunes. The main drawback of this setup is that we can't get remote functionality that works seamlessly across all these different interfaces, but that's not been a problem.

Overall I would rate this experiment as a success and would definitely recommend giving up cable.