Sunday, December 16, 2012

The perils of leave-one-out crossvalidation for individual difference analyses

There is a common tendency of researchers in the neuroimaging field to use the term "prediction" to describe observed correlations.  This is problematic because the strength of a correlation does not necessarily imply that one can accurately predict the outcome of new (out-of-sample) observations.  Even if the underlying distributions are normally distributed, the observed correlation will generally overestimate the accuracy of predictions on out-of-sample observations due to overfitting (i.e., fitting the noise in addition to the signal).  In degenerate cases (e.g., when the observed correlation is driven by a single outlier), it is possible to observe a very strong in-sample correlation with almost zero predictive accuracy for out-of-sample observations.

The concept of crossvalidation provides a way out of this mess; by fitting the model to subsets of the data and examining predictive accuracy on the held-out samples, it's possible to directly assess the predictive accuracy of a particular statistical model.  This approach has become increasingly popular in the neuroimaging literature, which should be a welcome development.  However, crossvalidation in the context of regression analyses turns out to be very tricky, and some of the methods that are being used in the literature appear to be problematic.

One of the most common forms of crossvalidation is "leave-one-out" (LOO) in which the model is repeatedly refit leaving out a single observation and then used to derive a prediction for the left-out observation.  Within the machine learning literature, it is widely appreciated that LOO is a suboptimal method for cross-validation, as it gives estimates of the prediction error that are more variable than other forms of crossvalidation such as K-fold (in which the training/testing is performed after breaking the data into K groups, usually 5 to 10) or bootstrap; see Chapter 7 in Hastie et al. for a thorough discussion of this issue).

There is another problem with LOO that is specific to its use in regression.  We discovered this several years ago when we started trying to predict quantitative variables (such as individual learning rates) from fMRI data.  One thing that we always do when running any predictive analysis is to perform a randomization test to determine the distribution of performance when the relation between the data and the outcome is broken (e.g., by randomly shuffling the outcomes). In a regression analysis, what one expects to see in this randomized-label case is a zero correlation between the predicted and actual values.  However, what we actually saw was a very substantial negative bias in the correlation between predicted and actual values.  After lots of head-scratching this began to make sense as an effect of overfitting.  Imagine that you have a two-dimensional dataset where there is no true relation between X and Y, and you first fit a regression line to all of the data; the slope of this line should on average be zero, but will likely deviate from zero on each sample due to noise in the data (an effect that is bigger for smaller sample sizes).  Then, you drop out one of the datapoints and fit the line again; let's say you drop out one of the observations at the extreme end of the X range.  On average, this is going to have the effect of bringing the estimated regression line closer to zero than the full estimate (unless your left-out point was right at the mean of the Y distribution).  If you do this for all points, you can then see how this would result in a negative correlation between predicted and actual values when the true slope is zero, since this procedure will tend to pull the line towards zero for the extreme points.

To see an example of this fleshed out in an ipython notebook, visit http://nbviewer.ipython.org/4221361/ - the full code is also available at https://github.com/poldrack/regressioncv.  This code creates random X and Y values and then tests several different crossvalidation schemes to examine their effects on the resulting correlations between predicted and true values.  I ran this for a number of different samples sizes, and the results are shown in the figure below (NB: figure legend colors fixed from original post).



The best (i.e. least biased) performance is shown by the split half method.  Note that this is a special instance of split half crossvalidation with perfectly matched X distributions, because it has exactly the same values on the X variable for both halves.  The worst performance is seen for leave-one-out, which is highly biased for small N's but shows substantial bias even for very large N's.  Intermediate performance is seen when a balanced 8-fold crossvalidation scheme is used; the "hi" and "lo" versions of this are for two different balancing thresholds, where "hi" ensures fairly close matching of both X and Y distributions across folds whereas "lo" does not.  We have previously used balanced crossvalidation schemes on real fMRI data (Cohen et al., 2010) and found them to do a fairly good job of removing bias in the null distributions, but it's clear from these simulations that bias can still remain.

As an example using real data, I took the data from a paper entitled "Individual differences in nucleus accumbens activity to food and sexual images predict weight gain and sexual behavior" by Demos et al.  The title makes a strong predictive claim, but what the paper actually found was an observed correlation of r=0.37 between neural response and future weight gain.  I ripped the data points from their scatterplot using PlotDigitizer and performed a crossvalidation analysis using leave-one-out and 4-fold crossvalidation either with or without balancing of the X and Y distributions across folds (see code and data in the github repo).   The table below shows the predictive correlations obtained using each of the measures (the empirical null was obtained by resampling the data 500 times with random labels; the 95th percentile of this distribution is used as the significance cutoff):

CV methodr(pred,actual)r(pred,actual) with random labels95%ile
LOO 0.176395 -0.276977 0.154975
4-fold 0.183655 -0.148019 0.160695
Balanced 4-fold 0.194456 -0.066261 0.212925

This analysis shows that, rather than the 14% of variance implied by the observed correlation, one can at best predict about 4% of the variance in weight gain from brain activity; this result is not significant by a resampling test, though the LOO and 4-fold results, while weaker numerically, are significant compared to their respective empirical null distributions. (Note that a small amount of noise was likely introduced by the data-grabbing method, so the results with the real data might be a bit better.)

UPDATE: As noted in the comments, the negative bias can be largely overcome by fixing the intercept to zero in the linear regression used for prediction.  Here are the results obtained using the zero-intercept model on the Demos et al. data:

CV methodr(pred,actual)r(pred,actual) with random labels95%ile
LOO0.258-0.0670.189
4-fold0.263-0.0550.192
Balanced 4-fold0.256-0.0370.213



This gets us up to about 7% variance accounted for by the predicted model, or about half of that implied by the in-sample correlation, and now the correlation is significant by all CV methods.

Take-home messages:
  • Observed correlation is generally larger than predictive accuracy for out-of-sample observations, such that one should not use the term "predict" in the context of correlations.
  • Cross-validation for predicting individual differences in fMRI analysis is tricky.  
  • Leave-one-out should probably be avoided in favor of balanced k-fold schemes
  • One should always run simulations of any classifier analysis stream using randomized labels in order to assess the potential bias of the classifier.  This means running the entire data processing stream on each random draw, since bias can occur at various points in the stream.


PS: One point raised in discussing this result with some statisticians is that it may reflect the fact that correlation is not the best measure of the match between predicted and actual outcomes.  If someone has a chance to take my code and play with some alternative measures, please post the results to the comments section here, as I don't have time to try it out right now.

PPS: I would be very interested to see how this extends to high-dimensional data like those generally used in fMRI.  I know that the bias effect occurs in that context given that this is how we discovered it, but I have not had a chance to simulate its effects.