- Sharing code is good, but only if someone else actually looks at it very closely.
- You can't rely on tools to fail when you make a mistake.
- Classifiers are very good at finding information, even if it's not the information you had in mind.
The code in question is 4_classify_wholebrain.py which reads in the processed data (saved in a numpy file) and classifies each dataset (with about 184K features and 400 observations) into one of 23 different classes (representing different tasks). The code was made publicly available before submitting the paper; while I have no way of knowing whether the reviewers have examined it, it's fair to say that even if they did, they would most likely not have caught this particular bug unless they were very eagle-eyed. As it happens, a student here was trying to reproduce my analyses independently, and was finding much lower classification accuracies than the ones I had reported. As he dug into my code, it became clear that this difference was driven by a (lazy, in hindsight) coding mistake on my part.
The original code can be viewed here - the snippet in question (cleaned up a bit) is:
skf=StratifiedKFold(labels,8)
if trainsvm:
pred=N.zeros(len(labels))
for train,test in skf:
clf=LinearSVC()
clf.fit(data[train],labels[train])
pred[test]=clf.predict(data[test])
Pretty simple - it creates a crossvalidation object using sklearn, then loops through, fitting to the train folds and computing the predicted class for the test fold. Running this, I got about 93% test accuracy on the multiclass problem; had I gotten 100% accuracy I would have been sure that there was a problem, but given that we have previously gotten around 80% for similar problems, I was not terribly shocked by the high accuracy. Here is the problem:
In [9]: data.shape
Out[9]: (182609, 400)
When I put the data into the numpy object, I had voxels as the first dimension, whereas for classification analysis one would usually put the observations in rows rather than columns. Now, numpy is smart enough that when I give it the train list as an array index, it uses it as an index on the first dimension. However, because of the transposition of the dimensions in the data, the effect was to classify voxels, rather than subjects:
In [10]: data[train].shape
Out[10]: (350, 400)
In [11]: data[test].shape
Out[11]: (50, 400)
When I fix this by using the proper data reference (as in the current revision of the code on the repo), then it looks as it should (i.e. all voxels included for the subjects in the train or test folds):
In [12]: data[:,train].T.shape
Out[12]: (350, 182609)
In [14]: data[:,test].T.shape
Out[14]: (50, 182609)
When I run this with the fixed code I get about 53% accuracy; still well above chance (remember that it's a 23-class problem), but much less than the 93% we had gotten previously.
It's worth noting that randomization tests with the flawed code showed the expected null distribution; the source of the information being used by the classifier is a bit of a mystery, but likely reflects the fact that the distance of the voxels in the matrix is related to their distance in space in the brain, and the labels were grouped together sequentially in the label file, such that they were correlated with physical distance in the brain and thus provided information that could drive the classification.
This is clearly a worst-case scenario for anyone who codes up their own analyses; the paper has already been submitted and you find an error that greatly changes the results. Fortunately, the exact level of classification accuracy is not central to the paper in question, but it's worrisome nonetheless.
What are the lessons to be learned here? Most concretely, it's important to check the size of data structures whenever you are slicing arrays. I was lazy in my coding of the crossvalidation loop, and I should have checked that the size of the dataset being fed into the classifier was what I expected it to be (the difference between 400 and 182609 would be pretty obvious). It might have added an extra 30 seconds to my initial coding time but would have saved me from a huge headache and hours of time needed to rerun all of the analyses.
Second, sharing code is necessary but not sufficient for finding problems. Someone could have grabbed my code and gotten exactly the same results that I got; only if they looked at the shape of the sliced arrays would they have noticed a problem. I am becoming increasingly convinced that if you really want to believe a computational result, the strongest way to do that is to have an independent person try to replicate it without using your shared code. Failing that, one really wants to have a validation dataset that one can feed into the program where you know exactly what the output should be; randomization of labels is one way of doing this (i.e., where the outcome should be chance) but you also want to do this with real signal as well. Unfortunately this is not trivial for the kinds of analyses that we do, but perhaps some better data simulators would help make it easier.
Finally, there is a meta-point about talking openly about these kinds of errors. We know that they happen all the time, yet few people ever talk openly about their errors. I hope that others will take my lead in talking openly about errors they have made so that people can learn from them and be more motivated to spend the extra time to write robust code.
The original code can be viewed here - the snippet in question (cleaned up a bit) is:
skf=StratifiedKFold(labels,8)
if trainsvm:
pred=N.zeros(len(labels))
for train,test in skf:
clf=LinearSVC()
clf.fit(data[train],labels[train])
pred[test]=clf.predict(data[test])
Pretty simple - it creates a crossvalidation object using sklearn, then loops through, fitting to the train folds and computing the predicted class for the test fold. Running this, I got about 93% test accuracy on the multiclass problem; had I gotten 100% accuracy I would have been sure that there was a problem, but given that we have previously gotten around 80% for similar problems, I was not terribly shocked by the high accuracy. Here is the problem:
In [9]: data.shape
Out[9]: (182609, 400)
When I put the data into the numpy object, I had voxels as the first dimension, whereas for classification analysis one would usually put the observations in rows rather than columns. Now, numpy is smart enough that when I give it the train list as an array index, it uses it as an index on the first dimension. However, because of the transposition of the dimensions in the data, the effect was to classify voxels, rather than subjects:
In [10]: data[train].shape
Out[10]: (350, 400)
In [11]: data[test].shape
Out[11]: (50, 400)
When I fix this by using the proper data reference (as in the current revision of the code on the repo), then it looks as it should (i.e. all voxels included for the subjects in the train or test folds):
In [12]: data[:,train].T.shape
Out[12]: (350, 182609)
In [14]: data[:,test].T.shape
Out[14]: (50, 182609)
When I run this with the fixed code I get about 53% accuracy; still well above chance (remember that it's a 23-class problem), but much less than the 93% we had gotten previously.
It's worth noting that randomization tests with the flawed code showed the expected null distribution; the source of the information being used by the classifier is a bit of a mystery, but likely reflects the fact that the distance of the voxels in the matrix is related to their distance in space in the brain, and the labels were grouped together sequentially in the label file, such that they were correlated with physical distance in the brain and thus provided information that could drive the classification.
This is clearly a worst-case scenario for anyone who codes up their own analyses; the paper has already been submitted and you find an error that greatly changes the results. Fortunately, the exact level of classification accuracy is not central to the paper in question, but it's worrisome nonetheless.
What are the lessons to be learned here? Most concretely, it's important to check the size of data structures whenever you are slicing arrays. I was lazy in my coding of the crossvalidation loop, and I should have checked that the size of the dataset being fed into the classifier was what I expected it to be (the difference between 400 and 182609 would be pretty obvious). It might have added an extra 30 seconds to my initial coding time but would have saved me from a huge headache and hours of time needed to rerun all of the analyses.
Second, sharing code is necessary but not sufficient for finding problems. Someone could have grabbed my code and gotten exactly the same results that I got; only if they looked at the shape of the sliced arrays would they have noticed a problem. I am becoming increasingly convinced that if you really want to believe a computational result, the strongest way to do that is to have an independent person try to replicate it without using your shared code. Failing that, one really wants to have a validation dataset that one can feed into the program where you know exactly what the output should be; randomization of labels is one way of doing this (i.e., where the outcome should be chance) but you also want to do this with real signal as well. Unfortunately this is not trivial for the kinds of analyses that we do, but perhaps some better data simulators would help make it easier.
Finally, there is a meta-point about talking openly about these kinds of errors. We know that they happen all the time, yet few people ever talk openly about their errors. I hope that others will take my lead in talking openly about errors they have made so that people can learn from them and be more motivated to spend the extra time to write robust code.
I have a draft blog post gathering e-dust that bears the working title, "Whose job is method validation, anyway?" It was motivated by the acquisition end of the fMRI pipeline - my concern - but the issues seem to pervade the entire operation.
ReplyDeleteOne of my points is (er, will be) that the party responsible for creating a widget isn't necessarily the one who should be responsible for validating it. Indeed, one can make a strong case for separating the validation from the production because of conflicts of interest, over familiarity, etc.
In my post I'll be pointing a lot of fingers. My contention is that we all bear a part of the responsibility, no-one less than the person who takes the widget you've made and uses it without first determining what (independent) validation has been performed.
So, I congratulate you on your mea culpa and I hope that it motivates serious (re)consideration of how we use the tools we have at our disposal in neuroimaging. Because no lives are at stake we can be rather lax when it comes to insisting on, or checking out, the validation of our methods. Bad, bad, bad!
Thanks Ben - I'll look forward to your post!
DeleteHi Russ,
ReplyDeleteThanks for the important post - I sometimes feel like errors like this might be a pretty serious source of false positives in the literature.
I too have a draft blog post gathering dust (titled: "Biased Debugging"), which considers how the debugging process is non-random, biased in favour of positive and sensible-looking results. If an error throws up a crazy (or non-significant result), we are more likely to track it down than if it fits nicely with our hopes and expectations.
Given the increasing complexity data analyses, it is naive to think that the published data are free of coding errors. If they are random, then the additional noise is probably harmless enough. But if they are not random, and biased in favour of positive findings, then perhaps biased debugging is a serious contributor to the statistical bias that deserves more attention.
As you say, sharing code is not always sufficient (though certainly better for eventual detection than not sharing). Only independent verification using fresh code is 100% safe. Incidentally, just yesterday I asked my graduate student to code up his own version of mahalanobis distance for MEG rather than use mine, just to make sure that our results converged...
Your other point is also crucial - we need to be prepared to admit our mistakes. This is perhaps especially important for graduate students. If students do not feel that they are able to admit their mistakes to their supervisors, then there is little chance for redress. Errors like these should not be seen as a sign of failure, in fact more likely the opposite. We should be encouraged to try new things, to write custom scripts, implement new analytic procedures, and innovation is always going to be error prone. Perfect coding is as unrealistic as perfect prose - just think of the grammatical errors that sneak through endless drafts seen by all co-authors. Then think, whoever goes through thousands of lines of matlab code as carefully as the manuscript being prepared for submission!?
Mark - very good point regarding the biases that can emerge from debugging. Hope to see your full post soon!
DeleteAs promised, I managed to dust off the old draft at the Brain Box: http://the-brain-box.blogspot.co.uk/2013/02/biased-debugging.html
DeleteHi Russ,
ReplyDeleteThis is great!
I completely agree about code replication. It's amazing how easy it is to catch errors that go in the wrong direction (that decrease your results) and how hard it is to catch those that go in the right direction. Double-entry (recoding from scratch) really seems like the gold standard.
Do you have thoughts about practices that help you avoid errors? It's really important to produce reliable code but at the same time it seems cripplingly slow to implement unit tests for every analysis you do. So I have struggled with what concrete recommendations to make to students and how to implement error checking in my own coding practice...
best,
Mike Frank
Hey Mike - thanks for your comments. I agree that it's really hard to come up with general recommendations without going down the slippery slope towards full-blown unit testing. I'll think a bit more and maybe put together another post with some of the major places where I think problems are most likely.
DeleteThis comment has been removed by the author.
ReplyDeletehi Michael - I think you are right that refactoring into a more general library would be one useful way to address this issue. I am getting better at that, but old habits still die hard.
DeleteBy the way, it just occurred to me to point out that the Github Pull Request interface is probably the best solution for collaborative code review. Here's an example from the scikit-learn repo: https://github.com/scikit-learn/scikit-learn/pull/1628 that I think exemplifies pull requests at their best. You can comment at varying levels of granularity down to specific lines, code is updated live as changes are made, and it provides a public record that's wrapped up together with the code itself.
Deletemw
thanks for sharing this. validation and testing are two of the least thought about areas in the brain imaging community and that needs to change. michael hanke is leading a project called testkraut, an effort to address testing concerns for workflows and software.
ReplyDeletealso, our current publication life-cycle/peer review/culture needs to address reproducibility as a core principle. i would rather see half the number of publications from a lab as long as each publication was independently verified on a different dataset. this may not only addresses the possibility of code error but would also provide an additional stamp on the veracity of the result.
The smarter the programming language the more cautious one should be. All those implicit conversions can be extremely dangerous (also in C++). Matlab I think is better in this extent.
ReplyDeleteThe striking thing which I think is missing completely in academic coding comparing to software industry is a code peer-review. In many industry companies you just cannot commit a code to source-control system without additional pair of eyes look at your code. It is amazing how many bugs one can catch when he just explains someone what he just programmed. So, basically, in many cases it can be even a doll and not a peer :)
That's a great point about code review - we have started doing this in my lab (we have the luxury of a full time software developer on staff). Perhaps I should go out and buy a bunch of dolls for the lab... :-)
DeleteHere is another recent example where what looks like a very small error (using int instead of float type for a specific variable) resulted in unreasonably high accuracy on the Netflix prediction problem - http://nuit-blanche.blogspot.fr/2013/02/this-weeks-guardians-of-science-zeno.html
ReplyDelete