Wednesday, December 9, 2015

Reproducible analysis in the MyConnectome project

Today our paper describing the MyConnectome project was published in Nature Communications.  This paper is unlike any that I have ever worked on before (and probably ever will again), as it reflects analyses of data collected on myself over the course of 18 months from 2012-2014.  A lot has been said already about what the results might or might not mean.  What I want to discuss here is the journey that ultimately led me to develop a reproducible shared analysis platform for the study.

Data collection was completed in April 2014, shortly before I moved to the Bay Area, and much of that summer was spent analyzing the data.  As I got deeper into the analyses, it became clear that we needed a way to efficiently and automatically reproduce the entire set of analyses.  For example, there were a couple of times during the data analysis process when my colleagues at Wash U updated their preprocessing strategy, which meant that I had to rerun all of the statistical analyses that relied upon those preprocessed data. This ultimately led me to develop a python package (https://github.com/poldrack/myconnectome) that implements all of the statistical analyses (which use a mixture of python, R, and **cough** MATLAB) and provides a set of wrapper scripts to run them.  This package made it fairly easy for me to rerun the entire set of statistical analyses on my machine by executing a single script, and provided me with confidence that I could reproduce any of the results that went into the paper.  

The next question was: Can anyone else (including myself at some later date) reproduce the results?  I had performed the analyses on my Mac laptop using a fairly complex software stack involving many different R and python packages, using a fairly complex set of imaging, genomic, metabolomic, and behavioral data.  (The imaging and -omics data had been preprocessed on large clusters at the Texas Advanced Computing Center (TACC) and Washington University; I didn’t attempt to generalize this part of the workflow).  I started by trying to replicate the analyses on a Linux system; identifying all of the necessary dependencies was an exercise in patience, as the workflow would break at increasingly later points in the process.  Once I had the workflow running, the first analyses showed very different results between the platforms; after the panic subsided (fortunately this happened before the paper was submitted!), I tracked the problem down to the R forecast package on Linux versus Mac (code to replicate issue available here).  It turned out that the auto.arima() function (which is the workhorse of our time series analyses) returned substantially different results on Linux and Mac platforms if the Y variable was not scaled (due apparently to a bug on the Linux side), but very close results when the Y variable was scaled. Fortunately, the latest version of the forecast package (6.2) gives identical results across Linux and Mac regardless of scaling, but the experience showed just how fragile our results can be when we rely upon complex black-box analysis software, and how we shouldn't take cross-platform reproducibility for granted (see here for more on this issue in the context of MRI analysis).

Having generalized the analyses to a second platform, the next logical step was to generalize it to any machine.  After discussing the options with a number of people in the open science community, the two most popular candidates were provisioning of a virtual machine (VM) using Vagrant, or creating a Docker container.  I ultimately chose to go with the Vagrant solution, primarily because it was substantially easier; in principle you simply set up a Vagrantfile that describes all of the dependencies, and type “vagrant up”.    Of course, this “easy” solution took many hours to actually implement successfully because it required reconstruction of all of the dependencies that I had taken for granted on the other systems, but once it was done we had a system that allows anyone to recreate the full set of statistical analyses exactly on their system, which is available at https://github.com/poldrack/myconnectome-vm

A final step was to provide a straightforward way for people to view the complex set of results.  Our visualization guru, Vanessa Sochat, developed a flask application (https://github.com/vsoch/myconnectome-explore) that provides a front end to all of the HTML reports generated by the various analyses, as well as a results browser that allows one to browse the 38,363 statistical tests that were computed for project.  This browser is available locally if one installs and runs the VM, and is also accessible publicly from http://results.myconnectome.org
Dashboard for analyses

Browser for timeseries analysis results

We have released code and data with papers in the past, but this is the first paper I have ever published that attempts to include a fully reproducible snapshot of the statistical analyses.  I learned a number of lessons in the process of doing this:
  1. The development of a reproducible workflow saved me from publishing a paper with demonstrably irreproducible results, due to the OS-specific software bug mentioned above.  This in itself makes the entire process worthwhile from my standpoint.
  2. Converting a standard workflow to a fully reproducible workflow is difficult. It took many hours of work beyond the standard analyses in order to develop a working VM with all of the analyses automatically run; that doesn’t even count the time that went into developing the browser. Had I started the work within a virtual machine from the beginning, it would have been much easier, but still would require extra work beyond that needed for the basic analyses.
  3. Ensuring longevity of a working pipeline is even harder.  The week before the paper was set to published I tried a fresh install of the VM to make sure it was still working.  It wasn’t.  The problem was simple (miniconda had changed the name of its installation directory), and highlighted a significant flaw in our strategy, which was that we had not specified software versions in our VM provisioning.  I hope that we can add that in the future, but for now, we have to keep our eyes out for the disruptive effects of software updates.
I look forward to your comments and suggestions about how to better implement reproducible workflows in the future, as this is one of the major interests of our Center for Reproducible Neuroscience.