Virtual machines considered harmful for reproducibility

In his paper, Reproducible Research and Cloud Computing, Bill Howe asks:

What happens if you do all your work on a virtual machine hosted in the cloud? When it came time to publish, you might make a snapshot of the VM, make it public, and cite it in your paper. Those who wish to reproduce your experiments would launch the virtual machine (on their dime) and have access to your entire experimental environment --- the code, the data, the environment, log files, notes, etc. There would be no need to install a network of complex, version-sensitive inter-dependent prerequisites.

Indeed, what a good idea! But not, I think, sufficient.

This idea -- that posting a VM is sufficient for reproducibility -- has hit my Twitter feed a couple of times now, and each time I feel compelled to make the point that this isn't useful reproducibility. Mick Watson put it best when he said you can't install an image for every pipeline you want.

To put it another way, it is certainly true that posting a virtual machine is a way to make your research reproducible. It's just not a very useful way, in the sense that it effectively blocks remixing or mashing up the code. In my post on the diginorm paper I made this point in response to some poo-pooing of replicability:

Fifth, and probably most significant from a practical perspective, Graham misses the point of reuse. In bioinformatics, it behooves us to reuse proven (aka published) tools -- at least we know they worked for someone, at least once, which is not usually the case for newly written software.

In essence, providing a gigantic black box of custom installed code that was installed, set up, and executed by experts just isn't very useful to many people.

I think the ENCODE effort did it about 3/4 right --

As part of the supplementary material for this paper, we have established a virtual machine instance of the software, using the code bundles from, where each analysis program has been tested and run.

I could have wished that each bit of code was in a separate git or hg repository, for example, or that there were small test data sets; this would have maximized my ability to dive into the code and play with it. But this is a giant step forward all on its own, compared to what pretty much everyone else does! And it's really fantastic to see it being done by a massive genomics consortium

Bill Howe commented that I'm conflating the publication of "one off" experiments (which require reproducibility) with the dissemination of reusable software, and that we should enable the first via whatever mechanisms we can, given the poor status quo. I disagree, mainly because it's not capacity building: releasing shoddy VMs is easy to do, but it doesn't help you learn how to do a better job of reproducibility along the way. Releasing software pipelines, however crappy, is on the path towards better reproducibility.

A related topic that also comes up occasionally is distributing software via VM. Scott Cain gave a great talk at BOSC 2012 on Tripal, a Web interface for Chado, and mentioned that there's a VM available. During the Q&A I apparently confused him by recommending that instead of, or in addition to, a VM, he provide a source code repository along with an install script pinned to a particular Linux install -- something that's really easy to do these days, what with the clouds and open sourciness. His response was "why would you need anything more than the VM?" Again, the reason comes back to Mick's observation above: you can't (or at least shouldn't need to) install a VM for every software package or pipeline you need to execute!

If you think about the dependency chain here, it's easy to build a VM if you automate the install process, and providing that install script for even one OS can demystify the install process for others; conversely, just because you provide a VM doesn't mean that anyone other than you can install your software. So why not make life easy for everyone?

There is a deeper principle at work here: the distinction between a user and a maker. A user merely wants to take your software and run with it; a maker wants to probe, remix, and mash up your software. To maximize the benefit of our scientific software, we should be enabling makers, not users. To do anything else limits the use of our software to our own imagination, rather than enabling serendipity. And wouldn't that be a shame?


Comments !