About 10 days ago, I gave a talk in Manchester to Carole Goble's
group, hosted by Aleksandra Pawlik. The talk title was "Six ways to
Sunday: Approaches to computational reproducibility in non-model
sequence analysis." I've posted the slides (here).
For the talk, I put together a list of five things that I felt were "myths" of
reproducible computational research: attitudes that wrongly
discouraged people from actually doing computational research
reproducibly. I thought it was worth reproducing them below with a bit
of discussion.
Myth 1: Partial reproducibility is hard
Some people seem wary of the mountain of work that any true attempt at
reproducible computational science must entail, and so they don't want
any part of it.
Bah!
Start small; it's easy. Let me tell you: anything you do --
providing the raw data, posting any small scripts, detailing the
versions of programs you used together with their parameters -- will
be tremendously welcome to anyone trying to validate or build off your
paper. This includes you yourself, in 2 years.
Myth 2: Incomplete reproducibility is useless
20% or 50% or 60% or 80% is all better than 0%. Sure, 100% is better
than 80%, but it might be a lot more work to get to 100% rather than
80%, so why not start small and target something that seems achievable
first? (It's OK, I give you permission!)
In practice, anything you do will be useful. (See #1.)
Myth 4: Virtual machines solve the reproducibility problem
Maybe they do, but not usefully so.
Look, VMs are just giant black boxes. Your VM could be a giant lookup
table that doesn't implement any actual methods but just regurgitates
already calculated results. No one would be able to tell -- you're
just providing an interface to something, right? But is that really useful
in helping people understand what it is you did, replicating it on their
own, or reusing and remixing it? No, not at all.
And yes, I've made this argument before. Bill
even says that he sorta agrees with me now.
Myth 5: GUIs are the way to go, because scientists might actually use easy-to-use software.
Bioinformaticians like the idea of Galaxy (just to pick on one that I
actually kinda like ;) because they think that it will make it easy for
biologists to do computation. And hey, look, you get reproducibility for
free!
Unfortunately, at least in the area of bioinformatics I work in, things
change so rapidly that the GUIs don't keep up. By the time it's in a GUI,
it's rarely still very cutting edge, which doesn't fit well with most
research that I see being done.
But there's a bigger problem: almost all data analysis steps take
place in a larger pipeline. The GUI has to wrap the entire pipeline,
or else be scriptable in order to fit in; otherwise, it actually
presents an obstacle to reproducibility, because you have to capture
its parameters somewhere else.
Galaxy is a decent compromise: it is a GUI, that provides access to
cloud resources; it wraps command lines, so you can pretty quickly
integrate the latest thing; and reproducibility comes "for free".
But (like all the other GUIs I've seen) it's pretty limiting in terms
of what you can wrap with it.
Less talk, more work
At the end of the day, my lab's experience is this: you don't need
much in the way of "magic sauce" to get started doing things
reproducibly. You need a little bit of experience or training (cue
Software Carpentry), and a little
bit of elbow grease the first time around, but it's just not that hard
to get started and make your research life better.
Fundamentally, my lab now does this stuff 'cause it makes everything
easier. Automation reduces our maintenance burden; explicit workflows
built for extensibility make paper revisions much easier; explicit
instructions are good for training students. At this point I feel
like we're moving clearly into the "virtuous cycle" stage, where the
positive feedback of doing everything like this turns us into
unstoppable borg-like computational scientists. We'll let you know
how that goes.
--titus
There are comments.