Replication, reproduction, and remixing in research software

I just spent a really fun and exciting two hours installing a piece of software that I needed to run to do a paper review. The software itself downloaded, but failed routinely on their own test data; after delving through four layers of Perl and Python, I discovered that the problems lay in my having installed the wrong version of two pieces of underlying software. One, a common microbiology framework whose name rhymes with 'rhyme', had been updated since the release of the software under review, and the new version of the framework is incompatible with the software; since the software under review didn't specify the versions of any of its dependencies, I had to hunt around to figure out what it was looking for, find the version containing that library by surfing github, and install it. That was easy. The other problem was caused by a hidden dependency (four layers deep!) that failed silently but resulted in a more visible failure a few lines of code later; this was written in (I think) Python that called Perl that called Python that called a binary executable, and so I had to grub through Perl a bit.

2 hours. Whee!

This resulted in a not-very-disguised rant on research software quality, which was at least partly in response to Mick Watson's post on bioinformatics police. In this post -- which you should go read, it has a surprise ending, and Mick apparently needs the traffic for his Google ads -- Mick says:

First of all, I want to state quite clearly that I am not a code Nazi. I don't care about your coding practices. Good architecture, an elegant object model, a stable API, version control, efficient code reuse, efficient code etc etc. I don't care. Write all the unit tests you like, because if they fail, I’ll just force the install anyway. I don't care if you used extreme programming, whether or not you involved Tibetan monks and had your github repository blessed by the Dalai Lama. Maybe you ensured the planets were aligned before you released version 0.1, or made sure all of your code monkeys had perfect Feng Shui in their bedsits. I don't care. That don't impress me much.

However, I do care that your code goddamn works.

I think, as a scientist, if I take some published code, that it should work. Not much too ask is it? Sure, a readme.txt or a manual.pdf would be nice too, but first and foremost, it has to just do the eff-ing job it's supposed to.

I think this is really confusing, to the point of being wrong. Now, Mick has been wrong before, but this time I think he's so wrong that it's usefully wrong.

First, Mick and others noted that "surely you shouldn't be complaining since you got the car for free". Since I wrote my little rant in a hurry (between child care obligations), I used a somewhat clumsy analogy and failed to properly point out that while some research software may be "free" in the sense that you don't pay for downloading and using, someone has -- generally you, the taxpayer.

But that's not really the most important point.

I started thinking about this back when I wrote my most depressing blog post ever -- depressing because people were actually arguing in the comments that it's OK to write bad code. I followed that post up with another post on code quality and testing. At this point realized that there were really several different things being conflated in the discussion about research software, and I ended up nailing this down in my own head when I wrote about making science better -- a fine example of essay writing generating surprises.

Here is my conclusion:

The three uses of research software

Replication -- if you used software to do something important, and publish it, and we can't replicate it by using the same software, FAIL. For all intents and purposes, the software can be a big black box -- all we need to do to replicate your results are run it on your data, or someone else's data. This is where things like RunMyCode come in, by making it easy to distribute runtime environments.

Reproduction -- considered by some to be more important than replication, reproducibility (often confused with replication, by me and others) is the question of whether or not your answers can be reached via different means. This is considered a holy grail of the science process: if other people can get the same result without using an identical process, then the result is more likely to be correct. (I think this is over-emphasized because systematic bias can be very reproducible, but it's still important.)

Reuse -- in bioinformatics, specifically -- and scientific software development more generally -- reuse and remixing is very important. I think this is the key point that many just don't think about. Science isn't just about discovering facts; it's about making progress in what we know. This can be accelerated by reusable, remixable tools. Any one individual end goal may be knowing some fact or set of facts about something, but the process by which we reach that goal will often better enable others to reach their end goal faster, better, cheaper, and more accurately. This is the point I failed to make well in my post Virtual machines are harmful to reproducibility; somewhat ironically, Mick agrees with me on that one :). Science can be most easily accelerated if you make your source code available so that others can riff off it.

I have read many arguments against this: that publishing a theoretical description of an algorithm is enough; or that it's actually harmful to others to provide the source, because lacking source forces people to reproduce your work rather than merely replicate; or that publishing code obligates you to support; or that publishing bad code is a bad idea, and you need to clean it up to publish it. Bushwah. These specific objections are easily answered ((a) efficient and correct implementation matters, and the algorithmic description often masks important implementation details; plus, it's hard!; (b) as Victoria Stodden points out, what do you do when two implementations disagree? Write a third? No, you compare the implementations, for which you need the source; (c) No, it doesn't; (d) the main reason people avoid publishing code and data is because they're afraid it's wrong (and for good reason, apparently)), which indicts the whole field). None of these arguments hold up, IMO.

I personally hate anecdotal science tremendously, and I keep on coming back to that SUPER awesome paper with a data mining approach we wanted to try... but with a script that had a syntax error in line 2. Grr. Reuse, blocked; I didn't trust any of their work after that. (A good guess on my part -- the entire approach turned out to be too fragile and parameter dependent to use, and frankly the paper should not have been published.)

My inability to use your software aside, though, I think the main point is this:

Bad code is often wrong code

Sure, you don't need (and I certainly don't have ;) many of the things that Mick argues are irrelevant: good architecture, an elegant object model, a stable API, efficient code, etc. Most of these are about explicit code reuse, and odds are high that no one is ever going to look at or reuse your code -- it just needs to be possible to do so, for all three of the reasons above.

But, Mick? I'll fight you to the death on version control. Why?

Writing correct code is hard, and a vast array of effort has been brought to bear on code correctness over the years; it is simply stupid to ignore this experience. This is the point we try to make in our Best Practices for Scientific Computing paper -- you don't have to use version control, but you have a great chance of introducing regressions if you don't. You don't need to write tests of any kind, but this goes against the experience of virtually every modern software professional you talk to. Et cetera.

If you buy a car and it doesn't work in obvious ways you should be very skeptical about the engineers who designed it. For example, you might not want to cross the bridge that they designed, or fly in an airplane. Why would I treat scientific software any differently?

But you don't need to listen to me on this -- no less of an expert than Van Halen makes the same point: paying attention to the details is an indicator of general competence.

The bottom line is this: if the code looks badly written and ignores essentially all major tenets of modern software design, it's probably seriously wrong in places. Not because the authors aren't good scientists, not because of some lack of Dalai Lama blessing, but because software engineering is hard hard hard, and if you can't be bothered to learn how to use version control, you shouldn't be trusted to write important software.

This is true in much the same way that using basic lab practices are both importand indicative. If you wander into someone's lab and you see someone using TA buffer with lots of solid precipitate to pour a gel shift gel under the advisor's eyes, might you not wonder about the reliability of said lab's results? If the lab's PI says "don't worry about those negative PCR controls, they're always negative and it's a waste of reagents to run them" -- run screaming, amiright?

Every now and then some slick shyster comes my way (usually Randy Olson or someone else from Chris Adami's lab) and explains how honest-to-gosh, they have found that unit testing isn't as important as, say, functional testing in their simulations. Great! You have a reason based on experience -- I respect your right to have an opinion! It's the people who blithely dismiss Practice X (version control, usually) because "it's not that important, and I never learned it anyway" that drive me nuts and turn me stabby.

Punting on software remixability

A few final words, courtesy of my late night experience with software installs.

If you say "this software works best when we install it for you and give you a virtual machine", you are essentially punting on the idea that anyone will ever combine your software with anyone else's.

If you provide no documentation anywhere, and no README, then I am pretty sure you're not serious about anyone else ever using it. (How hard is this, really?)

If you rely on other packages but never specify a version number or test for "correct" output of packages you depend on, the odds are that your software will bitrot to unusability quite quickly. Please don't do that. Your software looks useful and I'd like to try it out in 6 months, after you've moved on to something else.

It's still all about the incentives

I don't actually harbor much anger towards the software that expended so much of my time -- the software seems to work now, and it's not that badly written; I intend to submit patches or bug reports to further improve it. Mick is right that software needs to enable good biology, above all else, and that's what I'm trying to evaluate in the review. Sure, my life would be easier if the software had been written with more of eye towards bitrot, and I'm loathe to recommend it to newbies, but...

...I recognize that the explicit incentives for writing good, reusable software are lacking. I'm going to keep on trucking, though, because it seems to be working. And I'll see *you* from the other side of an anonymous review sheet :).

One final thought for y'all. As Data of Unusual Size continues to make inroads into science, more and more software will be written, and more and more of the conversation needs to be about good software capacity building, aka software cyberinfrastructure. Big Data is sufficiently inconvenient that hastily or badly written software infrastructure will doom you to irrelevance. Worth a think.

--titus

p.s. Need training and exposure to good scientific computing practice? Know Python, will travel. Drop us a line.

p.p.s. Stop hosting code on your lab Web site.

Comments !