Archivability is a disaster in the software world
In The talk I didn't give at Caltech, I pointed out that our current software stack
is connected, brittle, and non-repeatable. This affects our ability to
store and recover science from archives.
Basically, in our lab, we find that our executable papers routinely
break over time because of minor changes to dependent packages or
Yes, the software stack is constantly changing! Why?!
Let me back up --
Our analysis routines usually depend on an extensive hierarchy of
packages. We may be writing bespoke scripts on top of our own library,
but those scripts and that library sit on top of other libraries, which
in turn use the Python language, the GNU ecosystem, Linux, and a bunch
of firmware. All of this rests on a not-always-that-sane hardware
implementation that occasionally throws up errors because x was compiled
on y processor but is running on z processor.
We've had every part of this stack cause problems for us.
many current repeatability stacks are starting to rely on Docker.
But Docker changes routinely, and it's not at all clear that the
images you save today will work tomorrow. Dockerfiles (which
provide the instructions for building images) should be more robust,
but there is a tendency to have Dockerfiles run complex shell
scripts that may themselves break due to software stack changes.
But the bigger problem is that Docker just isn't that robust.
Don't believe me? For more, read this and weep: Docker in
Production: A history of Failure.
software stacks are astoundingly complex in ways that are mostly
hidden when things are working (i.e. in the moment) but that block
any kind of longitudinal robustness. Perhaps the best illustration
"left-pad" pulled it from the packaging system, throwing the
practically, we can already see the problem - go sample from
A gallery of interesting Jupyter Notebooks. Pick five. Try to
run them. Try to install the stuff needed to run them. Weep in despair.
(This is also true of mybinder repos, just in case you're wondering;
many of my older ones simply don't work, for a myriad of reasons.)
These are big, real issues that affect any scientific software that relies
on any code written outside their project (which is everyone - see "Linux
operating system" and/or "firmware" above.)
My conclusion is that, on a decadal time scale, we cannot rely on software
to run repeatably.
This connects to two other important issues.
First, since data implies software,
we're rapidly moving into a space where the long tail of data is going
to become useless because the software needed to interpret it is
vanishing. (We're already seeing this with 454 sequence data, which
is less than 10 years old; very few modern bioinformatics tools will
ingest it, but we have an awful lot of it in the archives.)
Second, it's not clear to me that we'll actually know if the
software is running robustly, which is far worse than simply having it
break. (The situation above with Jupyter Notebooks is hence less
problematic than the subtle changes in behavior that will come from
e.g. Python 5.0 fixing behavioral bugs that our code relied on in
I expect that in situations where language specs have
changed, or library bugs have been fixed, there will simply be silent
changes in output. Detecting this behavior is hard. (In our own
khmer project, we've started including tests that compare the md5sum
of running the software on small data sets to stored md5sums, which
gets us part of the way there, but is by no means sufficient.)
If archivability is a problem, what's the solution?
So I think we're heading towards a future where even perfectly
repeatable research will not have any particular longevity, unless
it's constantly maintained and used (which is unrealistic for most
research software - heck, we can't even maintain the stuff we're using
right now this very instant.)
Are there any solutions?
First, some things that definitely aren't solutions:
Saving ALL THE SOFTWARE is not a solution; you simply can't, because
of the reliance on software/firmware/hardware interactions.
Blobbing it all up in a gigantic virtual machine image simply pushes
the turtle one stack frame down: now you've got to worry about keeping
VM images running consistently. I suppose it's possible but I don't
expect to see people converge on this solution anytime soon.
More importantly, VMs and docker images may let you reach bitwise
reproducibility, but they're not scientifically useful because
they're big black boxes that don't really let you reuse or remix the
contents; see Virtual machines considered harmful for
The post-apocalyptic world of binary containers.
Not using or relying on other software isn't a practical solution:
first, good luck with that ;). Second, see "firmware", above.
And, third, while there is definitely a crowd of folk who like to
reimplement everything themselves, there is every likelihood that
their code is wronger and/or buggier than widely used community
software; Gael Varoquaux makes this point very well in his blog
post, Software for reproducible science.
I don't think trading archivability for incorrectness is a good trade :).
The two solutions that I do see are these:
run everything all the time.
This is essentially what the software world does with continuous integration.
They run all their tests and pipelines all the time, just to check that
it's all working. (See "Continuous integration at Google Scale".)
Recently, my #MooreData colleagues Brett Beaulieau and Casey Greene
proposed exactly this for scientific papers, in their preprint
"Reproducible Computational Workflows with Continuous Analysis".
While this is a potential solution, it's rather heavyweight to set
up, and (more importantly) it gets kind of expensive -- Google runs
many compute-years of code each day -- and I worry that the cost to
utility ratio is not in science's favor. This is especially true
when you consider that most research ends up being a dead end -
unread, uncited, and unimportant - but of course you don't know which until
acknowledge that exact repeatability has a half life of utility, and that
this is OK.
I've only just started thinking about this in detail, but it is at
least plausible to argue that we don't really care about our ability
to exactly re-run a decade old computational analysis. What we do
care about is our ability to figure out what was run and what the
important decisions were -- something that Yolanda Gil refers to as
"inspectability." But exact repeatability has a short shelf-life.
This has a couple of interesting implications that I'm just starting to
- maybe repeatability for science's sake can be thought of as a
short-term aid in peer review, to make sure that the methods are
suitably explicit and not obviously incorrect. (Another use for
exact repeatability is enabling reuse and remixing, of
course, which is important for scientific progress.)
- as we already knew, closed source software is useless crap because
it satisfies neither repeatability nor inspectability. But maybe
it's not that important (for inspectability) to allow software
reuse with a F/OSS license? (That license is critical for reuse
and remixing, though.)
- maybe we could and should think of articulating "half lives" for
research products, and acknowledge explicitly that most research
won't pass the test of time.
- but perhaps this last point is a disaster for the kind of
serendipitous reuse of old data that Christie Bahlai and Amanda
Whitmire have convinced me is important.
Huge (sincere) thanks to Gail for arguing both sides of this,
including saying that (a) archive longevity is super important
because everything has to be saved or else it's a disaster for
humanity, and (b) maybe we don't care about saving everything
because after all we can still read Latin even if we don't actually
get the full cultural context and don't know how to pronounce the
words, and (c) hey maybe the full cultural context is important and
we should endeavor to save it all after all.
Lots for me to think on.