How long does it take to produce scientific software?

Over here at UC Davis, the Lab for Data Intensive Biology has been on extended walkabout developing software for, well, doing data intensive biology.

Over the past two to three years or so, various lab members have been working on the following new pieces of software -

dammit, de novo transcriptome annotation pipeline (Camille Scott);

kevlar, reference free variant discovery in large eukaryotic genomes (Daniel Standage), in collaboration with Fereydoun Hormozdiari;

sourmash, a MinHash-based sequence analysis framework (Luiz Irber, Phillip Brooks, Taylor Reiter, and several others);

spacegraphcats, a compact De Bruijn graph search system (this is a collaboration with Blair Sullivan and her group including Mike O'Brien and Felix Reidl, as well as Dominik Moritz of Jeff Heer's group);

boink, a De Bruijn graph processing framework (Camille Scott);

I should say that all of these except for kevlar have been explicitly supported by my Moore Foundation funding from the Data Driven Discovery Initiative.

With the possible exception of dammit, every single one of these pieces of software was developed entirely since the move to UC Davis (so, since 2015 or later). And almost all of them are now approaching some reasonable level of maturity, defined as "yeah, not only does this work, but it might be something that other people can use." (Both dammit and sourmash are being used by other people already; kevlar, spacegraphcats, and boink are being written up now.)

All of these coming together at the same time seems like quite a coincidence to me, and I would like to make the following proposition:

It takes a minimum of two to three years for a piece of scientific software to become mature enough to publicize.

This fits with my previous experiences with khmer and the FamilyRelations/Cartwheel set of software as well - each took about two years to get to the point where anyone outside the lab could use them.

I can think of quite a few reasons why some level of aging could be necessary -

often in science one has no real idea of what you're doing at the beginning of a project, and that just takes time to figure out;

code just takes time to get reasonably robust when interfacing with real world data;

there are lots of details that need to be worked out for installation and distribution of code, and that also just takes time;

but I'm somewhat mystified by the 2-3 year arc. It could be tied to the funding timeline (the Moore grant ends in about a year) or career horizons (the grad students want to graduate, the postdocs want to move on).

My best guess, tho, is that there is some complex tradeoff between scope and effort that breaks the overall software development work into multiple stages - something like,

figure out the problem

implement a partial solution

make an actual solution

expand solution cautiously to apply to some other nearby problems.

I'm curious as to whether or not this pattern fits with other people's experiences!

I do expect these projects to continue maturing as time and opportunity permits, much like khmer. boink, spacegraphcats, and sourmash should all result in multiple papers from my lab; kevlar will probably move with Daniel to his next job, but may be something we also extend in our lab; etc.

Another very real question in my mind is: which software do we choose to maintain and extend? It's clearly dependent on funding, but also on the existence of interesting problems that the software can still address, and on who I have in my lab... right now a lot of our planning is pretty helter skelter, but it would be good to articulate a list of guiding considerations for when I do see pots of money on the horizon.

Finally: I think this 2-3 year timeline has some interesting implications for the question of whether or not we should require people to release usable software. I think it's a major drain on people to expect them to not only come up with some cool new idea and implement it in software they can use, but then also make software that is more generally usable. Both sides of this take special skills - some people are good at methods & algorithms development, some people are good at software development, but very few people are good at both. And we should value both, but not require that people be good at both.

--titus

Living in an Ivory Basement Stochastic thoughts on science, testing, and programming.

Other articles

How to analyze, integrate, and model large volumes of biological data - some thoughts

Categorizing 400,000 microbial genome shotgun data sets from the SRA

Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists!

Efficiently searching MinHash Sketch collections

MinHash signatures as ways to find samples, and collaborators?

Applying MinHash to cluster RNAseq samples

Other articles

social