This is a response to (parts of) Dr. Lior Pachter's post, "The myths
of bioinformatics software". (You can also see my post on bioinformatics software licensing for at least some of the background arguments.)
I agree with a lot of what Lior says: most bioinformatics software is
not very good quality (#1), most bioinformatics software is not built
by a team (#2), licensing is at best a minor component of what makes
software widely used (#3), software should have an expiration date
(#5), most URLs are unstable (#6), software should not be "idiot
proof" (#7), and it shouldn't matter whether you use a specific
programming language (#8).
I strongly disagree with Lior's point #4, in almost every way. I try
make my software free for everyone, including companies, for both
philosophical reasons and for simplicity; I explained my reasoning in
my blog post.
(Anyone who doesn't think linking against GPL software is reasonably
complicated and nuanced should through the tweets and comments on that
post!) From my few involvements with working on non-free software, I
would also add that selling software is a tough business, and not one
that automatically leads to any profits; there's a long tail, just as
with everything else, and I long ago decided that my time is worth
more to me than the expected income from selling software would be.
(I would be thrilled if a student wanted to try to make money off of
our work, but my academic work would remain open source.)
Regardless, Lior's opinion isn't obviously wrong, and I appreciate the
discussion.
What surprises me most about Lior's post, though, is that he's
describing the present situation rather accurately, but he's not angry
about it. I'm angry, frustrated, and upset by it, and I really
want to see a better future -- I'm in science, and biology, partly
because I think it can have a real impact on society and health.
Software is a key part of that.
Biology and genomics are changing. Large scale data analysis is
becoming more and more important to the biomedical sciences, and
software packages like kallisto and khmer are almost certainly going
to be used in the clinic at some point. (I believe some of Broad's
variant calling software is already used in diagnosis and treatment
for cancer, for example, although I don't know the details.) Our
software is certainly being used by people doing basic biomedical
research, although it may not be directly clinical yet - and I think
the quality of computation in basic research matters too.
And this means bioinformatics should grow up a bit. If
bioinformatics is a core component of the future of biology (which I
think is obvious), then the quality of bioinformatics software
matters.
To quote Lior, "Who wants to read junk software, let alone try to edit
it or build on it?" Certainly not me - but then why are we producing
it? Are we settling for this kind of software in biomedical research?
Are we just giving up on producing decent quality software altogether,
because, uh, it's hard? How is this different from doing bad math, or
publishing bad biology - topics that Lior and others get really mad
about?
Lior also quotes a Computational Biology interview with James Taylor,
who says,
A lot of traditional software engineering is about how to build
software effectively with large teams, whereas the way most
scientific software is developed is (and should be)
different. Scientific software is often developed by one or a
handful of people.
That was true in a decade ago, and it may have been a reasonable
reason to avoid using decent software engineering techniques then, but
the landscape has changed significantly in the last decade, with a
wide variety of rapid prototyping, test-driven development, and
lean/agile methodologies being put into practice in startups and large
companies. So I think James is mistaken here.
I wager that the reason a lot of scientists do bad software
engineering is because they can get away with it, not because there
are no techniques they could profitably use. Heck, if they wanted to
learn something about it, Software Carpentry will come teach workshops for you
on this very topic, and I'd be happy to offer both Lior and James a
workshop to bring them up to speed. (Note: I don't think either of
them needs my advice, which is actually kind of my point.)
(As for languages, Lior's point #8, there is a persistent expansion of
the Python and R toolchains around bioinformatics and a convergence on
them as the daily workhorses of bioinformatics data analysis. So even
that's changing.)
Fundamentally the blithe acceptance of badly engineered software in
science baffles me. I can understand (and even endorse)
not requiring good software engineering for algorithmic proofs of
concept, but clearly we want to have good, robust libraries for
serious work.
To claim otherwise would seem to lead to the conclusion that much of
bioinformatics and genomics should seek to be incorrect and
irrelevant.
I want there to be a robust community of computational scientists
and software developers in biology. I want people to be able to
build a new variant caller without having to reimplement a FASTQ
or SAM parser. I think we need people to file bug reports,
catch weird off-by-one problems, and otherwise spot check all the
software they are using. And I don't think it's impossible or even
terribly difficult to achieve this.
The open source community has been developing software with
distributed teams, with no single employer, and with uncertain funding
for decades. It's not easy, but it's not impossible. And in the end I
do think that the open source community has a lot of the solutions the
computational science community needs, and in fact is simply a much
better exemplar for how to work reproducibly and with high technical
quality. Why we
continue to ignore this mystifies me, although I would guess it has to
do with how credit is assigned in academic software development.
If we went back to the 80s and 90s we'd see that many of the same
arguments that Lior is making were applied to open source software in
contrast to commercial software. We know how that ended - open source
software now runs most of the Internet infrastructure. And open
source has had other benefits, too; to quote Bill Janeway, "open
source and the cloud have dramatically decreased the friction of
innovating", and the scientific community has certainly benefited from
the tremendous innovation in software and high tech. I would love to
see that same innovation enabled in genomics and bioinformatics. And
that's why we try to practice good software development in my lab;
that's why we release our software under the BSD license; and that's
why we encourage people to do whatever they want with our software.
Ultimately I think we should develop our software (and choose our
licenses), for the future we want to see, not the present that we're
stuck with.
--titus
There are comments.