If a piece of bioinformatics software is not fully open source, my lab and I will generally seek
out alternatives to it for research, teaching and training. This holds
whether or not the software is free for academic use.
If a piece of bioinformatics software is only available under the GNU
Public License or another
then my lab and I will absolutely avoid integrating any part of it
into our own source code, which is mostly BSD.
Why avoid non-open-source software?
We avoid non-open-source software because it saves us future
Contrary to some assumptions, this is
not because I'm anti-company or against making money from software,
although I have certainly chosen to forego that in my own life. It's
almost entirely because such software is an evolutionary dead end, and
hence time spent working with it is ultimately wasted.
More specifically, here are some situations that I want to avoid:
- we invest a lot of time in building training materials for a piece
of software, only to find out that some of our trainees can't make
use of the software in their day job.
- we need to talk to lawyers about whether or not we can use a piece
of software or include some of its functionality in a workflow we're
- we find a small but critical bug in a piece of bioinformatics software,
and can't reach any of the original authors to OK a new release.
This is the reason why I won't be investing much time or effort in
in my courses and my research: it's definitely not open source.
Why avoid copyleft?
The typical decision tree for an open source license is between a "permissive"
BSD-style license vs a copyleft license like the GPL; see Jake
Vanderplas's excellent post on licensing scientific code for specifics.
There is an asymmetry in these licenses.
Our software, khmer, is
available under the BSD license. Any open source project (indeed,
any project) is welcome to take any part of our source code and
include it in theirs.
However, we cannot use GPL software in our code base at all. We can't
call GPL library functions, we can't include GPL code in our codebase,
and I'm not 100% sure we can even look closely at GPL code. This is
because if we do so, we must license our own software under the GPL.
This is the reason that I will be avoiding bloomtree code, and in fact we
will probably be following through on our reimplementation --
bloomtree relies on both Jellyfish and sdsl-lite, which are GPL.
Why did we choose BSD and not GPL for our own code?
Two reasons: first, I'm an academic, funded by government grants;
second, I want to maximize the utility of my work, which means
choosing a license that encourages the most participation in the
project, and encourages the most reuse of my code in other projects.
Jake covers the second line of reasoning really nicely in his blog
so I will simply extract his summary of John Hunter's reasoning:
To summarize Hunter's reasoning: the most important two predictors
of success for a software project are the number of users and the
number of contributors. Because of the restrictions and subtle
legal issues involved with GPL licenses, many for-profit companies
will not touch GPL-licensed code, even if they are happy to
contribute their changes back to the community. A BSD license, on
the other hand, removes these restrictions: Hunter mentions several
specific examples of vital industry partnership in the case of
matplotlib. He argues that in general, a good BSD-licensed project
will, by virtue of opening itself to the contribution of private
companies, greatly grow its two greatest assets: its user-base and
I also think maximizing remixability is a
basic scientific goal, and this is something that the GPL
The first line of reasoning is a little more philosophical, but
basically it comes down to a wholesale rejection of the logic in the
Bayh-Dole act, which tries
to encourage innovation and commercialization of federally funded
research by assigning intellectual property to the university. I
think this approach is bollocks. While I am not an economic expert, I
think it's clear that most innovation in the university is probably
not worth that much and should be made maximally open. From talking to
Dr. Bill Janeway, I he
agrees that pre-Bayh-Dole was a time of more openness, although I'm
not sure of the evidence for more innovation during this period.
Regardless, to me it's intuitively obvious that the prospect of
commercialization causes more researchers to keep their research
closed, and I think this is obviously bad for science. (The Idea
talks a lot about how Bell Labs spurred immense amounts of innovation
because so much of their research was open for use. Talent Wants to
be Free is a
pop-sci book that outlines research supporting openness leading to
So, basically, I think my job as an academic is to come up with cool
stuff and make it as open as possible, because that encourages innovation
and progress. And the BSD fits that bill. If a company wants to make
use of my code, that's great! Please don't talk to us - just grab it and
I should say that I'm very aware of the many good reasons why GPL
promotes a better long-term future, and until I became a grad student
I was 100% on board. Once I got more involved in scientific
programming, though, I switched to a more selfish rationale, which is
that my reputation is going to be enhanced by more people using my
code, and the way to do that is with the BSD. If you have good
arguments about why I'm wrong and everyone should use the GPL, please
do post them (or links to good pieces) in the comments; I'm happy to
promote that line of reasoning, but for now I've settled on BSD for my
One important note: universities like releasing things under the GPL,
because they know that it virtually guarantees no company will use it
in a commercial product without paying the university to relicense it
specifically for the company. While this may be in the best
short-term interests of the university, I think it says all you need
to know about the practical outcome of the GPL on scientific
Why am I OK with the output of commercial equipment?
Lior Pachter drew a contrast between my refusal to teach non-free
software and my presumed teaching on sequencing output from commercial
Illumina machines. I think there's
at least four arguments to be made in favor of continuing to use Illumina
while avoiding the use of Kallisto.
- pragmatically, Illumina is the only game in town for most of my
students, while there are plenty of RNA-seq analysis programs. So
unless I settled on kallisto being the super-end-all-be-all of
RNAseq analysis, I can indulge my leanings towards freedom by
ignoring kallisto and teaching something else that's free-er.
- Illumina has a clear pricing model and their sequencing is essentially
a commodity that needs little to no engagement from me. This is not
generally how bioinformatics software works :)
- There's no danger of Illumina claiming dibs on any of my results or
extensions - we're all clear that I pays my money, and I gets my
sequence. I'm honestly not sure what would happen if I modified
kallisto or built on it to do something cool, and then wanted to let
a company use it. (I bet it would involve talking to a lot of
lawyers, which I'm not interested in doing.)
- James Taylor made the excellent points that
limited training and development time is best spent on tools that
are maximally available, and that don't involve licenses that they
So that's my reasoning. I don't want to pour fuel on any licensing
fire, but I wanted to explain my reasoning to people. I also think
that people should fight hard to make their bioinformatics software
available under a permissive license, because it will benefit everyone
I should say that Manoj Samanta has been following this line of
for much longer than me, and has written several blog posts on this
topic (see also this,
There are comments.