I'm starting to notice that a lot of bioinformatics is anecdotal.
People publish software that "works for them." But it's not clear
what "works" means -- all to often either the exact parameters or the
specific evaluation procedure is not provided (and yes, there's
a double standard here
where experimental methods are considered more important than
computational methods).
This means that their result is not an example of computational
science. It's an anecdote.
Worse, it's pretty non-useful. I used to think biological validation
of bioinformatics was a far more important end goal than validation of
the bioinformatics -- that is, if a paper did some hand waving in the
bioinformatics section but then showed a solid biological result, it
was fine. But now I'm switching firmly over to the opinion that I was
full of it then, and am now more enlightened. Read on.
Let's take the Iverson et al. paper (published in
Science) as an
example -- I'm picking on it because someone more important than me
already pointed out some of its problems,
but I'm happy to point out many, many more papers in private. I bet
90% of the people reading Iverson et al. are not that interested in
Euryarchaeota. Most of them want to know how to take metagenomes --
simple or complex -- and assemble genomes out of them. But they will
be stymied, because in the initial publication, the full source code
and parameters are not provided.
The lack of computational details causes us lots of headaches (which
is one reason I am so bullish on practical considerations of
reproducible science). The Iverson et al. paper did a great thing:
they developed an approach to scaffolding contigs from metagenome
assembly that (from reading the paper) looks great and makes sense.
Since we happen to have some awesome metagenome contig generation
technology in our lab, we'd love to use their approach. But, at least
at the moment, we'd have to reimplement the interesting part of it
from scratch, which will take a both solid reimplementation effort as
well as guesswork, to figure out parameters and resolve unclear
algorithmic choices. If we do reimplement it from scratch, we'll
probably find that it works really well (in which case Iverson et
al. get to claim that they invented the technique and we're
derivative) or we'll find that it works badly (in which case Iverson
et al. can claim that we implemented it badly). It's hard to see this
working out well for us, and it's hard to see it working out poorly
for Iverson et al.
(It kind of helps my morale that no one I've talked to can figure out
how the Iverson paper got accepted without the source code being
available. This seems like a failure on the part of the reviewers and
the Science editors rather than something accepted in the field of
metagenomics. Good thing Science is a high-impact journal, otherwise
I'd be worried that the paper might be wrong!)
All too often, biologists and bioinformaticians spend time hunting for
the magic combination of parameters that gives them a good result,
where "good result" is defined as "a result that matches expectations,
but with unknown robustness to changes in parameters and data." (I
blame the hypothesis-driven fascista for the attitude that a
result matching expectations is a good thing.) I hardly need to
explain why parameter search is a problem, I hope; read this
fascinating @simplystats blog post
for some interesting ideas on how to deal with the search for
parameters that lead to a "good result". But often the result you
achieve are only a small part of the content of a paper -- methods,
computational and otherwise, are also important. This is in part
because people need to be able to (in theory) reproduce your paper,
and also because in larger part progress in biology is driven by new
techniques and technology. If the methods aren't published in detail,
you're short-changing the future. As noted above, this may be an
excellent strategy for any given lab, but it's hardly conducive to
advancing science. After all, if the methods and technology are both
robust and applicable to more than your system, other people will use
them -- often in ways you never thought of.
Closer to home, I think I can attribute some of my collaborators'
impatience with me to this attitude of mine. I want to do good,
solid, robust computational science, as well as relevant biology; my
schtick is, at least at the moment, computational methods. Since
my collaborators tend not to be computationally focused, they don't
always get the point of all the computational work. Some of them are
either more patient or more relaxed about the whole thing -- if you're
wondering why Jim Tiedje is co-authoring papers on probabilistic de
Bruijn graphs, well, that's why :). Some of them are less patient,
and it's why I would never recommend a bioinformatics analysis position
to anyone -- it leads to computational science driven by biologists,
which is often something we call "bad science".
What's the bottom line? Publish your methods, which include your
source code and your parameters, and discuss your controls and
evaluation in detail. Otherwise, you're doing anecdotal science.
--titus
There are comments.