It's not (just) the tools, folks

The latest hot shit idea for making a protein-protein interaction database leaves me lukewarm.

A few months ago I met with a genomics group, and we had a back-and-forth about genome annotation. The conversation went something like this:

them: "We have to improve the tools for annotating un-annotated genes!"

me: "Uhh, don't those tools have a ridiculously high rate of error?"

them: "Exactly!  More work is needed!"

me: "Didn't most of the original information come from annoying, slow,
expensive manual experiments, though?"

them: "Yes!  But it's too slow that way!  There are too many un-annotated
genes out there!  We need to do something automated and/or computational!
And manual annotations suck, because people don't properly use ontologies,
so we can't make use of the manual annotations in our own high-throughput

me: "But don't we want to make the information useful for biologists
first and foremost?  They're the ones doing the experiments, right?"

them: "No! It needs to be ontologized first and foremost!"

This sort of conversation showcases a fairly common tension between genomics folks (who are often computationally minded and want to sequence, predict, and otherwise do high-throughput science) and classically-minded molecular biologists and geneticists (who seem to firmly believe that individual experiments are The Way).

This tension has extended to all of science as Moore's Law, the Internet, and technology in general has expanded the data sets available to us & the tools we have to analyze them.

All in all, I agree that the tools for entering biological data, annotating it with ontologies, and otherwise communicating single-scientist knowledge kinda suck. (I'm still aghast at the annotation process that the Wormbase folk go through -- it involves ASCII text editors, for chrissakes!) But while tools can be helpeful, the fundamental problem is much more, well, fundamental: science is hard. Connecting the dots is hard. Thinking clearly about the problem and separating the wheat from the chaff, so to speak, is hard. I worry that for the majority of biologists, new tools are going to be more distracting than helpful. We need to build simpler, easier-to-use tools, not more complicated tools; we need to keep our focus on the goal (solving biological problems) and not just on intermediate stages like improving databases and building better prediction tools.

To paraphrase Paul Sternberg, it's more often the sharpshooter on the hill than the marching army that makes the difference in science.

My gold standard for a tool or a database is simple: do I believe the database or tool sufficiently to put a few weeks into an experiment to test a hypothesis derived from it? If not, then it is not a tool intended for bench biologists and it shouldn't be billed as such. You can take a similar perspective on big initiatives like the Protein Structure Initiative, although I think you'd have been mostly wrong in the case of the Human Genome Project.


p.s. Oddly enough, my 10 yrs in developmental molecular biology has completely trumped my ~18 years in computational science here: I'm a pretty firm believer in hypothesis-driven science. That having been said, the true answer and the optimal approach lies somewhere in the middle: figuring out how to combine large-scale approaches with hypothesis-driven science.

Comments !

(Please check out the comments policy before commenting.)