A good way to publish -- arXiv FTW

(Or, "A better way to publish bioinformatics.")

We just got word that our paper, "Scaling metagenome assembly with probabilistic de Bruijn graphs" [ arXiv ] [ github ] has been accepted for publication in PNAS. (Yay!) I just posted the final version to github, and the arXiv PDF should be updated to the third version on Tuesday or Wednesday (July 3rd or 4th). We're very happy, not least because this is my first senior author paper and my grad student Jason Pell's first first-author paper.

But we have other reasons to be happy, too.

You see, back in December when we submitted the paper (on Dec 28th, 2011), I took the relatively unusual step of posting it to arXiv, a preprint server. This made the paper available before peer review, which isn't something that we tend to do in bioinformatics or biology, for whatever reason.

The paper didn't exactly sink like a rock, but I didn't get much in the way of feedback or publicity from making it available. That's OK, it sounds pretty esoteric -- it involves biology, computer science, and physics, and discusses DNA shotgun sequencing, metagenomics, probabilistic data structures, and percolation theory. Apart from being nearly impossible to review from a position of expertise (my group may actually be the world expert on the intersection of these four topics, solely because we wrote this paper), the paper seemed reasonably impenetrable to pretty much any audience. Oh well.

Then two super neat things happened.

First, about a month ago I received a request to review a paper on quip, a DNA sequence compression package that uses probabilistic data structures. I was asked to review it because they cited our arXiv paper (yes, you can cite arXiv papers -- it's archival!) I did review the quip paper (it's a great paper) and I hope it appears soon. (I signed the review, note.)

Second, about two weeks ago I heard about Minia, an assembler that used an extension of our basic idea to do really low-memory assembly of the human genome. What was ultra cool about this is that Rayan Chikhi, one of the authors, told me directly that they built off of our paper:

Thanks for sharing your pre-prints on arxiv, our article is a direct
consequence of that action!

So, not only does our newly-accepted PNAS paper have two citations, both from before it was accepted, but another group has already extended our approach in a new direction (which looks pretty significant to me, BTW).

This just seems to be the right way to do science, to me: minimum time from paper to exposure to (possible) extension.

To be clear, I don't want to claim that our paper is super consequential (that'll wait for another blog post ;) or even that the Minia assembler is an amazing breakthrough that totally built off of our giant contribution. I just think it demonstrates a way to accelerate bioinformatics research -- which we should be doing, yeh?

Incidentally, Cameron Neylon, the new advocacy director for PLoS, told me that in physics (where they've been using arXiv for 20 years or more) the citation count for papers peaks right around the point where they're accepted for publication -- because by then they're old hat: everyone has read them on arXiv already. Wow, physics and bioinformatics publication culture sure is different :).


p.s. Note that the paper is under embargo until it's formally published. I'm not sure what that means in the age of blogs, exactly, but if you're a NY Times reporter just itchin' to write a front-page story about our paper... please wait, okay?

Comments !