We just posted a new preprint (well, ok, a few weeks back)! The preprint title is
"Crossing the streams: a framework for streaming analysis of short DNA
sequencing reads", by Qingpeng Zhang, Sherine Awad, and myself. Note
that like our other recent papers, this paper is 100% reproducible,
with all source code in github. we haven't finished
writing up the instructions yet, but happy to share (see AWS.md for
notes).
This paper's major point is that you can use the massive redundancy of
deep short-read sequencing to analyze data as it comes in, and that
this could easily be integrated into existing k-mer based error
correctors and variant callers. Conveniently the algorithm doesn't
care what you're sequencing - no assumptions are made about uniformity
of coverage, so you can apply the same ol' spectral approaches you use
for genomes to transcriptomes, metagenomes, and probably single-cell
data. Which is, and let's be frank here, super awesome.
The paper also provides two useful tools, both implemented as part of
khmer: one is an efficient
approach to k-mer-abundance-based error trimming of short reads, and
the other is a streaming approach to looking at per-position error
profiles of short-read sequencing experiments.
A colleague, Erich Schwarz, suggested that I more strongly make the
point that is really behind this work: we're in for more data. Scads
and scads of it. Coming it with ways of efficiently dealing with it
at an algorithmic level is important. (We didn't strengthen this point
in the posted preprint - the feedback came too late -- but we will hopefully
get a chance to stick it in in the revisions.)
The really surprising thing for me is that the general semi-streaming
approach has virtually no drawbacks - the results are very similar to
the full two-pass offline approaches used currently for error
correction etc. Without implementing a huge amount of stuff we had to
make this argument transitively, but I think it's solid.
For entertainment, take a look at the error profile in Figure 6. That's
from a real data set, published in Nature something or other...
And, finally, dear prospective reviewers: the biggest flaws that I see
are these:
- we chose to make most of our arguments by analyzing real data, and
we didn't spend any time developing theory. This is a choice that
our lab frequently makes -- to implement effective methods rather
than developing the underlying theory -- but it leaves us open for
a certain type of criticism.
- to extend the previous point, the algorithmic performance depends
critically on details of the data set. We didn't know how to
discuss this and so the discussion is maybe a bit weak. We'd love
reviewers to ask pointed questions that we can address in order to
shore it up.
- Repeats! How does all this stuff work with repeats!? I did a lot of
work simulating repetitive sequences and couldn't find any place
where repeats actually caused problems. My intuition now tells me
that repeats are not actually a problem for methods that interrogate
De Bruijn graphs using entire reads as an index into the graph, but
I'd welcome someone telling me I'm wrong and either telling me
where to look, or asking concrete questions that illuminate better
directions to take it.
--titus
There are comments.