Note: at the Lab for Data Intensive Biology, We're trying out a new journal club
format where we summarize our thoughts on the paper in a blog post.
For this blog post, Camille wrote the majority of the text and the
rest of us added questions and comments.
Inanç Birol, Justin Chu, Hamid Mohamadi, et al., “Spaced Seed Data
Structures for De Novo Assembly,” International Journal of Genomics,
Article ID 196591, in press.
The paper:
Some relevant background:
Summary
The authors describe several data structures for integrating a special
case of spaced seeds into de Bruijn Graphs. Traditionally, a spaced
seed is a mask that's applied to an indexing sequence, increasing the
sensitivity of seeding approaches. For example, we could use 110110110
to mask out the 3rd base in each codon of our seed, thus allowing for
a wider range of matches given the high variability of that position
in protein coding regions. The special case here has the mask follow
the form 1..1..0..0..0..1..1, where the left and right pass regions
are k-mers, and the mask in the middle is of length Δ; or in other
words, given a region of length 2k+Δ, the seed comprises the prefix
and suffix k-mers.
Three data structures are introduced. The first is a naive hash table formulated in the same manner as the current ABySS implementation, which stores:
- The concatenated seeds as a bit string, represented with the notation [ k : k ]
- The extensions of both k-mers in the seed pair, stored as a 16-bit string
- Observation frequencies for the two strands of the seeds
- Bookkeeping junk
The second is a “spaced seeds bloom filter,” which as one might expect, stores the seeds in a bloom filter. The novelty here is the way they hash the seeds. Given a string of length 2k representing the concatenated k-mers of the seed, they hash:
- The left k-mer (i.e., the first k bases)
- The right k-mer (i.e., the last k bases)
- The string formed from pulling out every other base starting at position 0 (i.e., the “odd” bases)
- The string formed from pulling out every other base starting at position 1 (i.e., the “even” bases)
- For each of the four described values, they actually hash the XOR of the forward and revcomp’d 2-bit encodings
One immediately useful application of this method is in its ability to
infer the existence of single-base errors. For example, if we check a
seed and find that our filter already contains the “left” and “odd”
k-mers, but not the “right” and “even” ones, there’s a good chance
that our query seed just has a single base error in one of the “right”
and “even” positions (see Table 3 for complete listing).
Finally, they describe a “spaced seeds counting bloom filter” which,
you guessed it, stores seeds in a counting bloom filter. This is a
particularly nifty implementation, as it uses the minifloat standard
to exactly count up to 15, and probabilistically count values from
15-~128,000. They use the bloom filter to first count existence, and
then fall over to the counting filter when a seed is observed multiple
times. The usefulness of a better counting bloom filter should be
obvious to our group.
Broadly, we care about this because:
- The seeding methodology allows for dBG’s to be scaled up to longer,
error prone reads - a very important advancement to make if we want to
dBG’s to continue to be relevant. The question remains as to whether
we ought to be piling more and more duct tape on to dBG's to keep them
in use.
- The seeding also allows more accurate resolution of complex graph
regions by retaining longer range correlations from within reads.
- The aforementioned error identification. The hashing method allows
one to quickly restrict the set of possible/likely erroneous k-mers in
a read, which should speed up spectral error correction.
- Generally, spaced seeds have better fault tolerance and uniqueness
than long k-mers. Fig. 1 shows that spaced seeds of length 16 with
Δ>100 have better uniqueness than k-mers of length 64, and are
obviously less prone to single base errors because they are composed
of less sequence.
Other notes
Interesting, they routinely use a k size of 120 when assembling 150 bp
reads.
The staged Bloom filter reminds us of the BFCounter implementation,
which uses exact counting for k-mers seen two or more times.
For Illumina reads (100 - 150 bp long), the middle part of each read
are ignored if delta is big like 100 bp. So, were the y axis in figure
1 changed to absolute number, unique 2k-mers should be much higher in
number than unique spaced seeds.
A generally question we have is which one is the most memory efficient
data structure for DBG, sDBG or bloom filters?
Spaced seeds let you take advantage of longer reads; the alternative,
using longer k, would reduce coverage dramatically, be sensitive to
errors as well, and consume more memory.
Confusions
The authors claim that this is better than the method used in the
PathSet paper because spaced seeds “allow” for fixed distance
between seed pairs. This is confusing to us, because the variable
distance used in PathSet seems to be described as a feature — yet
these authors posit that variable-distance seeds are sensitive to
“read coverage fluctuations” for reasons. No justification was given
for this statement.
We don't see how spaced seeds are useful with the higher error rate of
uncorrected long 'pore reads; granted the error correction for both
nanopore and pacbio has gotten a lot better lately.
More specifically, this seems targeted at long, erroneous
reads. What effect do indels etc have from pac bio? Do you need error
corrected reads? If you have error corrected long reads aren't you
already mostly done assembling and no longer need to use DBG? And
what's the effect of indels vs effect of high substitution, especially
given the spaced seeds with fixed spacing?
There are comments.