Other articles

DIB jclub: Spaced Seed Data Structures for De Novo Assembly

Published: Wed 27 May 2015
By Camille Scott Michael Crusoe Jiarong Guo Qingpeng Zhang C. Titus Brown

In science.

tags: De Bruijn graphs assembly jclub

Note: at the Lab for Data Intensive Biology, We're trying out a new journal club format where we summarize our thoughts on the paper in a blog post. For this blog post, Camille wrote the majority of the text and the rest of us added questions and comments.

Inanç Birol …
read more
There are comments.
How do sequencing errors modify de Bruijn graph connectivity? (Part II)

Published: Thu 28 March 2013
By C. Titus Brown

In science.

tags: science de Bruijn graphs k-mers

Continuing in the saga of "what do sequencing errors do to our de Bruijn graph density measure" (read the first post here), I have some new results.

The conclusion of the first post was that on random (non-real) genomes, both with and without repeats, we see that de Bruijn graph …
read more
There are comments.
How do sequencing errors modify de Bruijn graph connectivity?

Published: Tue 26 March 2013
By C. Titus Brown

In science.

tags: science de Bruijn graphs k-mers

Are our reviewers correct or incorrect?

About two months ago we got back reviews for our assembly artifacts paper, in which we showed that there was a strong 3' bias in the reads towards higher graph connectivity. Since shotgun sequencing is supposed to be random, we asserted that this 3' …
read more
There are comments.

Questions and Comments:

Essentially, a nice fast data structure for querying for k-mers and retrieving their colors. I guess this is for pangenomes, among other things.
They essentially use compressed nodes in the tree to efficiently store prefixes for large sections of the tree.
We worry about the peak mem usage diagram. It seems like a fair amount of memory is used in the making. How does this compare to the SBT? Do they compare peak memory usage or merely compressed memory usage?
It seems like one advantage that the SBT has is that with the BFT you cannot store/query for presence in individual data sets. So, for example, if you wanted to build indices for data sets spread across many different machines, you would have to do it by gathering all of the data sets in one place.
Both SBT and BFT get the compression mainly from bloom filter. The author did not discuss about why there is difference in compression ratio. Bloom filter size? The FP rate of bloom filters in used in SBT was mentioned as 7.2%, but FP rate of bloom filters in BFT were not mentioned in paper.
Another catch in the evaluation is that 1) loading cpu time difference in Table 1 of SBT and BFT may be from kmer counting (Jellyfish vs. CMK2); 2) When comparing the unique kmer query time, unique kmer were divided into subsets due to memory limit. Not sure whether this was a fair comparison.
How does false positive rate of all bloom filters (on all nodes) affect overall error rate, e.g. If BFT is converted back to k-mers, how many sequence error are there? (None, we think)
PanCake (alignment based) and RCSI (Reference based) were mentioned but not included in evaluation, which gave us the impression that they are not as efficient. Do they have any advantage?
BFT or SBT vs. khmer? (mentioned in intro but not discussed)
Pan genome (and transcriptome, proteome!) storage is super cool. (might not be relevant question here, but I am wondering:) How are genomes defined as "highly similar", as the authors restricted their test data sets to. At what point do species diverge to become too distant to analyze in this manner? e.g. how close is close, and what is too far?

(CTB answer: it has something to do with how many k-mers they share, but I don't know that this has been really quantified. Kostas Konstantinidis et al's latest work on species defn might be good reading (http://nar.oxfordjournals.org/content/early/2015/07/06/nar.gkv657.full) as well as his Average Nucleotide Identity metric.)
Wondering how might BFT scale? Authors only tested prokaryotic sequences, 473 clinical isolates of Pseudomonas aeruginosa from 34 patients = 844.37 GB. Simulated data were 6 million reads of 100 b length for 31 GB. In comparison, MMETSP data are transcriptomic data from 678 cultured samples of 306 marine eukaryotic species representing more than 40 phyla (see Figure 2, Keeling et al. 2014) Not sure how large the entire MMETSP data set is, but probably on order of TB? http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001889 http://www.ncbi.nlm.nih.gov/nuccore?term=231566%5BBioProject%5D
Although they discussed SBT as existing data structure, and graphalign in khmer, it wasn't clear until end that one of the main goals of the paper, besides describing BFT, was to compare their BFT to SBT (Soloman and Kingsford 2015 http://biorxiv.org/content/biorxiv/early/2015/03/26/017087.full.pdf) I feel this should have been noted in the Abstract.
speed can partly come from being able to abort searches for k-mers partway through.
BFT is really specialized for the pangenome situation, where many k-mers are in common. The cluster approach will break down if the genomes aren't mostly the same?
we would have liked a more visual representation of the data structure to help build intuition.

Other articles

social