Essentially, a nice fast data structure for querying for k-mers and
retrieving their colors. I guess this is for pangenomes, among
other things.
They essentially use compressed nodes in the tree to efficiently
store prefixes for large sections of the tree.
We worry about the peak mem usage diagram. It seems like a fair
amount of memory is used in the making. How does this compare to
the SBT? Do they compare peak memory usage or merely compressed
memory usage?
It seems like one advantage that the SBT has is that with the BFT
you cannot store/query for presence in individual data sets. So,
for example, if you wanted to build indices for data sets spread
across many different machines, you would have to do it by gathering
all of the data sets in one place.
Both SBT and BFT get the compression mainly from bloom filter. The
author did not discuss about why there is difference in compression
ratio. Bloom filter size? The FP rate of bloom filters in used in
SBT was mentioned as 7.2%, but FP rate of bloom filters in BFT were
not mentioned in paper.
Another catch in the evaluation is that 1) loading cpu time
difference in Table 1 of SBT and BFT may be from kmer counting
(Jellyfish vs. CMK2); 2) When comparing the unique kmer query time,
unique kmer were divided into subsets due to memory limit. Not sure
whether this was a fair comparison.
How does false positive rate of all bloom filters (on all nodes)
affect overall error rate, e.g. If BFT is converted back to k-mers,
how many sequence error are there? (None, we think)
PanCake (alignment based) and RCSI (Reference based) were mentioned
but not included in evaluation, which gave us the impression that
they are not as efficient. Do they have any advantage?
BFT or SBT vs. khmer? (mentioned in intro but not discussed)
Pan genome (and transcriptome, proteome!) storage is super
cool. (might not be relevant question here, but I am wondering:) How
are genomes defined as "highly similar", as the authors restricted
their test data sets to. At what point do species diverge to become
too distant to analyze in this manner? e.g. how close is close, and
what is too far?
(CTB answer: it has something to do with how many k-mers they share,
but I don't know that this has been really quantified. Kostas
Konstantinidis et al's latest work on species defn might be good
reading
(http://nar.oxfordjournals.org/content/early/2015/07/06/nar.gkv657.full)
as well as his Average Nucleotide Identity metric.)
Wondering how might BFT scale? Authors only tested prokaryotic
sequences, 473 clinical isolates of Pseudomonas aeruginosa from 34
patients = 844.37 GB. Simulated data were 6 million reads of 100 b
length for 31 GB. In comparison, MMETSP data are transcriptomic data
from 678 cultured samples of 306 marine eukaryotic species
representing more than 40 phyla (see Figure 2, Keeling et al. 2014)
Not sure how large the entire MMETSP data set is, but probably on
order of TB?
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001889
http://www.ncbi.nlm.nih.gov/nuccore?term=231566%5BBioProject%5D
Although they discussed SBT as existing data structure, and
graphalign in khmer, it wasn't clear until end that one of the main
goals of the paper, besides describing BFT, was to compare their BFT
to SBT (Soloman and Kingsford 2015
http://biorxiv.org/content/biorxiv/early/2015/03/26/017087.full.pdf)
I feel this should have been noted in the Abstract.
speed can partly come from being able to abort searches for k-mers partway
through.
BFT is really specialized for the pangenome situation, where many k-mers are
in common. The cluster approach will break down if the genomes aren't mostly
the same?
we would have liked a more visual representation of the data structure
to help build intuition.