Note: this is a blog post from the DIB Lab journal club.
Questions and Comments:.
"Bloom Filter Trie: a data structure for pan-genome storage."
by Guillaume Holley, Roland Wittler, and Jens Stoye.
Pan Genome: Represents genes in a clade that comprises
hundreds/thousands of strains that share large sequence parts but
differ by individual mutations from one another.
Colored De Bruijn Graph (C-DBG) - A directed graph where each vertex
represents a kmer and is associated with a color that represents the
genome where the kmer occurs. An edge between vertex x and vertex y
exists if x[2..k] = y[1..k-1]
Existing pan-genome Data structures
Burrows-Wheeler Transform (BWT) that rearranges the input data to
enable better compression by aggregating characters with similar
FM-Index count and locate the occurrence of substring in BWT.
Bloom Filter BF - A bit array of n elements initialized with zeros,
and it uses a set of hash functions for look-up and insertion in a
Sequence Bloom Filter (SBT) is a binary tree with BFs as vertices.
A query sequence is broken into a set of kmers then checked if they
exist in the vertex bloom filter. If yes, search is repeated on
children if not existing the proceeds on other vertices.
Paper Scope: New pan-genome Data Structure
Proposes a new data structure for indexing and compressing a
pan-genome as a C-DBG, the Bloom Filter Trie (BFT). The trie stores
kmers and their colors. The new representation allows for compressing
and indexing shared substrings. It also allows quick checking for the
existence of a given substring. It is incremental, alignment and
reference free, and allows for any format.
How BFT Works?
Colors Representation Before Compression
Colors are represented by a bit array initialized with 0s. Each color
has an index i_color: color_x [i_color] =1 means kmer x has color
Colors Representation After Compression
If we can put each color set in a specific position in an array such
that this position encodes into a number less than the color set size,
then we can store the color set in the BFT using less space. Color
set size is the number of kmers sharing a color multiplied by its
size. So basically, they sort colors in a list based on the color set
size in a decreasing order then they add each color set to an external
array incrementally (if : integer encoding the position of the color in
the array < size of the color). Finally, each color is replaced in the
BFT by its position in the external array
A set of tuples of capacity
c <s, color_ps> where:
x = ps and x is the path from root to v.
When number of suffixes exceeds c the container is burst.
In bursting process, each suffix s is split into a suffix s_suff and prefix s_pref
Prefixes are stored in a new compressed container.
Suffixes and their colors are stored in a new uncompressed containers.
Compressed Container and Bursting Procedure
An uncompressed container is replaced by a compressed one that stores
q prefixes of suffix with links to children containing the suffix.
quer is a BF represented as m bit array and has f hash functions to
record and filter the presence of q prefixes of suffix.
The q prefixes of suffix are stored in a bloom filter BF. q
suffixes are stored in an array suf in lexicographic order of the
prefixes of those suffixes.
pref[𝛂] =1 if a prefix a exists with 𝛂 as its binary representation.
clust is an array of q bits one per suffix of array suf. A cluster is a list of consecutive suffixes in array suf that has the same prefix.
To check if a Spref ab with 𝛂𝜷 representation exist in a compressed
container cc, the BF quer is checked and the prefix a existence is
verified in the array pref. Remember that suffix prefixes are stored
in quer during bursting process.
if a exist, then its hamming weight is computed which is the index of
the cluster in which suffix b is likely located where i is the start
position of the cluster. Remember that a cluster is a list of
consecutive suffixes that has the same prefix, so b is compared to
suffixes in that cluster.
To check of a kmer x exists in a BFT t, the look-up process traverses
t and for each vertex v it checks its containers one after
one. Remember that suffix and their colors are stored in uncompressed
container during burst process, hence a vertex now either represent a
suffix from an uncompressed container or a suffix rooted from its
For the first case, and as the uncompressed container has no childs,
a match indicates the presence of the kmer.
For the second case, the quer is checked for the x_suff[1..l]. If it
is found, traverse is continued recursively to the corresponding
children. The absence of of x_suff[1..l] means the absence of the kmer
as it can’t be located in another container of vertex v. Remember
that k is a multiple of l so kmer =k/l equal substrings.
If the kmer already exist, its set of colors is only
modified. Otherwise, a lookup process is continued till:
The prefix of the searched suffix does not exist
The kmer suffix does not exist
Then the kmer is inserted. Insertion is simple if the container is
uncompressed. If the container is compressed, the insertion of of
s_pref =ab is pretty complicated:
Remember in the look up process, the ‘a’ prefix existence is verified
by checking pref array. If it does not exist: it is a FP, and we can
insert now by setting pref[𝛂] to 1. So, in next look up, the
verification will lead to a TP index and start position of cluster pos
are recomputed and updated. How? if it does exist: Then the suffix b
is to be inserted into suf[pos]
Experiments presented in the paper show that BFT is faster than SBT
while utilizing less memory.
KmerGenie was used to get optimal k size and mininal kmer count
Data insertion (loading) and kmer query was compared between SBT and
BFT. Traversal time is also evaluated on BFT.
BFT was multiple times faster than the SBT on the building time
while using about 1.5 times less memory. The BFT was about 30 times
faster than the SBT for querying a k-mer.
Points for clarification or discussion:
c is defined as capacity, but this is not well-described. What is capacity?
BFT to khmer graphalign comparison?
Proudly powered by
pelican, which uses python.
The theme is subtlely modified from one by
Smashing Magazine, thanks!
For more about this blog's author, see
the main site or the lab site
While the author is employed by the University of California, Davis, his opinions are his own and almost certainly bear no resemblance to what UC Davis's official opinion would be, had they any.