There's an interesting and intuitive connection between error
correction and
variant calling - if you can do one well, it lets you do (parts of)
the other well. In the previous blog post on
some new features in khmer, we introduced our new "graphalign"
functionality, that lets us align short sequences to De Bruijn graphs,
and we discussed how we use it for error correction. Now, let's try
it out for some simple variant calling!
Graphalign can potentially be used for variant calling in a few
different ways - by mapping reads to the reference graph and then
using a pileup approach, or by error correcting reads against the
graph with a tunable threshold for errors and then looking to see
where all the reads disagree - but I've become enamored of an approach
based on the concept of reference-guided assembly.
The essential idea is to build a graph that contains the information
in the reads, and then "assemble" a path through the graph using a
reference sequence as a guide. This has the advantage of looking at
the reads only once (to build a DBG, which can be done in a single
pass), and also potentially being amenable to a variety of heuristics.
(Like almost all variant calling, it is limited by the quality of
the reference, although we think there are probably some ways around
that.)
Basic graph-based variant calling
Implementing this took a little bit of extra effort beyond the basic
read aligner, because we want to align past gaps in the graph. The
way we implemented this was to break the reference up into a bunch of
local alignments, each aligned independently, then stitched together.
Again, we tried to keep the API simple. After creating a ReadAligner object,
aligner = khmer.ReadAligner(graph, trusted_cutoff, bits_theta)
there's a single function that takes in the graph and the sequence (potentially
genome/chr sized) to align:
score, alignment = align_long(graph, aligner, sequence)
What is returned is a score and an alignment object that gives us access
to the raw alignment, some basic stats, and "variant calling" functionality -
essentially, reporting of where the alignments are not identical. This is
pretty simple to implement:
for n, (a, b) in enumerate(zip(graph_alignment, read_alignment)):
if a != b:
yield n, a, b
The current implementation of the variant caller does nothing beyond
reporting where an aligned sequence differs from the graph; this is
kind of like guided assembly. In the future, the plan is to extend it
with reference-free assembly.
To see this in action for a simulated data set, look at the file
sim.align.out
-- we get alignments like this, highlighting mismatches:
ATTTTGTAAGTGCTCTATCCGTTGTAGGAAGTGAAAGATGACGTTGCGGCCGTCGCTGTT
|||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||
ATTTTGTAAGTGCTCTATCCCTTGTAGGAAGTGAAAGATGACGTTGCGGCCGTCGCTGTT
(Note that the full alignment shows there's a bug in the read aligner
at the ends of graphs. :)
It works OK for whole-genome bacterial stuff, too. If we take an
E. coli data set (the same one we used in the semi-streaming paper) and just run the reads against
the known reference genome, we'll get 74 differences between the graph
and the reference genome, out of 4639680 positions -- an identity of
99.998% (variants-ecoli.txt).
On the one hand, this is not that great (consider that for something
the size of the human genome, with this error rate we'd be seeing
50,000 false positives!); on the other hand, as with error correction,
the whole analysis stack is surprisingly simple, and we haven't spent
any time tuning it yet.
Simulated variants, and targeted variant calling
With simulated variants in the E. coli genome, it does pretty well.
Here, rather than changing up the genome and generating synthetic
reads, we went with the same real reads as before, and instead changed
the reference genome we are aligning to the reads. This was done with
the patch-ecoli.py script,
which changes an A to a C at position 500,000, removes two bases at
position 2m, and adds two bases at position 3m.
When we align the "patched" E. coli genome against the read graph, we
indeed recover all three alignments (see variants-patched.txt)
in the background of the same false positives we saw in the unaltered
genome. So that's kind of nice.
What's even neater is that we can do targeted variant calling
directly against the graph -- suppose, for example, that we're
interested in just a few regions of the reference. With the normal
mapping-based variant calling, you need to map all the reads first
before querying for variants by location, because mapping requires the
use of the entire reference. Here, you are already looking at all the
reads in the graph form, so you can query just the regions you're
interested in.
So, for example, here you can align just the patched regions (in
ecoli-patched-segments.fa) against the read graph and get the same
answer you got when aligning the entire reference (target
ecoli-patched-segments.align.out).
This works in part because we're stitching together local alignments,
so there are some caveats in cases where different overlapping query
sequences might lead to different optimal alignments - further
research needed.
Speed considerations
Once you've created the graph (which is linear time with respect to
the number of reads), things are pretty fast. For the E. coli data
set, it takes about 25 seconds to do a full reference-to-graph
alignment on my Mac laptop. Much of the code is still written in
Python so we hope to get this under 5 seconds.
In the future, we expect to get much faster. Since the alignment is
guided and piecewise, it should be capable of aligning through highly
repetitive repeats and is also massively parallelizable. We think that
the main bottleneck is going to be loading in the reads. We're
working on optimizing the loading separately, but we're hoping to get
down to about 8 hours for a full ~50x human genome variant calling
with this method on a single CPU.
Memory considerations
The memory is dominated by graph size, which in turn is dominated by
the errors in short-read Illumina data. We have efficient ways of
trimming some of these errors,
and/or compressing down the data, even if we don't just correct them;
the right approach will depend on details of the data (haploid?
diploid? polyploid?) and will have to be studied.
For E. coli, we do the above variant calling in under 400 MB of RAM.
We should be able to get that down to under 100 MB of RAM easily
enough, but we will have to look into exactly what happens as we
compress our graph down.
From the Minia paper, we can place
some expectations on the memory usage for diploid human genome
assembly. (We don't use cascading Bloom filters, but our approaches
are approximately equivalent.) We believe we can get down to under 10
GB of RAM here.
Additional thoughts
As with most of our methods, this approach should work directly for
variant calling on RNAseq and metagenomic data with little alteration.
We have a variety of graph preparation methods (straight-up graph
loading as well as digital normalization and abundance slicing)
that can be applied to align to everything while favoring
high-coverage reads, or only to high coverage, or to error-trimmed
reads, or...
In effect, what we're doing is (rather boring) reference-guided
assembly. Wouldn't it be nice if we extended it to longer indels, as
in Holtgrewe et al., 2015? Yes, it would. Then
we could ask for an assembly to be done between two points... This
would enable the kinds of approaches that (e.g.) Rimmer et al., 2014
describe.
One big problem with this approach is that we're only returning
positions in the reference where the graph has no agreement - this
will cause problems when querying diploid data sets with a single
reference, where we really want to know all variants, including
heterozygous ones where the reference contains one of the two. We can
think of several approaches to resolving this, but haven't implemented
them yet.
A related drawback of this approach so far is that we have (so far)
presented no way of representing multiple data sets in the same graph;
this means that you can't align to many different data sets all at
once. You also can't take advantage of things like the contiguity
granted by long reads in many useful ways, nor can you do haplotyping
with the long reads. Stay tuned...
References and previous work
A number of people have done previous work on graph-based variant calling --
Zam Iqbal and Mario Caccamo's Cortex is the first article
that introduced me to this area. Since then, Zam's work as well as
some of the work that Jared Simpson is doing on FM indices has been
a source of inspiration.
(See especially Zam's very nice comment
on our error correction post!)
Heng Li's FermiKit does
something very similar to what we're proposing to do, although it
seems like he effectively does an assembly before calling variants.
This has some positives and some negatives that we'll have to
explore.
Kimura and Koike (2015)
do variant calling on a Burrows- Wheeler transform of short-read
data, which is very similar to what we're doing.
Using k-mers to find variation is nothing new. Two articles that
caught my eye -- BreaKmer (Abo et al, 2015) and kSNP3 (Gardner
et al., 2015)
both do this to great effect.
the GA4GH is working on graph-based variant calling, primarily for
human. So far it seems like they are planning to rely on well
curated genomes and variants; I'm going to be working with (much)
poorer quality genomes, which may account for some differences in
how we're thinking about things.
There are comments.