One of our long-term interests has been in figuring out what the
!$!$!#!#%! assemblers actually do to real data, given all their
heuristics. A continuing challenge in this space is that short-read
assemblers deal with really large amounts of noisy data, and it can
be extremely hard to look at assembly results without running into
this noise head-on. It turns out that being able to label De Bruijn
graphs efficiently and align
reads to graphs can
help us explore assemblies in a variety of ways.
The two basic challenges are noisy data and lots of data. When
(for example) looking at what fraction of reads has been incorporated
into an assembly, noise causes problems because a read may have been
corrected during assembly. This is where graph alignment comes in
handy, because we can use it to align reads to the full graph and get
rid of much of this noise. Lots of data complicates things because
it's very hard to look at reads individually - you need to treat them
in aggregate, and it's much easier just to look at the reads that
match to your assembly than to investigate the oddball reads that
don't assemble. And this is where the combination of graph alignment
and labeling helps, because it's easy to count and extract reads based
on overlaps with labels, as well as to summarize those overlaps.
The main question we will be asking below is: can we measure overlaps
and disjoint components in graph extents, that is, in unique
portions of assembly graphs? We will be doing this using our sparse
graph instead
of counting nodes or k-mers, for two reasons: first, the covering is
largely independent of coverage, and second, the number of sparse
nodes is a lot smaller than the total number of k-mers.
The underlying approach is straightforward:
- load contigs or reads from A into the graph, tagging sparse nodes as we go;
- load contigs or reads from B into the graph, tagging sparse nodes as we go;
- count the number of tagged nodes that are unique to A, unique to B, and
in the overlap;
- optionally do graph alignment as you load in reads, to ignore errors.
Some basics
Let's start with simulations, as usual. We'll set up two randomly
generated chromosomes, a and b, of equal size, both in genomes.fa,
and look at genome-a extent within the context of both (target
'fake_a' in Makefile):
./compare-graphs.py genomes.fa genome-b.fa
all tags: 52
n tags in A: 52
n tags in B: 26
tags in A but not in B 26
tags in B but not in A 0
So far so good -- there's a 50% overlap between one of the chromosomes
and the total.
If we now generate reads from genome-b.fa and do the graph comparison
with the reads, we get silly results (target 'fake_b' in Makefile):
./compare-graphs.py genomes.fa reads-b.fa
all tags: 135
n tags in A: 109
n tags in B: 107
tags in A but not in B 28
tags in B but not in A 26
Despite knowing by construction that all of the reads came from
genome-b, we're getting results that there's a lot of tags in the
reads that aren't in the genome. This is because of errors in the
reads, which introduce many spurious branches in the graph.
This is now where the read aligner comes in; we can do the same
comparison, but this time we can ask that the reads be aligned to the
genome, thus eliminating most of the errors in the comparison:
./compare-graphs.py genomes.fa reads-b.fa --align-b
all tags: 99
n tags in A: 99
n tags in B: 72
tags in A but not in B 27
tags in B but not in A 0
At this point we can go in and look at the original tags in A that aren't
covered in B (there are 52) and note that B is missing approximately half
of the graph extent in A.
Trying it out on some real data
Let's try evaluating a reference against some low-coverage reads.
Using the same mouse reference transcriptome & subset of reads that
we've been using in previous blog posts, we can ask
"how many sparse nodes are unaccounted for in the mouse transcriptome
when we look at the reads?" (Note, the mouse transcriptome was not
generated from this data set; this is the reference transcriptome.)
The answer (target rna-compare-noalign.txt
in the Makefile)
is:
all tags: 1959121
n tags in A: 1878475
n tags in B: 644963
tags in A but not in B 1314158
tags in B but not in A 80646
About 12.5% of the reads in (B; 80646 / 644963) don't pick up tags in
the official reference transcriptome (A).
Interestingly, the results with alignment are essentially the same
(target rna-compare-align.txt):
all tags: 1958219
n tags in A: 1877685
n tags in B: 643655
tags in A but not in B 1314564
tags in B but not in A 80534
suggesting that, by and large, these reads are disjoint from the
existing assembly, and not mere sequencing errors. (This may be
because we require that the entire read be mappable to the graph in
order to count it, though.)
Evaluating trimming
One of the interesting questions that's somewhat hard to investigate
in terms of transcriptome assembly is, how beneficial is read
trimming to the assembly? The
intuition here (that I agree with) is that generally sequence trimming
lowers the effective coverage for assembly, and hence loses you
assembled sequence. Typically this is measured by running an
assembler against the reads, which is slightly problematic because the
assembler could have all sorts of strange interactions with the
trimming.
So, can we look at the effect of trimming in terms of sparse nodes?
Sure!
Suppose we do a stringent round of trimming on our RNAseq (Trimmomatic
SLIDINGWINDOW:4:30) - what do we lose?
On this low coverage data set, where A is the graph formed from the
trimmed reads and B is the graph from the raw reads, we see (target
rseq-hardtrim-ba-noalign.txt):
all tags: 588615
n tags in A: 518980
n tags in B: 588615
tags in A but not in B 0
tags in B but not in A 69635
we see about 12% of the sparse nodes missing from the trimmed data.
If we run the read aligner with a low coverage cutoff (target
rseq-hardtrim-ba-align1.txt),
we see:
all tags: 569280
n tags in A: 519396
n tags in B: 561757
tags in A but not in B 7523
tags in B but not in A 49884
Basically, we recover about 20,000 tags in B (69,635 - 49,884) with
alignment vs exact matches, so a few percent; but we also lose about
half that (7,500) for reasons that we don't entirely understand
(wiggle in the graph aligner?)
We have no firm conclusions here, except to say that this should be a
way to evaluate the effect of different trimming on graph extent, which
should be more reliable than looking at the effect on assemblies.
Notes and miscellany
- There is no inherent coverage model embedded here, so as long as we can
correct for the density of tags, we can apply these approaches to
genomes, metagenomes, and transcriptomes.
- It's actually very easy to extract the reads that do or don't match,
but our current scripts don't let us do so based on labels.
- We aren't really using the labeling here, just the tagging - but
labeling can enable n-way comparisons between e.g. different
assemblies and different treatments, because it lets us examine
which tags show up in different combinations of data sets.
There are comments.