Mon, 30 Aug 2010

Evolution of chordate features -- looking at the Molgula


(with Billie Swalla)

I've spent the last two weeks out at the Roscoff Statione Biologique in Roscoff, France. This little port is on the northern coast of the French region of Bretagne, or Brittany. I'm here with Billie Swalla, a professor at UW Seattle, and Elijah Lowe, a Computer Science PhD student from Michigan State.

Why are we here? Therein lies a tale, about tails.

Roscoff is the only place in the world where you can find two particular species of ascidians (that is, members of the Tunicata (Urochordata), the phylum currently thought to be most closely related to the vertebrates). These two species are named Molgula occulta and Molgula oculata. In their adult form, they look very similar: they are little sessile blobs sitting on the ocean floor, covered in sand, happily filter feeding for a living. The only apparent difference is that M. oculata has a white spot between its two siphons, is a little bit bigger, and likes to cover itself in lighter sand and bigger rocks.

http://ivory.idyll.org/permanent/roscoff/molgula-in-tank.jpg

Caption: This picture is of M. oculata; you can see the big 'uns on the right, with two clear siphons. You can't see the white spot between them, however.

When you look at ascidian embryos and larvae, however, you get a very different picture. While the adults superficially look nothing like chordates, the embryos have a characteristic chordate development (including the rolling up of the neural tube) and the larvae generally have a notochord and tail. This is one of the reasons that ascidians are so interesting to evolutionary biologists: by their embryos they are clearly chordates, and share relatively recent ancestry (~550-800m years) with vertebrates, yet they have a lifestyle -- active juvenile and sessile adult -- shared only by tenured professors.

So these Molgula are ascidians, with tadpole larvae. And some of them have tails. But many do not. That is, while almost all ascidian larvae have tails, and the last common ancestor of the Molgula almost certainly did as well, many of the Molgula species do not have a tail -- and that's why we're in Roscoff.

M. oculata and M. occulta larvae are, respectively, tailed and tailless. Otherwise very similar in early development and after metamorphosis, the oculata larvae (hence "tailed") have tails, while the occulta larvae (hence "tailless") do not. M. occulta (tailless) are also missing an otolith, a gravity-sensing organ present in M. oculata (tailed) that one would presume is useful during their motile phase.

http://ivory.idyll.org/permanent/roscoff/molgula-larvae.jpg

Caption: M. occulta (left) and M. oculata (right) larvae. Note the lack of a tail and an otolith (dark spot) on the left, tailless larva.

So here we have two species, quite closely related, living in similar if not overlapping environments, developing along similar timelines, with quite distinct morphologies in their tadpole larva phase. But that's not all.

The icing on the cake is this: you can interbreed the tailed and tailless species. That is, M. oculata sperm can fertilize M. occulta eggs, and vice versa. (The details are a bit tricky because ascidians are hermaphrodites, M. oculata can self-fertilize, and M. occulta cannot. So you have to be really careful during the fertilization stage But never mind.) And what do you get when you hybridize them? Little tadpole larvae that have a half-size tail, with 20 notochord cells instead of 40; and an otolith.

http://ivory.idyll.org/permanent/roscoff/molgula-hybrids.jpg

Caption: A: M. oculata (tailed); B: occulta x oculata hybrid; C: M. occulta (tailless).

Let me stress that we had no a priori reason to believe that by crossing tailed and tailless species, we would get a half-tail. That would be like a child with brown hair having a blonde parent and a black-haired parent -- traits don't always average. In fact, while you could guess that the hybrids might have a tail, it would be equally likely (given no other information) that they would have no tail, or a full-length tail (blonde or black hair, rather than brown).

The fact that hybrids -- from sperm from tailed ascidian and eggs from a tailless ascidians -- have a tail at all is pretty surprising, and suggests (Occam's Razor and all that) that the molecular mechanism underlying tail-less-ness is some sort of loss in the genome. This is a decent guess mainly because sperm tend not to have too much in the way of RNAs or proteins in them: their main contribution to the embryo is the daddy genome. If, in contributing their genome, they induce a trait that was lost in the mommy genome, it is pretty plausible that they are doing so by re-adding in a genomic locus that was mutated or lost in the mommy genome. But we really don't know; understanding things at that level of molecular detail is a pretty difficult thing to do.

So these two Molgula are interesting because they have a really distinct set of features, are closely related enough to hybridize, and produce interesting hybrids. They're particularly interesting because they are chordates, and the features that under investigation are some of the defining features of chordates.

All of this is old hat, in fact, with the hybrids most recently published on by Swalla and Jeffery in the late 1990s. (Yes, that is the same Swalla, of course.)

So, what are we doing? And why I am I involved?

I'll give you a hint -- it involves large-scale sequencing.

A bientot!

--titus

posted at: 03:38 | path: /aug-10 | 0 comments

Tags: , ,


Sun, 29 Aug 2010

Assembly is hard because it's not decomposable


(with Adina Howe, Jason Pell, Rosangela Canino-Koning, and Arend Hintze).

Introduction

A few weeks ago I blogged a bit about a k-mer filtering system, khmer, that we were using to reduce metagenomic data to a more tractable size by throwing out error-prone reads (see A memory efficient way to remote low-abundance k-mers from large DNA data sets). No sooner had we tried that, than did we realize that we were probably primarily throwing away good, if low-abundance data (see Illumina reads and their features). No matter: we couldn't assemble the original data sets anyway, so we had to get rid of some of it, right?

The subject of this blog post is not on how to best throw away data. (I'll address that in a few weeks.) Instead, it's on why we have to throw away data in the first place. More precisely,

Why is assembly hard?

First, some background. Imagine you have some long-ish strings (1mn - 200 mn in length), composed of only the letters A, C, G, and T, and you want to know what the sequence of the strings is. You can't actually read the sequences directly; they're too physically small. But you can randomly retrieve short subsequences ~100-1000 letters in length from the original long sequences. You don't know where they're from on the original sequence, or even which of the original sequences they're from. And the process of retrieval is error-prone, so you can't even trust the exact sequence you get. But you do know that, by and large, the short sequences are mostly correct; and (the most important bit) that you can get as many of these short sequences as you want, within $$ limitations.

From this kind of information you want to reconstruct the original sequences.

This is a basic description of the process of shotgun sequencing, in which you take DNA, shred it, and then sequence from it randomly -- many, many times. And it lays out the basic problem of assembly, too: you want to figure out how to reconstruct the original sequences from the little subsequences that you actually have.

If you are a computer scientist, you can probably already think of some basic ways to proceed. For example, you could do an all-by-all comparison of the short sequences, lay out which ones overlap and how, build a map of the overlaps, and try to build a tiling path that maximizes the connectivity of your map. Voila! Some approximation of the original sequences results! This approach is known as the overlap-layout-consensus approach, where at the end you produce a consensus view of the original sequence based on all the reads you have.

If you are a computer scientist or someone who programs for a living, you will also immediately recognize this as a rilly rilly hard problem! Forget biological peccadilloes; just doing this efficiently for large collections of sequences is computationally quite difficult. In particular, the all-by-all comparison is brutal: the number of comparisons scales as N**2 with the number of sequences N, so even if it's relatively efficient to compare two sequences, the problem behaves poorly as your data set grows. Plus, building a map of the overlaps is another hard problem: holding all that information in memory requires (yep!) O(N**2) memory, which is not cheap.

Is there any easy way to break down the problem? After all, big computers aren't cheap, but small computers are; so if you could split the problem into many smaller chunks, you could imagine using a grid or Beowulf approach, and just buying lots and lots of cheap hardware to scale.

Alas, the problem isn't easy to subdivide. It's easy to see why, if you think about the nature of the original sequences. Here's a little diagram; suppose, for example, that you have four subsequences all derived from one original sequence:

(orig) atggaccagatgagagcatgagccatggacggatcatggaaaacggttaaaaggggcatgg

(1)    atggaccagatgagagca
(2)                 gagcatgagccatggacggatc
(3)                                  ggatcatggaaaacggttaaaa
(4)                                                  ttaaaaggggcatgg

If the layout above is the only way that subsequences 1-4 overlap and can assemble, then to decompose the overlap problem across multiple computers would involve sending (1) and (2) to one computer, and (3) and (4) to another, assembling them there, and then taking the results and composing them on a shared node. Unfortunately, to do this efficiently currently requires that you know that 1 and 2 overlap, and that 3 and 4 overlap -- which is basically the problem that you already need to solve!

As I understand it -- I'm not a computer scientist unless you look at my letterhead -- there is simply no efficient way to decompose the overlap-layout-consensus assembly algorithm without either assuming something about the structure of the data, and/or introducing errors. (If you disagree, I'd appreciate either a reference or an implementation; thanks ;)

The second, or possibly third, generation of assemblers

OK, but computer scientists and computational biologists aren't dumb, and they like to tackle hard problems, and frankly this is an incredibly important problem to solve (for all sorts of reasons that you'll have to trust me on for now). Moreover, N^2 scaling is simply unacceptable!

Newer assemblers use a de Bruijn graph approach. Essentially, this involves breaking the subsequences down into fixed-length words of length k, and constructing an overlap graph. For example, taking the sequences above,:

(orig) atggaccagatgagagcatgagccatggacggatcatggaaaacggttaaaaggggcatgg

(1)    atggaccagatgagagca
(2)                 gagcatgagccatggacggatc
(3)                                  ggatcatggaaaacggttaaaa
(4)                                                  ttaaaaggggcatgg

you would break the original sequences down into words of length (say) 5, yielding:

atgga   gatga   catga   atgga   atcat   aaacg    aaagg
 tggac   atgag   atgag   tggac   tcatg   aacgg    aaggg
  ggacc   tgaga   tgagc   ggacg   catgg   acggt    agggg
   gacca   gagag   gagcc   gacgg   atgga   cggtt    ggggc
    accag   agagc   agcca   acgga   tggaa   ggtta    gggca
     ccaga   gagca   gccat   cggat   ggaaa   gttaa    ggcat
      cagat   agcat   ccatg   ggatc   gaaaa   ttaaa    gcatg
       agatg   gcatg   catgg   gatca   aaaac   taaaa    catgg
                                                aaaag

The overlaps between k-mers now implicitly give you a graph connecting each k-mer to all overlapping k-mers; and if you can find a path that traverses every node in this graph once, you will have your original contig.

Note that this actually works, although of course k must be much bigger than 5 in practice, and there are all sorts of cute tricks you must play to do a good job of disentangling complicated graphs.

Why is this an advantage over the overlap/layout/consensus approach that we looked at first? I'm not sure I've identified all the reasons, but there are at least two very important ones.

First, memory usage. While your memory usage for finding overlaps grows > O(N) with the overlap approach (with sparse matrices it should be N log N, I think?), the de Bruijn graph approach consumes only as much memory as you need to represent each new k-mer (so, with the number of novel k-mers) as well as the connections between them (which can be implicitly represented if you have efficient k-mer lookup). For large, deeply sequenced data sets this is going to be a huge savings: there are only three billion bases in the human genome, and probably only two billion unique k-mers of length 32 -- so if you can store k-mers efficiently (hint: you can) then the de Bruijn graph approach is really great.

Second, k-mers and k-mer overlaps can be stored and queried efficiently -- you just use a hash table or a trie structure. For example, you can store all 4**17 k-mers of length 17 as 34-bit offsets in a hash table (2 bits per DNA base), or you can use a branching trie structure to store arbitrarily long k-mers (see tallymer). Hash tables are be efficient (if big) representations for densely occupied k-mer spaces, while tries will be efficient for sparsely occupied k-mer spaces. Arbitrary length sequences are comparatively difficult to store and query.

The de Bruijn graph approach is what Velvet, ABySS, and SOAPdenovo use, and it seems to work well.

So what's the problem?

Scaling. Scaling is the problem.

Well, that and the sequencing companies and the biologists.

Let me explain. Sequencing companies are producing newer and bigger and better machines, that produce more and more sequence, every week. The Illumina GA2 produces 10-100 Gb of sequence per run now. The HiSeq 2000 is going to produce even more enormous amounts of sequence as soon as we get one. And more, lots more, is on the way.

This wouldn't be a problem if biologists would just stick to the exciting old problems, like resequencing humans and doing transcriptomes etc. But noooo, biologists see these juicy new sequencers and think -- hey! I could sequence populations of organisms! Or, like, 30 new organisms at once! Or 30 transcriptomes at once! And it will be cheap! (And we'll have someone else do the bioinformatics, which is easy, right? Right?)

So the sequencing companies are producing newer and cheaper and faster sequencing machines, and the biologists are using them to tackle ever more exciting and novel and challenging biological questions, and ... guess what? Our existing tools and approaches don't scale very well.

For one very specific example, the de Bruijn graph approach breaks down completely if you are sequencing endlessly diverse populations, as we seem to be doing in metagenomics. If you have some high abundance organisms, and a lot more low abundance organisms, and you sequence the organism soup to some arbitrary level, the novel k-mers will swamp your assembler, and to no end -- because those k-mers are never going to assemble to anything big without more sequencing. In which case you've compounded your swamping problem in an attempt to solve your earlier problem.

Similar things happen with wild population sequencing, where you get new and diverse sequences every time you look at a new animal; humans, even with their relatively low diversity, are one fine example.

OK, so this is the problem to solve, and I think it's a really big problem. It's not decomposable so it can't be made to scale well, and we're already at the limit of our existing compute infrastructure for the data we already have. (See Terabase metagenomics -- the computational side and grim future for sequencing centers.) And as we try to inch the boundaries along, the sequencing companies are producing new and bigger machines to give us new and bigger amounts of data.

Are there any solutions? No really good ones, unfortunately. The solution du jour (see MetaHIT methods and my earlier blog posts) is to throw away low abundance data that you figure won't assemble, and/or subdivide the sequences by abundance, in the expectation that similar abundance sequences will come from the same original genome. These are basically approximation heuristics, hoping to reduce the data in such a way that the assembler can deal with it. The hope is that the assembler can do a not-terribly-bad-job of assembly based on the known structure of the population.

Moreover, the throwing-away-data solution won't scale very well; soon enough you'll be throwing away not just 90% of the data, but 99% of the data, just to get a tractable data set.

We are doomed, doomed I say! Clearly we should give up.

Anyway, this concludes part one of a series of blog posts on assembly. In part two, I plan to talk a bit about paired-end sequencing and repeat sequences.

--titus

p.s. An excellent assembly algorithm reference: Miller, Koren, and Sutton, Genomics, 2010.

posted at: 15:07 | path: /aug-10 | 2 comments

Tags: , ,


Thu, 26 Aug 2010

Galileo, Open Science, and History


I'm a big believer in open science -- see this great polemic over at Mendeley for a good read -- but it's always interesting to think about how such things as "data release" can be perverted by clever scientists. I'm currently in France working on some ascidians with Billie Swalla -- more on that later -- and we've been talking about what data we plan to release, and how. During these talks (leisurely conducted over cafe au lait and chausson pomme, of course!) Billie brought up an interesting historical parallel.

The story, as I understand it, is this: when Galileo Galilei first looked through a good quality telescope and discovered Jupiter's moons, nobody believed him. Since he was the only person able to make such good telescopes, he actually made and distributed them to other scientists -- not just as a profitable sideline, but so that the other scientists could confirm his observations!

One could see this a first step towards "open science": in order to reproduce Galileo's observations, astronomers had to have a telescope that only Galileo could make. So Galileo had to make telescopes and send them out, thus allowing others to both reproduce his observations and build upon them.

The story takes on a different aura, however, when you realize that Galileo could have just given out the actual manufacturing instructions for the telescopes, but didn't. Two possible reasons are money (he made money selling the telescopes to others) and scientific miserliness: he didn't want others to get credit for building on his results. As long as he withheld the details necessary to reproduce his instruments, he ensured that no one could build on his results, and that he would have preeminence in astronomy. (The parables between this and source code are uncanny, no?)

It was quite a balancing act. To quote from Dr. Biagioli's "Replication or Monopoly" (pdf here),

"His primary worry was not that some people might reject his claims, but rather that those able to replicate them could too easily proceed to make further discoveries on their own and deprive him of future credit (Galilei 1989, 17). Consequently, he tried to slow down potential replicators to prevent them from becoming competitors. He did so by not providing other practitioners access to high-power telescopes and by withholding detailed information about how to build them.

But as important as it was for Galileo to keep his fellow astronomers in the dark, such negative tactics alone would not have allowed him to gain credit from his discoveries and move from his post at the university of Padua to a position at the Medici court in Florence as mathematician and philosopher of the grand duke - goals clearly on his mind in 1610.He needed proactive tactics as well. First, he did his best to make sure the grand duke saw the satellites of Jupiter (which Galileo had named "Medicean Stars") by sending detailed instructions to Florence on how to conduct these observations, and then by going to court himself at Easter time (Galilei 1890- 1909, X:281, 304). Second, through the prompt publication of the Sidereus nuncius in March of 1610 he tried to establish priority and international visibility - resources he needed to impress his prospective patron, not just the republic of letters.

The Nuncius was carefully crafted to maximize the credit Galileo could expect from readers while minimizing the information given out to potential competitors."

Here you can see calculation as fine as any modern professor, trying to decide if they should release all their data, or only some of it; all of their source code, or only a crippled version.

Billie also observes that one potential irony in this story is that Galileo, by so strongly taking sole credit for his discoveries, made himself a clear target for the Catholic Church...

An even more pernicious approach, seeking priority while avoiding embarrassment by publishing hashes (well, anagrams ;) of formulae or observations, was common in the 17th century. In The Newton Handbook, by Derek Gjertsen, Gjertsen writes:

"It was not uncommon for seventeenth-century scientists to record their more valued results in the form of anagrams. Thus, Galileo published his discovery in 1610 of the phases of Venus in a thirty-five letter anagram, Huygens announced his 1656 observation that Saturn was surrounded by a ring in a sixty-three letter anagram, while, in England, Robert Hooke and Christopher Wren resorted to similar stratagems. The advantages of the ploy are obious. Priority was established, yet nothing was given away to potential rivals. If, by chance, the work failed to stand up to further analysis it could be quietly forgotten without the embarrassment public failures tended to incur."

One can only wonder how many one-shot awesome Science and Nature papers, using software that was and remains unavailable, are entirely unreplicable or otherwise uninteresting -- for example, I like to pick on one of Eran Segal's publications, because it's so neat and yet very very difficult to replicate without source code. (A colleague is trying.)

Compare this to the recent discussion of the (leaked) P != NP proof, now shown to be erroneous - see, e.g., Greg Baker's blog post, P != NP. Now this is the way science is supposed to work! Quick, thoughtful commentary by experts, highlighting potential problems with your work -- and allowing or enabling others to build off of it.

It's clear to see that by withholding the manufacturing instructions, Galileo may well have held back astronomy as a whole. And by publishing their equations in anagram form, it's likely that Newton and the others did damage to science as a whole.

Today, intellectual reputations like that are in some ways less important (at least in my bottom-feeding scientific world). Publications and citations are more important, since they're measurable by Promotion & Tenure committees. I (and probably many other scientists) are continually worrying about the line between publishing good stuff that enables citations, and giving away all of our future research directions. It takes a real act of faith to throw yourself off the cliff and offer up your latest & greatest source code and data to the world, in the hopes that somehow the resulting "usefulness" will provide lift to your career. We'll see how that goes: road kill? Or tenure?

Back to Galileo -- I think the Galileo example is why, as wonderful as the Panton Principles are for data, for truly open science it's critical to provide not only the raw data, but the source code used to do the analysis. And not only the source code, but useful source code: documented and tested source code [1]. To do anything else would be the equivalent of selling telescopes while withholding the manufacturing instructions that would let others build on your own ideas.

Interesting stuff to think about! Now, back to science...

--titus

[1] Yeah, I realize that most scientific source code probably isn't documented or tested. Draw your own conclusions there ;).

posted at: 14:07 | path: /aug-10 | 3 comments

Tags: ,


Tue, 24 Aug 2010

CSE 891, Computational Science for Evolutionary Biologists


Course page at: http://ged.msu.edu/courses/2010-fall-cse-891/:

This course will introduce biologists to computational thinking, practical computational techniques, and research topics in computational evolution. The course will consist of three intensive hands-on 5-week modules: computational competence in UNIX; data mining and hypothesis generation using the Avida digital life platform and computational analysis of large-scale resequencing data from the Lenski Lab E. coli long-term evolution experiment.

This course is one of the two required entry courses for BEACON, an NSF-funded center for the study of Evolution in Action (see www.beacon.msu.edu).

The course will have 15-20 students from Michigan State (MSU), 2-3 from UW Seattle, 1-2 from U Idaho, and 2-3 from UT Austin (all partners in BEACON).

Briefly, we will be using EC2 and S3 to

  1. demonstrate remote computer use and remote data analysis
  2. compile, install, and run the Avida Digital Life system (see http://devolab.msu.edu and http://en.wikipedia.org/wiki/Avida)
  3. do resequencing analysis of the Lenski Lab's E. coli long-term evolution experiment (http://en.wikipedia.org/wiki/E._coli_long-term_evolution_experiment)

All of our materials will be made freely available under a Creative Commons license, as with my last EC2-using course (see: http://ivory.idyll.org/blog/jun-10/ngs-course-postmortem.html and http://ged.msu.edu/angus/).

We also hope to "export" this course outside of BEACON in the future.

--

I am open to a few people from outside the BEACON consortium attending the course via teleconf, although we may not grade your homework :). Let me know if this is of interest to you: ctb@msu.edu.

--titus

(Sigh, of 6 total semesters in which I've taught at MSU, I've taught 4 new courses. What am I thinking??)

posted at: 08:38 | path: /aug-10 | 2 comments

Tags: , , ,


Wed, 04 Aug 2010

Illumina reads and their features


(This project is a collaboration with Jason Pell and Adina Howe)

A few weeks ago I posted about a k-mer filtering approach that we were using to remove low-abundance k-mers from metagenomic data sets, prior to assembly. This technique is working well, and we've managed to do some assembly of soil data sets. Our next challenge there is figuring out how to evaluate the quality of the assembly!

At the time of this previous post, I said

... any sequence arising as the result of a sequencing error is going to be extremely rare ...

and the mental model we've been using for filtering was that since Illumina reads are error-prone -- especially on the 3' side -- removal of sequences containing low abundance k-mers would help screen out sequences containing errors. The logic here is that if you break a sequence, like

atggAgtac

down into (say) 4-mers,

atgg
 tggA
  ggAg
   gAgt
    Agta
     gtac

and there was a single sequencing error (the 'A') in the original sequence, then all four k-mers containing the 'A' would be at low abundance in the data set. Set k=32 and you see the potential for lots and lots of low-abundance k-mers.

So I figured that the dominant effect of k-mer filtering would be removal of erroneous sequence.

It turns out I was dead wrong, and that we're actually just removing genuinely low-abundance sequences, presumably from genuinely low-abundance DNA present in the sample we're sequencing. But we took a circuitous route to get there. This is that story.

The k-mer abundance profile problem

We're interested in two things: first, eliminating reads with early errors in them; and second, trimming reads to remove the error-prone 3' end. For read trimming, we didn't know where to start trimming, so we figured that we could develop an empirical threshold by looking at the k-mer abundance profile across the read, and then choosing a point in the reads that initiated a run of a lot of low-abundance k-mers. This would be the place where the Illumina sequencing was breaking down.

So we plotted the summed abundance of 32-mers across the entire data set against their position in reads using our khmer magic, and got the green line on the following plot:

http://ivory.idyll.org/permanent/illumina-read-phenomenology/average-abundance-by-pos.png

OK, the green line doesn't show any drop in average abundance, so maybe we won't be able to trim using low-abundance k-mers as a signal for dropping quality. Minorly surprising but not stunning.

Unfortunately, Jason also plotted the average abundance of 17-mers -- red line, above plot.

It rises continuously throughout the read, suggesting that Illumina reads trend towards high-abundance 17-mers at their 3' end. WTF??

...this is not at all what you'd expect if errors were causing novel k-mers to appear at the end of reads.

Looking between the two graphs, we immediately suspected that homopolymer runs were popping up. That is, at the end of reads, erroneous runs of AAAAAA.... or CCCCC... etc. were read off by the Illumina sequencer or base calling system, resulting in a lot of low-complexity subsequences. For small k (=17), this would result in lots and lots of identical 17-mers; for larger k (=32), the homopolymer runs would be connected to more unique sequence earlier in the read and the 32-mers wouldn't all be unique.

To track this down, we first decided to look at the distribution of low (abundance = 1) and high (abundance >= 255) abundance k-mers across reads.

High- and low-abundance kmers, by position

When we looked at where unique k-mers tended to lie in the read, we found that there was a nearly uniform distribution of them (first graph, green line). Sure, there's a small uptick at the end, but if you look at the left axis, you can see that it's only about 10% -- not at all what you'd expect if the reads were very error-prone at the 3' end. The real signal is in the high-abundance k-mers, which have a huge predisposition to being at the 3' end of the read (second graph, green line). This suggested that our homopolymer mechanism for explaining the rise of the summed 17-mer abundances towards the 3' end of the reads was correct -- and, indeed, when we went in and looked at what exact sequences were present in high abundance, we saw a bunch of homopolymer runs.

http://ivory.idyll.org/permanent/illumina-read-phenomenology/abund=1.compare.png http://ivory.idyll.org/permanent/illumina-read-phenomenology/abund=255.compare.png

We verified that these high-abundance k-mers were predominantly homopolymer runs by using our Mark 1 Eyeball Detection System; here's the top of the list of abundance >= 255 k-mers:

ACAATGGATCAACAACGACAATGGATCAACAA
TGAGGCGGGGGTCACCCTGAGGCGGGGGTCAC
AATCTCATGCCTCAGCCAATCTCATGCCTCAG
CCCCCCCCCCCCCCCCCCCCCCCCTCCCCCCC
AAACCAAAAAAAAAAAAAAACAAAAAAAAAAA
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
GGGGGGGGGGGGGAGGGGGGGGGGGGGGGGAC
AAAAAAACAAAAAAAAACAAAAAACAAAAAAA
GGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGG
CACCGTAGGCACCTTGGCACCGTAGGCACCTT
GGGGGGGGGGGGGGGGGGGGAGGGGGGGGGGG
AGAGTGTAGATCTCGGTGGTCGCCGTATCATT
GGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGG
TGTAGGGAAAGAGTGTAGATCTCGGTGGTCGC
CTCCCCCCCCCCCCCCCCGCCCCCCCCCCCCC
GCGGGGGGGGGGGGGGGGCAGGGGGGGGGGGG
CCACCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCGCCCCCCCCCCCCCCCCCCCC
AAAAAAAAAAACCAAAAAAAAAAAAAAAACAA
AAAAAAACAAAAACAAAAAAAAAAAAAAAACA
GGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGG
CGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGA
CCCCCCACCCCCCCCACCCCCCCACCCCCCCC
GGGGGGGGGGGGGGCGGGGGGCGGGGGGGGGC

So, in the raw reads from the sequencing facility, we see a lot of homopolymer runs on the 3' end, and not much in the way of unique k-mers (indicating that there's very little single-base error.)

Trimming reads by error scores

At this point, our work on k-mers converged with some work that Adina was doing. She was looking at trimming by quality score; the more recent base calling system used by Illumina puts in phred-score=2 markers at positions within the read where the sequence quality has deteriorated to the point where it shouldn't be trusted any more. We decided to see if trimming all sequences at such positions would lead to "better" (more uniform) k-mer distributions. We therefore looked at the low- and high-abundance k-mer distributions, as above, but for quality-trimmed sequence. We got the red lines in the plots above.

For the high-abundance k-mers, we see a near-perfect leveling of the k-mer abundance distribution in the quality-trimmed reads. This indicates that the Illumina phred-score=2 prediction is working nearly perfectly for predicting the start of poor sequence stretches, as indicated by homopolymer runs. This distribution is more or less what you'd expect of good-quality sequence.

The distribution of low abundance k-mers, however, is much more puzzling. Unique k-mers are dropping in number as reads get longer. This shouldn't happen unless the 5' end of the read is much more error-prone than the 3' end -- generally not what you'd expect! The explanation for this phenomenon may be that Illumina is over-trimming the reads: that is, in an effort to get rid of low-quality sequence, they're putting in a bias against good quality scores at the 3' end of their reads. This results in a systematic over-trimming of reads by our program, and my guess is that we're probably throwing away some good sequence here. (A casual conversation with an Illumina representative at a recent conference suggests that this is in fact what's going on.)

K-mer abundance histogram

The suspicious among you might note the decline in the red lines (trimmed data) on the right side of the low- and high-abundance k-mer graphs, and period-5 bounciness in the same red lines. Are we just getting rid of long reads in the trimming process? And whence the bounciness? Well, first, most of the reads remain untrimmed, so we can't account for declining signal; and the periodicity is actually caused by a periodicity in the length distribution of the trimmed reads:

http://ivory.idyll.org/permanent/illumina-read-phenomenology/length-dist-filtered.png

(This graph is actually by # of k-mers at each position, not by read length; add 31 to the bottom axis for technical correctness.)

It's interesting to look at the periodicity in the trimmed reads. It gives us some insight into the Illumina quality score calculation code: there must be some Illumina error calculation system that is dependent on blocks of five bases, although why they suggest trimming at positions that are exact multiples of five is an interesting question.

The homopolymer issue

So, it seems that Illumina reads are actually more prone to homopolymer runs than anything else, AND the per-base error rate is low enough to not result in lots and lots of unique k-mers.

This has some interesting implications for assemblers. In particular, you definitely want to trim 3' ends of Illumina reads before assembly, because otherwise assemblers might choke on the homopolymer runs. You certainly don't want to be paying attention to the homopolymer sequence!

I'd really like to take a look at k-mer abundances across 454 reads. Soon, perhaps.

Conclusions

My main conclusion from all of this is: in our data set, unique k-mers are not predominantly caused by errors. Rather, they are simply low-abundance reads from a complex population. There's therefore no scientific point to removing reads containing low-abundance k-mers, or trimming reads at the first unique k-mer. It's simply a way to remove data so as to favor high-abundance organisms, which might be more assemblable. (Which is fine, but it should be explicitly understood!)

This conclusion is almost certainly not true of lower-complexity samples, like mRNAseq, ChIP-seq, or genome resequencing.

A secondary conclusion is that k-mer abundance analysis is an excellent way to look for patterns within shotgun reads. Since patterns shouldn't generally be there, anything you see is suspicious and should be investigated further.

--titus

Appendix: the scripts

A pointer to some data, and all the scripts and code needed to generate the above graphs, is here:

http://github.com/ctb/khmer/blob/master/doc/blog-posts.txt

posted at: 10:43 | path: /jul-10 | 7 comments

Tags: ,