<?xml version="1.0" encoding="iso-8859-1"?>
<!-- name="generator" content="pyblosxom/1.3.2 2/13/2006" -->
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN" "http://my.netscape.com/publish/formats/rss-0.91.dtd">

<rss version="0.91">
<channel>
<title>Daily Life in an Ivory Basement   </title>
<link>http://ivory.idyll.org/blog</link>
<description>This Space Intentionally Left Empty</description>
<language>en</language>
<item>
  <title>Terabase metagenomics -- the computational side</title>
  <link>http://ivory.idyll.org/blog/jul-10/computation-for-terabase-metagenomics.html</link>
  <description><![CDATA[
<div class="document">
<p>The <a class="reference" href="http://ivory.idyll.org/blog/jul-10/terabase-metagenomics">Terabase Metagenomics meeting</a> was good
fun, but I most valued the computational component (because that's
what I do).  Rachel Mackelprang and Rob Knight and I wrote down a list
of the computational issues involved in a petabase metagenomics
project, and that list will help direct my future research.  I'll try
to write up the list sooner rather than later, but in the meantime, I
wanted to write down some general thoughts on the subject of
processing large amounts of data for metagenomics.</p>
<p>In any petabase metagenomics sequencing project focused on probing the
unknown, we are going to have to face two basic computational issues:
assembly and sequence comparison.  Both of these issues require an
all-by-all analysis, or at least the approximation of one.  That is,
as our sampling grows, we're going to want to assemble contiguous DNA
sequences from larger and larger pools of short sequences; this
requires some form of comparison that will scale as the amount of
genomic novelty increases.  We're also going to want to take the
products of sequencing and assembly (raw reads and contigs) and
compare them or derivative products to all known genes.  By and large,
this problem boils down to systematic comparisons of points in a
sparsely but maybe thoroughly occupied n-dimensional space -- an
N-squared problem that has no exact parallelizable solution.</p>
<p>Currently, neither assembly nor gene comparison scales well.
Metagenomic assembly is currently the most disastrous: existing
short-read assemblers use <a class="reference" href="http://www.ncbi.nlm.nih.gov/pubmed/20211242">De Bruijn graphs</a> and require ginormous
amounts of memory (half-terabyte amounts or more) to run.  Gene
comparison is largely done using custom tools (for e.g. rRNA
comparison) or the more general BLAST and HMMER/profile tools; these
tools scale reasonably well for comparisons, but for post-processing
into high-quality phylogenies and gene families, there is no scalable
solution that I know of.</p>
<p>I firmly believe that the intractable-looking problem of assembly is
going to be among the first to fall.  This is because, at least in
theory, metagenomic assembly should be massively parallelizable:
metagenomes contain many <em>distinct</em> genomes that don't connect to each
other, so if you can filter or bin the raw sequences into assemblable
subsets, then you will be able to run the individual assemblies on as
many computers as you have original contigs.  This is a logical and
simple approach that simply awaits appropriate heuristic techniques
for division of the reads -- and we'll figure that out sooner rather
than later.  Plus, people much smarter than me are actively working on
better ways of parallelizing the basic De Bruijn graph approach, and
that will inevitably bear fruit.</p>
<p>The marker-gene comparison problem may already be solved.  For any
given gene family (like 16s) you can already do linear-time
comparisons with profile methods like HMMER. Using existing breakdowns
of phase space as a scaffold for new marker data as it arrives should
be possible.  At least, I can think of several ways that it will work,
and Rob Knight is quite optimistic.  (This is his area of expertise,
and since he's a Python programmer, I'm inclined to give him the
benefit of the doubt. ;)</p>
<p>The big problem facing in the future, I think, is the clustering and
discovery of gene families.  As we sequence deeper into the unknown,
we're going to have to do ever larger, more systematic, and more
sensitive comparisons of new data against all known genes, in all
types of ways -- DNA and protein sequence, RNA secondary structure,
and protein structure.  The goal will be to elucidate gene families
(which is an ill-defined concept), their phylogenetic structure, and
as much as possible about their function, across all of the domains of
life.  This is hard because we are explicitly looking for trace
connections -- distant homologies and similarities -- that we haven't
seen yet.  Unlike the marker-gene approach, we don't have good
heuristics for reducing the problem to one of interpolation, and this
means that the problem remains (at first glance) an O(N^2) problem.
Even if you use a prefiltering approach (like BLAST or HMMER 3) you
are still supra-linear, which is deadly when you're talking about
billions or trillions of sequences.</p>
<p>Memory usage is particularly problematic for gene comparison, with
most clustering techniques requiring scads of memory (N^2, in the
extreme) for comparison problems.  This may be something we can
reduce with heuristic approaches, but biology has a way of messing
with us so we can't rely on heuristics up front - <em>especially</em> in
large discovery-based projects.</p>
<p>There's an additional complication, though: we really want to reduce
the problem to one of online processing.  As we gather petascale data,
we don't want to do all-by-all comparisons of the new data against the
old data; rather, we want to fit as much of the new data into an
existing structure of classifications, and then sideline the remaining
data for more complex exploration.  This results in an O(N)
classification for the &quot;easy&quot; or known stuff, while dealing with the
hard stuff remains Bad (supra-linear, and perhaps exponential).  For
example, with respect to assembly, we expect to have assemblies of
most of the more abundant soil genomes within the next 2-3 years; we
should be able to pre-screen new data by seeing if it matches to an
existing assembly, and if it does, simply count it (or discard it).
Only the data that matches to no prior assembly would be kept.  A
similar approach can be used for 16s or marker-gene data.  I can think
of ways to approach it for protein-coding genes and non-coding RNA as
well, but I'm very afraid of missing some or many of the unknown
unknowns in our data sets by being too liberal with classification.
On the flip side, being too conservative will inevitably lead to
scaling problems, so there's no way to win.</p>
<p>As with all Big Data problems, the sheer scale of these issues causes
all sorts of annoying data processing and storage problems.  It's
become a ruling principle of my tool development that algorithms must
be constant memory: hence things like our <a class="reference" href="http://ivory.idyll.org/blog/jul-10/kmer-filtering">k-mer filtering approach</a>, which trades
accuracy guarantees for pragmatic issues like the ability to run the
software in the first place.</p>
<p>It's worth noting that Moore's Law will not save us.  Neither memory
nor CPU power is growing as fast as sequencing capacity -- some
estimate the exponent on sequencing capacity as being &gt; 5.  (Thank
goodness, disk space will probably keep up!)  We're going to get a
huge increase in the number of cores, but that doesn't really help
with Big Data unless you have robust pleasantly parallel algorithms
that don't require much memory -- which we don't, and in general
probably won't, have.</p>
<p>So, my (our) outlook for the next few years will be: build tools that
scale way better than our existing tools, in all directions (time,
memory, parallelizability); yield guaranteed behavior with respect to
memory consumption in particular; and are optimally sensitive and
specific for the large class of unknowns (new proteins, ncRNA, and
structures) that await us in the world of sequence.</p>
<p>Sounds like fun!</p>
<p>By the way, for those of you who like to think about problem equivalence,
the protein comparison problem ultimately boils down to comparing protein
structures.  We, umm, can't solve that one, either...</p>
<p>--titus</p>
</div>

]]></description>
</item>

<item>
  <title>Terabase metagenomics - meeting and outlook</title>
  <link>http://ivory.idyll.org/blog/jul-10/terabase-metagenomics.html</link>
  <description><![CDATA[
<div class="document">
<p>I'm on my way back from the <a class="reference" href="http://icis.anl.gov/programs/summer-1">Terabase Metagenomics</a> meeting in Snowbird, UT,
and I'm buzzing with ideas about how to move forward in metagenomics
and bioinformatics research.  Metagenomics, the use of genomics
approaches to study microbial communities, has been opening up as
sequencing drops in price.  With sequencing becoming ever more
ridiculously deep and inexpensive -- we're looking at a power of 2-5
increase in depth every year for the next 5 years -- the questions
have increasingly become intertwined with (surprise!) computational
science problems and specifically bioinformatics.  As we move towards
multiple individual terabase (and collectively <em>petabase</em>) sequencing
data sets in the next two years, life is going to become more
complicated for metagenomicists and the bioinformaticians who aspire
to help them.  (That's me.)</p>
<p>The meeting was focused on the promise of terabase metagenomics for a
&quot;whole earth&quot; microbial ecology sampling projects, and we spent quite
a lot of time trying to figure out the size of unknown unknowns in
metagenomics.  There was a great mix of people at the meeting -- both
&quot;pure&quot; microbial ecologists who regarded computers with suspicion and
computer scientists who didn't understand how DNA became protein, as
well as the entire range between those two poles. We talked about
questions such as: &quot;how deep <em>do</em> we need to sequence in order to get
a real picture of microbial diversity?&quot; (a: very) and &quot;how do we
reconcile genomic and transcriptomic data sets?&quot; (a: unknown) and &quot;how
can we scale up 1000-fold when our existing algorithms are all
N-squared?&quot; (a: with difficulty.) The discussion was at a reasonably
high level of sophistication: pretty much everyone in the room had
&quot;seen the elephant&quot;, and representatives of the major analysis
pipelines were also present.</p>
<p>We didn't come to any firm conclusions about the biology, but we did
sketch the outlines of a global sampling project to look at global
microbial diversity, together with the deeper sequencing of specific
locales to plumb the depths of a few communities.  Coming up with a
computational pipeline to deal with data and metadata on this scale
-- storage, analysis, integration, presentation, and querying -- is
truly challenging and should be fun.</p>
<p>For me, one recurring theme in the meeting was entanglement with
management and standardization issues in big projects.  <em>Everyone</em> was
bitching about these on their existing projects, and I'm trying to
avoid them. I have little or no interest in telling other people how
to do their job; <em>absolutely</em> no interest in being told how to do <em>my</em>
job; and a general lack of patience with standardization issues,
however important and critical they may be.  (In some ways, the
Benevolent Dictator For Life approach to decision making used by the
Python community seems optimal for medium-sized projects, although of
course it takes a special person to pull it off.  I acknowledge that
decisions need to be made, but have no real interest in doing anything
more myself than exploring the technical issues that could inform the
decisions.  Hmm, I wonder if the PEP process might be a good way to do
a formal standards process in metagenomics?)  I don't if there's any
way for me to avoid this stuff in the future, but it's the main reason
that I'm wary of joining big projects.</p>
<p>A few interesting references came up during the meeting.  For example,
I hadn't seen <a class="reference" href="http://en.wikipedia.org/wiki/George_H._Heilmeier#Heilmeier.27s_Catechism">the Heilmeier Catechism</a>
before; I think it's a good idea for both scientific and open source
projects to look at this before they do too much work!  I was also
entertained by <a class="reference" href="http://phylogenomics.blogspot.com/2010/07/what-is-not-getting-any-love-at-this.html">Jonathan Eisen's</a>
report on things that the workshop participants disliked: for example,
&quot;Bureaucracy (and how it impedes science).&quot;</p>
<p>All in all, I think this meeting was fantastic: very small, very
focused, and with plenty of time for discussion and socializing.  I'm
happy the organizers invited me, so thanks to them!</p>
<p>--titus</p>
</div>

]]></description>
</item>

<item>
  <title>Casual mentions</title>
  <link>http://ivory.idyll.org/blog/jul-10/casual-mentions.html</link>
  <description><![CDATA[
<div class="document">
<p>(Everyone needs an &quot;I'm wonderful&quot; wall, right?  Here's mine!)</p>
<p>My rather <a class="reference" href="http://ivory.idyll.org/blog/may-10/data-management.html">snarky post on data management</a> made it
into <a class="reference" href="http://www.c2cbio.com/2010/07/07/episode-34-technology-gremlins/">Episode 34 of the Coast to Coast Bio Podcast</a>.</p>
<p>Turns out I'm now a <a class="reference" href="http://www.xconomy.com/national/2010/07/06/amazon-with-rented-server-space-in-the-cloud-sees-opportunity-in-genomic-data-overload/">leading evangelist for cloud computing in genomic
research</a>, which I
guess is true if evangelism is relative!  My evangelism, plus 2 bucks, may
get me a cup of coffee <em>and</em> a grant...</p>
<p>The <a class="reference" href="http://ivory.idyll.org/blog/jul-10/kmer-filtering">long and detailed k-mer filtering post</a> elicited a <a class="reference" href="http://twitter.com/pathogenomenick/statuses/17957407197">great comment</a> from Nick Loman:</p>
<blockquote>
with blogposts like this, who needs peer-review journals?</blockquote>
<p>Unfortunately my blog posts don't count for promotion &amp; tenure, so
that would be me needing peer-review journals ;).  I also am sadly
amused that one of the most content-ful blog posts I've written
recently got very little notice.  It's frustrating that people seem to
prefer to kibitz about open access/open source more than actually
blogging in detail about what they're doing at a technical level.
We need more of that.</p>
<p>Greg Wilson just posted an <a class="reference" href="http://software-carpentry.org/blog/2010/07/interview-with-michigan-states-titus-brown/">interview</a>
with me about why Michigan State, GEDD, and I are all interested in
Sotware Carpentry.  I told him to stop bothering me and just read my
several blog posts on the subject, but he out-persisted me.  Grumble.</p>
<p>And finally, next week I'm going to start a blog-off with <a class="reference" href="http://phylogenomics.blogspot.com/">Jonathan
Eisen</a> as we'll both be at a
<a class="reference" href="http://icis.anl.gov/programs/summer-1">Terabase Metagenomics</a>
meeting.  (He doesn't know about the blog-off yet, so don't tell him,
ok?)  Points will be awarded on --</p>
<blockquote>
<ul class="simple">
<li>first blogpost on the worst -omics word comign from the meeting (we
re-invented &quot;predictomics&quot; in my lab last Friday!)</li>
<li>snarkiest comment</li>
<li>quickest tweet off the mark (&quot;just 10 seconds ago the speaker said this!&quot;)</li>
<li>least scientifically relevant tweet</li>
</ul>
</blockquote>
<p>I'm hoping to gin up a solid lead before Jonathan realizes we're competing.
Shhh.</p>
<p>--titus</p>
</div>

]]></description>
</item>

<item>
  <title>A memory efficient way to remove low-abundance k-mers from large (metagenomic?) DNA data sets</title>
  <link>http://ivory.idyll.org/blog/jul-10/kmer-filtering.html</link>
  <description><![CDATA[
<div class="document">
<p>I've spent the last few weeks working on a simple solution to a
challenging problem in DNA sequence assembly, and I think we've got a
nice simple theoretical solution with an actual implementation.  I'd
be interested in comments!</p>
<div class="section">
<h1><a id="introduction" name="introduction">Introduction</a></h1>
<p>Briefly, the algorithmic challenge is this:</p>
<p>We have a bunch of DNA sequences in (let's say) FASTA format,</p>
<pre class="literal-block">
&gt;850:2:1:1145:4509/1
CCGAGTCGTTTCGGAGGGACCCCGCCATCATACTCGGGGAATTCATCTGAAAGCATGATCATAGAGTCACCGAGCA
&gt;850:2:1:1145:4509/2
AGCCAAGAGCACCCCAGCTATTCCTCCCGGACTTCATAACGTAACGGCCTACCTCGCCATTAAGACTGCGGCGGAG
&gt;850:2:1:1145:14575/1
GACGCAAAAGTAATCGTTTTTTGGGAACATGTTTTCATCCTGATCATGTTCCTGCCGATTCTGATCTCGCGACTGG
&gt;850:2:1:1145:14575/2
TAACGGGCTGAATGTTCAGGACCACGTTTACTACCGTCGGGTTGCCATACTTCAACGCCAGCGTGAAAAAGAACGT
&gt;850:2:1:1145:2219/1
GAAGACAGAGTGGTCGAAAGCTATCAGACGCGATGCCTAACGGCATTTTGTAGGGCGTCTGCGTCAGACGCCAACC
&gt;850:2:1:1145:2219/2
GAAGGCGGGTAATGTCCGACAAACATTTGACGTCAAAGCCGGCTTTTTTGTAGTGGGTTTGACTCTTTCGCTTCCG
&gt;850:2:1:1145:5660/1
GATGGCGTCGTCCGGGTGCCCTGGTGGAAGTTGCGGGGATGCGGATTCATCCGGGACGCGCAGACGCAGGCGGTGG
</pre>
<p>and we want to pre-filter these sequences so that only sequences
containing high-abundance DNA words of length k (&quot;k-mers&quot;),
remain. For example, given a set of DNA sequences, we might want to
remove any sequence that does not contain a k-mer present at least 5
times in the entire data set.</p>
<p>This is very straightforward to do for small numbers of sequences, or
for small k.  Unfortunately, we are increasingly confronted by data sets
containing 250 million sequences (or more), and we would like to be
able to do this for large k (k &gt; 20, at least).</p>
<p>You can break the problem down into two basic steps: first,
counting k-mers; and second, filtering sequences based on those k-mer
counts.  It's not immediately obvious how you would parallelize this
task: the counting should be very quick (basically, it's I/O
bound) while the filtering is going to rely on wide-reaching lookups
that aren't localized to any subset of k-mer space.</p>
<p>tl; dr? we've developed a way to do this for arbitrary k, in linear
time and constant memory, efficiently utilizing as many computers as
you have available.  It's open source and works today, but, uhh, could
use some documentation...</p>
</div>
<div class="section">
<h1><a id="digression-the-bioinformatics-motivation" name="digression-the-bioinformatics-motivation">Digression: the bioinformatics motivation</a></h1>
<p>(You can skip this if you're not interested in the biological motivation ;)</p>
<p>What we <em>really</em> want to do with this kind of data is assemble it,
using a <a class="reference" href="http://en.wikipedia.org/wiki/De_Bruijn_graph">De Bruijn graph approach</a> a la <a class="reference" href="http://genome.cshlp.org/content/18/5/821.full">Velvet</a>, <a class="reference" href="http://www.bcgsc.ca/platform/bioinfo/software/abyss">ABySS</a>, or
<a class="reference" href="http://soap.genomics.org.cn/soapdenovo.html">SOAPdenovo</a>.
However, De Bruijn graphs all rely on building a graph of overlapping
k-mers in memory, which means that their memory usage scales as a
function of the number of unique k-mers.  This is good in general, but
Bad in certain circumstances -- in particular, whenever the data set
you're trying to assemble has a lot of genomic novelty.  (See <a class="reference" href="http://www.ncbi.nlm.nih.gov/pubmed/20211242">this
fantastic review</a> and
my <a class="reference" href="http://ged.msu.edu/angus/files/lecture5-assembly.pdf">assembly lecture</a> for some
background here.)</p>
<p>One particular circumstance in which De Bruijn graph-based assemblers
flail is in <a class="reference" href="http://en.wikipedia.org/wiki/Metagenomics">metagenomics</a>, the isolation and
sequencing of genetic material from &quot;the wild&quot;, e.g.  soil or the
human gut.  This is largely because the diversity of bacteria present
in soil is so huge (although the relatively high error rate of
next-gen platforms doesn't help).</p>
<p>Prefiltering can help reduce this memory usage by removing erroneous
sequences as well as not-so-useful sequences.  First, any sequence
arising as the result of a sequencing error is going to be extremely
rare, and abundance filtering will remove those.  Second, genuinely
&quot;rare&quot; (shallowly-sequenced) sequences will generally not contribute
much to the assembly, and so removing them is a good heuristic for
reducing memory usage.</p>
<p>We are currently playing with dozens (probably hundreds, soon) of gigabytes
of metagenomic data, and we really need to do this prefiltering in order
to have a chance at getting a useful assembly out of it.</p>
<p>It's worth noting that this is in no way an original thought: in
particular, the Beijing Genome Institute (BGI) did this kind of
prefiltering in their landmark Human Microbiome paper (<a class="reference" href="http://www.nature.com/nature/journal/v464/n7285/full/nature08821.html">ref</a>).
We want to do it, too, and the BGI wasn't obliging enough to give
us source code (booooooo hisssssssssssssss).</p>
</div>
<div class="section">
<h1><a id="existing-approaches" name="existing-approaches">Existing approaches</a></h1>
<p>Existing approaches are inadequate to our needs, as far as we can tell
from a literature survey and private conversations.  Everyone seems
to rely on big-RAM machines, which are nice if you have them, but shouldn't
be necessary.</p>
<p>There are two basic approaches.</p>
<p>First, you can build a large table in memory, and then map k-mers into
it.  This involves writing a simple hash function that converts DNA
words into numbers.  However, this approach scales poorly with k: for
example, there are 4**k unique DNA sequences of length k (or roughly
(4**k) / 2 + (4**(k/2))/2, considering reverse complements).  So the table
for k=17 needs 4**17 entries -- that's 17 gb at 1 byte per counter,
which stretches the memory of cheaply available commodity hardware.
And we want to use bigger k than 17 -- most assemblers will be more
effective for longer k, because it's more specific.  (We've been using
k values between 30 and 70 for our assemblies, and we'd like to filter
on the same k.)</p>
<p>Second, you can observe that k-mer space (for sufficiently large k) is
likely to be quite sparsely occupied -- after all, there's only so
many actual 30-mers present in a 100gb data set! So, you can do
something clever like use a tree representation of k-mers and then
attach counters to the end nodes of the tree (see, for example,
<a class="reference" href="http://www.ncbi.nlm.nih.gov/pubmed/18976482">tallymer</a>.  The
problem here that you need to use pointers to connect nodes in the
tree, which means that while the tree size is going to scale with the
amount of novel k-mers -- ok! -- it's going to have a big constant in
front of it -- bad!.  In our experience this has been prohibitive for
real metagenomic data filtering.</p>
<p>These seem to be the two dominant approaches, although if you don't
need to actually <em>count</em> the k-mers but only assess presence or
absence, you can use something like a <a class="reference" href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a> -- for example, see
<a class="reference" href="http://bioinformatics.oxfordjournals.org/cgi/content/full/26/13/1595">Classification of DNA sequences using a Bloom filter</a>,
which uses Bloom filters to look for novel sequences (the exact
opposite of what we want to do here!).  References to other approaches
welcome...</p>
<p>Note that you really, really, really want to avoid disk access by
keeping everything in memory.  These are ginormous data sets and we
would like to be able to quickly explore different parameters and
methods of sequence selection.  So we want to come up with a good counting
scheme for k-mers that involves relatively small amounts of memory and
as little disk access as possible.</p>
<p>I think this is a really fun kind of problem, actually.  The more
clever you are, the more likely you are to come up with a completely
inappropriate data structure, given the amount of data and the basic
algorithmic requirements.  It demands KISS!  Read on...</p>
</div>
<div class="section">
<h1><a id="an-approximate-approach-to-counting" name="an-approximate-approach-to-counting">An approximate approach to counting</a></h1>
<p>So, we've come up with a solution that scales with the amount of
genomic novelty, and efficiently uses your available memory.  It can
also make effective use of multiple computers (although not multiple
processors).  What is this magic approach?</p>
<p><a class="reference" href="http://en.wikipedia.org/wiki/Hash_table">Hash tables</a>.  Yep.  Map
k-mers into a fixed-size table (presumably one about as big as your
available memory), and have the table entries be counters for k-mer
abundance.</p>
<p>This is an obvious solution, except for one problem: collisions.  The
big problem with hash tables is that you're going to have collisions,
wherein multiple k-mers are mapped into a single counting bin.  Oh
noes!  The traditional way to deal with this is to keep track of each
k-mer individually -- e.g. when there's a collision, use some sort of
data structure (like a linked list) to track the actual k-mers that
mapped to a particular bin.  But now you're stuck with using gobs of
memory to keep these structures around, because each collision
requires a new pointer of some sort.  It may be possible to get around
this efficiently, but I'm not smart enough to figure out how.</p>
<p>Instead of becoming smarter, I reconfigured my brain to look at the problem
differently.  (Think Different?)</p>
<p>The big realization here is that collisions <strong>may not matter</strong> all
that much.  Consider the situation where you're filtering on a maximum
abundance of 5 -- that is, you want at least one k-mer in each
sequence to be present five or more times across the data set.  Well,
if you look at the hash bin for a specific k-mer and see the number
<strong>4</strong>, you immediately know that whether or not there are any
collisions, that particular k-mer isn't present five or more times,
and can be discarded!  That is, the count for a particular hash bin is
the sum of the (non-negative) individual counts for k-mers mapping to
that bin, and hence that sum is an upper bound on any individual
k-mer's abundance.</p>
<img alt="http://ivory.idyll.org/permanent/kmer-hashtable.png" src="http://ivory.idyll.org/permanent/kmer-hashtable.png" style="width: 20%;" />
<p>The tradeoff is false positives: as k-mer space fills up, the hash
table is going to be hit by more and more collisions.  In turn, more
of the k-mers are going to be falsely reported as being over the
minimum abundance, and more of the sequences will be kept.  You can
deal with this partly by doing iterative filtering with different
prime hash table sizes, but that will be of limited utility if you
have a really saturated hash table.</p>
<p>Note that the counting and filtering is still O(N), while the false
positives grow as a function of k-mer space occupancy -- which is to
say that the more diversity you have in your data, the more trouble
you're in.  That's going to be a problem no matter the approach, however.</p>
<p>You can see a simple example of approximate and iterative filtering
here, for k=15 (a k-mer space of approximately 1 billion) and hash
table sizes ranging from 50m to 100m.  The curves all approach the
correct final number of reads (which we can calculate exactly, for
this data set) but some take longer than others.  (The code for this
is <a class="reference" href="http://github.com/ctb/khmer/blob/master/scripts/ctb-iterative-bench-2.py">here</a>.)</p>
<img alt="http://ivory.idyll.org/permanent/kmer-filtering-iterative.png" src="http://ivory.idyll.org/permanent/kmer-filtering-iterative.png" style="width: 50%;" />
</div>
<div class="section">
<h1><a id="making-use-of-multiple-computers" name="making-use-of-multiple-computers">Making use of multiple computers</a></h1>
<p>While I was working out the details of the (really simple) approximate
counting approach, I was bugged by my inability to make effective use
of all the juicy computers to which I have access.  I don't have many
<em>big</em> computers, but I do have lots of medium sized computers with
memory in the ~10-20 gb range.  How can I use them?</p>
<p>This is not a pleasantly parallel problem, for at least two reasons.
First, it's I/O bound -- reading the DNA sequences in takes more time
than breaking them down into k-mers and counting them.  And since it's
really memory bound -- you want to use the biggest hash table possible
to minimize collisions
-- it doesn't seem like using multiple processors on a single machine
will help.  Second, the hash table is going to be too big to
effectively share between computers: 10-20 gb of data per computer is
too much to send over the network.  So what do we do?</p>
<p>I was busy explaining to <a class="reference" href="http://en.wikipedia.org/wiki/Charles_Ofria">a colleague</a> that this was
impossible -- always a useful way for me to solve problems ;) -- when
it hit me that you could use <em>multiple chassis</em> (RAM + CPU + disk) to
decrease the false positive rate with only a small amount of
communication overhead.  Basically, my approach is to partition k-mer
space into Z subsets (one for each chassis) and have each computer count
only the k-mers that fall into their partition.  Then, after the
counting stage, each chassis records a max k-mer abundance per
partition per sequence, and communicates <em>that</em> to a central
node.  This central node in turn calculates the max k-mer abundance
across all partitions.</p>
<p>The partitioning trick is a more general form of the specific 'prefix'
approach -- that is, separately count and get max abundances/sequence
for all k-mers starting with 'A', then 'C', then 'G', and then 'T'.
For each sequence you will then have four values (the max
abundance/sequence for k-mers start with 'A', 'C', 'G', and 'T'),
which can be cheaply stored on disk or in memory.  Now you can do a
single-pass integration and figure out what sequences to keep.</p>
<p>This approach effectively multiplies your working
memory by a factor of Z, decreasing your false positive rate
equivalently - no mean feat!</p>
<img alt="http://ivory.idyll.org/permanent/kmer-hashtable-par.png" src="http://ivory.idyll.org/permanent/kmer-hashtable-par.png" style="width: 20%;" />
<img alt="http://ivory.idyll.org/permanent/kmer-filter-process-2.png" src="http://ivory.idyll.org/permanent/kmer-filter-process-2.png" style="width: 40%;" />
<p>The communication load is significant but not prohibitive: replicate a
read-only sequence data set across Z computers, and then communicate
max values (1 byte for each sequence) back -- 50-500 mb of data per
filtering round.  The result of each filtering round can be returned
to each node as a readmask against the already-replicated initial
sequence set, with one bit per sequence for ignore or process.  You can
even do it on a single computer, with a local hard drive, in multiple
iterations.</p>
<p>You can see a simple in-memory implementation of this approach <a class="reference" href="http://github.com/ctb/khmer/blob/master/python/khmer/split.py">here</a>,
and some tests <a class="reference" href="http://github.com/ctb/khmer/blob/master/python/test_split.py">here</a>.
I've implemented this using readmask tables and min/max tables (serializable
data structures) more generally, too; see &quot;the actual code&quot;, below.</p>
</div>
<div class="section">
<h1><a id="similar-approaches" name="similar-approaches">Similar approaches</a></h1>
<p>By allowing for false positives, I've effectively turned the hash
table into a probabilistic data structure.  The closest analog I've
seen is the <a class="reference" href="http://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a> which records
presence/absence information using multiple hash functions for
arbitrary k.  The hash approach outlined above devolves into a
maximally efficient Bloom filter for fixed k when only
presence/absence information is recorded.</p>
</div>
<div class="section">
<h1><a id="the-actual-code" name="the-actual-code">The actual code</a></h1>
<p>Theory and practice are the same, in theory.  In practice, of course,
they differ.  A whole host of minor interface and implementation
design decisions needed to be made.  The result can be seen at the
'khmer' project, here: <a class="reference" href="http://github.com/ctb/khmer">http://github.com/ctb/khmer</a>.  It's slim on
documentation, but there are some example scripts and a reasonable
amount of tests.  It requires nothing but Python 2.6 and a compiler;
nose if you want to run the tests.</p>
<p>The implementation is in C++ with a Python wrapper, much like the
paircomp and motility software packages.</p>
<p>It will filter 1m 70-mer sequences in about 45 seconds, and 80 million sequences
in less than an hour, on a 3 GHz Xeon with 16 gbs of RAM.</p>
<p>Right now it's limited to k &lt;= 32, because we encode each DNA k-mer in
a 64-bit 'long long'.</p>
<p>You can see an exact filtering script here:
<a class="reference" href="http://github.com/ctb/khmer/blob/master/scripts/filter-exact.py">http://github.com/ctb/khmer/blob/master/scripts/filter-exact.py</a> .  By
varying the hash table size (second arg to new_hashtable) you can turn
it into an <em>inexact</em> filtering script quite easily.</p>
<p>Drop me a note if you want help using the code, or a specific example.
We're planning to write documentation, doctests, examples, robust
command line scripts, etc. prior to publication, but for now we're
primarily trying to use it.</p>
</div>
<div class="section">
<h1><a id="other-uses" name="other-uses">Other uses</a></h1>
<p>It has not escaped our notice that you can use this approach for a bunch of
other k-mer based problems, such as repeat discovery and calculating abundance
distributions... but right now we're focused on actually using it for
filtering metagenomic data sets prior to assembly.</p>
</div>
<div class="section">
<h1><a id="acks" name="acks">Acks</a></h1>
<p>I talked a fair bit with Prof. Rich Enbody about this approach, and he
did a wonderful job of double-checking my intuition.  Jason Pell and
Qingpeng Zhang are co-authors on this project; in particular, Jason
helped write the C++ code, and Qingpeng has been working with k-mers
in general and tallymer in specific on some other projects, and
provided a lot of background help.</p>
</div>
<div class="section">
<h1><a id="conclusions" name="conclusions">Conclusions</a></h1>
<p>We've taken a previously hard problem and made it practically
solvable: we can filter ~88m sequences in a few hours on a
single-processor computer with only 16gb of RAM.  This seems useful.</p>
<p>Our main challenge now is to come up with a better hashing function.
Our current hash function is not uniform, even for a
uniform distribution of k-mers, because of the way it handles reverse
complements.</p>
<p>The approach scales reasonably nicely.  Doubling the amount of data
doubles the compute time.  However, if you have double the novelty,
you'll need to do double the partitions to keep the same false
positive rate, in which case you quadruple the compute time.  So it's
O(N^2) for the worst case (unending novelty) but should be something
better for real-world cases.  That's what we'll be looking at over
the next few months.</p>
<p>I haven't done enough background reading to figure out if our approach
is particularly novel, although in the space of bioinformatics it seems
to be reasonably so.  That's less important than actually solving our
problem, but it would be nice to punch the &quot;publication&quot; ticket if possible.
We're thinking of writing it up and sending it to BMC Bioinformatics,
although suggestions are welcome.</p>
<p>It would be particularly ironic if the first publication from my lab
was this computer science-y, given that I have no degrees in CS and
am in the CS department by kind of a fluke of the hiring process ;).</p>
</div>
</div>

]]></description>
</item>

<item>
  <title>Teaching scientists how to use computers - hub &amp; spokes</title>
  <link>http://ivory.idyll.org/blog/jul-10/swc-hub-spokes.html</link>
  <description><![CDATA[
<div class="document">
<p>After my recent <a class="reference" href="http://ivory.idyll.org/blog/jun-10/ngs-course-postmortem">next-gen sequencing course</a>, which
was supposed to tie into the whole <a class="reference" href="http://software-carpentry.org/">software carpentry (SWC) effort</a> but didn't really succeed in doing
so the first time through, I started thinking about the Right Way to
tie in the SWC material.  In particular, how do you both motivate
scientists to look at the SWC material, and (re)direct people to the
appropriate places?</p>
<p>It's not clear that a Plan is in place.  Greg Wilson seems to assume
that scientists will find at least some of the material immediately
obviously usable, but I think he's targetted at a more sophisticated
population of users -- physicists and the like.  My experience with
bioinformaticians, however, is that they either come from straight
biology backgrounds (with little or no computational background and
rather limited on-the-job training), straight computation backgrounds
(with very little biology), or physics (gonzo programming skills, but
no biology).  The latter fit neatly into the SWC fold, but they (we ;)
are rare in biology.  I think computer scientists and biologists are
going to need guidance to dive into SWC at an early enough time for it
to be the most rewarding.</p>
<p>So, what's a good model for SWC to guide scientists from multiple
disciplines into the appropriate material?  It's obviously not going
to be possible to have Greg et al. tailor the SWC material to individual
subgroups -- he doesn't know much (any ;) biology, for example.  I don't
have the time, patience, or skillset to integrate my next-gen notes
into his SWC material, either.  So, instead, I propose the hub &amp; spokes
model!</p>
<img alt="http://ivory.idyll.org/permanent/hub-spokes.png" src="http://ivory.idyll.org/permanent/hub-spokes.png" />
<p>Here, the &quot;hub&quot; is the SWC material, and the spokes are all of the
individual disciplines.</p>
<p>Basically, the idea is that individual sites (like my own ANGUS site
on next-gen sequencing, <a class="reference" href="http://ged.msu.edu/angus/">http://ged.msu.edu/angus/</a>) will develop their
own field-specific content, and then link from that content into the
SWC notes.  This way the experts with feet in both fields can link
appropriately, and Greg only has to worry about making the central
content general -- which he's already doing quite well, I think.  Yes,
It's more work than asking Greg to do it, but frankly I'm going to be
happy with a kick-ass central SWC site to which I can link -- right
now it's dismayingly challenging to teach students why this stuff
matters and how to learn it.</p>
<p>From the psychosocial perspective, it's a great fit.  Students can get
hands on tutorials on how to do X, Y, and Z in their own field -- and
then connect into the SWC material to learn the background, or
additional computational techniques in support of it.  Motivation first!</p>
<p>What do we need SWC to do to support this?  Not much -- basically, the
central SWC notes need to be stable enough (with permalinks) that I
can link into them from my own site(s) and not have to worry about the
links becoming broken or (worse) silently migrating in topic.  There
are other solutions (wholesale incorporation of SWC into my own notes,
for example) but I think the permalink idea is the most
straightforward.  Oh, and we should have a Greg-gets-hit-by-a-bus plan,
too; at some point he's going to move on from SWC (perhaps when his
lovely wife decide she's had enough and he needs to stop obsessing over
it, or perhaps under more dire circumstances ;( and it would be good to
know who holds the domain and site keys.</p>
<p>Thoughts?  Comments?</p>
<p>--titus</p>
</div>

]]></description>
</item>

</channel>
</rss>
