Wed, 20 Jul 2011

How much compute power do you need for next-gen sequencing?


During our next-gen course, a "student" (really a professor from Australia ;) asked me if I could provide some guidance on what computational infrastructure was necessary to handle next-gen sequencing data. While we used Amazon Web Services during the course, she was interested in finding out if they could use their local HPC, or some other dedicated compute center, to process their data.

So here are my estimates for what you would need if you were planning to buy an Illumina HiSeq machine and needed to do all the processing downstream of "I have sequence", i.e. from the basic quality-passed FASTQ sequences that most sequencing centers give you.

This is actually a pretty important question to nail down. Most data centers have no idea about the biology, and most biologists have no idea about the data centers. So biologists may end up asking vague questions and acting helpless, while the HPC folk buy lots of CPUs for them (i.e. the wrong thing; see below). Like may of the little frustrations in bioinformatics it boils down to a communication problem!

The below estimates are based on my personal experience over the last year or two, and should be regarded as the minimum for effective functioning. Corrections or other opinions welcome in the comments, or write your own dang blog post and I'll link to it... There is really a lot of hand-waving on my part so I won't be offended!

Hard disk capacity

Just to archive the sequences from a single run, you'll need on the order of 100 GB (1-2 copies of the basic gzipped FASTQ data). Temporary disk space for working with the data will average on the order of 500 GB (uncompressing, copying, moving, temporary data files, index data files, etc.) That space can be re-used once the basic analysis has been done. Assuming you generate 4-8 data sets each month with a single Illumina machine, I would guesstimate that about 1 TB a month of permanent archival space, and 4 TB of working disk space, would be good.

So, 100 GB disk per data set, permanent, and 500 GB working disk, per data set per month, for 8 data sets, + fudge factor: 1 TB of disk space a month, permanent, and 4 TB of working disk space. Cost? Negligible, $200/mo plus $2000 a year.

Compute capacity

Oddly enough, CPU is rarely a huge concern (in my experience). Unless you're doing things like really, really large BLASTs against unassembled short reads (which is inadvisable on pretty much any planet, not just ours), you probably will have enough CPU on medium sized computers your data center has. 4 to 8 cores, for 1 month per data set, are probably enough to do the basic mapping or assembly analyses, although of course more is better. Mapping to a reference genome/transcriptome is particularly parallelizable so you can take advantage of as many cores as you have. Bottom line, if you have one reasonably sized dedicated computer per data set, you should be OK. I would suggest 8-16 GB of RAM minimum (but see next section) on a 2 to 4 CPU machine, with each CPU having 2 to 4 cores. You can easily buy this kind of thing for way less than $5000 -- it's what a lot of kids have at home for gaming these days, I think.

So, 1 medium sized computer (2-4 multicore CPUs, 8 GB of RAM) for 1 month, per data set, for 8 data sets: 8 computers. Cost: let's say $40,000/year.

Memory/RAM

Memory is sort of the big bugaboo for me. I've been focusing on de novo assembly, which is a memory hog; I've just put in an order for a 500 GB machine, and I'm writing a 1 TB machine into my next grant. Mapping is much less memory intensive, requiring at most a few GB (although performance can always be improved by buying more memory, of course).

Many de novo assemblers scale with the number of unique k-mers in the data set, which means that for big, deeply sequenced data sets with lots of sequencing errors, you are going to need lots of memory. For bacterial genomes, you only need a few GB. For anything more challenging, you will need 100s of GBs. I would recommend a 512 GB machine, and strongly suggest a 1 TB machine (because who really wants to run only one analysis at a time, anyway?)

The only published machine estimate I've seen for assembly, BTW, is from the Broad Center, in the ALLPATHS-LG paper, where they estimate that they can assemble a human genome de novo in about two weeks with under 512 GB of RAM.

If Amazon Web Services wants to be really, really friendly to me, they can start providing 512 GB RAM machines for rent... and then give them to me for free, hint hint.

Note that I haven't said much about CPU power. That's because by the time you get a machine that has 512 GB of RAM, it probably has enough CPU power to run the assembly just fine. Some assemblers can make use of multiple CPUs: ABySS does, and Velvet recently released an update supporting it. I assume ALLPATHS-LG, SOAPdenovo, and others are keeping pace.

But the overall problem is it only takes ~1 week to generate a data set that can require 2-4 weeks to assemble in 512 GB of RAM. And these machines are expensive: figure $20-40k for something robust, with decent CPU and memory performance. And you need one of these babies per de novo assembly project, dedicated for 1-3 months (because de novo assembly is slow and data intensive).

If you're an HPC admin sitting there, sweating, you might think you don't need to worry, because biologists will tell you that they're going to be resequencing lots of genomes, and doing lots of transcriptomes, etc., so de novo assembly isn't going to be required much. They'll tell you it'll mostly be mapping.

Unfortunately I think they're wrong. De novo assembly is going to be a big challenge going forward, as we sequence more and more odd genomes. I think humanity is going to sequence between 10**3 and 10**6 more novel genomes in the next 5 years than we have to date, and many of these genomes will have no reasonably close reference. (Don't believe me? Check out the Tree of Life from Norm Pace's Web site. Humans and corn are the two little teeny branches over on the upper left of the Eucarya branch; I believe we have fewer than 20 draft genomes from the non-plant/animal/fungi segments of the Eucarya branch, i.e. it's completely unsampled!)

In sum, at least 1 bigmem computer (512 GB RAM) available for dedicated use by biologists doing assembly, preferably more. Cost? $50k/year for one.

---

In summary:

Hard disk: $5000/yr (not counting RAID, NAS, permanent backups, etc.)

Compute: approx. 8 medium computers, $40,000/yr (not counting air conditioning)

Memory: 1x bigmem computer minimum, $50,000/yr (not counting air conditioning)

---

The numbers above feel OK to me, but may be a little bit light. If I was doing nothing other than running such a center, I'd want about double that (so figure $100-200k/year, hardware costs), which would result in a fair amount of overcapacity that others could use. But if I had to give the minimum budget, $100k/year for preliminary sequence analysis sounds about right.

So why am I wrong? :) Inquiring minds want to know!

--titus

posted at: 07:32 | path: /jul-11 | 5 comments

Tags: ,


Fri, 08 Jul 2011

The Molgula again -- attending the Int'l Tunicate Meeting


I just flew back from Montreal, where I gave a talk at the International Tunicate Meeting on the Molgula project. This is a project wherein we are doing quantitative mRNA sequencing on two species of ascidians, or sea squirts -- specifically, on M. oculata (tailed), M. occulta (tailless) -- and their hybrids. We just got some initial results from the sequencing, and so we took the opportunity to go up to Montreal and present on it.

In addition to my talk, I presented the same content in poster form, and Elijah Lowe presented some initial results on notochord genes, which are subtly wrong (my fault, long story) but still quite interesting.

Since the Molgulids are representatives of not-very-well studied branch of the ascidians, Stolidobranchs, we are also making some of our sequence available. Specifically, you can go to our "data" page, here, and grab about 40,000 transcripts assembled from our M. oculata mRNA sequencing efforts. Enjoy!

--titus

posted at: 14:49 | path: /jul-11 | 0 comments

Tags: , ,


My lab is awesome (and what it is I actually do)


For anyone who actually wants to know what it is I do, I've updated my lab Web site, http://ged.msu.edu/, to be a bit more representative of what it is we're doing these days. (I wrote it over three years ago, so it's been becoming increasingly dated.) In particular,

  • I put up my reappointment essay, in which I explain to Those Who Judge what my career goals are and why they should keep me around (short version: because I'm awesome! longer version: read the PDF).
  • I've posted a couple of grant proposals on the interests page
  • our list of interests has been updated;

etc. Probably should update the list of students/lab members a bit, too... hrm.

I also added something special to the "joining the lab" bit so that I can delete the more silly "I want you to hire me" spam that I get from wannabe students.

--titus

posted at: 12:04 | path: /jul-11 | 0 comments

Tags: