Fri, 11 Mar 2011

My new data analysis pipeline code


First, I write a recipe file, 'metagenome.recipe', laying out my job description for, say, sequence trimming and assembly with Velvet:

fasta_file soil-data.fa

qc_filter min_length=50 remove_Ns=true

graph_filter min_length=400

velvet_assemble k=33 min_length=1000 scaffolding=True

Then I specify machine parameters, e.g. 'bigmem.conf':

[defaults]
n_threads=8

[graph_filtering]
use_mem=32gb

[velvet]
needs_mem=64gb

And finally, I run the pipeline:

% ak-run metagenome.recipe -c bigmem.conf

If I have cloud access (and who doesn't?) I can tell the pipeline to spin up and down nodes as needed:

% ak-aws-run metagenome.recipe -c bigmem.conf

(Bear in mind most of these tasks are multi-hour, if not multi-day, operations, so I'm not too worried about optimizing machine use and re-use.)

Hadoop jobs could be spawned underneath, depending on how each recipe component was actually implemented.

As for testing reproducibility of pipeline results, which is the short-term motivation here, I can store results for regression testing with later versions:

% ak-run metagenome.recipe -c bigmem.conf --save-endpoint=/some/path

and then compare:

% ak-run --check-endpoint=/some/path

---

Now, does anyone know of a package or packages that actually do this, so I/we don't have to write it??

See texttest and ruffus for some of my inspiration/interest.

--titus

posted at: 06:56 | path: /mar-11 | 3 comments

Tags: , ,


Comments:

Posted by Nick Loman at Fri Mar 11 07:35:59 2011:
I've been playing with EC2 the last few days and want something similar.

Packages that look potentially interesting include:

Fabric (Python lib for controlling bunches of servers either local or remote)

MIT Starcluster - roll a proper MPI-capable cluster on EC2

Kokki (config management) http://samuelks.com/kokki/

Posted by Greg Wilson at Fri Mar 11 10:01:54 2011:
Why not ruffus itself?

Posted by Titus Brown at Fri Mar 11 11:01:12 2011:
ruffus doesn't seem to handle any of the configuration or setup or data management; isn't it primarily for dependency management?  That's not a big problem for me; we know the recipes and they don't change.

But maybe I'm missing something?

Post a new comment:

Name:


E-mail:


URL:


Comment:


Note that comments must be manually approved; e-mail titus@idyll.org if your comment doesn't show up quickly.