I've been putting together a streaming API for khmer that would let us use generators to do sequence analysis, and I'd be interested in thoughts on how to do it in a good Pythonic way.
Some background: a while back, Alex Jironkin asked us for high level APIs, which turned into an issue on GitHub. More recently, we posted a paper on streaming and semi-streaming approaches to spectral sequence analysis, and so for my VanBUG talk I was inspired to put together a little API using generators.
The code looks like this, currently (see versioned permalink to khmer_api.py for full implementation):
graph = khmer.new_counting_hash(20, 1e7, 4) out_fp = open(os.path.basename(filename) + '.abundtrim', 'w') ## khmer scripts/trim-low-abund.py -V, using generators input_iter = screed.open(filename) input_iter = broken_paired_reader(input_iter) input_iter = clean_reads(input_iter) input_iter = streamtrim(input_iter, graph, 20, 3) output_reads(input_iter, out_fp)
Briefly, what this code does is
- create the core data structure we use
- open an output file
- open an input file
- create a generator that takes in a stream of data and groups records;
- create a generator that takes in records, does necessary preprocessing, and outputs them;
- create a generator that does our semi-streaming error trimming (from the semi-streaming preprint);
- outputs the reads to the given output fp.
The key bit is that this uses generators, so all of this is happening with full-on streaming. The one exception to this is the 'streamtrim' which has to cache a subset of the reads for processing more than once.
Interestingly, this is an implementation of the core functionality for the 'trim-low-abund.py' script that we will be releasing with the next version of khmer (the release after khmer 1.3 - not sure if it's 1.3.1 or 1.4 yet).
You can also replace the 'streamtrim' line with:
input_iter = diginorm(input_iter, graph, 20)
if you want to do digital normalization. That turns this into an implementation of the core functionality for the 'normalize-by-median.py' script that has been in khmer for a while.
Obviously these generator implementations are not yet production-ready, although they do give identical results to the current command line scripts.
The question I have, though, is what should the actual API look like?
The two basic options I've come up with are method chaining and UNIX-style pipes.
Method chaining might look like this:
read(in_fp). \ clean_reads(). \ streamtrim(graph, 20, 3). \ output(out_fp)
and piping would be
read(in_fp) | \ clean_reads() | \ streamtrim(graph, 20, 3) | \ output(out_fp)
...so I guess that's really pretty minor. I don't know which is more Pythonic, though, or what would permit more flexibility in terms of an underlying flexibility. Thoughts?
There are some other decisions to be made -- configuration and parameters.
For configuration, ideally we would be able to specify multiple input and output files to be paired with them, and/or set default parameters for multiple steps. Runtime reporting, error handling, and benchmarking should all be put into the mix. Should there be a simple object with hooks to handle all of this, or what? For example,
s = Streamer(...) # configure inputs and outputs s | clean_reads() | streamtrim(graph, 20, 3) | output()
where 's' could help by holding metadata or the like. I'm not sure - I've never done this before ;).
As for parameters, I personally like named parameters, so streamtrim above would better be streamtrim(graph, coverage=20, cutoff=3). But perhaps passing in a dictionary of parameters would be more flexible - should we support both? (Yes, I'm aware that in Python you can use ** - is there a preference?)
I'd love some examples of more mature APIs that have had the rough edges rubbed off 'em; this is really a UX question and I'm just not that well equipped to do this kind of design.
Thoughts? Help? Examples?
cheers, --titus
p.s. Down the road we're going to add correct_errors and eventually assemble to the steps, and perhaps also adapter trimming, quality trimming, and mapping. Won't that be fun? Imagine:
assembly = stream_input(filename1).trim().assemble()
to do your assembly... betcha we can stream it all, too.
Comments !