Request for suggestions on a generator API for khmer

I've been putting together a streaming API for khmer that would let us use generators to do sequence analysis, and I'd be interested in thoughts on how to do it in a good Pythonic way.

Some background: a while back, Alex Jironkin asked us for high level APIs, which turned into an issue on GitHub. More recently, we posted a paper on streaming and semi-streaming approaches to spectral sequence analysis, and so for my VanBUG talk I was inspired to put together a little API using generators.

The code looks like this, currently (see versioned permalink to for full implementation):

graph = khmer.new_counting_hash(20, 1e7, 4)
out_fp = open(os.path.basename(filename) + '.abundtrim', 'w')

## khmer scripts/ -V, using generators
input_iter =
input_iter = broken_paired_reader(input_iter)
input_iter = clean_reads(input_iter)
input_iter = streamtrim(input_iter, graph, 20, 3)
output_reads(input_iter, out_fp)

Briefly, what this code does is

  1. create the core data structure we use
  2. open an output file
  3. open an input file
  4. create a generator that takes in a stream of data and groups records;
  5. create a generator that takes in records, does necessary preprocessing, and outputs them;
  6. create a generator that does our semi-streaming error trimming (from the semi-streaming preprint);
  7. outputs the reads to the given output fp.

The key bit is that this uses generators, so all of this is happening with full-on streaming. The one exception to this is the 'streamtrim' which has to cache a subset of the reads for processing more than once.

Interestingly, this is an implementation of the core functionality for the '' script that we will be releasing with the next version of khmer (the release after khmer 1.3 - not sure if it's 1.3.1 or 1.4 yet).

You can also replace the 'streamtrim' line with:

input_iter = diginorm(input_iter, graph, 20)

if you want to do digital normalization. That turns this into an implementation of the core functionality for the '' script that has been in khmer for a while.

Obviously these generator implementations are not yet production-ready, although they do give identical results to the current command line scripts.

The question I have, though, is what should the actual API look like?

The two basic options I've come up with are method chaining and UNIX-style pipes.

Method chaining might look like this:

read(in_fp). \
    clean_reads(). \
    streamtrim(graph, 20, 3). \

and piping would be

read(in_fp) | \
     clean_reads() | \
     streamtrim(graph, 20, 3) | \
     output(out_fp) I guess that's really pretty minor. I don't know which is more Pythonic, though, or what would permit more flexibility in terms of an underlying flexibility. Thoughts?

There are some other decisions to be made -- configuration and parameters.

For configuration, ideally we would be able to specify multiple input and output files to be paired with them, and/or set default parameters for multiple steps. Runtime reporting, error handling, and benchmarking should all be put into the mix. Should there be a simple object with hooks to handle all of this, or what? For example,

s = Streamer(...)  # configure inputs and outputs

s | clean_reads() | streamtrim(graph, 20, 3) | output()

where 's' could help by holding metadata or the like. I'm not sure - I've never done this before ;).

As for parameters, I personally like named parameters, so streamtrim above would better be streamtrim(graph, coverage=20, cutoff=3). But perhaps passing in a dictionary of parameters would be more flexible - should we support both? (Yes, I'm aware that in Python you can use ** - is there a preference?)

I'd love some examples of more mature APIs that have had the rough edges rubbed off 'em; this is really a UX question and I'm just not that well equipped to do this kind of design.

Thoughts? Help? Examples?

cheers, --titus

p.s. Down the road we're going to add correct_errors and eventually assemble to the steps, and perhaps also adapter trimming, quality trimming, and mapping. Won't that be fun? Imagine:

assembly = stream_input(filename1).trim().assemble()

to do your assembly... betcha we can stream it all, too.

Comments !

(Please check out the comments policy before commenting.)