Living in an Ivory Basement Stochastic thoughts on science, testing, and programming.

PyCon sprints - playing with named pipes and streaming and khmer

I'm at the PyCon 2015 sprints (day 2), and I took the opportunity to play around with named pipes.

I was reminded of named pipes by Vince Buffalo in this great blog post, and since we at the khmer project are very interested in streaming, and named pipes fit well with a streaming perspective, I thought I'd check out named pipes as a way to communicate sequences between different khmer scripts.

First, I tried using named pipes to tie digital normalization together with splitting reads out into two output files, left and right:

mkfifo aa
mkfifo bb

# set up diginorm, reading from 'aa' and writing to 'bb'
normalize-by-median.py aa -o bb &

# split reads into left and right, reading from 'bb' and outputting to
# output.1 / output.2
split-paired-reads.py bb -1 output.1 -2 output.2 &

# feed in your sequences!
cat sequence.fa > aa

Here the setup is a bit weird, because you have to set up all the commands that are going to operate on the data before feeding in the data. But it all works!

Next, I tried process substitution. This does essentially the same thing as above, but is much nicer and more succinct:

normalize-by-median.py sequence.fa -o >(split-paired-reads.py /dev/stdin -1 output.1 -2 output.2)

Finally, I tried to make use of functionality from our new semi-streaming paper to run 3-pass digital normalization using semi-streaming --

trim-low-abund.py -Z 20 -C 2 sequence.fa -o >(normalize-by-median.py /dev/stdin -o result.fa

but this fell apart because 'trim-low-abund.py' doesn't support -o ;). This led to a few issues being filed in our issue tracker...

Very neat stuff. It has certainly given me a strong reason to make sure all of our scripts support streaming input and output!

--titus

There are comments.

social