Notes on the "week of khmer"

Last week we wrote five blog posts about some previously un-publicized features in the khmer software - most specifically, read-to-graph alignment and sparse graph labeling -- and what they enabled. We covered some half-baked ideas on graph-based error correction, variant calling, abundance counting, graph labeling, and assembly evaluation.

It was, to be frank, an immense writing and coding effort and one from which I'm still recovering!

Some details on khmer and replicating results

For anyone interested in following up on implementation details or any other details of the analyses, all of the results we wrote up last week can be replicated from scratch using khmer and publicly available data & scripts. You can also use a Docker container to run everything. To try this all out, use the links at the bottom of each blog post and follow the instructions there.

khmer itself is licensed under the BSD 3-Clause License, and hence fully available for reuse and remixing, including by commercial entities. (Please contact me if you have any questions about this, but it's really that simple.)

The majority of the khmer codebase is C++ with a CPython wrapping that provides a Python interface to the data structures and algorithms. Some people are already using it primarily via the C++ interface, while our own group mainly uses the Python interface.

So what was the point?

I had many reasons for investing effort in the the blog posts, but, as with many decisions I make, the reasoning became less clear as I pushed forward. Here are some things I wrote down while thinking about the topic and writing things up --

we've had a lot of this basic functionality implemented for a while, but had never really applied it to anything. This was an attempt to drive a vertical spike through some problems and see how things worked out.
taking existing ideas and bridging them to practice is always a good way to understand those ideas better.
from writing this up, I developed more mature use cases, found broken aspects of the implementation, provided minimal documentation for a bunch of features in khmer, and hopefully sharpened our focus a bit.
not enough people realize how fundamental a concept graphs (in general) are, and (more specifically) how powerful De Bruijn graphs are! It was fun to write that up in a bit more detail.
I've found it virtually impossible to think concretely about publishing any of this. Very little of it is particularly novel and I'm not so interested in micro-optimizing the code for specific use cases so that we can publish a "10% better" paper." So writing them up as blog posts seemed like a good way to go, even had that not been my native inclination.
Providing low-memory and scalable implementations seems like a good idea, especially when it's as simple as ours.

So far I'm quite happy with the results of the blogging (quiet interest, more references, some real improvements in the code base, etc. etc.). For now, I don't have anything more to say than that I'd like to try more technical blogging as a way to release potentially interesting computational bits and bobs to the community, and discuss them openly. It seems like a good way to advance science.

--titus

Living in an Ivory Basement Stochastic thoughts on science, testing, and programming.

Notes on the "week of khmer"

Some details on khmer and replicating results

More reading and references

So what was the point?

Comments !

Some details on khmer and replicating results

More reading and references

So what was the point?

Comments !

social