Last week we wrote five blog posts about some previously un-publicized features in the khmer software - most specifically, read-to-graph alignment and sparse graph labeling -- and what they enabled. We covered some half-baked ideas on graph-based error correction, variant calling, abundance counting, graph labeling, and assembly evaluation.
It was, to be frank, an immense writing and coding effort and one from which I'm still recovering!
Some details on khmer and replicating results
For anyone interested in following up on implementation details or any other details of the analyses, all of the results we wrote up last week can be replicated from scratch using khmer and publicly available data & scripts. You can also use a Docker container to run everything. To try this all out, use the links at the bottom of each blog post and follow the instructions there.
khmer itself is licensed under the BSD 3-Clause License, and hence fully available for reuse and remixing, including by commercial entities. (Please contact me if you have any questions about this, but it's really that simple.)
The majority of the khmer codebase is C++ with a CPython wrapping that provides a Python interface to the data structures and algorithms. Some people are already using it primarily via the C++ interface, while our own group mainly uses the Python interface.
More reading and references
One wonderful outcome of the blog posts was a bunch of things to read! A few I was already aware of, others were new to me, and I was thoroughly reminded of my lack of knowledge in this area.
In no particular order,
Lex Nederbragt has a wonderful blog post introducing the concept of graph-based genomics, On graph-based representations of a (set of) genomes. The references at the bottom are good for people that want to dive into this more.
Heng Li wrote a nicely technical blog post with a bunch more references.
Zam Iqbal left a nice comment on my first post that largely reiterated the references from Lex and Heng's blog posts (which I should have put in there in the first place, sorry).
Several people pointed me at BGREAT, Read Mapping on de Bruijn graph. I need to read it thoroughly.
Rob Patro pointed me at several papers, including Compression of high throughput sequencing data with probabilistic de Bruijn graph and Reference-based compression of short-read sequences using path encoding. More to read.
Erik Garrison pointed me at 'vg', tools for working with variant graphs. To quote, "It includes SIMD-based "banded" string to graph alignment. Can read and write GFA." See the github repo.
So what was the point?
I had many reasons for investing effort in the the blog posts, but, as with many decisions I make, the reasoning became less clear as I pushed forward. Here are some things I wrote down while thinking about the topic and writing things up --
- we've had a lot of this basic functionality implemented for a while, but had never really applied it to anything. This was an attempt to drive a vertical spike through some problems and see how things worked out.
- taking existing ideas and bridging them to practice is always a good way to understand those ideas better.
- from writing this up, I developed more mature use cases, found broken aspects of the implementation, provided minimal documentation for a bunch of features in khmer, and hopefully sharpened our focus a bit.
- not enough people realize how fundamental a concept graphs (in general) are, and (more specifically) how powerful De Bruijn graphs are! It was fun to write that up in a bit more detail.
- I've found it virtually impossible to think concretely about publishing any of this. Very little of it is particularly novel and I'm not so interested in micro-optimizing the code for specific use cases so that we can publish a "10% better" paper." So writing them up as blog posts seemed like a good way to go, even had that not been my native inclination.
- Providing low-memory and scalable implementations seems like a good idea, especially when it's as simple as ours.
So far I'm quite happy with the results of the blogging (quiet interest, more references, some real improvements in the code base, etc. etc.). For now, I don't have anything more to say than that I'd like to try more technical blogging as a way to release potentially interesting computational bits and bobs to the community, and discuss them openly. It seems like a good way to advance science.