Hi all,
I've spent the last few weeks working on a DNA sample screening Web site, named "chill-filter". The goal is to support rapid, lightweight compositional analysis of shotgun sequencing DNA data sets - basically, "what's in my sample?!"
You can play with it here: chill-filter.sourmash.bio. It's free, with no login required, and there are a number of examples. Here's one to start with - a human WGS data set with some likely plant and microbial contamination.
chill-filter is built on top of our sourmash software (so, Rust and Python underneath). chill-filter extends and refines sourmash functionality in a few ways.
First, and most important, it has a user interface - it's a Web app. "Just write a damn user interface for once, Titus." OK, fine.
Second, it's rapid and lightweight - nearly realtime. The searches take about 5 seconds for each sample, and they search a database containing 8 human and animal genomes, all 1700 plant genomes, and 600,000 microbial genomes (GTDB RS220).
Third, we sketch the sample on the browser side, which reduces the upload bandwidth required by, like, an awful lot. So you're not uploading your 20 GB data set, but rather a ~2 MB compressed version of it.
Fourth, there's a near complete absence of configurability, which I think is good, in this case. Just pick your DNA to upload, and boom, you're done. We've chosen parameters that optimize for specificity by minimizing false positives (and sensitivity is really not a problem for sourmash, within the limits of query size). We'll miss stuff that's not in our reference database, of course, but otherwise... it should work well?
Fifth, it's meant to be comprehensive. In the not so distant future it should detect essentially any known genome (barring viruses and other really small fry), and it will scale to all available reference genomes without much trouble. This means that you don't/won't need to pick a database. We'll just include everything.
And last but by no means least: the approach is both theoretically well understood and practically well implemented. Admittedly it has taken us 8 years to get to the point where we can implement this functionality in under a month and in a few hundred lines of code, but that's scientific software engineering, amirite?
Use cases.
The basic use case here is figuring out what part of your sample matches something, anything in a known genome.
On the genome sequencing side, this could help detect and remove contamination.
On the metagenome side, it could help with high level sample profiling and detection of host contamination. And, realistically, this is something that could just be applied to any and every sample as a kind of simple report - it's fast enough to apply comprehensively.
The whole thing would fit on a reasonably powerful phone (with no network traffic needed), which is pretty cool! (< 1GB of disk, < 2 GB of RAM, and single CPU.)
But... what other use cases are there? We'd love to explore!
Note that it is open source (AGPL), locally installable, and supports custom databases.
What's next?
The site is pretty easy to maintain, he says with uncertainty in his voice. The code is simple, and I wrote a reasonable number of tests. (It's written in flask, which was rather pleasant )
It's pretty cheap, too, which is good because it's unfunded. I'm paying about $20/month for it at the moment; it's hosted at Digital Ocean. I have a hard time imagining it becoming popular enough that I would need to scale it up...?
I have all sorts of vague development goals, but I won't have too much time over the next few months to do ambitious things.
The main things I plan to prioritize are:
-
add more reference genomes to the database. There's already been a request to add fungi and viruses. I think once we add fungi to the current database, and maybe a few more animal genomes, we'll be done for the short term - and it will be a reasonably comprehensive sample screening Web site.
-
add some enhanced sample analysis. For example, we can estimate how much of your sample should assemble (i.e. is high abundance).
-
refine and validate eukaryotic genome predictions. We already know that sourmash works quite well on bacteria and archaea, and we have some pretty strong indications that it works well on vertebrate genomes, but we should do some more careful work there.
-
build out the command-line side of things. For once, we started with a user interface! But there are reasons to support a CLI.
Let us know!
Questions? Thoughts? Spam? Drop us an e-mail, or create an issue, or bump me on bluesky, or send a check to me c/o UC Davis!
And Happy Holidays!
--titus
Comments !