Living in an Ivory Basement

Why Rust is an increasingly beloved part of my programming toolbox

2024-10-16T00:00:00+02:00

Hi, I'm Titus. I'm a prof at UC Davis who writes scientific software for biological data analysis of very, very large data sets. And I really like Rust, especially in combination with Python. This blog post is about why.

Some history

For the last 5+ years, I've been one of the maintainers of a combined Python + Rust package, sourmash. This is because in sourmash v3.0, we (well, Luiz Irber) replaced the C++ layer underneath the sourmash Python library with a Rust layer. There were two primary motivating factors: first, Luiz really wanted to run sourmash in a browser, and this was going to be much easier if we used Rust underneath (ref pyodide and Rust and WebAssembly and Luiz's blog post). And second, Luiz wanted to use Rust multithreading to do extremely large scale searches (which he eventually started doing in 2022 - only a few dozen lines of code!!).

This put me in a somewhat weird situation, because I didn't know Rust and yet I was left as perhaps the most active maintainer of the sourmash package - other folks (in and out of the lab) were developing new functionality, but I took on most of the bug reports and releases. For much of the time, I had to just ignore the Rust code. This was especially true once Luiz graduated and left the lab. It wasn't ideal but it was hard for me to find time to really learn Rust, and so I just let it sit - luckily, the code worked great, so it didn't cause any problems.

For the first 3 and half years years after sourmash v3.0 was released, this situation continued. Then in August 2023, I took advantage of the new plugin framework in sourmash to basically copy over Luiz's large-scale search Rust codebase and wrap it into a sourmash plugin, sourmash_plugin_branchwater (first release, v0.3). This was largely driven by the increasing importance of the large scale search to a variety of collaborators, and our desire to customize it in various ways.

I have a lot to say about the process of developing that plugin, but the most important thing for this blog post is this: it let us (mostly me and Tessa Pierce-Ward, but also Mo Abuelanin and Olga Botvinnik) start playing with Rust in a real application that we were using regularly.

In the two years since then, Tessa and (to a lesser extent) I and others have put in a lot of effort on our various Rust codebases. It enables high-performance multithreaded analysis in various ways, and it's become one of our main development efforts as a data set sizes grow.

Why am I writing this blog post? Because I finally took a deep dive into a biggish pile of Rust code, and achieved some level of actual Rust understanding, and ... I really, really like it! And I'm really excited that my first big PR adding real new functionality has been merged and is being released!

The top reasons I love Rust

Simple, robust multithreading is really easy.

We're using rayon, and it's been pretty much as simple as saying "hey, Rust, do this iteration in parallel". If your code is threadsafe, it'll just compile and become multithreaded; if it doesn't compile, it's because your code is legitimately not threadsafe and you need to fix it ;).

It's straightforward to track and manage object modifications, references, and lifetime

Rust has a really, really nice way of managing variable mutability and references. This eases one of my main challenges in refactoring complex code in other languages, and lets me confidently use functions with clear knowledge of whether the objects are being modified (or not), and/or copies are being made.

A closely related concept is that of copying objects so that you can modify them, which is typically done using obj.clone(). I spent a lot of time and energy figuring out how to fix some of our internal functions to only clone big objects when they needed to be cloned (ref sourmash#3343), and when Luiz reviewed that and some related code, he salted in a conversion function that let me refactor a bunch of code and realize some substantial performance improvements. It was remarkably easy and felt truly magical.

I'm sure you can do this in other languages, and it's perhaps something that Go or some other modern language - even C++, if the proper incantations are uttered - would let you do easily. In Rust, it's built in, and the compiler guides you in the right direction when you do something that's problematic. I love it!

The compiler messages are ridiculously useful

I've never experienced a more helpful compiler - when it fails to compile code that attempts to do something unwise, it clearly and explicitly tells me what's going on, and even offers suggestions on how to fix it.

The Python integration is really nice

I've been using pyo3 to wrap Rust code and make it accessible to Python. (sourmash still uses FFI, which involves more coding, but since it works, we're not gonna break it just yet ;).

I have a friend that insists that everything should just be rewritten in Rust. Even if I didn't have a substantial existing codebase in Python, I think Rust is still a bit too rigid to do some of the truly exploratory work that my own scientific & coding process depends on. So I'm not going to switch over to Rust completely. But I love that I can quickly build a Rust function, compile it, and then access it via Python. It's been much more straightforward than C++.

I really like `Option` and `Result`

Rust has a really neat approach to representing values as "something" vs "nothing", and catching errors. They rely on using enums called Option (which can take on values None or Some(val)) and Result (which takes on Ok(val) or Err(val)). I'm still not completely comfortable with them - custom errors kick my butt, in particular - but they are a really important way of managing these issues in the codebase, and I like 'em a lot.

Increasingly, Rust lets me do my work.

I'm not sure how much is my brain adjusting to Rust in a kind of Stockholm-like syndrome and how much Rust is really just that good, but either way I'm reasonably happy and productive, so I'll take it ;).

Rust is almost certainly not for everyone. I would recommend looking into it if you're stuck in C++ hell, but the transition is not that easy. In my case I lucked out because Luiz dropped a reasonably large and functioning Rust codebase in my lap, and it worked very well, so I was incentivized to learn Rust and build on the existing Rust code.

I have been searching out negative views on Rust, and I sympathize with some of them. Maybe that will grow over time. But for me, it's a massive improvement over C++. Your mileage may vary!

--titus

Recent advances in the sourmash ecosystem (August 2024)

2024-08-16T00:00:00+02:00

sourmash is our software for exploring and analyzing large collections of sequencing data - mostly focused on microbial genomics and metagenomics, but increasingly relevant to larger flora and fauna :).

Our ongoing focus on incremental improvements to sourmash continues to bear fruit. Below, I discuss robust, publicly available, and documented features that are ready for others to use!

Speed and memory improvements - multithreading has come to sourmash!

(Well, technically it has come to a plugin ;).

sourmash is implemented under the hood in Rust, a very fast language capable of multithreading. However, the command-line for sourmash is in Python, and for a variety of reasons we've never made it multithreaded.

Over the last year, we've started to take real advantage of the underlying Rust code by developing a plugin, the sourmash branchwater plugin, that provides fast, low-memory, multithreaded search (manysearch and multisearch, metagenome decomposition (fastgather and fastmultigather), and sketching (manysketch), as well as some fast clustering (cluster).

These commands speed up sourmash by 100-1000x in many cases, and have really transformed our internal use of sourmash as a result.

Exactly how and why we did this in a plugin, and how we're going to evolve sourmash to make use of this functionality in the future, is a story that will be told in another blog post :).

Improved visualization!

One of the main purposes of sourmash, if citations are to be believed, is for people to make and display distance matrices - the relevant commands are sourmash compare and sourmash plot.

But... the plot command hasn't aged well. It's got a lot of minor problems, and it's not that flexible. We've bandaged it as best we can given the constraints of semantic versioning ("thou shalt not break commands for the heck of it") but more was needed.

And, separately, we kept on building super cool new display options - including tSNE and MDS plots, coloring plots by categories, Venn diagrams, and so on. But a lot of these were in Jupyter notebooks or RMarkdown documents and weren't directly accessible to command-line users. And we also didn't want to add a bunch of viz dependencies like seaborn to core sourmash.

Moreover, Taylor Reiter provided a lot of inspiration with sourmashconsumr, but that hasn't been regularly maintained, and it's hard for us (me) to take over maintenance of an R package.

So, we implemented another plugin, the betterplot plugin, which has a simple naming scheme (plot2, plot3, etc.) that allows us to add new plot types really easily. As of this date, it supplies better distance matrices, tSNE and MDS plotting, and a few different plot formats.

Here are some examples:

New, ultra-scalable backend database system

The branchwater plugin also implements a straightforward command-line interface to our newest database type: an inverted index that uses [RocksDB](https://github.com/facebook/rocksdb underneath.

This database is demonstrably ultra-scalable: it is the index type underlying the petabase-scale search offered by the branchwater Web site.

Making it available via the command-line interface means that we can experiment with it for other purposes - including metagenome decomposition via fastgather, as well as command-line metagenome search via manysearch. We're really excited about offering it as a simple, flexible way to index and search massive amounts of data!

Plugins!

As of v4.6, sourmash supports a few different kinds of plugins. That underlies a lot of what's going on above (for a few reasons that I'll elaborate on in a different blog post).

Plugins have proven to be super wonderful - they are letting us experiment with adding new commands that "look like" sourmash commands, and interact with sourmash data types, but don't incur the same support burden that semantic versioning does.

There's been some robust plugin action going on, too - I built a (slow, but pretty) containment search plugin, which inspired Dr. Tessa Pierce-Ward to implement the core functionality in Rust as part of the branchwater plugin, and then I backported the pretty printouts into the branchwater plugin (see sourmash_plugin_branchwater#408 for the denoument).

Stability, maintenance, and releases

I spend a surprising amount of my time (as a tenured full prof at UC Davis) maintaining sourmash and some of the associated plugins: I answer most of the issues, debug most of the bugs, and cut most of the releases. I've had to learn Rust as part of the deal, and the current state of the code + plugins means that I spend a fair amount of time bouncing between different repositories trying to figure out where particular behavior is encoded.

Meanwhile, collaborators and colleagues and labmates use sourmash for their own work, extending it in new and exciting ways to enable new types of inquiry.

As a result of all of this, we've been able to provide an interesting mix of stability, documentation, tutorials, and new functionality. We're not quite sure where sourmash is going, but that's part of the fun, right?

What I, personally, am sure of, is this: there's value in long-term maintenance of cutting-edge research software, and that part of "cutting edge" can be "we're providing a stable platform and library with which to do new science." We'll see if I can convince a funding agency of this :)

--titus

Speeding sourmash the heck up

2024-02-20T00:00:00+01:00

sourmash is our tool for genome and metagenome investigation. Using and developing it has been a major focus of our lab for over 7 years, and maintaining and extending it is my main passion project. sourmash is a k-mer multitool that enables all sorts of really neat bulk metagenome analyses!

I'm proud to say that last week we released a new version of sourmash, v4.8.6, that continues to improve functionality, increase documentation, and decrease computational requirements. But, you know, we release new versions of sourmash pretty regularly, so that's only moderately exciting :).

A bit more exciting - we are hopefully closing in on an updated Journal of Open Source Software publication via our pyopensci review. I wanted to highlight something very nice one of our reviewers said:

Outstanding work with sourmash! Your commitment to creating a package that's both easily maintainable and well-documented truly shines through. The code is impressively organized, accompanied by clear comments explaining each section, making it easy to comprehend the purpose of each file and function.

It's so nice to have your multiple years of effort be appreciated!

The most exciting news is that we've released a significant update to our branchwater plugin for sourmash. This plugin supplies fast, low-memory, and multithreaded versions of common sourmash functions. Version 0.9.0 of sourmash_plugin_branchwater dramatically improves the convenience of using the plugin while also speeding up a common use case and, perhaps most importantly to us maintainers, making significant moves towards convergence with the core sourmash code base.

What's that, you say?? Fast, low-memory, and multithreaded sourmash functionality?

Yep. Using our test metagenome, the SRR606249 mock community, you can search all 400,000 genomes in the GTDB rs214 release in around 2 minutes and under 2 GB of RAM, using 64 cores. This is 7 fold lower memory than regular ol' sourmash, and approximately 20x faster. Even cooler, if you index GTDB first, you can do it in 600 MB of RAM!

software/version	command	details	time	max RAM
sourmash v4.8.6	`gather`	the OG	42m 26s	14.5 GB
branchwater v0.9.0	`fastgather`	against zip	2m 5s	14.1 GB^
branchwater v0.9.0	`fastgather`	against pathlist	2m 26s	1.8 GB
branchwater v0.9.0	`fastgather`	against manifest	2m 19s	1.9 GB
branchwater v0.9.0	`fastmultigather`	against rocksdb	2m 8s	600 MB
branchwater v0.8.6	`fastgather`	against pathlist	2m 24s	1.6 GB
branchwater v0.8.6	`fastgather`	against zip	28m 34s	1.7 GB

^ This benchmark number isn't really real, despite it being reported under Max RSS. The measurement is high because the zip library we're using in Rust uses memmap - actual heap consumption is in the 2 GB range, matching the other approaches. See sourmash#2340 for more info.

Anyhoo. sourmash v4.8.6 and sourmash_plugin_branchwater v0.9.0 are both available via conda & conda-forge. Enjoy!

--titus

The history of the "Tragedy of the Commons"

2024-01-13T00:00:00+01:00

I've been really interested in applying lessons from common pool resource theory to my own work and interests in open source and open science (see my various posts). The framework around this created by Dr. Elinor Ostrom, for which she received the Nobel Prize in Economics, is awe-inspiring and incredibly motivational! I've also thoroughly enjoyed the Frontiers of Commoning podcast that David Bollier runs, which showcases many ongoing communities and efforts in these areas.

All of this is strongly coupled (negatively) to the well-known concept of the Tragedy of the Commons, published in 1968 by Dr. Garret Hardin, a professor at UCSB. It turns out that Hardin was not only very wrong (see above links on CPR!) but also a terrible person, and if you care to read the Tragedy of the Commons article, it's, well, very bad (ibid). (If you prefer a podcast to reading, here's one that looks good, from srsly wrong.

Anyway, I find CPR theory tremendously inspiring, and it provides wonderful counterexamples to the beliefs that only strong hierarchy, authoritarian governance, and/or corporate enclosure can work to manage resources. Highly recommended. Always happy to chat, although I'd suggest just reading widely instead, since I'm by no means an expert on any of this!

salud!

--titus

Sourmash and branchwater licensing: thoughts on extractive engagement with projects

2024-01-07T00:00:00+01:00

I am helping maintain some petabase-scale genomic search infrastructure as part of the sourmash and branchwater projects. One of the questions that's frequently in the back of my mind is how to incentivize commons-style engagement rather than extractive engagement, and a key tool for this purpose is licensing.

Sourmash is BSD-licensed, which, in essence, means that anyone can do whatever they want with the code - including incorporating it unchanged into a commercial closed-source product, rebranding it as a new product, and/or changing it in incompatible ways (and then rebranding it as a new and better product). This is typically something that companies will do, although it also happens with open source forks. (See: Elasticsearch to OpenSearch; and Matrix).

Branchwater, our internal code-name for the collection of sourmash-based functionality that enables petabase-scale search, is licensed under AGPL. This means that anyone can use it however they want, as long as they release any modifications they make to the source code. In particular, this also applies to people providing a service based on the branchwater code:

Let’s say you create a software program. Another developer takes and modifies it, and then provides access to that modification to paying customers through a software-as-a-service model. Under the GPL v3, that modification would essentially become proprietary because it wasn’t technically distributed. Under AGPL, however, that developer would need to make their modified source code available for download. (link)

IIRC, there are a couple of reasons that Dr. Luiz Irber (the initial author of the branchwater code, and the originator of most of the branchwater code and supporting infrastructure) chose AGPL. One of the main ones (again, IIRC) is to discourage incompatible forks of the source code. But it also discourages many kinds of extractive behavior: a company could not, for example, take this code, modify it in sekret ways, and provide services based upon that sekrecy, without providing the modified code openly under the AGPL license.

You could argue that the AGPL license decreases certain kinds of uptake. Perhaps so, and I chose the BSD license for sourmash (with Luiz's OK, albeit in a situation where I was his supervisor...) specifically to encourage uptake, reuse, modification, and experimentation. I don't know how to evaluate the success of this choice, really, other than to say that I still don't see a blindingly obvious downside to it (as of Jan 5, 2024 :).

At the end of the day, my thoughts trend towards seeing the value in sourmash as less algorithmic innovation and more infrastructure innovation. We are maintaining and sustaining a very functional and useful piece of software, with good documentation and an ever-expanding range of use cases. And it remains very useful to me and my lab, specifically. Not only do I not care if companies extract value from it - there are many ways to skin this particular cat - but I am happy and excited that my labor as an academic is actually useful to someone else.

On the flip side, branchwater is both more niche and more difficult. There aren't many ways to do petabase-scale search, and there is a lot more infrastructure maintenance involved. I would be sad to see someone take our (collective) investment in this functionality and build upon it without returning something to the community of developers.

I'm not sure what and where the dividing line between these two situations is for me. But I think sketching out the current line is a good start :).

--titus

snakemake for doing bioinformatics - inputs and outputs and more!

2023-04-07T00:00:00+02:00

`input:` and `output:` blocks

As we saw before, snakemake will automatically "chain" rules by connecting inputs to outputs. That is, snakemake will figure out what to run in order to produce the desired output, even if it takes many steps.

We also saw that snakemake will fill in {input} and {output} in the shell command based on the contents of the input: and output: blocks. This becomes even more useful when using wildcards to generalize rules, where wildcard values are properly substituted into the {input} and {output} values.

Input and output blocks are key components of snakemake workflows. Below, we will discuss the use of input and output blocks a bit more comprehensively.

Providing inputs and outputs

As we saw previously, snakemake will happily take multiple input and output values via comma-separated lists and substitute them into strings in shell blocks.

rule example:
   input:
       "file1.txt",
       "file2.txt",
   output:
       "output file1.txt",
       "output file2.txt",
   shell: """
       echo {input:q}
       echo {output:q}
       touch {output:q}
   """

When these are substituted into shell commands with {input} and {output} they will be turned into space-separated ordered lists: e.g. the above shell command will print out first file1.txt file2.txt and then output file1.txt output file2.txt before using touch to create the empty output files.

In this example we are also asking snakemake to quote filenames for the shell command using :q - this means that if there are spaces, characters like single or double quotation marks, or other characters with special meaning they will be properly escaped using Python's shlex.quote function. For example, here both output files contain a space, and so touch {output} would create three files -- output, file1.txt, and file2.txt -- rather than the correct two files, output file1.txt and output file2.txt.

Quoting filenames with {...:q} should always be used for anything executed in a shell block - it does no harm and it can prevent serious bugs!

Digression: Where can we (and should we) put commas?

In the above code example, you will notice that "file2.txt" and "output file2.txt" have commas after them:

rule example:
   input:
       "file1.txt",
       "file2.txt",
   output:
       "output file1.txt",
       "output file2.txt",
   shell: """
       echo {input:q}
       echo {output:q}
       touch {output:q}
   """

Are these required? No. The above code is equivalent to:

rule example:
   input:
       "file1.txt",
       "file2.txt"
   output:
       "output file1.txt",
       "output file2.txt"
   shell: """
       echo {input:q}
       echo {output:q}
       touch {output:q}
   """

where there are no commas after the last line in input and output.

The general rule is this: you need internal commas to separate items in the list, because otherwise strings will be concatenated to each other - i.e. "file1.txt" "file2.txt" will become "file1.txtfile2.txt", even if there's a newline between them! But a comma trailing after the last filename is optional (and ignored).

Why!? These are Python tuples and you can add a trailing comma if you like: a, b, c, is equivalent to a, b, c.

So why do we add a trailing comma?! I suggest using trailing commas because it makes it easy to add a new input or output without forgetting to add a comma, and this is a mistake I make frequently! This is a (small and simple but still useful) example of defensive programming, where we can use optional syntax rules to head off common mistakes.

Inputs and outputs are ordered lists

We can also refer to individual input and output entries by using square brackets to index them as lists, starting with position 0:

rule example:
   ...
   shell: """
       echo first input is {input[0]:q}
       echo second input is {input[1]:q}
       echo first output is {output[0]:q}
       echo second output is {output[1]:q}
       touch {output}
   """

However, we don't recommend this because it's fragile. If you change the order of the inputs and outputs, or add new inputs, you have to go through and adjust the indices to match. Relying on the number and position of indices in a list is error prone and will make your Snakefile harder to change later on!

Using keywords for input and output files

You can also name specific inputs and outputs using the keyword syntax, and then refer to those using input. and output. prefixes. The following Snakefile rule does this:

rule example:
   input:
       a="file1.txt",
       b="file2.txt",
   output:
       a="output file1.txt",
       c="output file2.txt"
   shell: """
       echo first input is {input.a:q}
       echo second input is {input.b:q}
       echo first output is {output.a:q}
       echo second output is {output.c:q}
       touch {output:q}
   """

Here, a and b in the input block, and a and c in the output block, are keyword names for the input and output files; in the shell command, they can be referred to with {input.a}, {input.b}, {output.a}, and {output.c} respectively. Any valid variable name can be used, and the same name can be used in the input and output blocks without collision, as with input.a and output.a, above, which are distinct values.

This is our recommended way of referring to specific input and output files. It is clearer to read, robust to rearrangements or additions, and (perhaps most importantly) can help guide the reader (including "future you") to the purpose of each input and output.

If you use the wrong keyword names in your shell code, you'll get an error message. For example, this code:

rule example:
   input:
       a="file1.txt",
   output:
       a="output file1.txt",
   shell: """
       echo first input is {input.z:q}
   """

gives this error message:

AttributeError: 'InputFiles' object has no attribute 'z', when formatting the following:

       echo first input is {input.z:q}

Example: writing a flexible command line

One example where it's particularly useful to be able to refer to specific inputs is when running programs on files where the input filenames need to be specified as optional arguments. One such program is the megahit assembler when it runs on paired-end input reads. Consider the following Snakefile:

rule all:
    input:
        "assembly_out"

rule assemble:
    input:
        R1="sample_R1.fastq.gz",
        R2="sample_R2.fastq.gz",
    output:
        directory("assembly_out")
    shell: """
        megahit -1 {input.R1} -2 {input.R2} -o {output}
    """

In the shell command here, we need to supply the input reads as two separate files, with -1 before one and -2 before the second. As a bonus the resulting shell command is very readable!

Input functions and more advanced features

There are a number of more advanced uses of input and output that rely on Python programming - for example, one can define a Python function that is called to generate a value dynamically, as below -

def multiply_by_5(w):
    return f"file{int(w.val) * 5}.txt"


rule make_file:
    input:
        # look for input file{val*5}.txt if asked to create output{val}.txt
        filename=multiply_by_5,
    output:
        "output{val}.txt"
    shell: """
        cp {input} {output:q}
    """

When asked to create output5.txt, this rule will look for file25.txt as an input.

Since this functionality relies on knowledge of wildcards as well as some knowledge of Python, it's too advanced to talk about here. More on that later!

References and Links

Snakemake manual section on rules

`params:` blocks and `{params}`

As we saw above, input and output blocks are key to the way snakemake works: they let snakemake automatically connect rules based on the inputs necessary to create the desired output. However, input and output blocks are limited in certain ways: most specifically, every entry in both input and output blocks must be a filename. And, because of the way snakemake works, the filenames specified in the input and output blocks must exist in order for the workflow to proceed past that rule.

Frequently, shell commands need to take parameters other than filenames, and these parameters may be values that can or should be calculated by snakemake. Therefore, snakemake also supports a params: block that can be used to provide parameter strings that are not filenames in the shell block. As you'll see below, these can be used for a variety of purposes, including user-configurable parameters as well as parameters that can be calculated automatically by Python code.

A simple example of a params block

Consider:

rule use_params:
    params:
        val = 5
    output: "output.txt"
    shell: """
        echo {params.val} > {output}
    """

Here, the value 5 is assigned to the name val in the params: block, and is then available under the name {params.val} in the shell: block. This is analogous to using keywords in input and output blocks, but unlike in input and output blocks, keywords must be used in params blocks.

In this example, there's no gain in functionality, but there is some gain in readability: the syntax makes it clear that val is a tunable parameter that can be modified without understanding the details of the shell block.

Params blocks have access to wildcards

Just like the input: and output: blocks, wildcard values are directly available in params: blocks without using the wildcards prefix; for example, this means that you can use them in strings with the standard string formatting operations.

This is useful when a shell command needs to use something other than the filename - for example, the bowtie read alignment software takes the prefix of the output SAM file via -S, which means you cannot name the file correctly with bowtie ... -S {output}. Instead, you could use {params.prefix} like so:

rule all:
    input:
        "reads.sam"

rule use_params:
    input: "{prefix}.fq",
    output: "{prefix}.sam",
    params:
        prefix = "{prefix}"
    shell: """
        bowtie index -U {input} -S {params.prefix}
    """

If you were to use -S {output} here, you would end up producing a file reads.sam.sam!

Links and references:

Snakemake docs: non-file parameters for rules

Using `expand` to generate filenames

Snakemake wildcards make it easy to apply rules to many files, but also create a new challenge: how do you generate all the filenames you want?

As an example of this challenge, consider the list of genomes needed for rule compare_genomes from before -

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig",
        "GCF_008423265.1.fna.gz.sig",

This list is critical because it specifies the sketches to be created by the wildcard rule. However, writing this list out is annoying and error prone, because parts of every filename are identical and repeated.

Even worse, if you needed to use this list in multiple places, or produce slightly different filenames with the same accessions, that will be error prone: you are likely to want to add, remove, or edit elements of the list, and you will need to change it in multiple places.

Previously, we showed how to change this to a list of the accessions at the top of the Snakefile and then used a function called expand to generate the list:

ACCESSIONS = ["GCF_000017325.1",
              "GCF_000020225.1",
              "GCF_000021665.1",
              "GCF_008423265.1"]

#...

rule compare_genomes:
    input:
        expand("{acc}.fna.gz.sig", acc=ACCESSIONS),

Using expand to generate lists of filenames is a common pattern in Snakefiles, and we'll explore it more below!

Using `expand` with a single pattern and one list of values

In the example above, we provide a single pattern, {acc}.fna.gz.sig, and ask expand to resolve it into many filenames by filling in values for the field name acc from each element in ACCESSIONS. (You may recognize the keyword syntax for specifying values, acc=ACCESSIONS, from input and output blocks, above!

The result of expand('{acc}.fna.gz.sig', acc=...) here is identical to writing out the four filenames in long form:

"GCF_000017325.1.fna.gz.sig",
"GCF_000020225.1.fna.gz.sig",
"GCF_000021665.1.fna.gz.sig",
"GCF_008423265.1.fna.gz.sig"

That is, expand doesn't do any special wildcard matching or pattern inference - it just fills in the values and returns the resulting list.

Here, ACCESSIONS can be any Python iterable - for example a list, a tuple, or a dictionary.

Using `expand` with multiple lists of values

You can also use expand with multiple field names. Consider:

expand('{acc}.fna.{extension}`, acc=ACCESSIONS, extension=['.gz.sig', .gz'])

This will produce the following eight filenames:

"GCF_000017325.1.fna.gz.sig",
"GCF_000017325.1.fna.gz",
"GCF_000020225.1.fna.gz.sig",
"GCF_000020225.1.fna.gz",
"GCF_000021665.1.fna.gz.sig",
"GCF_000021665.1.fna.gz",
"GCF_008423265.1.fna.gz.sig",
"GCF_008423265.1.fna.gz"

by substituting all possible combinations of acc and extension into the provided pattern.

Generating all combinations vs pairwise combinations

As we saw above, with multiple patterns, expand will generate all possible combinations: that is,

X = [1, 2, 3]
Y = ['a', 'b', 'c']

rule all:
   input:
      expand('{x}.by.{y}', x=X, y=Y)

will generate 9 filenames: 1.by.a, 1.by.b, 1.by.c, 2.by.a, etc. And if you added a third pattern to the expand string, expand would also add that into the combinations!

So what's going on here?

By default, expand does an all-by-all expansion containing all possible combinations. (This is sometimes called a Cartesian product, a cross-product, or an outer join.)

But you don't always want that. How can we change this behavior?

The expand function takes an optional second argument, the combinator, which tells expand how to combine the lists of values the come after. By default expand uses a Python function called itertools.product, which creates all possible combinations, but you can give it other functions.

In particular, you can tell expand to create pairwise combinations by using zip instead - something we did in one of the wildcard examples.

Here's an example:

X = [1, 2, 3]
Y = ['a', 'b', 'c']

rule all:
   input:
      expand('{x}.by.{y}', zip, x=X, y=Y)

which will now generate only three filenames: 1.by.a, 2.by.b, and 3.by.c.

The big caveat here is that zip will create an output list the length of the shortest input list - so if you give it one list of three elements, and one list of two elements, it will only use two elements from the first list.

For example, in the expand in this Snakefile,

X = [1, 2, 3]
Y = ['a', 'b']

rule all_zip_short:
   input:
      expand('{x}.by.{y}', zip, x=X, y=Y)

only 1.by.a and 2.by.b will be generated, as there is no partner for 3 in the second list.

For more information see the snakemake documentation on using zip instead of product.

Getting a list of identifiers to use in `expand`

The expand function provides an effective solution when you have lists of identifiers that you use multiple times in a workflow - a common pattern in bioinformatics! Writing these lists out in a Snakefile (as we do in the above examples) is not always practical, however; you may have dozens to hundreds of identifiers!

Lists of identifiers can be loaded from other files in a variety of ways, and they can also be generated from the set of actual files in a directory using glob_wildcards.

Examples of loading lists of accessions from files or directories

Loading a list of accessions from a text file

If you have a simple list of accessions in a text file accessions.txt, like so:

File accessions.txt:

GCF_000017325.1
GCF_000020225.1
GCF_000021665.1
GCF_008423265.1

then the following code will load each line in the text file in as a separate ID.

Snakefile to load accessions.txt:

with open('accessions.txt', 'rt') as fp:
    ACCESSIONS = fp.readlines()
    ACCESSIONS = [ line.strip() for line in ACCESSIONS ]

print(f'ACCESSIONS is a Python list of length {len(ACCESSIONS)}')
print(ACCESSIONS)

rule all:
    input:
        expand("{acc}.sig", acc=ACCESSIONS)

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first -o {output}
    """

and build sourmash signatures for each accession.

Loading a specific column from a CSV file

If instead of a text file you have a CSV file with multiple columns, and the IDs to load are all in one column, you can use the Python pandas library to read in the CSV. In the code below, pandas.read_csv loads the CSV into a pandas DataFrame object, and then we select the accession column and use that as an iterable.

File accessions.csv:

accession,information
GCF_000017325.1,genome 1
GCF_000020225.1,genome 2
GCF_000021665.1,genome 3
GCF_008423265.1,genome 4

Snakefile to load accessions.csv:

import pandas

CSV_DATAFRAME = pandas.read_csv('accessions.csv')
ACCESSIONS = CSV_DATAFRAME['accession']

print(f'ACCESSIONS is a pandas Series of length {len(ACCESSIONS)}')
print(ACCESSIONS)

rule all:
    input:
        expand("{acc}.sig", acc=ACCESSIONS)

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first -o {output}
    """

Loading from the config file

Snakemake also supports the use of configuration files, where the snakefile supplies the name of the a default config file (which can in turn be overridden on the command line.)

A config file can also be a good place to put accessions. Consider:

accessions:
- GCF_000017325.1
- GCF_000020225.1
- GCF_000021665.1
- GCF_008423265.1

which is used by the following Snakefile.

Snakefile to load accessions from config.yml:

configfile: "config.yml"

ACCESSIONS = config['accessions']

print(f'ACCESSIONS is a Python list of length {len(ACCESSIONS)}')
print(ACCESSIONS)

rule all:
    input:
        expand("{acc}.sig", acc=ACCESSIONS)

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first -o {output}
    """

Here, config.yml is a YAML file, which is a human-readable format that can also be read by computers. We will talk about config files later!

Using `glob_wildcards` to load IDs or accessions from a set of files

We introduced the glob_wildcards command briefly in the post on wildcards: glob_wildcards does pattern matching on files actually present in the directory.

Here's a Snakefile that uses glob_wildcards to get the four accessions from the actual filenames:

GLOB_RESULTS = glob_wildcards("genomes/{acc}.fna.gz")
ACCESSIONS = GLOB_RESULTS.acc

print(f'ACCESSIONS is a Python list of length {len(ACCESSIONS)}')
print(ACCESSIONS)

rule all:
    input:
        expand("{acc}.sig", acc=ACCESSIONS)

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first -o {output}
    """

This is a particularly convenient way to get a list of accessions, although it can be dangerous to use this. In particular, it is easy to accidentally delete a file and not notice that a sample is missing! For that reason we suggest providing an independent list of files to load for many situations.

Wildcards and `expand` - some closing thoughts

Combined with wildcards, expand is extremely powerful and useful. Just like wildcards, however, this power comes with some complexity. Here is a brief rundown of how these features combine.

The expand function makes a list of files to create from a pattern and a list of values to fill in.

Wildcards in rules provide recipes to create files whose names match a pattern.

Typically in Snakefiles we use expand to generate a list of files that match a certain pattern, and then write a rule that uses wildcards to generate those actual files.

The list of values to use with expand can come from many places, including text files, CSV files, and config files. It can also come from glob_wildcards, which uses a pattern to extract the list of values from files that are actually present.

Links and references

snakemake reference documentation for expand
The Python itertools documentation.

snakemake for doing bioinformatics - using wildcards to generalize your rules

2023-03-03T00:00:00+01:00

As we showed in a previous blog post, when you have repeated substrings between input and output, you can extract them into wildcards - going from a rule that makes specific outputs:

rule sketch_genomes_1:
    input:
        "genomes/GCF_000017325.1.fna.gz",
    output:
        "GCF_000017325.1.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first
    """

to a rule that makes any output that fits a pattern:

rule sketch_genomes_1:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

Here, {accession} is a wildcard that "fills in" as needed for any filename that is under the genomes/ directory and ends with .fna.gz.

Snakemake uses simple pattern matching to determine the value of {accession} - if asked for a filename ending in .fna.gz.sig, snakemake takes the prefix, and then looks for the matching input file genomes/{accession}.fna.gz, and fills in {input} accordingly.

This is incredibly useful and means that in many cases you can write a single rule that can generate hundreds or thousands of files!

However, there are a few subleties to consider. In this chapter, we're going to cover the most important of those subtleties, and provide links where you can learn more.

Rules for wildcards

First, let's go through some basic rules for wildcards.

Wildcards are determined by the desired output

The first and most important rule of wildcards is this: snakemake fills in wildcard values based on the filename it is asked to produce.

Consider the following rule:

rule a:
    output: "{prefix}.a.out"
    shell: "touch {output}"

The wildcard in the output block will match any file that ends with .a.out, and the associated shell command will create it! This is both powerful and constraining: you can create any file with the suffix .a.out - but you also need to ask for the file to be created.

This means that in order to make use of this rule, there needs to be another rule that has a file that ends in .a.out as a required input. (You can also explicitly ask for such a file on the command line.) There's no other way for snakemake to guess at the value of the wildcard: snakemake follows the dictum that explicit is better than implicit, and it will not guess at what files you want created.

For example, the above rule could be paired with another rule that asks for one or more filenames ending in .a.out:

rule make_me_a_file:
    input:
        "result1.a.out",
        "result2.a.out",

This also means that once you put a wildcard in a rule, you can no longer run that rule by the rule name - you have to ask for a filename, instead. If you try to run a rule that contains a wildcard but don't tell it what filename you want to create, you'll get:

Target rules may not contain wildcards.

One common way to work with wildcard rules is to have another rule that uses expand to construct a list of desired files; this is often paired with a glob_wildcards to load a list of wildcards. See the recipe for renaming files by prefix, below.

All wildcards used in a rule must match to wildcards in the `output:` block

snakemake uses the wildcards in the output: block to fill in the wildcards elsewhere in the rule, so you can only use wildcards mentioned in output:.

So, for example, every wildcard in the input: block needs to be used in output:. Consider the following example, where the input block contains a wildcard analysis that is not used in the output block:

# this does not work:

rule analyze_sample:
    input: "{sample}.x.{analysis}.in"
    output: "{sample}.out"

This doesn't work because snakemake doesn't know how to fill in the analysis wildcard in the input block.

Think about it this way: if this worked, there would be multiple different input files for the same output, and snakemake would have no way to choose which input file to use.

There are situations where wildcards in the output: block do not need to be in the input: block, however - see "Using wildcards to determine parameters to use in the shell block", below, on using wildcards to determine parameters for the shell block.

Wildcards are local to each rule

Wildcard names must only match within a rule block. You can use the same wildcard names in multiple rules for consistency and readability, but snakemake will treat them as independent wildcards, and wildcard values will not be shared.

So, for example, these two rules use the same wildcard a in both rules -

rule analyze_this:
    input: "{a}.first.txt"
    output: "{a}.second.txt"

rule analyze_that:
    input: "{a}.second.txt"
    output: "{a}.third.txt"

but this is equivalent to these next two rules, which use different wildcards a and b in the separate rules:

rule analyze_this:
    input: "{a}.first.txt"
    output: "{a}.second.txt"

rule analyze_that:
    input: "{b}.second.txt"
    output: "{b}.third.txt"

There is an exception to the rule that wildcards are independent: when you use global wildcard constraints to constrain wildcard matching by wildcard name, the constraints apply across all uses of that wildcard name in the Snakefile. However, the values of the wildcards remain independent - it's just the constraint that is shared.

While wildcards are independent in values, it is a good convention to choose wildcards to have the same semantic meaning across the Snakefile - e.g. always use sample consistently to refer to a sample. This makes reading the Snakefile easier!

One interesting addendum: because wildcards are local to each rule, you are free to match different parts of patterns in different rules! See "Mixing and matching wildcards", below.

The wildcard namespace is implicitly available in `input:` and `output:` blocks, but not in other blocks.

Within the input: and output: blocks in a rule, you can refer to wildcards directly by name. If you want to use wildcards in other parts of a rule you need to use the wildcards. prefix. Here, wildcards is a namespace, which we will talk about more later. (CTB)

Consider this Snakefile:

# this does not work:

rule analyze_this:
    input: "{a}.first.txt"
    output: "{a}.second.txt"
    shell: "analyze {input} -o {output} --title {a}"

Here you will get an error,

NameError: The name 'a' is unknown in this context. Did you mean 'wildcards.a'?

As the error suggests, you need to use wildcards.a in the shell block instead:

rule analyze_this:
    input: "{a}.first.txt"
    output: "{a}.second.txt"
    shell: "analyze {input} -o {output} --title {wildcards.a}"

Wildcards match greedily, unless constrained

Wildcard pattern matching chooses the longest possible match to any characters, which can result in slightly confusing behavior. Consider:

rule all:
    input:
        "x.y.z.gz"

rule something:
    input: "{prefix}.{suffix}.txt"
    output: "{prefix}.{suffix}.gz"
    shell: "gzip -c {input} > {output}"

In the something rule, for the desired output file x.y.z.gz, {prefix} will currently be x.y and {suffix} will be z. But it would be equally valid for {prefix} to be x and suffix to be y.z.

A more extreme example shows the greedy matching even more clearly:

rule all:
    input:
        "longer_filename.gz"

rule something:
    input: "{prefix}{suffix}.txt"
    output: "{prefix}{suffix}.gz"
    shell: "gzip -c {input} > {output}"

where {suffix} is reduced down to a single character, e, and {prefix} is longer_filenam!

Two simple rules for wildcard matching are: * all wildcards must match at least one character. * after that, wildcards will match greedily: each wildcard will match everything it can before the next wildcard is considered.

Therefore, it's good practice to use wildcard constraints to limit wildcard matching. See "Constraining wildcards to avoid subdirectories and/or periods", below, for some examples.

Some examples of wildcards

Running one rule on many files

Wildcards can be used to run the same simple rule on many files - this is one of the simplest and most powerful uses for snakemake!

Consider this Snakefile for compressing many files:

rule all:
    input:
        "compressed/F3D141_S207_L001_R1_001.fastq.gz",
        "compressed/F3D141_S207_L001_R2_001.fastq.gz",
        "compressed/F3D142_S208_L001_R1_001.fastq.gz",
        "compressed/F3D142_S208_L001_R2_001.fastq.gz"

rule gzip_file:
    input:
        "original/{filename}"
    output:
        "compressed/{filename}.gz"
    shell:
        "gzip -c {input} > {output}"

This Snakefile specifies a list of compressed files that it wants produced, and relies on wildcards to do the pattern matching required to find the input files and fill in the shell block.

That having been said, this Snakefile is inconvenient to write and is somewhat error prone:

writing out the files individually is annoying if you have many of them!
to generate the list of files, you have to hand-rename them, which is error prone!

Snakemake provides several features that can help with these issues. You can load the list of files from a text file or spreadsheet, or get the list directly from the directoriy using glob_wildcards; and you can use expand to rename them in bulk. Read on for some examples!

Why use snakemake here?

It is possible to accomplish the same task by using gzip -k original/*, although you'd have to move the files into their final location, too.

How is using gzip -k original/* different from using snakemake? And is it better?

First, while the results aren't different - both approaches will compress the set of input files, which is what you want! - the gzip -k command runs in serial and will not run in parallel - that is, gzip will by default compress one file at a time. The Snakefile will run the rule gzip_file in parallel, using as many processors as you specify with -j. That means that if you had many, many such files - a common problem in bioinformatics! - the snakemake version could potentially run many times faster.

Second, specifying many files on the command line with gzip -k original/* works with gzip but not with every shell command. Some commands only run on one file at a time; gzip just happens to work whether you give it one or many files. Many other programs do not work on multiple input files; e.g. the fastp program for preprocessing FASTQ files runs on one dataset at a time. (It's also worth mentioning that snakemake gives you a way to flexibly write custom command lines; more on that later.)

Third, in the Snakefile we are being explicit about which files we expect to exist after the rules are run, while if we just ran gzip -k original/* we are asking the shell to compress every file in original/. If we accidentally deleted a file in the original subdirectory, then gzip would not know about it and would not complain - but snakemake would. This is a theme that will come up repeatedly - it's often safer to be really explicit about what files you expect, so that you can be alerted to possible mistakes.

And, fourth, the Snakefile approach will let you rename the output files in interesting ways - with gzip -k original/*, you're stuck with the original filenames. This is a feature we will explore in the next subsection!

Renaming files by prefix using `glob_wildcards`

Consider a set of files named like so:

F3D141_S207_L001_R1_001.fastq
F3D141_S207_L001_R2_001.fastq

within the original/ subdirectory.

Now suppose you want to rename them all to get rid of the _001 suffix before .fastq. This is very easy with wildcards!

The below Snakefile uses glob_wildcards to load in a list of files from a directory and then make a copy of them with the new name under the renamed/ subdirectory. Here, glob_wildcards extracts the {sample} pattern from the set of available files in the directory:

# first, find matches to filenames of this form:
files = glob_wildcards("original/{sample}_001.fastq")

# next, specify the form of the name you want:
rule all:
    input:
        expand("renamed/{sample}.fastq", sample=files.sample)

# finally, give snakemake a recipe for going from inputs to outputs.
rule rename:
    input:
        "original/{sample}_001.fastq",
    output:
        "renamed/{sample}.fastq"
    shell:
        "cp {input} {output}"

This Snakefile also makes use of expand to rewrite the loaded list into the desired set of filenames. This means that we no longer have to write out the list of files ourselves - we can let snakemake do it with expand!

Note that here you could do a mv instead of a cp and then glob_wildcards would no longer pick up the changed files after running.

This Snakefile loads the list of files from the directory itself, which means that if an input file is accidentally deleted, snakemake won't complain. When renaming files, this is unlikely to cause problems; however, when running workflows, we recommend loading the list of samples from a text file or spreadsheet to avoid problems

Also note that this Snakefile will find and rename all files in original/ as well as any subdirectories! This is because glob_wildcards by default includes all subdirectories. See the next section below to see how to use wildcard constraints to prevent loading from subdirectories.

Constraining wildcards to avoid subdirectories and/or periods

Wildcards match to any string, including '/', and so glob_wildcards will automatically find files in subdirectories and will also "stretch out" to match common delimiters in filenames such as '.' and '-'. This is commonly referred to as "greedy matching" and it means that sometimes your wildcards will match to far more of a filename than you want! You can limit wildcard matches using wildcard constraints.

Two common wildcard constraints are shown below, separately and in combination. The first constraint avoids files in subdirectories, and the second constraint avoids periods.

# match all .txt files - no constraints
all_files = glob_wildcards("{filename}.txt").filename

# match all .txt files in this directory only - avoid /
this_dir_files = glob_wildcards("{filename,[^/]+}.txt").filename

# match all files with only a single period in their name - avoid .
prefix_only = glob_wildcards("{filename,[^.]+}.txt").filename

# match all files in this directory with only a single period in their name
# avoid / and .
prefix_and_dir_only = glob_wildcards("{filename,[^./]+}.txt").filename

Check out wildcard constraints for more information and details.

Advanced wildcard examples

Renaming files using multiple wildcards

The first renaming example above works really well when you want to change just the suffix of a file and can use a single wildcard, but if you want to do more complicated renaming you may have to use multiple wildcards.

Consider the situation where you want to rename files from the form of F3D141_S207_L001_R1_001.fastq to F3D141_S207_R1.fastq. You can't do that with a single wildcard, unfortunately - but you can use two, like so:

# first, find matches to filenames of this form:
files = glob_wildcards("original/{sample}_L001_{r}_001.fastq")

# next, specify the form of the name you want:
rule all:
    input:
        expand("renamed/{sample}_{r}.fastq", zip,
               sample=files.sample, r=files.r)

# finally, give snakemake a recipe for going from inputs to outputs.
rule rename:
    input:
        "original/{sample}_L001_{r}_001.fastq",
    output:
        "renamed/{sample}_{r}.fastq"
    shell:
        "cp {input} {output}"

We're making use of three new features in this code:

First, glob_wildcards is matching multiple wildcards, and puts the resulting values into a single result variable (here, files).

Second, the matching values are placed in two ordered lists, files.sample and files.r, such that values extracted from file names match in pairs.

Third, when we use expand, we're asking it to "zip" the two lists of wildcards together, rather than the default, which is to make all possible combinations with product.

Also - as with the previous example, this Snakefile will find and rename all files in original/ as well as any subdirectories!

Links:

snakemake documentation on using zip instead of product

Mixing and matching strings

A somewhat nonintuitive (but also very useful) consequence of wildcards being local to rules is that you can do clever string matching to mix and match generic rules with more specific rules.

Consider this Snakefile, in which we are mapping reads from multiple samples to multiple references (rule map_reads_to_reference) as well as converting SAM to BAM files:

rule all:
    input:
        "sample1.x.ecoli.bam",
        "sample2.x.shewanella.bam",
        "sample1.x.shewanella.bam"

rule map_reads_to_reference:
    input:
        reads="{sample}.fq",
        reference="{genome}.fa",
    output:
        "{reads}.x.{reference}.sam"
    shell: "minimap2 -ax sr {input.reference} {input.reads} > {output}"

rule convert_sam_to_bam:
    input:
        "{filename}.sam"
    output:
        "{filename}.bam"
    shell: "samtools view -b {input} -o {output}

Here, snakemake is happily using different wildcards in each rule, and matching them to different parts of the pattern! So,

Rule convert_sam_to_bam will generically convert any SAM file to a BAM file based solely on the .bam and .sam suffixes.
However, map_reads_to_references will only produce mapping files that match the pattern of {sample}.x.{reference}, which in turn depend on the existence of {reference}.fa and {sample}.fastq.

This works because, ultimately, snakemake is just matching strings and does not "know" anything about the structure of the strings that it's matching. And it also doesn't remember wildcards across rules. So snakemake will happily match one set of wildcards in one rule, and a different set of wildcards in another rule!

Using wildcards to determine parameters to use in the shell block.

You can also use wildcards to build rules that produce output files where the parameters used to generate the contents are based on the filename; for example, consider this example of generating subsets of FASTQ files:

rule all:
    input:
        "big.subset100.fastq"

rule subset:
    input:
        "big.fastq"
    output:
        "big.subset{num_lines}.fastq"
    shell: """
        head -{wildcards.num_lines} {input} > {output}
    """

Here, the wildcard is only in the output filename, not in the input filename. The wildcard value is used by snakemake to determine how to fill in the number of lines for head to select from the file!

This can be really useful for generating files with many different parameters to a particular shell command - "parameter sweeps". More about this later.

How to think about wildcards

Wildcards (together with expand and glob_wildcards) are perhaps the single most powerful feature in snakemake: they permit generic application of rules to an arbitrary number of files, based entirely on simple patterns.

However, with that power comes quite a bit of complexity!

Ultimately, wildcards are all about strings and patterns. Snakemake is using pattern matching to extract patterns from the desired output files, and then filling those matches in elsewhere in the rule. Most of the ensuing complexity comes avoiding ambiguity in matching and filling in patterns, along with the paired challenge of constructing all the names of the files you actually want to create.

Additional references

See also: the snakemake docs on wildcards.

conda & mamba on shared clusters works better now!

2023-02-09T00:00:00+01:00

Friends! Countrymen! I bring you good tidings! The bug is dead! Long live conda/mamba on shared clusters!

OK, wait. Let's back up. What's this bug, and why does it matter that it's fixed?

It all starts with teaching...

conda is, like, the best for teaching bioinformatics!!

I've been teaching bioinformatics using conda for about 5 years now. Not only do I straight up teach conda/mamba but I also use it extensively in my Intro Bioinformatics hands-on lab for graduate students, where I teach variant calling, de novo assembly, and RNAseq.

Mostly I teach on a shared cluster, the 'farm' HPC, because that's where many of the students will be doing their research.

And I teach conda (and mamba) for a few reasons:

it works!
you don't need admin privileges to install specific versions of your software!
most bioinformatics command-line software is available via conda!
many (most?) Python packages and many (most?) R packages are available from conda-forge or bioconda!
and, most recently, one of our admins, Camille Scott, got RStudio Server working so that it loads R and R packages from conda environments!

So, basically, conda is a full solution for students to take and use after my class is over.

My teaching setup for conda

I teach using a bunch of accounts specifically created for the course. These accounts are set up so that I have ssh access into them, which is really important; and they have specific queue access. It all works really well! Well, mostly.

Things that work out of the box: software installed with conda. Yay!

Things that don't work out of the box: 30 students simultaneously downloading the same packages from conda-forge.

This is because 30 students downloading 500 MB of packages from the same remote Web site is slow ;).

The thing is, it's not really necessary for everyone to download the packages - most of the time, students are only downloading packages all at the same time during class, and they're all downloading the same packages. We should be able to cache them!

So I've set up the accounts with a central cache. Read on...

Using a central package cache for a bunch of accounts

It's actually pretty straightforward to set up; there are two components: a condarc file,

pkgs_dirs:
  - ~/.conda/pkgs
  - /home/ctbrown/remote-computing.cache

that specifies a package cache directory that's shared; and an install script that I run in each "child" account that installs and configures conda to use the shared cache:

$ cd ~/
$ mkdir -p ~/.conda/pkgs
$ cp ~ctbrown/shared-conda-on-farm/condarc ~/.condarc
$ bash ~ctbrown/shared-conda-on-farm/Mambaforge-Linux-x86_64.sh -b -p $HOME/miniforge3

This sets things up so that all the accounts look for packages in one place, and download them to their local account if they're not there.

I run this script in each child account, and then I set up a separate parent account that has write privileges to the cache directory. This parent account must then download all of the desired conda packages, at which point they are then available to all the child accounts to use without download.

This works great, except for one thing: until recently, the child account mamba calls would complain bitterly if permissions were wrong. And sometimes things would work out even less well and there would be crashes. So I had to be very mindful of how I installed packages. Which I wasn't always. Which caused problems.

And that's the bug that was fixed! - the specific conda issue I've been paying attention to references this fix, which was actually pointed at this issue.

All's well that ends well - I upgraded all of the accounts to mamba 1.30 and ran some tests and it all seems to work! We did a stress test on Wednesday with ~30 people running through my snakemake lesson, and other than network glitches, life was good!

Taking a step back: is conda all that?

Yes, it's great.

I'm sure it doesn't solve all the packaging problems, and I'm positive it's theoretically inferior to many things, but I've gotta say, it really just works for me (and people in my lab) 99% of the time.

Even better, other people are reporting that it's working well for them - including for R software installations.

Conda and R

Conda solves a lot of R package installation problems for me.

I'm no R expert, but here is what I've gathered as to why I have a lot of problems:

The challenge with R installation is that many R packages need to be compiled before installation; I gather the R packaging ecosystem typically distributes things as source. This means installing them requires having a particular compiler tool-chain installed. Dependencies also become an issue. Basically, this is a point of fragility.

Conda conveniently does things in a different way: packages are distributed as binaries with no compilation required, and their dependencies include everything required for runtime. When this works, it works really well - you just download and install the compiled package for your system!

Even better, all of the conda magic works - you get to use an isolated environment, with the version of R you wanted to use, with all of the compatible packages installed. And if you need to install something yourself, you can do so in that isolated conda environment without potentially contaminating your other R installs.

So, I now regularly use conda environments that look like this:

channels:
    - conda-forge
    - bioconda
    - defaults
dependencies:
    - r-ggplot2
    - r-dplyr
    - r-readr
    - r-pheatmap
    - r-knitr
    - r-rmarkdown
    - r-rsqlite
    - r-data.table
    - r-kableextra
    - bioconductor-tximeta
    - bioconductor-deseq2
    - bioconductor-summarizedexperiment
    - r-base
    - r-irkernel=1.1
    - r-devtools

and it works really well for me.

I'll note that the situation has really improved over the last 3 years - I used to have lots of issues, but conda-forge has really stepped up their game and now most of my problems occur elsewhere (problem-specific stuff, basically).

One concern with conda has been the availability of common R packages. Here I'm happy to say that Fredrik Boulund reported that he was able to find that all but one of 600 of their internally used R packages were already available on conda-forge. So that's pretty cool!

One last thought for you...

...or maybe two ;).

Packaging for data science software really requires a community. There are so many packages, and so many diverse and disparate needs, that if you want a solution that satisfies > 80% of the needs you need to build off a diverse community. If the community mechanisms include a way to add your own packages of interest (like conda-forge and bioconda do) then that results in magic!

Also, I think software solutions have to incorporate the newbie/learners perspective. If I can't get a class of 30 people to robustly use your solution, then that's a problem.

--titus

A brief overview of automation and parallelization options in UNIX/on an HPC

2023-01-31T00:00:00+01:00

What do you do if you have a lot of computing jobs to run, and lots of computing resources to run them?

Let's play with some options! We'll run a simple set of bioinformatics analyses as an example, but all of the approaches below should work for a wide variety of command line needs.

Most of the commands below should work as straight-up copy/paste. Please let me know if they don't!

Setup and file preparation

Download some metagenome assemblies from our metagenome assembly evaluation project. These are all files generated from from Shakya et al., 2014 - specifically, assemblies of SRR606249.

mkdir queries/
cd queries/
curl -JLO https://osf.io/download/q8h97/
curl -JLO https://osf.io/download/7bzrc/
curl -JLO https://osf.io/download/3kgvd/
cd ..
mkdir -p database/
cd database/
curl -JLO https://osf.io/download/4kfv9/
cd ../

Now you should have three files in queries/

ls -1 queries/

>idba.scaffold.fa.gz
>megahit.final.contigs.fa.gz
>spades.scaffolds.fasta.gz

and one file in database/

ls -1 database/

>podar-complete-genomes-17.2.2018.tar.gz

Let's sketch the queries with sourmash:

for i in queries/*.gz
do
    sourmash sketch dna -p k=31,scaled=10000 $i -o $i.sig
done

Next, unpack the database and create database.zip:

cd database/
tar xzf podar*.tar.gz
sourmash sketch dna -p k=31,scaled=10000 *.fa --name-from-first -o ../database.zip
cd ../

Finally, make all your inputs read-only:

chmod a-w queries/* database.zip database/*

This prevents against accidental overwriting of the files.

Running your basic queries

We're going to run sourmash gather for all three assembly files in queries/ against the 64 genomes in database.zip. These specific commands will run quickly, but note that they are a proxy for a much bigger analysis against larger databases.

You could do these queries in serial:

sourmash gather queries/idba.scaffold.fa.gz.sig \
    database.zip -o idba.scaffold.fa.gz.csv

sourmash gather queries/megahit.final.contigs.fa.gz.sig \
    database.zip -o megahit.final.contigs.fa.gz.csv

sourmash gather queries/spades.scaffolds.fasta.gz.sig \
    database.zip -o spades.scaffolds.fasta.gz.csv

but then your total compute time would be the sum of the individual compute times. And what if each query is super slow and/or big, and you have dozens or hundreds of them? WHAT THEN?

Read on!

Automation and parallelization

1. Write a shell script.

Let's start by automating the queries so that you can just run one command and have it do all three (or N) queries.

Create the following shell script:

run1.sh:

sourmash gather queries/idba.scaffold.fa.gz.sig database.zip -o idba.scaffold.fa.gz.csv

sourmash gather queries/megahit.final.contigs.fa.gz.sig database.zip -o megahit.final.contigs.fa.gz.csv

sourmash gather queries/spades.scaffolds.fasta.gz.sig database.zip -o spades.scaffolds.fasta.gz.csv

and run it:

bash run1.sh

This automates the commands, but nothing else.

Notes:

all your commands will run in serial, one after the other;
the memory usage of the script will be the same as the memory usage of the largest command;

2. Add a for loop to your shell script.

There's a lot of duplication in the script above. Duplication leads to typos, which leads to fear, anger, hatred, and suffering.

Let's make a script run2.sh that contains a for loop instead.

run2.sh:

for query in queries/*.sig
do
output=$(basename $query .sig).csv
sourmash gather $query database.zip -o $output
done

While this does exactly the same thing computationally as run1.sh, it is a bit nicer because it is less repetitive and lets you run as many queries as you have.

Notes:

yes, we carefully structured the filenames so that the for loop would work :)
the output= line uses basename to remove the queries/ prefix and .sig suffix from each query filename.

3. Write a for loop that creates a shell script.

Sometimes it's nice to generate a script that you can edit to fine tune and customize the commands. Let's try that!

At the shell prompt, run

for query in queries/*.sig
do
output=$(basename $query .sig).csv
echo sourmash gather $query database.zip -o $output
done > run3.sh

This creates a file run3.sh that contains the commands to run. Neato! You could now edit this file if you wanted to individually change up the commands. Or, you could adjust the for loop if you wanted to change all the commands.

Notes:

same runtime parameters as above: everything runs in serial.
be careful about overwriting run3.sh by accident after you've edited it!

4. Use `parallel` to run the commands instead.

Once we have this script file ready, we can actually run the commands in parallel, using GNU parallel:

parallel -j 2 < run3.sh

This runs up to two commands from run3.sh at a time (-j 2). Neat, right?!

Notes:

depending on the parameter to -j, this can be much faster - here, twice as fast!
it will also use twice as much memory...!
parallel runs each line on its own. So if you have multiple things you want to run in each parallel session, you need to do something different - like write a shell script to do each compute action, and then run those in parallel.

5. Write a second shell script that takes a parameter.

Let's switch things up - let's write a generic shell script that does the computation. Note that it's the same set of commands as in the for loops above!

do-gather.sh:

output=$(basename $1 .sig).csv
sourmash gather $1 database.zip -o $output

Now you can run this in a loop like so:

for i in queries/*.sig
do
   bash do-gather.sh $i
done

Notes:

here, $1 is the first command-line parameter after the shell script name.
this is back to processing in serial, not parallel.

It would be easy to make this into something you can run in parallel, by providing a list of do-gather.sh commands as in (4), above.

6. Change the second shell script to be an sbatch script.

Suppose you have access to an HPC that has many different computers, and you want to run a bunch of big jobs across those computers. How do we do that?

All (most?) clusters have a queuing system; ours is called slurm. (You can see a tutorial here.)

To send jobs to many different computers, you can write a shell script that executes a particular job, and then run lots of those.

Change do-gather.sh to look like the following.

#SBATCH -c 1     # cpus per task
#SBATCH --mem=5Gb     # memory needed
#SBATCH --time=00-00:05:00     # time needed
#SBATCH -p med2 

output=$(basename $1 .sig).csv
sourmash gather $1 database.zip -o $output

This is now a script you can send to the HPC to run, using sbatch:

for i in queries/*.sig
do
    sbatch do-gather.sh $i
done

The advantage here is these commands can be scheduled by the HPC to run whenever and wherever there is computational "space" to run them. (Here, the #SBATCH lines in the shell script specify how much compute time/memory is needed.)

Notes:

this distributes your job across the HPC;
these jobs only take up as much time/memory as each job individually! but now they're running in parallel on multiple machines!
do-gather.sh is actually still a bash script so you can still run it that way, too.

7. Write a snakemake file.

An alternative to all of the above is to have snakemake run things for you. Here's a simple snakefile to run things in parallel:

Snakefile:

QUERY, = glob_wildcards("queries/{q}.sig")

rule all:
    input:
        expand("{q}.csv", q=QUERY)

rule run_query:
    input:
        sig = "queries/{q}.sig",
    output:
        csv = "{q}.csv"
    shell: """
        sourmash gather {input.sig} database.zip -o {output.csv}
    """

and run it in parallel:

snakemake -j 2

Notes:

this will run things in parallel as in the above example (4).

Strategies for testing and evaluation

Here are the three strategies I use when trying to scale something up to run in multiple jobs and across multiple computers:

Build around an existing example.
Subsample your query data.
Test on a smaller version of your problem.

Appendix: making your shell script(s) nicer

The above shell scripts are not actually the way I recommend writing shell scripts! Here are a few additional thoughts for you -

1. Make them runnable without an explicit `bash`

Put #! /bin/bash at the top of the shell script and run chmod +x <scriptname>, and now you will be able to run them directly:

./run1.sh

2. Set error exit

Add set -e to the top of your shell script and it will stop running when there's an error.

snakemake for doing bioinformatics - a beginner's guide (part 2)

2023-01-23T00:00:00+01:00

(The below post contains excerpts from Slithering your way into bioinformatics with snakemake, Hackmd Press, 2023.)

In Section 1, we introduced snakemake as a system for (efficiently and effectively) running a series of shell commands.

In Section 2, we'll explore a number of important features of snakemake. Together with Section 1, this section covers the core set of snakemake functionality that you need to know in order to effectively leverage snakemake.

After this section, you'll be well positioned to write a few workflows of your own, and then you can come back and explore more advanced features as you need them.

Chapter 4: running rules in parallel

Let's take a look at the sketch_genomes rule from the last Snakefile entry:

rule sketch_genomes:
    input:
        "genomes/GCF_000017325.1.fna.gz",
        "genomes/GCF_000020225.1.fna.gz",
        "genomes/GCF_000021665.1.fna.gz",
    output:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

This command works fine as it is, but it is slightly awkward - because, bioinformatics being bioinformatics, we are likely to want to add more genomes into the comparison at some point, and right now each additional genome is going to have to be added to both input and output. It's not a lot of work, but it's unnecessary.

Moreover, if we add in a lot of genomes, then this step could quickly become a bottleneck. sourmash sketch may run quickly on 10 or 20 genomes, but it will slow down if you give it 100 or 1000! (In fact, sourmash sketch scales with the number of genomes - so it will take 100 times longer on 100 genomes than on 1.) Is there a way to speed that up?

Yes - we can write a rule that can be run for each genome, and then let snakemake run it in parallel for us!

Let's start by breaking this one rule into three separate rules:

rule sketch_genomes_1:
    input:
        "genomes/GCF_000017325.1.fna.gz",
    output:
        "GCF_000017325.1.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

rule sketch_genomes_2:
    input:
        "genomes/GCF_000020225.1.fna.gz",
    output:
        "GCF_000020225.1.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

rule sketch_genomes_3:
    input:
        "genomes/GCF_000021665.1.fna.gz",
    output:
        "GCF_000021665.1.fna.gz.sig"
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

# rest of Snakefile here!

It's wordy, but it will work - run:

snakemake -j 1 --delete-all plot_comparison
snakemake -j 1 plot_comparison

Before we modify the file further, let's enjoy the fruits of our labor: we can now tell snakemake to run more than one rule at a time!

Try typing this:

snakemake -j 1 --delete-all plot_comparison
snakemake -j 3 plot_comparison

If you look closely, you should see that snakemake is running all three sourmash sketch dna commands at the same time.

This is pretty cool and is one of the more powerful practical features of snakemake: once you tell snakemake what you want it to do (by specifying your desired output(s)) and give snakemake the set of recipes telling it how to do each step, snakemake will figure out the fastest way to run all the necessary steps with the resources you've given it.

In this case, we told snakemake that it could run up to three jobs at a time, with -j 3. We could also have told it to run more jobs at a time, but at the moment there are only three rules that can actually be run at the same time - sketch_genomes_1, sketch_genomes_2, and sketch_genomes_3. This is because the rule compare_genomes needs the output of these three rules to run, and likewise plot_genomes needs the output of compare_genomes to run. So they can't be run at the same time as any other rules!

Chapter 5 - visualizing workflows

Let's visualize what we're doing! Here's the output of snakemake --dag plot_comparison, visualized with the graphviz package:

This diagram shows the relationship between the rules we've put in the Snakefile: compare_genomes takes the output of the sketch_genome rules as its own input, and then plot_comparison uses the output of compare_genomes to build its own plot.

One key aspect of this graph is that it shows you where the various rules can be run at the same time as each other because they neither require nor are required for the others - here, the three sketch_genome rules. That is what let us run all three simultaneously in the previous chapter!

Note: sometimes you have to have a single rule that deals with all of the genomes - for example, compare_genomes has to compare all the genomes, and there's no simple way around that. But with sketch_genomes, we do have the option of breaking the rule up!

Chapter 6 - using wildcards to make rules more generic

Let's take another look at one of those sketch_genomes_ rules:

rule sketch_genomes_1:
    input:
        "genomes/GCF_000017325.1.fna.gz",
    output:
        "GCF_000017325.1.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

There's some redundancy in there - the accession GCF_000017325.1 shows up twice. Can we do anything about that?

Yes, we can! We can use a snakemake feature called "wildcards", which will let us give snakemake a blank space to fill in automatically.

With wildcards, you signal to snakemake that a particular part of an input or output filename is fair game for substitutions using { and } surrounding the wildcard name. Let's create a wildcard named accession and put it into the input and output blocks for the rule:

rule sketch_genomes_1:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} \
            --name-from-first
    """

What this does is tell snakemake that whenever you want an output file ending with .fna.gz.sig, you should look for a file with that prefix (the text before .fna.gz.sig) in the genomes/ directory, ending in .fna.gz, and if it exists, use that file as the input.

(Yes, there can be multiple wildcards in a rule! We'll show you that later!)

If you go through and use the wildcards in sketch_genomes_2 and sketch_genomes_3, you'll notice that the rules end up looking identical. And, as it turns out, you only need (and in fact can only have) one rule - you can now collapse the three rules into one sketch_genome rule again.

Here's the full Snakefile:

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first
    """

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    output:
        "compare.mat"
    shell: """
        sourmash compare {input} -o {output}
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot {input}
    """

It looks a lot like the Snakefile we started with, with the crucial difference that we are now using wildcards.

Here, unlike the situation we were in at the end of last section where we had one rule that sketched three genomes, we now have one rule that sketches one genome at a time, but also can be run in parallel! So snakemake -j 3 will still work! And it will continue to work as you add more genomes in, and increase the number of jobs you want to run at the same time.

Before we do that, let's take another look at the workflow now - you'll notice that it's the same shape, but looks slightly different! Now, instead of the three rules for sketching genomes having different names, they all have the same name but have different values for the accession wildcard!

Chapter 7 - giving snakemake filenames instead of rule names

Let's add a new genome into the mix, and start by generating a sketch file (ending in .sig) for it.

Download the RefSeq assembly file (the _genomic.fna.gz file) for GCF_008423265.1 from this NCBI link, and put it in the genomes/ subdirectory as GCF_008423265.1.fna.gz. (You can also download a saved copy with the right name from this osf.io link.)

Now, we'd like to build a sketch by running sourmash sketch dna (via snakemake).

Do we need to add anything to the Snakefile to do this? No, no we don't!

To build a sketch for this new genome, you can just ask snakemake to make the right filename like so:

snakemake -j 1 GCF_008423265.1.fna.gz.sig

Why does this work? It works because we have a generic wildcard rule for building .sig files from files in genomes/!

When you ask snakemake to build that filename, it looks through all the output blocks for its rules, and choose the rule with matching output - importantly, this rule can have wildcards, and if it does, it extracts the wildcard from the filename!

Warning: the `sketch_genome` rule has now changed!

As a side note, you can no longer ask snakemake to run the rule by its name, sketch_genome - this is because the rule needs to fill in the wildcard, and it can't know what {accession} should be without us giving it the filename.

If you try running snakemake -j 1 sketch_genome, you'll get the following error:

WorkflowError: Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).

This is telling you that snakemake doesn't know how to fill in the wildcard (and giving you some suggestions as to how you might do that, which we'll explore below).

In this chapter we didn't need to modify the Snakefile at all to make use of new functionality!

Chapter 8 - adding new genomes

So we've got a new genome, and we can build a sketch for it. Let's add it into our comparison, so we're building comparison matrix for four genomes, and not just three!

To add this new genome in to the comparison, all you need to do is add the sketch into the compare_genomes input, and snakemake will automatically locate the associated genome file and run sketch_genome on it (as in the previous chapter), and then run compare_genomes on it. snakemake will take care of the rest!

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first
    """

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig",
        "GCF_008423265.1.fna.gz.sig",
    output:
        "compare.mat"
    shell: """
        sourmash compare {input} -o {output}
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot {input}
    """

Now when you run snakemake -j 3 plot_comparison you will get a compare.mat.matrix.png file that contains a 4x4 matrix! (See Figure.)

Note that the workflow diagram has now expanded to include our fourth genome, too!

Chapter 9 - using `expand` to make filenames

You might note that the list of files in the compare_genomes rule all share the same suffix, and they're all built using the same rule. Can we use that in some way?

Yes! We can use a function called expand(...) and give it a template filename to build, and a list of values to insert into that filename.

Below, we build a list of accessions named ACCESSIONS, and then use expand to build the list of input files of the format {acc}.fna.gz.sig from that list, creating one filename for each value in ACCESSSIONS.

ACCESSIONS = ["GCF_000017325.1",
              "GCF_000020225.1",
              "GCF_000021665.1",
              "GCF_008423265.1"]

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first
    """

rule compare_genomes:
    input:
        expand("{acc}.fna.gz.sig", acc=ACCESSIONS),
    output:
        "compare.mat"
    shell: """
        sourmash compare {input} -o {output}
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot {input}
    """

While wildcards and expand use the same syntax, they do quite different things.

expand generates a list of filenames, based on a template and a list of values to insert into the template. It is typically used to make a list of files that you want snakemake to create for you.

Wildcards in rules provide the rules by which one or more files will be actually created. They are recipes that say, "when you want to make a file with name that looks like THIS, you can do so from files that look like THAT, and here's what to run to make that happen.

expand tells snakemake WHAT you want to make, wildcard rules tell snakemake HOW to make those things.

Chapter 10 - using default rules

The last change we'll make the Snakefile for this section is to add what's known as a default rule. What is this and why?

The 'why' is easier. Above, we've been careful to provide specific rule names or filenames to snakemake, because otherwise it defaults to running the first rule in the Snakefile. (There's no other way in which the order of rules in the file matters - but snakemake will try to run the first rule in the file if you don't give it a rule name or a filename on the command line.)

This is less than great, because it's one more thing to remember and to type. In general, it's better to have what's called a "default rule" that lets you just run snakemake -j 1 to generate the file or files you want.

This is straightforward to do, but it involves a slightly different syntax - a rule with only an input, and no shell or output blocks. Here's a default rule for our Snakefile that should be put in the file as the first rule:

rule all:
    input:
        "compare.mat.matrix.png"

What this rule says is, "I want the file compare.mat.matrix.png." It doesn't give any instructions on how to do that - that's what the rest of the rules in the file are! - and it doesn't run anything, because it has no shell block, and nor does it create anything, because it has no output block.

The logic here is simple, if not straightforward: this rule succeeds when that input exists.

If you place that at the top of the Snakefile, then running snakemake -j 1 will produce compare.mat.matrix.png. You no longer need to provide either a rule name or a filename on the command line unless you want to do something other than generate that file, in which case whatever you put on the command line will ignore the rule all:.

Chapter 11 - our final Snakefile - review and discussion

Here's the final Snakefile:

ACCESSIONS = ["GCF_000017325.1",
              "GCF_000020225.1",
              "GCF_000021665.1",
              "GCF_008423265.1"]

rule all:
    input:
        "compare.mat.matrix.png"

rule sketch_genome:
    input:
        "genomes/{accession}.fna.gz",
    output:
        "{accession}.fna.gz.sig",
    shell: """
        sourmash sketch dna -p k=31 {input} --name-from-first
    """

rule compare_genomes:
    input:
        expand("{acc}.fna.gz.sig", acc=ACCESSIONS),
    output:
        "compare.mat"
    shell: """
        sourmash compare {input} -o {output}
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot {input}
    """

This Snakefile provides some nice features.

First, it's easy to add new genomes into the comparison - we download the genome, name it for its accession, and add it to ACCESSIONS at the top. Voila!

Second, we don't have to remember the names of any rules to run the whole workflow, because the rule all: at the top provides a sensible default.

Third, it is easy to change the sketching or comparison parameters and then rerun the entire workflow from scratch - thus letting us quickly explore alternate parameters for sketching and comparisons if we so choose.

In future sections, we'll revisit this basic Snakefile from the top, and explore some of the details of rules, wildcards, and other features.

snakemake for doing bioinformatics - a beginner's guide (part 1)

2023-01-14T00:00:00+01:00

(The below post contains excerpts from Slithering your way into bioinformatics with snakemake, Hackmd Press, 2023.)

Installation and setup!

Setup and installation

I suggest working in a new directory.

You'll need to install snakemake and sourmash. We suggest using mamba, via miniforge/mambaforge, for this.

Getting the data:

You'll need to download these three files:

and rename them so that they are in a subdirectory genomes/ with the names:

GCF_000017325.1.fna.gz
GCF_000020225.1.fna.gz
GCF_000021665.1.fna.gz

Note, you can download saved copies of them here, with the right names: osf.io/2g4dm/.

Chapter 1 - snakemake runs programs for you!

Bioinformatics often involves running many different programs to characterize and reduce sequencing data, and I use snakemake to help me do that.

A first, simple snakemake workflow

Here's a simple, useful snakemake workflow:

rule compare_genomes:
    message: "compare all input genomes using sourmash"
    shell: """
        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first 

        sourmash compare GCF_000021665.1.fna.gz.sig \
            GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig \
            -o compare.mat

        sourmash plot compare.mat
    """

Put it in a file called Snakefile, and run it with snakemake -j 1.

This will produce the output file compare.mat.matrix.png which contains a similarity matrix and a dendrogram of the three genomes (see Figure 1).

This is functionally equivalent to putting these three commands into a file compare-genomes.sh and running it with bash compare-genomes.sh -

sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first 

sourmash compare GCF_000021665.1.fna.gz.sig \
            GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig \
            -o compare.mat 

sourmash plot compare.mat

The snakemake version is already a little bit nicer because it will give you encouragement when the commands run successfully (with nice green text saying "1 of 1 steps (100%) done"!) and if the commands fail you'll get red text alerting you to that, too.

But! We can further improve the snakemake version over the shell script version!

Avoiding unnecessary rerunning of commands: a second snakemake workflow

The commands will run every time you invoke snakemake with snakemake -j 1. But most of the time you don't need to rerun them because you've already got the output files you wanted!

How do you get snakemake to avoid rerunning rules?

We can do that by telling snakemake what we expect the output to be by adding an output: block in front of the shell block:

rule compare_genomes:
    message: "compare all input genomes using sourmash"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first

        sourmash compare GCF_000021665.1.fna.gz.sig \
            GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig \
            -o compare.mat

        sourmash plot compare.mat
    """

and now when we run snakemake -j 1 once, it will run the commands; but when we run it again, it will say, "Nothing to be done (all requested files are present and up to date)."

This is because the desired output file, compare.mat.matrix.png, already exists. So snakemake knows it doesn't need to do anything!

If you remove compare.mat.matrix.png and run snakemake -j 1 again, snakemake will happily make the files again:

rm compare.mat.matrix.png
snakemake -j 1

So snakemake makes it easy to avoid re-running a set of commands if it has already produced the files you wanted. This is one of the best reasons to use a workflow system like snakemake for running bioinformatics workflows; shell scripts don't automatically avoid re-running commands.

Running only the commands you need to run

The last Snakefile above has three commands in it, but if you remove the compare.mat.matrix.png file you only need to run the last command again - the files created by the first two commands already exist and don't need to be recreated. However, snakemake doesn't know that - it treats the entire rule as one rule, and doesn't look into the shell command to work out what it doesn't need to run.

If we want to avoid re-creating the files that already exist, we need to make the Snakefile a little bit more complicated.

First, let's break out the commands into three separate rules.

rule sketch_genomes:
    shell: """
        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first
    """

rule compare_genomes:
    shell: """
        sourmash compare GCF_000021665.1.fna.gz.sig \
            GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig \
            -o compare.mat
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot compare.mat
    """

We didn't do anything too complicated here - we made two new rule blocks, with their own names, and split the shell commands up so that each shell command has its own rule block.

You can tell snakemake to run all three:

snakemake -j 1 sketch_genomes compare_genomes plot_comparison

and it will successfully run them all!

However, we're back to snakemake running some of the commands every time - it won't run plot_comparison every time, because compare.mat.matrix.png exists, but it will run sketch_genomes and compare_genomes repeatedly.

How do we fix this?

Adding output blocks to each rule

If add output blocks to each rule, then snakemake will only run rules where the output needs to be updated (e.g. because it doesn't exist).

Let's do that:

rule sketch_genomes:
    output:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    shell: """
        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first
    """

rule compare_genomes:
    output:
        "compare.mat"
    shell: """
        sourmash compare GCF_000021665.1.fna.gz.sig \
            GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig \
            -o compare.mat
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot compare.mat
    """

and now

snakemake -j 1 sketch_genomes compare_genomes plot_comparison

will run each command only once, as long as the output files are still there. Huzzah!

But we still have to specify the names of all three rules, in the right order, to run this. That's annoying! Let's fix that next.

Chapter 2: snakemake connects rules for you!

Chaining rules with `input:` blocks

We can get snakemake to automatically connect rules by providing information about the input files a rule needs. Then, if you ask snakemake to run a rule that requires certain inputs, it will automatically figure out which rules produce those inputs as their output, and automatically run them.

Let's add input information to the plot_comparison and compare_genomes rules:

rule sketch_genomes:
    output:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    shell: """
        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first
    """

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    output:
        "compare.mat"
    shell: """
        sourmash compare GCF_000021665.1.fna.gz.sig \
            GCF_000017325.1.fna.gz.sig GCF_000020225.1.fna.gz.sig \
            -o compare.mat
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot compare.mat
    """

Now you can just ask snakemake to run the last rule:

snakemake -j 1 plot_comparison

and snakemake will run the other rules only if those input files don't exist and need to be created.

Taking a step back

The Snakefile is now a lot longer, but it's not too much more complicated - what we've done is split the shell commands up into separate rules and annotated each rule with information about what file it produces (the output), and what files it requires in order to run (the input).

This has the advantage of making it so you don't need to rerun commands unnecessarily. This is only a small advantage with our current workflow, because sourmash is pretty fast. But if each step takes an hour to run, avoiding unnecessary steps can make your work go much faster!

And, as you'll see later, these rules are reusable building blocks that can be incorporated into workflows that each produce different files. So there are other good reasons to break shell commands out into individual rules!

Chapter 3: snakemake helps you avoid redundancy!

Avoiding repeated filenames by using `{input}` and `{output}`

If you look at the previous Snakefile, you'll see a few repeated filenames - in particular, rule compare_genomes has three filenames in the input block and then repeats them in the shell block, and compare.mat is repeated several times in both compare_genomes and plot_genomes.

We can tell snakemake to reuse filenames by using {input} and {output}. The { and } tell snakemake to interpret these not as literal strings but as template variables that should be replaced with the value of input and output.

Let's give it a try!

rule sketch_genomes:
    output:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    shell: """
        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first
    """

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    output:
        "compare.mat"
    shell: """
        sourmash compare {input} -o {output}
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot {input}
    """

This approach not only involves less typing in the first place, but also makes it so that you only have to edit filenames in one place. This avoids mistakes caused by adding or changing filenames in one place and not another place - a mistake I've made plenty of times!

snakemake makes it easy to rerun workflows!

It is common to want to rerun an entire workflow from scratch, to make sure that you're using the latest data files and software. Snakemake makes this easy!

You can ask snakemake to clean up all the files that it knows how to generate - and only those files:

snakemake -j 1 plot_comparison --delete-all-output

which can then be followed by asking snakemake to regenerate the results:

snakemake -j 1 plot_comparison

snakemake will alert you to missing files if it can't make them!

Suppose you add a new file that does not exist to compare_genomes:

rule sketch_genomes:
    output:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig"
    shell: """
        sourmash sketch dna -p k=31 genomes/*.fna.gz --name-from-first
    """

rule compare_genomes:
    input:
        "GCF_000017325.1.fna.gz.sig",
        "GCF_000020225.1.fna.gz.sig",
        "GCF_000021665.1.fna.gz.sig",
        "does-not-exist.sig"
    output:
        "compare.mat"
    shell: """
        sourmash compare {input} GCF_000021665.1.sig -o {output}
    """

rule plot_comparison:
    message: "compare all input genomes using sourmash"
    input:
        "compare.mat"
    output:
        "compare.mat.matrix.png"
    shell: """
        sourmash plot {input}
    """

Here, does-not-exist.sig doesn't exist, and we haven't given snakemake a rule to make it, either. What will snakemake do??

It will complain, loudly and clearly! And it will do so before running anything.

First, let's force the rule remove the output file that depends on the

rm compare.mat

and then run snakemake -j 1. You should see:

Missing input files for rule compare_genomes:
    output: compare.mat
    affected files:
        does-not-exist.sig

This is exactly what you want - a clear indication of what is missing before your workflow runs.

Next steps

We've introduced basic snakemake workflows, which give you a simple way to run shell commands in the right order. snakemake already offers a few nice improvements over running the shell commands by yourself or in a shell script -

it doesn't run shell commands if you already have all the files you need
it lets you avoid typing the same filenames over and over again
it gives simple, clear errors when something fails

While this functionality is nice, there are many more things we can do to improve the efficiency of our bioinformatics!

In the next section, we'll explore

writing more generic rules using wildcards;
typing fewer filenames by using more templates;
providing a list of default output files to produce;
running commands in parallel on a single computer
loading lists of filenames from spreadsheets
configuring workflows with input files

sourmash has a plugin interface!

2023-01-08T00:00:00+01:00

Over the holiday break, I took on a "palette cleansing" project - something technically neat, that wasn't critically important to anyone or anything, but could be useful. I decided to implement plugins for sourmash.

Sourmash is open-source scientific software for fast, lightweight exploration of sequencing data set comparison, with a focus on metagenomics. It's largely a command-line program written in Python on top of a Rust library. It is maintained by a small group of developers, most of whom are (or were) affiliated in some way with my academic lab at UC Davis.

Python has (what seems to be) robust support for third-party plugins, where a project can provide hooks for other people to customize functionality.

So the question was, can we add Python plugin support to sourmash?

First - why focus on plugins?

Plugins serve a lot of purposes for a project, but I think the most interesting justification for supporting them came from Tim Head, who channeled his observations of Simon Willison's datasette project into a statement that plugins are an alternate way to direct open source projects. (You can read the whole Twitter thread here.)

Tim's tl;dr was this:

"first class plugins" is my current best answer to "we need a project roadmap"

but what does that mean?

The central idea is that the more extensible you make a project with plugins, the easier it is for everyone to "play" with the project, pursue their own directions, and figure out what to do next.

Or, to rephrase: if you focus your planning and governance efforts on defining how others can extend the core functionality of your software, then you free others up to do so without permission or close engagement. This can enable a lot of experimentation and creativity!

That was a large part of my sociotechnical motivation in looking into plugins, but there were several more reasons:

Maintaining an open source project is a fair bit of work, and I have a lot of interest in keeping the "feature surface" of sourmash small so that there's less to maintain. That battles with the desire to add more functionality to meet research and user needs. Plugins offer a way to segregate efforts to either side of a well-defined interface: either it's a "core" effort (lots of coordination and work!) or an "external" effort (maybe less work, certainly less coordination), and we can allocate our attention appropriately.
With a robust core, plugins can combine to expand the feature surface of sourmash combinatorially. That's a fancy way of saying that if there's a neat new visualization plugin written by Tina, and a neat new remote-collection loading mechanism written by Steve, people can use these plugins in combination to apply the viz to remote collections.
Right now it's quite hard to add platform-specific features to sourmash - in particular, there are some software packages that we'd like to use that don't compile on Mac OS ARM laptops. Plugins would be one way to support those features on specific platforms.
Refactoring internals to support plugins can clean up the internal code! The loading and saving plugins are implemented in exactly the same way as our internal code, and I think the effort to modularize loading/saving over time has ended up with reasonably simple and decent code internally. Plugins reinforce that by standardizing the API.

And how's that going, Dr. Brown?

What I can say after putting in a dozen or so hours of work on the plugin framework is that it's been very liberating - it's just so much easier to try out new ideas, and clearly distinguish them from "serious" core code contributions that need more care and thought.

So, ...it's going well!

What types of plugins does sourmash support?

As of this morning, the main branch of sourmash supports load_from and save_to plugins. As the names suggest, these plugins provide alternate ways of loading and saving sourmash sketches.

Using these, I've built out an Avro format saving/loading plugin as well as a load-sketches-from-URIs plugin based on fsspec.

I'm currently working on adding support for new command-line subcommands. The idea is that you would be able to add new commands under sourmash scripts (a provisional name).

How did we implement plugin support?

You can see the first plugin PR here, in sourmash#2428, but the tl;dr is: we used importlib.metadata to support plugins via entry points.

The code to support plugins is pretty minimal, and currently resides in sourmash.plugins. It looks like this:

# load entry points.
_plugin_load_from = entry_points(group='sourmash.load_from')


def get_load_from_functions():
    "Load the 'load_from' plugins and yield tuples (priority, name, fn)."

    # Load each plugin,
    for plugin in _plugin_load_from:
        loader_fn = plugin.load()

        # get 'priority' if it is available
        priority = getattr(loader_fn, 'priority', DEFAULT_LOAD_FROM_PRIORITY)

        # retrieve name (which is specified by plugin?)
        name = plugin.name

        yield priority, name, loader_fn

Then, in the pyproject.toml of a Python package, anyone can state that there's a sourmash plugin available like so:

[project.entry-points."sourmash.load_from"]
a_reader = "module_name:load_sketches"

[project.entry-points."sourmash.save_to"]
a_writer = "module_name:SaveSignatures_WriteFile"

and this will get automatically loaded and used by sourmash.

How do plugins fit into the sourmash ecosystem?

We have an interesting lab-centric / lab-adjacent ecosystem developing around sourmash.

sourmash itself provides a reasonably rich Python and Rust API, for people wanting to do clever things with it. For example, the branchwater software is a fairly small script for doing parallel search of many genomes, built on top of the Rust library.

There are workflows that make use of sourmash to do cool things, like characterizing metagenomes (genome-grist) and decontaminating databases (charcoal). These (and other) workflows wrap sourmash in a larger workflow (snakemake, nextflow, CWL, ...?) to do various things.

There's also a nascent R library, sourmashconsumr being built by Taylor Reiter (and others) at Arcadia Science, for taking the output of sourmash and doing nice things with it.

Taylor is also developing code to load sourmash compare and gather output into MultiQC (see issue). This is in effect using sourmash as a plugin for other software.

And now, sourmash plugins add a nice set of opportunities to diversify sourmash internal functionality to this ecosystem. It will be interesting to watch what happens as we build out this functionality!

How do we support plugin developers?

An important aspect of plugins is supporting plugin developers - so, we have some nascent documentation, as well as a getting-started template repository to eliminate a lot of the boilerplate.

I'm not 100% sure what to add beyond this, but I find dogfooding it to be a good approach - every time I work on a plugin, I will sand down the sharper corners of our documentation a bit more.

Where next?

For now, plugins remain experimental and are not subject to semantic versioning considerations. I'm not sure when that will change, but I want to write a few more plugins before committing to the current interfaces!

I think we probably have room for many more types of plugins. We're thinking about how to enable different taxonomy loading functions, for example; and I have specific need for better manifest/picklist manipulation that I think is amenable to being made a plugin. (The plugin design issue is sourmash#1353 if you're interested.)

I am also starting to think more about the user experience. How do users find, install, use, debug, and remove plugins? This is all relatively easy if you're a Python developer who is familiar with pip and importlib.metadata, but that's not our user base ;). For now, I've started by adding plugin reporting to sourmash info -v, which at least gives us a chance of figuring out what plugins might be around!

I'm also not quite sure how to manage what I expect to be a flood of small plugins from within my lab. Will we want to have a set of recommended plugins that evolves and matures over time? And how do we avoid massively increasing our maintenance surface? (Simon Willison has some sage advice for the serial project hoarder that applies here.)

I also have this niggling feeling that I should read through datasette's plugin interface to see what I can learn from all of Simon's hard work and experience...

--titus

Reading "Orwell's Roses" by Rebecca Solnit

2022-12-31T00:00:00+01:00

Happy New Year's Eve!

So, one of my resolutions for 2023 is that I want to do more non-escapist reading.

Why? And what had I been reading??

For the last three years I've been reading a lot of trashy books. Unless it was for work (biology/bioinformatics papers) or random infovore articles that I found online, I've read almost nothing but mystery novels, romance novels, and LitRPG. (Don't judge, it's been a weird three years. ;)

Now, LitRPG is all well and good (He Who Fights with Monsters is super fun!) and I had plenty of reasons to escape, but I was avoiding anything requiring an attention span, and my stack of Good But Serious Books was piling up. Every now and then I'd get a chance to read something more serious and I'd remember how much fun it was to read something that was well written and meaningful and horizon-expanding, but soon enough my attention span would lapse and I'd be back to reading LitRPG at 9pm at night.

SO.

Sometime in the last year, I picked up Orwell's Roses at our local bookstore, Avid Reader. I had been introduced to Rebecca Solnit's writing through her amazing book A Paradise Built in Hell: The Extraordinary Communities That Arise in Disaster, which Tracy Teal had recommended to me. While I hadn't (haven't yet!) finished that book, the parts that I had read were amazing. Sometime in 2022 I started following @RebeccaSolnit on Twitter, perhaps because I saw her retweeted by Malka Older (@m_older), and I found Solnit's tweets and articles inspirational. And through her tweets I found her book Hope In the Dark: Untold Histories, Wild Possibilities. That, in turn, became my end-of-quarter reading for December (and maybe more on that book later!)

This is all to say that for about a year, Orwell's Roses had been staring at me from my bookshelf. With its bright red cover, it's a distinctive book. Moreover, I've been a huge fan of Orwell's writing ever since reading Down and Out in Paris and London in my teens, and several scenes from that book remain burned into my memory. And so I decided that Orwell's Roses would be my next book to read!

What's "Orwell's Roses" about?

It's a lovely, meandering tale about Orwell's life and beliefs, and how his politics intersected with his love of gardening and farming. Along the way we are treated to extended (and very relevant!) digressions into other intersecting stories. It's kind of a biography, but also a partial history of certain kinds of thinking.

Two particular themes caught me the most. One was the discussion of "Bread and Roses", a political slogan linked to women's suffrage. The "bread" here refers to the basic needs of sustenance - food, water, housing, and the like. The "roses" refers to something a bit more indefinite - the freedom to pursue an independent life of the mind, whether it be art, music, literature, or something else. I can't possibly do justice to the discussion of this in the book, other than to say that the theme of "Bread for all, and Roses too" resonates in this age of COVID, r/antiwork, and union organizing.

The other theme that caught me is that of locality vs uniformity, or community vs systems, or bottom-up vs top-down. This is maybe a bit more entwined with my personal interests, and richer in my mind and hence harder to explain -- but it is also a theme that is guiding a lot of my reading choices, so I expect to get lots of practice thinking and maybe writing about it!

In brief, Solnit describes how Orwell took great pleasure in the particulars of gardening and farming, and grounded himself in the daily routines of his life. Solnit does a lovely job of connecting this to Orwell's writing, and talked about how taking joy in both productive and "non-productive" tasks (such as planting roses!) is an important and small-scale rebellion against the cult of productivity and grind that capitalism has instilled in our modern life. There were strong resonances with the book Seeing Like a State: How Certain Schemes to Improve the Human Condition Have Failed, as well as the "chickenization" theme of gig working introduced to me by Cory Doctorow. (In my imagination, Solnit and Doctorow have very productive regular chats over coffee. They live in the same city, I think, so it's a possibility, right? Oh to be a fly on the wall!)

I don't really have a conclusion here, other than that Orwell's Roses was a very rewarding read that made me think, and think differently. And it was a great book to break my fast on!

What book is next?

I'm planning to start on a new book today. I'm currently being eyed by Braiding Sweetgrass: Indigenous Wisdom, Scientific Knowledge and the Teachings of Plants, by Robin Wall Kimmerer, which has been sitting on my shelf for almost a year... it looks like a thick book, but I'll just take it 30 minutes at a time and we'll see how it goes!

--titus

So! You want to search all the public metagenomes with a genome sequence!

2022-08-31T00:00:00+02:00

Imagine you have a (microbial) genome. Or a contig. And you want to find similar sequences, either in genomes or in metagenomes.

Looking for it in genomes is possible, if not always easy - you can go to NCBI and do a BLAST of some sort, but BLAST is intended for more sensitive and shorter matches. But there are other tools, including sourmash, a tool we've been developing for a few years, that will happily do it for you.

Looking for something in metagenomes is harder. Metagenomes are hundreds, thousands, or even millions of times larger than genomes, and doing anything with them quickly is hard. sourmash supports doing it one metagenome at a time, but it's slow and memory intensive; serratus will do it for you using the power of the cloud, but it will cost you (at least) a few thousand $$.

If you're interested in how we're doing DNA sequence search, here's an excerpt from a previous blog post about using SQLite to store our data -

The basic idea is that we take long DNA sequences, extract sub-sequences of a fixed length (say k=31), hash them, and then sketch them by retaining only those that fall below a certain threshold value. Then we search for matches between sketches based on number of overlapping hashes. This is a proxy for the number of overlapping k=31 subsequences, which is in turn convertible into various sequence similarity metrics.

MAGsearch exists! It works! But it's hard to share.

For a couple of years now, we've had something called MAGsearch working on our own private infrastructure. MAGsearch is sourmash on steroids: it uses the same underlying Rust library as sourmash and loads and searches the metagenomes quickly. And it will do all of this on commodity hardware that many people have access to - a search of up to a thousand genomes against the SRA takes under 12 GB of RAM, and under 11 hours, using 32 cores.

MAGsearch does a fairly straightforward thing: it loads all the query genomes into memory and then iteratively loads each of ~700,000 metagenome sketches, reporting any overlaps. It does so in parallel, which is why it's so fast - doing this with sourmash would take about 40 times as long, because sourmash isn't parallelized.

One problem with MAGsearch is that it's not real time. 10 hours is great!!, especially for 1000 genomes, but that's still only about two genomes a minute. And it's too slow for us to provide MAGsearch as a service.

Another problem is that the underlying data is about 10 TB at the moment, and we don't really have a way to share that data.

So we've been using MAGsearch a fair bit over the last two years to do searches for others, but it's always done in a kind of batch mode where we run it in between other things we're doing.

Enter 'mastiff' - using RocksDB to do things faster

For the 2022 JGI User Meeting Dr. Luiz Irber was invited to talk about his MAGsearch work, and he got inspired to try out an alternative solution.

He decided to implement an inverted index using RocksDB, an embeddable database. I haven't dug into the implementation, but I believe mastiff uses individual hashes as keys and stores a vector of dataset IDs as values. So a search for overlaps in the database is done by using hashes from a query as keys, and then intersecting the hashes in the values to find which dataset IDs have sufficient estimated overlap to be reported.

Luiz reported that it took a bit under three weeks to build a RocksDB index for 500,000 datasets at k=21, scaled=1000. The resulting database is about 700 GB. He then wrote a Web server to enable queries against the database.

mastiff allows real-time search of SRA-scale data sets!

So... it's fast. Like, really fast.

It's so fast, you can just go try it out yourself - I've provided up a simple notebook here in this github repo, and you can run it directly by clicking on the button below:

This notebook does the following:

downloads some SRA metadata (once)
loads and sketches a Shewanella genome query into a sourmash signature (~45 KB, for a ~5.3 Mbp genome)
serializes the signatures and sends it to the mastiff server to run it against the SRA
receives the resulting CSV of dataset + containment estimates
interprets the CSV in light of the SRA metadata

What you'll see at the bottom of the notebook is that this particular genome tends to show up in freshwater and wastewater.

The cool thing is that you can run your own queries if you like - just replace the shewanella.fa.gz file references with your own queries of interest!

(There's also a snakemake workflow to query mastiff if you want to run many queries, and a mastiff command-line program that will sketch and query all in one go.)

What can mastiff be used for?

MAGsearch is already being used by people for outbreak analysis and biogeography studies, among other things. We have a few different active research projects in the lab that are exploring its utility for various questions. So we will soon be able to do those things a lot faster. Yay!

I personally am looking forward to digging into strain dynamics and content-based alerts of new metagenomes, among other things.

We can also enable other cool projects, including (perhaps most importantly) things that we didn't think of.

A rule of thumb that I like is that a technology will be most useful for researchers when a summer undergrad can casually use it to explore wild-haired ideas and initiate summer projects based on rapidly generated exploratory results - and I'm really curious to see what we can enable others to do with this ;). I can imagine that once people can casually search the SRA with queries, they'll come up with lots of ideas and make lots of discoveries. (Of course, lots of follow-up work would be needed, too - chasing down what detection of a genome in a metagenome means biologically is tough!)

It has not escaped our notice that this can be used for much smaller databases, too. So we're looking forward to enabling real-time search of all the NCBI microbial genomes, as well as ..well, whatever we can get our hands on :).

mastiff will eventually (see below, "Whither mastiff?") be integrated into sourmash and/or robustified, and then it will support private databases, too.

Well, but wait, you said "real-time"

Right, I did - it takes between 2 and 10 seconds to do a search, and IIRC the server can handle up to 200 simultaneous queries at a time.

And I've gotta be honest... at first I missed the point that this was real-time. And web-enabled.

I was describing it to some collaborators, and while I was describing it I realized, oh, cool, we can actually do this all in JavaScript via WebAssembly too, of course.

So, also coming eventually (if not, like, tomorrow), I expect we will provide a Web site where you can sketch a genome client-side (e.g. in the browser - see sourmash#1973), and then receive near-instantaneous reporting on similarities to any known genome as well as presence within public metagenomes.

And, once various things are worked out , I hope we can provide this as a generic service for others to use.

So that seems neat, right?

Cautions, reservations, and limitations

There are a few things you should know before you get too excited. I mean, you should totally be excited, but... read on.

First, this is a proof of concept. It shows it can be done, but it is not (yet) something that anyone other than Luiz can run! Engineering and testing and releasing needs to happen, and that will take time.

Second, there are reasonably significant limitations to this on the scientific side. The search will only work out to about 90% average nucleotide identity (ANI) - a containment of .01-.05, which means you can robustly find matches out to the genus level, but not beyond. That's a limitation of nucleotide k-mers and it's something we're working on.

Small-ish queries also don't work well - we can robustly find exact matches to 10kb chunks of sequence, but not shorter.

Third, mastiff is mostly designed around searching for small queries. Query times should scale approximately linearly with the query size. Luiz has limited the server to a 5MB query for this reason.

And last but by no means least, this is not the entire SRA, it's only about 480,000 records (of about 700,000). We'll update it eventually, but for now it's a sufficient proof of concept ;).

Whither mastiff?

We (mostly Luiz ;) are working to integrate mastiff functionality into sourmash. There's a pretty wide gap between a proof-of-concept implementation and mature, robust, end-user-usable software, of course, but we know how to do it.

There's probably other super cool back-end approaches we could use, and we'd love to talk to you about them if you're interested in trying out alternative implementations. At this point we have a fairly good understanding of the conceptual operations and can even convey them to you in functioning code snippets :).

I also gotta tell you that we don't know how to support this kind of work exactly. This developed out of Luiz's thesis work but is now done on a volunteer basis by him. JGI is supporting the server development for a year (thanks!!) but we are a bit bottlenecked on UX support and backend/frontend development. So drop us a line if you've got some spare change - we'd be looking for 3-5 years of support.

(I'd be interested in exploring governance and sustainability issues around this kind of thing, too.)

Acknowledgements

The interpretation and understanding of MAGsearch results has been tremendously helped by work from Dr. Tessa Pierce-Ward (ANI), Dr. Adrian Viehweger (pathogen outbreaks), Dr. Jessica Lumian (biogeography), Dr. Christy Grettenberger (biogeography and more), and others. Thank you!!

Announcing ribbity - a hacky project to build Web sites from GitHub issue trackers

2022-05-23T00:00:00+02:00

For the last few weeks, I've been hacking on a new passion microproject on the side, code-named ribbity.

ribbity is the software that builds the sourmash-examples Web site, by producing a mkdocs site from the sourmash-examples issue tracker.

In brief, ribbity takes issues descriptions from GitHub and puts them in Markdown files so you can run mkdocs :).

You can see the install and config documentation for ribbity here.

Why oh why would you do this?

You might well ask... why not "just build a Markdown site", maybe with pull requests? A few reasons -

The GitHub issue tracker is awesome

First, I really like using GitHub issue trackers to organize resources and notes. For example, the sourmash issue tracker is my "external brain" for all things related to sourmash and genome comparison. I also have several private repos that I use to organize link collections.

Most specifically, I really love the "backlinks" feature of github (where when you refer to issue A from issue B, issue A receives a pointer back to issue B) - this was in the original Project Xanadu plan for interlinked hypertext documents, but it never really made it into the Web. It's awfully handy.

Here, the ability to see backlinks from private repos into public repos is particularly lovely!

Flexible organization and commenting

I also really like the labeling (categorization) and commenting functionality of github.

Moreover, github has very nice Markdown support, along with a usable editor. And, while writing Markdown in a Web browser is not my most favorite of activities, it sure is nice to be able to do it in a pinch. But more importantly I can write Markdown in a hackmd page and then copy/paste it into a github issue - this is an increasingly common workflow for me!

Flexible authentication and notifications

I really like (and use heavily) github's auth and notification systems. You can enable and disable access to repositories, watch specific issues and silence others, lock issues, block people from posting, etc. etc.

I need auth and notifications, but I'm not interested in doing any of that myself. Building on top of all of that is a nice simplification.

GitHub as a platform

More generally, I really like how GitHub is becoming a platform for stuff; you can see an earlier project of mine here, Using GitHub for janky project reporting - some code.

Other inspirational projects in this space include utteranc.es, which builds a blog commenting platform on top of github; and Coraline's "low-friction project management" site. And, while I don't specifically use datasette (yet) in any way, it has been a major conceptual contributor to the idea that hosting things statically is a great idea :).

(If you know of other github-based hacks like this, please drop them in the comments or ping me on Twitter!)

mkdocs static site hosting is simple, esp via github pages

mkdocs produces static sites, and static sites are awesome! (inspiration from datasette here, again.) No complicated databases, or authentication, or nasty JavaScripts creepings across my pages. (Side note: I don't know JavaScript.)

Also, github pages sure is easy (and mkdocs natively supports deploying to github pages natively).

And of course you can host mkdocs sites in many places. So it's pretty flexible and enabling to build on top of mkdocs.

But does it, like, enable anything cool?

One of the prime proximal motivations for building ribbity was the increasing complexity of the sourmash documentation, which is in danger of becoming sprawling and labyrinthine.

I really like the idea of a set of documentation that is explicitly intended to be explored and searched in a non-linear way.

That's how I use github issues in practice.

So it seemed natural to try out something new that strips away some of the complexity of the github interface and makes it customizable.

And I'm pretty happy with the resulting sourmash examples Web site!

In particular, it has really lowered the barrier to contribution for me, personally. I don't have to worry about pull requests or integrating new examples into a big, complicated doc site in a good way - I just throw a new example together, slap a few labels on it, and get on with my day.

In some regards, this is a version of the pull request hack, a contribution model that has always intrigued me. Except instead of giving contributors PR access, they just need to be able to add issues - which, by default, anyone can do on any visible GitHub project!

How is ribbity implemented??

It's pretty simple underneath -

"pull" GitHub issues into a Python pickle dump.
process the pickle dump into Python objects, salted with a few regexps.
run object model through jinja2 templating to build a docs/ directory.
feed docs/ directory into mkdocs, which builds a site/ directory.

I've layered on some tests and some Python package stuff and some CLI, but the core code is pleasingly simple - under 400 lines of code, including spaces and comments.

Whither ribbity?

A few people have looked at ribbity and gone ...whoa. I want that! So that's nice and validating!

In particular, there's been some enthusiasm amongst colleagues about having a different interface to github issue trackers. One specific motivation is that the responsive search offered by the default mkdocs interface is nice! And I could see an argument for aggregating together multiple issue trackers in a single site, which is a use case some colleagues are interested in.

Basically I see a lot of enthusiasm around specific, customizable hackage of github things.

But... I dunno. There's quite some space between a minimal "this is useful! and limited enough that we can keep it working!" approach, and a janky, badly reimplemented version of everything the github Web site already offers. I'm leaning more towards the former, because I think that's achievable and offers specific utility. But I also have a lot of ideas for how to do ribbity-like things in other directions (Watch This Space!)

If I had to guess, I think my personal interest in ribbity will evolve in the following ways:

I'll work to push more of ribbity's text munging functionality into jinja2, and make the github download a more complete (and more standardized!) version of the issue repo.
this will in turn push the core ribbity into being a simple merge of (a) jinja2 templates overlaid on (b) a github object model.
if I can get the primitives right, this would then make it easy to build custom overlays on github issue trackers entirely in jinja2.

And that actually seems pretty maintainable to me.

Then the current ribbity functionality would just be a specific set of templates we use to build a particular kind of Web site. And new functionality or different issue tracker overlays could be built entirely in jinja2.

But, who knows? I'm definitely not committing to anything; just playing around for now.

That having been said, I'm thinking about applying ribbity to building a directory of training resources, and throwing it at the newsletter problem, and a colleague is using it for their own examples site. So we'll see!

What other fun experiences did you want to relate?

This was my first experience with Python dataclasses! Super cool! Code here.

(A colleague in the lab, Tessa Pierce, started using them over in sourmash, and that finally motivated me to move on from namedtuples or straight up bare Python objects.)

This was also my first parsing experience with TOML, which is pretty nice! And I found the tomli parser to be easy to use, and thought the tomllib PEP was really great.

Concluding thoughts

ribbity is open source - BSD 3-clause!

Please file issues if you have ideas for how you might want to use ribbity!

Pull requests are welcome, but this is a side project, so unless they're fairly minimal or accompanied by good, clear, obvious tests, I might defer them as "too much brain needed". I encourage forking and experimentation!

Your thoughts welcome!

--titus

The second Common Fund Data Ecosystem hackathon - May 9-13, 2022!

2022-05-01T00:00:00+02:00

We are pleased to announce that the NIH Common Fund Data Ecosystem will be hosting a hackathon on NIH Common Fund data sets from May 9 - 13! This follows on our first hackathon (see recap blog post).

This hackathon has both synchronous and asynchronous work, with concentrated hackathon sessions on specific data sets and co-working sessions on Thursday. Participants can attend whichever hackathon sessions they are interested in. There is no minimum work requirement, all are welcome to participate as much or as little as schedules and interest allow!

See our schedule and find more information about this event here: https://nih-cfde.github.io/2022-may-hackathon/

Hackathon Benefits:

Gain experience with Common Fund data sets and have access to data set curators!
See an immediate product from a short burst of concentrated effort!
Meet researchers with common interests and potentially spur collaborations or funding efforts!

Common Fund Session Details

Gabriella Miller Kids First Pediatric Research Program

The goal of the Gabriella Miller Kids First Pediatric Research Program is to help researchers uncover new insights into the biology of childhood cancer and structural birth defects, including the discovery of shared genetic pathways between these disorders.

Kids First will host a session on accessing and using federated Common Fund Data Ecosystem graph data through the Kids First-Human BioMolecular Atlas Program graph database with an API.

Stimulating Peripheral Activity to Relieve Conditions

The Common Fund’s Stimulating Peripheral Activity to Relieve Conditions (SPARC) program accelerates development of therapeutic devices that modulate electrical activity in nerves to improve organ function.

SPARC will host a session on providing information on access to SPARC resources via the SPARC portal and associated APIs.

Human Microbiome Project

The Human Microbiome project has DNA sequencing data to characterize the microbiome in healthy adults and people with specific microbiome-associated diseases. It also contains integrated datasets with multiple biological projects from the microbiome and host over time for specific microbiome associated diseases.

A session on Human Microbiome Project data will involve obtaining this data from the Common Fund Data Ecosystem search portal and working with it using Amazon Web Services.

Common Fund Data Ecosystem Search Portal

The Common Fund Data Ecosystem Coordinating Center supports efforts to make Common Fund data sets more findable, accessible, interoperable, and reusable for the scientific community through collaboration, end-user training, and data set sustainability.

The Common Fund Data Ecosystem Portal Demonstration will be a demonstration session on how to access data in the Portal.

Introduction to R for RNA-Seq Analysis Workshop

RNA-Sequencing (RNA-Seq) is a popular method for determining the presence and quantity of RNA in biological samples. In this 3 hour workshop, we will use R to explore publicly-available RNA-Seq data from the Gene Expression Tissue Project (GTEx). Attendees will be introduced to the R syntax, variables, functions, packages, and data structures common to RNA-Seq projects. We will use RStudio to import, tidy, transform, and visualize RNA-Seq count data. Attendees will learn tips and tricks for making the processes of data wrangling and data harmonization more manageable. This workshop will not cover cloud-based workflows for processing RNA-seq reads or statistics and modeling because these topics are covered in our RNA-Seq Concepts and RNA-Seq in the Cloud workshops. Rather, this workshop will focus on general R concepts applied to RNA-Seq data. Familiarity with R is not required but would be useful.

Participant Skill Level:

The hackathon is open to the public, and anyone can attend. Despite the name “hackathon”, participants don’t need to be experts in computer science! The most important criteria is interest in the data sets, and some familiarity with the command line and GitHub is helpful but not required.

See our schedule and find more information about this event here: https://nih-cfde.github.io/2022-may-hackathon/

Please don’t hesitate to contact training@cfde.atlassian.net with any questions!

Storing 64-bit unsigned integers in SQLite databases, for fun and profit

2022-04-22T00:00:00+02:00

The problem: storing and querying lots of 64-bit unsigned integers

For the past ~6 years, we've been going down quite a rabbit hole with hashing based sequence search, using a MinHash-derived approach called FracMinHash. (You can read more about FracMinHash here, but it's essentially a bottom-sketch version of ModHash.) This is all implemented in the sourmash software, a Python and Rust-based command-line bioinformatics toolkit.

The basic idea is that we take long DNA sequences, extract sub-sequences of a fixed length (say k=31), hash them, and then sketch them by retaining only those that fall below a certain threshold value. Then we search for matches between sketches based on number of overlapping hashes. This is a proxy for the number of overlapping k=31 subsequences, which is in turn convertible into various sequence similarity metrics.

The scale of the problems we're tackling is pretty big. As one example, we have a database (Genbank bacterial) with 1.15 million buckets of hashes, containing a total of 4.6 billion hashes across these buckets (representing approximately 4.6 trillion original k-mers). So we need to do moderately clever things to store them and search them quickly.

We already have a variety of formats for storing and querying sketch collections, including straight up zip files that contain JSON-serialized sketches, a custom disk-based Sequence Bloom Trees, and an inverted index that lives in memory. The inverted index turns out to be fast once loaded, but serialization is ...not that great, and memory consumption is very high. This is something I wanted to fix!

I've had a long-time love of SQLite, the tiny little embedded database engine that is just ridiculously fast, and I decided to figure out how to store and query our sketches in SQLite.

Using SQLite to store 64-bit unsigned integers: a first attempt

The challenge I faced here was that our sketches are composed of 64-bit unsigned integers, and SQLite does not store 64-bit unsigned ints natively. But this is exactly what I needed!

Enter type converters! I found two really nice resources on automatically converting 64-bit uints into data types that SQLite could handle: this stackoverflow post, "Python int too large to convert to SQLite INTEGER", and this great tutorial from wellsr.com, Adapting and Converting SQLite Data Types for Python.

In brief, I swiped code from the stackoverflow answer to do the following:

write a function that, for any hash value larger than 2**63-1, converts numbers into a hex string;
write the opposite function that converts hex strings back to numbers;
register these functions as adapters on a SQLite data type to automatically run for every column of that type.

This works because SQLite has a really flexible internal typing system where it can store basically anything as a string, no matter the official column type.

The python code looks like this:

MAX_SQLITE_INT = 2 ** 63 - 1
sqlite3.register_adapter(
    int, lambda x: hex(x) if x > MAX_SQLITE_INT else x)
sqlite3.register_converter(
    'integer', lambda b: int(b, 16 if b[:2] == b'0x' else 10))

and when you connect to the database, you can tell SQLite to pay attention to those adapters like so:

conn = sqlite3.connect(dbfile,
    detect_types=sqlite3.PARSE_DECLTYPES)

Then you define your tables in SQLite,

CREATE TABLE IF NOT EXISTS hashes
    (hashval INTEGER NOT NULL,
    sketch_id INTEGER NOT NULL,
    FOREIGN KEY (sketch_id) REFERENCES sketches (id))

CREATE TABLE IF NOT EXISTS sketches
  (id INTEGER PRIMARY KEY,
   name TEXT,
   ...)

and you can do all the querying you want, and large integers will be converted into hex strings, and life is good. Right?

This code actually worked fine! Except for one problem.

It was very slow. One key to making relational databases in general (and SQLite in specific) fast is to use indices, and these INTEGER columns could no longer be indexed as INTEGER columns because they contained hex strings! Which means that once databases got big, well, basically searching and retrieval was too slow to be useful.

This code was perfectly functional and lives on in some commits, but it wasn't fast enough to be used for production code.

Unfortunately (or fortunately?), I was now in it. I'd sunk enough time into this problem already, and had enough functioning code and tests, that I decided to keep on going. See: sunk cost fallacy.

Storing 64-bit unsigned integers efficiently in SQLite

I wasn't actually convinced that SQLite could do it efficiently, so I asked on Twitter about alternative approaches. Among a variety of responses, @jgoldschrafe said something very important that resonated:

SQLite isn't a performance monster for complex use cases, but should be absolutely fine for this.

and that gave me the courage to stay the course and work on a SQLite-based resolution.

The next key was an idea that I had toyed with, based on hints here and then confirmed by the still-awesome @jgoldschrafe - I didn't need more than 64 bits, and I just needed to do searching based on equality. So I could convert unsigned 64-bit ints into signed 64-bit numbers, shove them into the database, and do equality testing between a query and the hashvals. As long as I was doing the conversion systematically, it would all work!

I ended up writing two adapter functions that I call in Python code for the relevant values (not using the SQLite type converter registry) -

MAX_SQLITE_INT = 2 ** 63 - 1
convert_hash_to = lambda x: BitArray(uint=x, length=64).int if x > MAX_SQLITE_INT else x
convert_hash_from = lambda x: BitArray(int=x, length=64).uint if x < 0 else x

Note here I am using the lovely bitstring package so that I don't have to think hard about bit twiddling (although that's a possible optimization now that I have everything locked down with tests).

The SQL schema I am using looks like this:

CREATE TABLE IF NOT EXISTS sketches
  (id INTEGER PRIMARY KEY,
   name TEXT,
   ...)

CREATE TABLE IF NOT EXISTS sourmash_hashes (
   hashval INTEGER NOT NULL,
   sketch_id INTEGER NOT NULL,
   FOREIGN KEY (sketch_id) REFERENCES sourmash_sketches (id)
)

and I also build three indices, that correspond to the various kinds of queries I want to do -

CREATE INDEX IF NOT EXISTS sourmash_hashval_idx ON sourmash_hashes (
   hashval,
   sketch_id
)
CREATE INDEX IF NOT EXISTS sourmash_hashval_idx2 ON sourmash_hashes (
   hashval
)
CREATE INDEX IF NOT EXISTS sourmash_sketch_idx ON sourmash_hashes (
   sketch_id
)

One of the design decisions I made midway through this PR was to allow duplicate hashvals in sourmash_hashes - since different sketches can share hashvals with other sketches, we have to either do things this way, or have another intermediate table that links unique hashvals to potentially multiple sketch_ids. It just seemed simpler to have hashvals be non-unique, and instead build an index for the possible queries. (I might revisit this later, now that I can refactor fearlessly ;).

At this point, insertion is now easy:

sketch_id = ...

# insert all the hashes
hashes_to_sketch = []
for h in ss.minhash.hashes:
    hh = convert_hash_to(h)
    hashes_to_sketch.append((hh, sketch_id))

c.executemany("INSERT INTO sourmash_hashes (hashval, sketch_id) VALUES (?, ?)",
              hashes_to_sketch)

and retrieval is similarly simple:

sketch_id = ...

c.execute(f"SELECT hashval FROM sourmash_hashes WHERE sourmash_hashes.sketch_id=?", sketch_id)

for hashval, in c:
    hh = convert_hash_from(hashval)
    minhash.add_hash(hh)

So this was quite effective for storing the sketches in SQLite! I could perfectly reconstruct sketches after a round-trip through SQLite, which was a great first step.

Next question: could I quickly search the hashes as an inverted index? That is, could I find sketches based on querying with hashes, rather than (as above) using sketch_id to retrieve hashes for an already identified sketch?

Matching on 64-bit unsigned ints in SQLite

This ended up being pretty simple!

To query with a collection of hashes, I set up a temporary table containing the query hashes, and then do a join on exact value matching. Conveniently, this doesn't care whether the values in the database are signed or not - it just cares if the bit patterns are equal!

The code, for a cursor c:

def _get_matching_sketches(self, c, hashes, max_hash):
        """
        For hashvals in 'hashes', retrieve all matching sketches,
        together with the number of overlapping hashes for each sketch.
        """
        c.execute("DROP TABLE IF EXISTS sourmash_hash_query")
        c.execute("CREATE TEMPORARY TABLE sourmash_hash_query (hashval INTEGER PRIMARY KEY)")

        hashvals = [ (convert_hash_to(h),) for h in hashes ]
        c.executemany("INSERT OR IGNORE INTO sourmash_hash_query (hashval) VALUES (?)",
                      hashvals)

        c.execute(f"""
        SELECT DISTINCT sourmash_hashes.sketch_id,COUNT(sourmash_hashes.hashval) as CNT
        FROM sourmash_hashes, sourmash_hash_query
        WHERE sourmash_hashes.hashval=sourmash_hash_query.hashval
        GROUP BY sourmash_hashes.sketch_id ORDER BY CNT DESC
        """, template_values)

        return c

As a side benefit, this query orders the results by the size of overlap between sketches, which leads to some pretty nice and efficient thresholding code.

Benchmarking!!

I'll just say that performance is definitely acceptable - the below benchmarks compare sqldb against our other database formats. The database we're searching is a collection of 48,000 sketches with 161 million total hashes - GTDB RS202, if you're curious :).

For 53.9k query hashes, with 19.0k found in the database, the SQLite implementation is nice and fast, albeit with a large disk footprint:

db format	db size	time	memory
sqldb	15 GB	28.2s	2.6 GB
sbt	3.5 GB	2m 43s	2.9 GB
zip	1.7 GB	5m 16s	1.9 GB

For larger queries, with 374.6k query hashes, where we find 189.1k in the database, performance evens out a bit:

db format	db size	time	memory
sqldb	15 GB	3m 58s	9.9 GB
sbt	3.5 GB	7m 33s	2.6 GB
zip	1.7 GB	5m 53s	2.0 GB

Note that zip file searches don't use any indexing at all, so the search is linear and it's expected that the time will more or less be the same for regardless of the query. And SBTs are not really meant for this use case, but they are the other "fast search" database we have, so I benchmarked them anyway.

(There are lots of nuances to what we're doing here and I think I mostly understand these performance numbers; see the benchmarking issue for my thoughts.)

The really nice thing is that for our motivating use case, looking hashes up in a reverse index to correlate with other labels, the performance with SQLite is much better than our current JSON-on-disk/in-memory search format.

For 53.9k query hashes, we get:

lca db format	db size	time	memory
SQL	1.6 GB	20s	380 MB
JSON	175 MB	1m 21s	6.2 GB

which is frankly excellent - for 8x increase in disk size, we get 4x faster query and 16x lower memory usage! (The in-memory performance includes loading from disk, which is the main reason it's so terrible.)

Further performance improvements?

I'm still pretty exhausted from this coding odyssey (> 250 commits, ending with nearly 3000 lines of code added or changed), so I'm leaving some work for the future. Most specifically, we'd like to benchmark having multiple readers read from the database at once, for e.g. Web server backends. I expect it to work pretty well for that but we'll need to check.

I do use the following PRAGMAs for configuration, and I'm wondering if I should spend time trying out different parameters; this is mostly a database built around writing once, and reading many times. Advice welcome :).

PRAGMA cache_size=10000000
PRAGMA synchronous = OFF
PRAGMA journal_mode = MEMORY
PRAGMA temp_store = MEMORY

Concluding thoughts

The second solution above is the code that is in my current pull request, and I expect it will eventually be merged into sourmash and released as part of sourmash v4.4.0. It's fully integrated into sourmash (with a much broader range of use cases than I explained above ;), and I'm pretty happy with it. There's actually a whole 'nother story about manifests that motivated some part of the above; you can read about that here.

I'm not planning on revisiting reverse indices in sourmash anytime soon, but we are starting to think more seriously about better (...non-JSON ways) of serializing sketches. Avro looks interesting, and there are some fast columnar formats like Arrow and Parquet; see this issue for our notes.

Anyway, so that's my SQLite odyssey. Thoughts welcome!

--titus

The First Common Fund Data Ecosystem Hackathon

2022-03-05T00:00:00+01:00

The week of February 21-25, 2022, we hosted the first Common Fund Data Ecosystem (CFDE) Hackathon. The goals of the hackathon were to increase familiarity with data sets from Common Fund programs and work towards cross-cutting, integrative analyses.

We invited members of the CFDE to propose hackathon sessions to introduce their Common Fund data sets and provide technical support while attendees explored the data. Sessions featured data from the CFDE Portal, Human Microbiome Project (HMP), Extracellular RNA Communication (exRNA), Metabolomics Workbench (MW), and Signature Commons Library of Integrated Network-Based Cellular Signatures (SigCom LINCS).

The Hackathon Sessions

All sessions were recorded and can be viewed on the Session Details and Recordings page of the hackathon website!

This virtual event began Monday morning with a welcome address by Dr. Titus Brown (UC Davis) followed by presentations from each Common Fund Program to give a brief overview of their data and session goals.

On Monday afternoon, Dr. Amanda Charbonneau (UC Davis) taught attendees how to use the CFDE Portal to find datasets from participating Common Fund programs. Dr. Charbonneau used HMP data as a motivating example, then helped attendees discover data sets from other programs. These datasets are quite large, so on Tuesday afternoon, Dr. Charbonneau taught a second session on how to download and process data from the CFDE Portal using Amazon Web Services (AWS). Attendees were provided with AWS accounts that they could use to analyze data discovered through the portal.

On Tuesday morning, Emily LaPlante and Keyang Yu (Baylor College of Medicine) provided an overview of the exRNA Atlas, which contains over 7,500 small RNA sequences and qPCR profiles from human and mouse, and introduced attendees to a variety of software tools for exploring RNA binding proteins. This session explored two use cases:

1) Finding RNA binding proteins and their associated RNA cargo in a variety of human biofluids and exploring their utility as biomarkers

2) Exploring other sites across the genome by intersecting exRNA Atlas data with regions of interest using BedGraph files, as well as applying this approach to other datasets.

On Wednesday morning, Eoin Fahy and Mano Maurya (UCSD) introduced the Metabolomics Workbench database which contains over 164,000 molecular structures covering 100+ species! Attendees learned how to interact with the Metabolomics Workbench Portal and then viewed a demonstration of MetENP, an R package that enables detection of significant metabolites from metabolite information.

The final data-driven hackathon session took place on Thursday afternoon. John Erol Evangelista (Mt Sinai) introduced the SigCom LINCS API which contains over 1.5 million gene expression signatures from LINCS, the Gene Expression Tissue Project (GTEx), and Gene Expression Omnibus databases (GEO). Then, Daniel Clarke (Mt Sinai) gave an introduction to building Appyters and how to use the SigCom LINCS APIs within Appyters.

On Friday we ran a Wrap Up and Future Directions session for presenters to recap what happened at their sessions and talk about future goals for their tools. This allowed everyone to learn about sessions they might not have attended, and possibly spark interest in watching the video recording of the session.

Reflection

Overall the sessions were well attended and well received! In a pre-hackathon survey, we asked participants which hackathon sessions they were interested in attending. More people attended each session than we anticipated, which indicated that the introduction session on Monday was critical for spurring interest.

According to our survey, participants walked away with a greater understanding of Common Fund databases and tools, so we achieved our main goal of increasing familiarity with the diverse datasets supported by the Common Fund Data Ecosystem. Additionally, our team of trainers identified new Common Fund datasets that we plan on integrating into our training program in the future.

A common observation was that some sessions felt more like webinars or workshops than what the name "hackathon" implies. For our next hackathon, we will work with presenters to define sessions as webinars (a demonstration of a data tool), workshops (a training event with live coding), or hackathons (a defined problem that participants work on). We also received requests for more advanced notice and information about the content of sessions, which we will incorporate into our next round of event planning.

This event, along with many other online events, lacked the sense of community that can be present with in person multi-day events. We tried using GitHub Issues or Discussions to foster conversations between participants, but these tools were rarely used. We are thinking about how to address this for our next event, and we're open to feedback!

Finally, the hackathon coordination team would like to reiterate our thanks to all Common Fund groups that ran sessions for this event! We could not have achieved a diversity of datasets and tools at this event without your time and efforts.

Next steps

We are excited to announce that the second CFDE Hackathon will take place April 25 - 29th! We're going to fine tune the event with the feedback from our February event, and we hope you will join us!

If you are interested in learning more about attending the April 2022 Hackathon as a participant, please register here! We hope to see you there :)

The Common Fund Data Ecosystem Training Program is funded by the National Institutes of Health (1OT3OD025459-01).

On minimum metagenome covers, and calculating them for your own data.

2022-01-18T00:00:00+01:00

We just posted a preprint, Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers, Irber et al., 2022! Some day soon I'd like to write a long blog post about how this is six years in the making, part of a major intellectual endeavor in the lab that I'm incredibly excited about, yada yada yada, but for now let me just say that I think it's got some interesting ideas in it and if you're at all interested in analyzing shotgun metagenome data you should open it in a tab somewhere; a very readable HTML version is available for just that purpose.

There is a super cool figure. You should check it out!

But what I'm really here to say is this: you might see a super cool figure in the paper that looks like this:

That figure is super cool in part because it tells you what microbial genomes from Genbank are present in your shotgun metagenome.

And it's even super cooler because our software figures out which genomes are present automatically, and can use all of Genbank microbial to do so!¹

We're not talking taxonomic information here, BTW, where you then have to go pick a representative genome after doing an analysis that only gives you vague species-level designations. Nope, we're talking cold, hard DNA-sequence-on-the-table, genome-files-in-a-directory, automatically retrieved and analyzed for you. With mapping and everything.²

(Taxonomy is available, if you're interested in such. You can use GTDB or NCBI taxonomy as you wish. But you can just have the genomes, too!)

Where can I get this magickal software?

What, you say? How is this magic possible!?

We wrote some software! And workflows! It's called genome-grist and it's available NOW NOW NOW for the LOW LOW COST of FREE!

And (I can't stress enough how excited I am about this) it's got documentation, too!

And, for an unlimited time only, you can even integrate your own private, unpublished, cherished and hoarded genome sequences!³

<ahem>

Anyhoo. Feedback is welcome.

And yes, this is actually the software that was used to calculate the figures in the paper about the sourmash software referenced at the top. Yes, we are writing software so that we can generate figures for papers about other software. No, this will never end.

--titus

1: terms and conditions may apply: right now we can only give you Genbank as of July 2020. Sorry. We're working on it.

2: terms and conditions may apply: right now this really only works well with paired-end Illumina metagenomes.

3: And, like, your own private taxonomic classifications for your genomes, if you're into that kind of thing.

A bioinformatics training career panel in the DIB Lab

2021-11-08T00:00:00+01:00

Note: The below blog post was written by Dr. Saranya Canchi.

(Thanks to Marisa Lim, Abhijna Parigi, and Titus Brown for reading drafts!)

On August 6th, we held a career panel for the Lab for Data Intensive Biology (DIB Lab) The panel consisted of Drs. Tracy Teal, Karen Word and Kate Hertweck, all of whom are friends or alumni of the DIB lab and have built successful careers in biology and bioinformatics training. The discussion was attended by graduate students, post docs and alumni of the lab.

Given the non-traditional nature of the careers, we started the session learning about the career journeys of each panelist leading to their current roles. Interestingly, all our panelists had some shared experiences over time.

Kate learned computational methods and other model systems during postdoctoral training, which led to an assistant professor position at University of East Texas. Kate especially enjoyed teaching while in this position and that inspired her to transition to Fred Hutchinson Cancer Research Center as a bioinformatics training manager, where she developed and taught many Carpentries style "intro to" type lessons. She was also heavily involved in the Carpentries, having served on the executive council. In her current position as an open science specialist at Chan Zuckerberg Initiative, she combines her experience in teaching with her passion for open science methods.

Karen got her PhD in physiology, but spent a considerable portion of her graduate time as an educator working at science museums and teaching high school curriculum. She contributed to designing curriculums as well as teaching as part of her pre-doctoral experience, which carried over to her postdoctoral work in the DIB lab. Here she started working with the Carpentries training model, which helped pave the path to her current position as the Director of Education for The Carpentries.

Tracy, like the other two panelists, also had a non-traditional career path, starting with a PhD in computational neuroscience followed by postdoctoral research in microbial ecology and genomics. She decided against a non-tenure track assistant professorship position as that was non-optimal for planning, in addition to the burden of raising one's own capital. She had worked on Data Carpentry as part of an NSF grant during her postdoctoral period, which led to her obtaining a large scale grant from Gordon and Betty Moore Foundation for Data Carpentry. The merger of Data Carpentry with Software Carpentry into the Carpentries allowed her to come on board as the executive director for the overall program. She further expanded her skills as a executive director at Dryad, working with data curation and open science tools. In her current role as the Director for Open Source at RStudio, she combines her rich experience in computational genomics, teaching and open science knowledge to help drive the mission of RStudio.

What are the common job titles that are at the intersection of science, training and community management ?

All panelists agreed that job titles can be vague with fuzzy descriptions. While that poses challenges in understanding the required skill set and introduces a level of uncertainty, it also allows for flexibility in defining the boundaries and responsibilities of the role. Looking for keywords like training/support/community in the description can be helpful. Some possible titles include community manager, open source manager, science technician, training specialist etc. It can also be helpful to talk to your friends and peers when looking for jobs as they may be able to provide insights into your strengths, as well as point towards positions (including unlisted jobs) that could be a good fit irrespective of the job title.

How do you find these open ended positions ? Are most of them full time positions ?

The jobs channel on Center for Scientific Collaboration and Community Engagement (CSCCE) slack space is a good resource for these types of job postings. Informational interviews are a great networking strategy. It is best to seek out people in interesting positions or at interesting companies and ask about their experience and career path. Such sessions have the potential to become a networking opportunity with the possibility of future job postings being sent your way. Another great networking source is be your undergrad or graduate alumni network. UC Berkeley has a neat template for informational interviews which can be helpful in preparing potential questions.

Positions that rely on grant support typically have a finite timeline. Some open-ended positions are yearly and contract-based. If you are unsure, it is best to ask the hiring manager/HR/recruiter in the initial stages of hiring process.

What are some important aspects to consider when evaluating a position with a company ?

It is important to have lots of latitude to develop the responsibilities of the position as well as for personal growth. The position should allow you to challenge yourself and your team (if applicable) and try new initiatives. There should also be enough scope to read, reflect and engage in professional development to continually improve yourself. While it maybe difficult to gauge prior to starting a position, it is also important to think about the colleagues who you may work with everyday. Working at a company is not the whole experience but rather depends on which team and people you interact with the most. Staff restructuring can significantly change the overall company experience. It is also important to consider the values, vision and cultural fit of the company.

Tracy suggested asking these questions during the interview process: - What kind of values are there at X? - How do you run your meetings? - What are the expectations around communications? - How do you manage employee time away?

She pointed out that consistent answers across employees within a given team/company, which illustrates shared understanding, is helpful in evaluating potential fit.

What types of other jobs can one do as a graduate student/postdoc to gain experience beyond research ?

Academic experience can focus specifically on research skills, rather than outreach/training/community building skills. But to understand aspects of a job broadly it is important to gain experience outside of it. Volunteering and engaging in hobbies can be critical to developing skill sets that are helpful in non academic jobs. Volunteering is a safe route to try new roles while figuring out the career direction you would like to pursue.

It can be useful to look at job descriptions and work backwards. If you are interested in teaching and teaching leadership roles, consider volunteering for The Carpentries, a global organization that focuses on teaching essential data and computational skills! You can also join groups like R-Ladies, PyLadies, coding meetings or start a group of your own. Offer to host an event or help with documentation within your lab or beyond. Kate also shared a resource that she developed for a workshop on professional assets in data science careers. It is also useful to network and talk to people from varied fields to gain a unique perspective.

It is difficult to land a manager/director level position straight out of graduate school or postdoc, since it requires management experience. While not identical to managing employees, mentoring students can provide substantial people management experience in academic settings.

Do you have any regrets about not continuing the scientific research career route ?

Science comes in many flavors. Working in science adjacent fields such as training/ teaching can still offer the opportunity to do scientific data driven decisions while providing rewards such as learner satisfaction, developing open source materials and engaging with a broader community. There are some aspects of the scientific process that you may not get to do in this line of work. Allowing yourself space to process the disconnect and feeling sadness are important to move forward. However, you are constantly learning, adapting and thinking about the impact your current position has on a larger scale of teaching and that can be very empowering.

With the clock moving forward we had to end the lively and inspiring discussion!

Using snakemake to do simple wildcard operations on many, many, many files

2021-08-30T00:00:00+02:00

I recently co-taught another snakemake lesson (with Dr. Abhijna Parigi), and was reminded of one of my favorite off-label uses of snakemake: replacing complicated bash for loops with simple and robust snakemake workflows.

An example

As a bioinformatics researcher, I frequently need to do simple operations to many files. As part of this, I usually want to change the filename to represent the change in file content.

For example, let's suppose I have a bunch of FASTQ files (say, the ones here), and I want to subset them to the first 400 lines. The filenames all have the form NAME.fastq, and I want to add .subset.fastq to the end of the subset filenames to distinguish them. (See this shell scripting lesson for more background and motivation for this particular operation.)

Using `bash`, round 1

For many years I did this with bash for loops. The following code works, assuming the original fastq files are in a data/ subdirectory:

:::bash
mkdir subset
for i in data/*.fastq
do
    head -400 $i > subset/$(basename $i).subset.fastq
done

Starting from a bunch of files,

>data/F3D0_S188_L001_R1_001.fastq
>data/F3D0_S188_L001_R2_001.fastq
>...

this loop will produce

>subset/F3D0_S188_L001_R1_001.fastq.subset.fastq
>subset/F3D0_S188_L001_R2_001.fastq.subset.fastq

Improving the bash solution

The output filenames are kind of ugly, because fastq is repeated. That's just because bash makes it so easy to append to filenames - we can fix this by adding .fastq into the $(basename ...) call:

:::bash
mkdir subset2
for i in data/*.fastq
do
    head -400 $i > subset2/$(basename $i .fastq).subset.fastq
done

So... not difficult to read, and fairly straightforward. Why would I use anything else?

Using snakemake instead

tl;dr The bash code above is brittle when I modify it; it's not robust enough for important work.

In my (extensive <sigh>) experience, the above approach fails some reasonable percent of the time. Usually I get it right the first time I write it, and then I modify and tweak it, and chaos ensues because I omit something in the for loop.

So a year or two ago, I decided to try out snakemake for one of these operations.

Here's the contents of a file named Snakefile.subset, which does the same thing as the for loop above -

:::python
# pull in all files with .fastq on the end in the #data
FILES = glob_wildcards('data/{name}.fastq')

# extract the {name} values into a list
NAMES = FILES.name

rule all:
    input:
        # use the extracted name values to build new filenames
        expand("subset3/{name}.subset.fastq", name=NAMES)

rule subset:
    input:
        "data/{n}.fastq"
    output:
        "subset3/{n}.subset.fastq"
    shell: """
        head -400 {input} > {output}
    """

and you can run it with snakemake -s Snakefile.subset -j 1.

With this Snakefile, snakemake pulls in all files that match the glob pattern and extracts their names, and then constructs a set of "targets" in rule all that it must create. The subset rule specifies how to build targets of that name.

Why I like snakemake more than bash for this

So why do I like snakemake more? A few reasons that I think are intrinsic to snakemake vs bash -

snakemake fails if something is wonky about the filenames, before doing anything!
if any of the operations fail, snakemake stops and alerts me by default!
I can do the operations in parallel by specifying e.g. snakemake -j 4 to use 4 cores.
the templating language for using {...} is nice, simple, and Python-standard (see this blog post on f-strings and also the the templating minilanguage ref).
as the operations get more complicated, snakemake doesn't need to get more complicated, while the bash solution tends to complexify into illegibility...
I think the snakemake solution is easier to understand and modify!

Above all, the overall structure of snakemake is declarative rather than procedural. We declare what we want the result to look like, and snakemake uses the available rules to create the overall set of steps that must be executed and Makes It Happen. This is what makes the error checking and parallelization possible.

Another "feature" of this solution is that there are more comments because I comment Snakefiles more than bash scripts. This is probably a me-problem that is caused by snakemake forcing me to edit a file :).

I haven't reused Snakefiles that much, but I think you can reuse Snakefiles fairly easily - see next section.

Are there any downsides? The main one is that the snakemake solution feels more heavyweight to me - it involves creating a file, getting the spacing/indentation right, etc. etc. So I still don't use it as much as I probably should.

Thoughts welcome!

--titus

Appendix: A more reusable Snakefile

Below is a Snakefile that's a bit more reusable for situations where your input and output directories don't match the names I used above - you can override PREFIX and OUTPUT by running snakemake -C prefix=PREFIX output=OUTPUT.

(I don't really like the syntax of using f-strings here, but it's cleaner than anything else I've found. Suggestions welcome.)

:::python
# pull in all files with .fastq on the end in the 'data' directory.             
PREFIX = config.get('prefix', 'data')
print(f"looking for FASTQ files under '{PREFIX}'/")

OUTPUT = config.get('output', 'subset5')
print(f"subset results will go under '{OUTPUT}'/")

FILES = glob_wildcards(f'{PREFIX}/{{name}}.fastq')

# extract the {name} values into a list                                         
NAMES = FILES.name

# request the output files                                                      
rule all:
    input:
        # use the extracted 'name' values to build new filenames                
        expand("{output}/{name}.subset.fastq", output=OUTPUT, name=NAMES)

# actually do the subsetting                                                    
rule subset_wc:
    input:
        f"{PREFIX}/{{n}}.fastq"
    output:
        "{output}/{n}.subset.fastq"
    shell: """                                                                  
        head -400 {input} > {output}                                            
    """

A biotech career panel in the DIB Lab

2021-07-20T00:00:00+02:00

Note: The below blog post was written by Dr. Marisa Lim.

(Thanks to Titus Brown, Abhijna Parigi, Tessa Pierce, and Saranya Canchi for reading drafts!)

On June 25th, we held a career panel discussion for the DIB lab on bioinformatics and biomedical data science careers. We invited four DIB-lab alumni and affiliates to be our panelists - Shaun Jackman, Lisa Johnson, Phil Brooks, and Olga Botvinnik - and graduate students and post-docs in the lab attended the event.

Each panelist shared their career journey leading to their current roles and then we discussed topics of interest from the audience, which roughly fell into two categories: 1) advice for finding jobs and interviewing and 2) a comparison of academic research vs. biotech industry careers.

Finding jobs & interviewing

Everyone agreed that you're more likely to land interviews and jobs when you've got a contact that can refer and recommend you to the hiring manager. This is why it's important to create a professional network, as cold applying (applying to jobs without any prior contact) is a more difficult approach to finding jobs. Here was the advice for online and in-person networking:

Have some form of public online presence, whether that be a LinkedIN profile, twitter account (functions as personal stackoverflow for asking questions and a place to advertise your own work), and/or personal website. These forums serve as public portfolios for your research and work experience.
Industry resumes are typically short, so one tip our panelists recommended was embedding hyperlinks in the text, so recruiters/hiring managers can find the resources listed above for additional information.
In-person networking might occur at conferences or at smaller group events. For example, you can request an informational interview with someone to learn more about their job (most people will be willing to chat with you!). This is a great option for a smaller group discussion, which may be less overwhelming than trying to talk to people at large conferences. Be sure to take notes from informational interviews! One panelist suggested making a new document (i.e., google doc) for each interview and time stamping the conversation to keep track of the information. An added benefit to doing informational interviews is that they might generate positions or lead to formal interviews. It was mentioned that a large proportion of biotech jobs are actually not publicly announced.

Besides networking, our panelists suggested keeping up to date on biotech company news - if a company has recently gotten an infusion of funding, they're likely to be hiring soon!

Read biotech news and blogs - i.e., https://www.genomeweb.com/
Get in touch with venture capital (vc) recruiter firms.

At the job search/application stage, a big concern is whether to apply for jobs if you don't think you meet all the exact requirements. Our panelists very enthusiastically made the following suggestions:

As long as you're interested and show that you're motivated to learn, go for it! Let hiring managers decide!
You can learn new skills on the job (this is something to look out for when assessing whether a job allows you to grow)

At the interview stage, our panel had this advice to share:

Have your list of references (usually 3 people) ready to go! Don't wait until you need them for a job interview and be sure they're people you trust to support you.
Nobody has all of the skills listed in job descriptions, but make sure you know the purpose of every tool, even if you don't know the exact details. During the interview, you can say you know what the tools are for and if true, that you'd like to learn more about how to use them for your job.
Don't oversell your skills however, because interviewers can tell and it's perfectly ok to say you're keen to learn.
Know what job you're applying for and why you're a good fit for the team. For example, if you're interviewing for a customer support role, it's less pertinent to go into the fine details about your research, unless you can link the story back to something support-related.
Interviewing is a skill too and takes practice. Even if it's not your dream job, if there's a chance you'd take the job, consider going through with the interview process to gain experience.

Recognize that interviews are a two-way conversation. In addition to answering interviewer questions, be sure to ask questions to help determine whether the company, role, and team will be a good fit for you as well. Olga shared 3 questions she asks at every interview:

What has kept you at company X?
What would you change at company X?
Is there anything else I should have asked?

What to do if you notice warning signs during interviews?

It's a good idea to ask about company culture during one-on-one interviews. It can really help to have a contact at the company to talk to about potential issues.
If something feels really wrong, it's ok to say you want to stop early on. This will save your energy and time, as well as that of the interviewers.

Academia vs. Industry

As students and postdocs with training and research experience primarily at universities, one of the most popular topics is comparing academic research and biotech industry careers. While we only scratched the surface of this topic during the panel, here were our main discussion points:

The 'balance' part of work-life balance is highly dependent on the biotech company culture, job role, and timing. In general, the workload in industry is not spread evenly over the year. For example, there may be more intense working conditions leading up to product release deadlines. However, our panelists said they generally get to schedule their own time as long as they are meeting their commitments. For companies with a global customer base, the schedule may require some employees to work at night to accommodate time zone differences - however this may actually offer some work-time flexibility depending on your circumstances. It's important to communicate early on with your team to determine work and working condition expectations.
Perhaps one of the more visible differences between academia and industry is that industry jobs tend to be team-oriented and focused on specific aims; communication and interdisciplinary teamwork are consequently very important to meet the responsibilities of your group within the company. Deadlines are determined by business decisions and are often less flexible. In contrast, academic researchers and faculty tend to work more independently within their lab group or department, and wear multiple hats - i.e., apply for funding, mentor students, teach, publish, contribute to department and other service duties, and manage their lab. Project milestones are often less defined and deadlines may be more flexible.

Before we knew it, 1 hour had passed and it was time to wrap up the panel!

Scaling sourmash to millions of samples

2021-07-13T00:00:00+02:00

(Many thanks to Dr. Luiz Irber, who sunk the pillars and laid the foundations for a lot of the work below. Dr. Tessa Pierce and Dr. Taylor Reiter drove much of our engineering work by constantly coming up with new! and bigger! use cases that were also quite exciting and motivating ;)

sourmash is our software for quickly searching large volumes of genomic and metagenomic sequence data using k-mer sketching. We're up to version v4.2.0 now, and looking forward to releasing v4.2.1 sometime in the next month.

One emerging theme for sourmash for the v4 series has been scaling. There are a variety of large-scale data sets that continue to grow in size, and it sure would be nice to be able to work with them easily.

The challenges are big and growing. In no particular order,

NCBI has about a million microbial genomes in their GenBank database;
the Sequence Read Archive contains well over a million microbial shotgun sequencing data sets, with about 600,000 of them being large metagenomes;
the GTDB taxonomy group has produced revised taxonomic annotations for 250,000 of the GenBank genomes;
individual research projects can now quickly and easily produce hundreds of genomes and dozens to hundreds of metagenomics samples, so the numbers above are growing rapidly.

Thanks to Luiz Irber's work, discussed in his thesis, we have a nice distributed system ('wort') that computes new sourmash sketches as new data enters the system. (Some of that is also described in my blog post about searching all public metagenomes.) Also thanks to Luiz, over the last two years sourmash was refactored to use Rust underneath, and we've been enjoying a number of raw performance gains.

The challenges we're struggling with now are because of all of this. We have a lot of data that we can work with (package, search, etc.), but our processes and infrastructure for working with it haven't scaled to meet the new capabilities of sourmash. Briefly,

we have several millions of files sitting in various directories, representing sketches for a lot of public data.
it's prohibitively slow to scan through all of that information repeatedly, and difficult or impossible to fit it all in memory (depending on the collection in question).
most sketches are not really interesting for any given operation, so a lot of our scanning would be redundant anyway.

Because of this, a lot of our work since the 4.0 release has been on technical changes to support better processes that will better handle searching, collating, and updating collections of bajillions of files.

Motivation: building new database releases

One of the several major uses of sourmash is searching genome collections, with the goal of finding matches to and/or classifying genome or metagenome samples. We variously use the GTDB genomic representatives database (48k genomes), the GTDB complete database (250k genomes), or the NCBI microbial database (~800k genomes). And we want to provide these databases for download so that sourmash users don't have to do all the prep work themselves.

To provide these databases,

first, we need to sketch all the genomes. This involves downloading each genome, running sourmash sketch, and saving the results somewhere. This is what wort does - it monitors NCBI for new genome entries, calculates the sketches, and makes them available for download.
then we need to select a specific set of sketches based on a catalog (GenBank microbial, or GTDB genomic reps, or whatnot) and a set of parameters (k-mer size, mostly).
next we need to figure out which sketches do not exist in our overall collection, for whatever reason, and find/build those. (e.g. NCBI GenBank is somewhat fluid, and GTDB isn't always synced with its releases; or something just slipped through the wort cracks; or GenBank never actually had the right sequence, so it needs to be calculated)

If you have 100, 1000, or even 10,000 sketches, this is all pretty easy. It only starts to get annoying when you have 100,000 and more. We have a million :).

Investing in scaling - some principles

There are a number of techniques for working with large volumes of data.

First, lazy loading. Better known as Lazy Evaluation, this is a CS concept where you pass around references to objects, and only resolve those references when you decide to actually use the object. Since references are usually (much) cheaper than the full object, you can save on memory. In the case of sourmash, one of our on-disk search structures, the Sequence Bloom Tree (SBT), has relied on lazy loading for years, and we've been expanding this to sketch collections more generally.

Second, streaming input and output. Another CS concept, streaming means that you perform as many operations as possible on individual items, and don't hold all the items in memory (ever). We've always intended to support large-scale streaming I/O in sourmash, but it hasn't been a priority before this. Luckily Python gives us lots of tools - generators, in particular - for doing streaming!

A related concept to streaming is to avoid accumulating anything big in memory. This is easier said than done - for one not-so-random example, if you're searching a big database for matches, it is very easy to just keep matches in memory and then deal with them as a single collection. But what if a large portion of that database matches? You need a place to store the matches!

Fourth, use metadata to filter as much as possible. This seems separate from but maybe overlaps with lazy loading... basically, you want to work with catalogs of your data (which are less bulky), rather than your data itself (which is usually much larger).

Fifth, support flexible filtering. It's very easy to write custom solutions that get you what you need today, but a more general solution may not be much more work and will save you time later, as your use cases evolve.

Sixth, use databases. This may be obvious, but it's always worth remembering that there are literally decades of work on storing and searching structured catalogs of data! We should make use of that software! And if we use sqlite3, we have a superbly engineered and high-performance SQL database that is embedded in many programming languages!

Seventh, think declaratively instead of procedurally. Try to describe what you want to do with the data, now how you want it done (and in particular, avoid for loops as much as possible :). Abstracting the operations you want to do into a declarative form permits refactoring and optimization of the underlying implementation.

So how is this all shaking out in sourmash?

Iterating towards nerdvana: a progress report

We've invested considerable amounts of effort into engineering over the last year, iterating towards implementations of the above practices.

Round 0 (sourmash 3.x through sourmash 4.0)

By the release of sourmash 4.0 in March 2021, we had included a lot of good optimizations and refactoring already.

Way back in 3.x sometime, Luiz had moved sketch loading into Rust. This led to a ridiculous speedup in pretty much everything - 100-1000x.

We had slowly but surely made our way to a standard Index class API that let us collect, select, and search on large piles of sketches.

In particular, we'd started to invest in selectors, that let us specify features (like k-mer size, or molecule type) that we wanted our collection limited to.

Round 1 (sourmash 4.1)

For sourmash v4.1.0, two months later, we evolved things more.

We had a lazy selection Index class that deferred running the selectors until the actual sketches themselves were requested. Getting the class to work properly and supporting it fully throughout the code base (and testing the bejeezus out of it) forced us to regularize the class API some more, which opened up many more opportunities.

We also added generic support for retrieval of sketches by random access into a collection, through our use of .zip collections and ZipFileLinearIndex. This was an expansion of the lazy loading and on-disk storage that SBTs had enjoyed since v3.2, but without the same overhead cost of the data structures. So, now it was possible to package really large collections of sketch in a compressed format and retrieve individual sketches directly, with minimal overhead. Not so incidentally, this was also our first random-access/on-disk mechanism that could store incompatible sketches - so it was much more flexible than what we'd been doing before.

The internal (and command-line) support for the streaming prefetch functionality was also a watershed moment. Prior to prefetch, all of our database search methods did a search and then sorted all of the results to present a nice summary to the user. While useful, this meant that if you had lots of matches, you had to store them all in memory so you could sort them later. This could be ...prohibitive in terms of memory, and we already had specific examples where we knew it wouldn't work. prefetch was a new feature that was explicitly streaming and was meant to search Databases of Unusual Size: so, it simply output matches as it found them, with no sorting.

Last but by no means least, once we had streaming input we needed streaming output, so we implemented a general sketch saving method that supported several standard output methods (to directories and zipfiles, in particular) to offload sketches directly to disk.

Together, what this all meant was that we could finally:

take an arbitrarily large collection of on-disk sketches,
select just the ones we wanted without necessarily loading them all,
walk across those sketches one by one, storing no more than a small number of them in memory,
find matches and offload those matches to disk as we went.

There were still some suboptimal constraints that had to be obeyed, but they were in the implementation, not in the API, so we "just" needed to iterate on the implementation :).

This was the first crack in the dam of database building (one of our motivating use cases): once we had zipfile collections implemented, we could first build zipfile collections and then use those zipfile collections as our source build for all the other database types that supported fast search. (And, indeed, we now have a snakemake workflow that does exactly that!.)

Round 2 (sourmash 4.2)

Between v4.1 and v4.2, we had several minor releases that cleaned things up and improved edge case efficiency.

For sourmash v4.2.0 in early July 2021, however, we doubled down on the "working with large collections" theme.

First, we introduced "picklists", which give command-line and API-level support for selecting sketches based on their metadata features (not their content). The initial implementation was slooooooow on large data sets, but this was an important declarative mechanism (that immediately saw extension in unexpected directions, too!)

This was followed (in the same release) by database manifests, a feature that is not user-facing at all (and doesn't show up in the docs, either - oops!). Manifests are simply a spreadsheet-style catalog of the metadata for all the sketches in a particular database, and they can be calculated once and then included in zipfiles. They support direct retrieval of sketches by id, as well as rapid intersection with picklists.

These two features were relatively minor in terms of new user-facing functionality - although they do support some cool stuff! - but were massive in terms of internal improvements.

For example, it was now virtually instant to take a zipfile collection of 260,000 sketches and pick out the three sketches you were interested in, based on whatever criteria you wanted.

So, as a not so random example, you could run prefetch on a big database (low memory, streaming...) and save only the CSV with match names, and then just use that CSV as a picklist to run further operations on the database - search, gather, etc. There's no need for intermediate collections of sketches in workflows! (This has saved us literally 100s of GBs of disk space already!)

As another not-so-random example, you could load a manifest from a Zipfile collection, run your sketch selection (ksize, molecule type, identifier, etc.) on the manifest in memory, and then go back to load only the relevant sketches from disk as you needed them.

Again, the internal implementation is leading the user-facing features here, and there are still some performance issues, but the API support is there and seems flexible enough to support a wide range of optimizations.

Round 3 (sourmash 4.2.1, maybe?)

The next release will add some internal support for more/better manifest stuff. In particular, I've been experimenting with a generic lazy loading index class, which lets us do clever things like load a manifest, do selection and filtering on it, and only actually go to the disk to load the index object when we're ready - previous approaches always worked on the loaded index, which is suboptimal when you have thousands of them.

With this new class, we can apply the manifest directly as a picklist and subset down to just the sketches we care about. (As with all of these things, I've been playing around with different implementations and throwing different use cases at them, and it's been "interesting" to watch various solutions fall apart under the burden of really large collections!)

One thing this has let me do is (finally!) re-engineer database releases around manifests, using (tada) manifests of manifests. With this we do the following -

load many manifests from many collections into a single SQLite database;
run our metadata selection (k-mer size, molecule type, picklists, etc.) on this database, using SQL primitives;
and then go grab precisely those sketches we care about, for further downstream processing.

In practice this lets us do things like cut a new database release for GTDB quickly and easily - it takes only a minute to verify that we have all of the necessary sketches and return their locations. (And like everything else, it can probably be optimized dramatically.)

What's next? (sourmash 4.3 or later)

We've been somewhat fixated on trying to provide good user experience (fast, performant, communicative) around searching Extremely Large Collections.

We have a prototype solution for near-realtime search of 50k+ sketches (see sourmash#1226 and sourmash#1641) and at some point that will make it into the codebase. At that point we will be closer to fully exploiting CPU and disk capabilities; right now our speed is mostly bound by our lack of parallelism.

Somewhere down the road we're going to expand our persistent storage options. We support file storage, zipfiles, Redis, and IPFS for SBTs already (thanks again Luiz!) but want to support these for more collection types. Not hard now, just ...work.

And, as a nice cherry on top of the sundae, after all of the Index API refactoring we did back in 4.0 and 4.1, we can now easily support client/server mechanisms via remote procedure calls using only the standard interface - see sourmash#1484. This opens the door to using larger in-memory database types, which have been hampered thus far by the loading time.

Some concluding thoughts

Why are we putting so much effort into all of this? There are a couple of reasons:

sourmash is underpinning a lot of different work in our lab, and these kinds of efficiency enhancements really make a difference when amortized over 5 projects!
routine lightweight search of all public data will unlock a lot of use cases that we can only see dimly right now. But our experience has been that actually building stepping stones towards this dimly-seen future set of use cases is the best way to make them happen (or to figure out why they can't or shouldn't happen).
it's fun! This has been a labor of love during some rough pandemic times, and it's been nice to actually make visible progress on something over the last 18 months...

That all having been said, it's been a lot of work to solve the engineering challenges, with only fuzzy use cases to motivate us. Moreover, a lot of the work we describe above is not directly publishable. It's not entirely clear how we'll roll this out in a way that supports people's careers. So it's a bit of a gamble, but hey, that's what tenure's for, right?

(There's definitely some more to discuss here about the tension between grant writing and a focus on slow, careful, and iterative engineering of new capabilities - my tagline in lab for this is "boring in theory, transformative in practice" - but that's another blog post.)

Some other thoughts -

Abstractions sure are convenient! Figuring out the right APIs internally has led to a renaissance in our internal code, although I think we need to step up our code docs game, too, so that someone other than the core developers can make use of these features :(.

Declarative approaches are awesome. It's been really nice to redefine our APIs in terms of what we want to have happen, and then implement differently performing classes (storage, search, selection) that we can mix and match depending on our requirements.

Automated testing has been key! We embarked upon the v4 journey with a codebase at about 85% code coverage, and having those solid building blocks has been critical. We continue to discover API edge cases that need to be resolved, and then we immediately lock them down with more tests. Without these tests, the massive-scale refactoring we've been doing would never have worked. (A diff --stat v3.5.1 shows virtually every source file changed, with 17738 lines added, and 3546 removed, in a codebase with only 50,000 lines of code and tests!)

Python is awesome. The language supports really nice abstraction layers, provides good language primitives for streaming and lazy evaluation (generators in particular!), and has both a massive stdlib and straightforward installation system that means that adding new capabilities to software built in Python is quite easy.

Rust is awesome. We're really liking Rust as a high-performance layer under Python. The boundaries are still being negotiated - who owns what objects is a persistent theme, for example, and we're still working on fleshing out Rust support for new storage types - but Python is challenging for compute-focused multithreading while Rust supports it very straightforwardly (and way, way better than C++).

Last but not least, sqlite3 continues to be amazing. I'm not even that good at tuning it, and it's already incredibly efficient; if we put time and effort into better schema, we'll probably get an order of magnitude improvement out of it with only a few hours of work. We just don't need that yet :).

--t

New sourmash databases are available!

2021-06-29T00:00:00+02:00

(Many thanks go to Dr. Tessa Pierce for refitting our database construction process, to Dr. Luiz Irber for underlying infrastructure work, and to Dr. Taylor Reiter for updating the docs :)

While we are working on releasing sourmash 4.2, I wanted to drop a short note - we have some new databases (and database types!) available for sourmash, our genome and metagenome analysis tool.

If you go to the prepared databases page for sourmash, you'll see that we now make three types of databases available, for two different collections of GenBank genomes.

Collections

We've created two collections of sourmash signatures for GTDB 06-RS202, the latest release of the Genome Taxonomy Database. (Since every genome in GTDB is in GenBank, these are really just subsets of GenBank.)

The smaller collection contains the 48,000 genomic representatives, a collection of genomes that is non-redundant at the species level.

The larger collection contains all 258k GenBank genomes for which GTDB has calculated taxonomies.

Why do we have this focus on GTDB? It's a nice collection of high quality genomes; it covers most of the bacterial and archaeal species diversity present in GenBank; it's not massively redundant; and it's less monstrously huge than all of GenBank microbial (also see below.) And, since sourmash is (mostly) taxonomy agnostic, it doesn't matter whether you are fan of GTDB or a fan of NCBI taxonomies.

For all these reasons, GTDB has become our default for searches. But again, see below.

Database types

We provide three database types: SBT, LCA, and Zipfile collections.

The SBT and LCA databases are the same database types that we've provided for several years, and you probably want one of these. They are compatible with sourmash 3.5 and sourmash 4.x both.

To quote from the documentation,

SBT databases are low memory and disk-intensive databases that allow for fast searches using a tree structure, while LCA databases are higher memory and (after a potentially significant load time) are quite fast.

But!

With sourmash 4.1, we also support a new type of sourmash database - zipfile collections. These are unindexed collections of signatures, and now serve as the basis for our database release process. They are not (yet) that useful for users, is all.

Do we provide anything else?

Why, yes, thanks for asking! Actually, yes - if you look at our full google drive folder, you'll see that we also provide full manifests for the content of these databases, along with a report from sourmash lca index.

What other collections are we planning to provide?

We've spent much of the last year trying to figure out how to make all GenBank microbial genomes (=~ all non-animal/non-plant) searchable in a useful way - see e.g. sourmash 4.1, which massively sped up search and gather. We'll probably provide that as a zip collection soon (not sure about an SBT, though! and almost certainly not as an LCA database).

What about protein databases? And support for multiple taxonomies?

At the present time, I can neither confirm nor deny that we will soon be providing prepared database for protein search. Likewise, I can neither confirm nor deny that we will soon release support for doing taxonomic analysis with either NCBI or GTDB taxonomies, or indeed both at the same time. So please do not engage in unwarranted speculation.

Questions? Comments? Thoughts? Requests?

File an issue or come chat with us over on our new gitter channel, sourmash-bio/community!

And stay tuned!

--titus

Moving sourmash towards more community engagement - a funding application

2021-06-09T00:00:00+02:00

We applied for funding from CZI for sourmash a few weeks back, via the Essential Open Source Software for Science program. Here's the core of the application (lightly edited).

(We'll hear about funding by end of September, I believe.)

Feedback welcome, unless you're alerting me to the presence of typos :)

Proposal details

We seek funding for maintenance and user support for the sourmash software, while embarking on an ambitious plan to improve sustainability through improved governance, enhanced inclusivity, and robust community engagement.

Short description of software project:

Sourmash is mature software that enables lightweight content search, comparison and classification of microbial genomes and metagenomes. Sourmash works in low memory with compact databases, supports both NCBI and GTDB taxonomies, and can operate on private collections of genomes and metagenomes. The release of v4.1 brings massive-scale search of all Genbank microbial genomes and all public metagenomes to commodity hardware. These features are underpinned by novel data structures and algorithms, including an extension of MinHash that supports containment and the use of min-set-cov to do highly accurate metagenome analysis. Sourmash serves as a robust, reliable, and performant backbone for microbial sequence analysis.

We use development practices based on 30 years of scientific software engineering expertise: we develop in the open, do code review, have tests with 90%+ line coverage, and have a robust release process with semantic versioning. We provide thorough documentation, engage with users via our issue tracker, and use social media to broadcast new features and use cases. The utility of sourmash has been recognized by both users and funding agencies: we are increasingly well cited, the NSF is supporting the development of flexible taxonomies and distant evolutionary classification via protein k-mers, and the NIH is supporting iHMP reanalysis.

Proposal Summary

Sourmash is mature software that serves as a stable component of sequence analysis workflows, a fast and lightweight tool for massive-scale search of public and private sequence databases, and a platform for novel data structure and algorithm exploration. Sourmash is explicitly designed to meet the computational needs created by the massive expansion of sequencing capacity in microbiome biology.

We have arrived at an important crossroads with sourmash. We are just now releasing mature support for petabase-scale content search (v4.1.x and v4.2), and are currently writing up our novel data structures and algorithms for publication. We have ongoing projects using sourmash to analyze Human Microbiome Project datasets, including discovering strain-specific markers of Inflammatory Bowel Disease. Simultaneously, grant support for the core development of sourmash is ending, and Dr. Luiz Irber, the core developer behind most of the scaling work, is moving to another job where sourmash will become his part-time project. While sourmash research development will continue, we have no way to robustly support our current user base and grow the developer community with traditional funding, and do not have the governance infrastructure to productively engage with other support mechanisms.

We request support from CZI to support our newly released features with continued sourmash core development, while working toward sustainability by growing the project out of the lab and into the community. We propose to use the period to expand the sourmash community, define and grow a governance framework, connect to the Python and Rust bioinformatics ecosystem, and train both biologists and bioinformaticians to better engage with open source bioinformatics software. In particular, we see an opportunity to use sourmash to provide one example of how to grow a small project based in a single lab into a more sustainable community-based project. Importantly, this kind of maintenance and community growth does not fall within the scope of traditional funding opportunities.

At the end of this two year period, we will have continued to release and support high performance, high impact software. We will also have expanded our developer and user community, chosen a governance framework, identified a fiscal sponsorship plan, and published our strategies for project growth and sustainability.

Work Plan

Software development activities:

We propose to follow a “python-dev” model in which maintenance and feature releases proceed on their own timeline, while the roadmap process coordinates the planning and development of related feature sets (e.g. taxonomy extensions and database formats are connected). This separates maintenance updates from the “slow science” process of developing, testing, and evaluating new functionality against scientific use cases, while also ensuring that fully baked new functionality does regularly get released. Software development will proceed under our current “async” model, in which all decisions are discussed and documented openly in GitHub.

Fully 50% of the funded effort on this proposal goes to the “maintenance mode” activities, which are intended to further regularize the development process and support iterative, gradual performance improvement while preventing feature and performance regressions. This will include regular releases, continued maintenance of and improvements to software development and release process, database updates and releases as new genomes and metagenomes are made public, regular JOSS publications on major new versions (v4, v5, etc.), structural improvements to sourmash core, including a plugin architecture for storage formats, new command-line subcommands, and visualizations, and sketch serialization documentation and format upgrades to store more metadata, and support higher-performance binary formats.

Community engagement activities:

The community engagement activities below seek to build, grow, and support an active and robust user and developer community that includes biologists, bioinformaticians, computer scientists, and software engineers.

As sourmash matured, we focused our efforts toward building sustainable software and developing advanced use cases within the lab first, with documentation for new users added via github issues, blog posts, and feature papers. However, this has resulted in somewhat uneven support resources: e.g. we lack intermediate-level tutorials helping users transition from our introductory tutorials to advanced use cases or python API usage. We will upgrade our documentation systematically, create a “recipes” site, and construct an FAQ section that is well integrated with the documentation by reorganizing and amending existing content.

We plan to provide a warm, welcoming community forum that encourages new user questions and contributions. This will require engaged moderators, a strong Code of Conduct process, and a large user base, which we have not had the bandwidth to support previously. A key outcome of this funding will be the clear definition of a single support forum for sourmash, as one of the first outputs of our governance process.

Contributors may come from both the user community and the broader bioinformatics/CS community. We routinely source use cases, ideas for new functionality, and requests for performance improvement from the current biology-focused user community, and will encourage deeper and broader contributions through our governance and contributor framework, discussed below.

Similarly, there are many implementation aspects of sourmash that are interesting to, and may provide fodder for, CS and software engineers who are interested in contributing to bioinformatics software. While this is supported within the lab, these challenges are not immediately obvious or accessible to others without some biological background and appropriate documentation. We will build tutorials and documentation that highlight the algorithmic and implementation aspects of sourmash (sketching approximations, scaling issues, indexing formats, performance benchmarking, and quality-of-result benchmarking) and provide guidance for CS researchers who wish to evaluate new algorithms. Our governance and contributor framework will welcome extensions and evaluations and require neither permission nor involvement from sourmash core.

We see great value in further broadening our contributor base, and will continue to improve our current support for first-time OSS contributors by expanding our new contributor issue labels beyond “good first issue”, “good next issue”, and “repeatable quest”. While we do not expect many of these contributors to become long-term sourmash contributors, some may; more importantly, a steady influx of new first-time contributors will ensure that our development documentation remains accurate and useful. In support of this effort, we have budgeted for two 10 hrs/wk undergraduates to continue to contribute. We will also offer first-time contributor collaboratives, run documentation and visualization improvement hackyfests, and contribute to hackathons at BOSC and PyCon.

Governance activities:

We will build a Steering Council that guides governance, defines contributor guidelines, authorship considerations, and oversees the roadmap process. As part of this, we will nucleate “sourmash.bio” and move development activities out of the dib-lab organization. The Steering Council will also define the scope of the project and outline contribution mechanisms, most likely via a fiscal sponsor (perhaps the Software Freedom Conservancy).

Milestones and Deliverables:

We will deliver regular releases of sourmash under semantic versioning, per http://ivory.idyll.org/blog/2021-sourmash-v4-released.html. We anticipate approximately quarterly releases of major.minor versions, with more frequent patch releases.

We will quarterly update our roadmaps for v4.2.x, v5, and beyond. All planned features for these versions are discussed in the issue tracker. Each minor release will feature a link to updated roadmaps for the coming features. The issue tracker will continue to be constantly updated and refined in conjunction with releases and roadmaps.

These releases will also see regular refinement and updates of both the Python layer and the Rust layer; a major goal of our project is to expand our Rust contributor pool via CS undergrads and also (potentially) engagement with rust-bio.

We will simultaneously engage in iterative refactoring of our documentation to include not just getting-started docs and tutorials, but also detailed guidelines on how to get started contributing, video guides to sourmash, a “recipe” site that outlines solutions to common use cases, developer-oriented documentation for new plugins and visualizations, and a CS-focused introduction to the problems that sourmash is tackling. Recipes will be in place by mid-2022 and major updates will be delivered on a semi-annual basis.

Each summer (2022 and 2023) we will participate in undergraduate research projects (e.g. the National Summer Undergraduate Research Program) and introduce biology and CS undergraduates to problems in microbial genomics and metagenomics, including but not limited to sourmash. We will also participate in summer training courses (STAMPS at MBL, and DIBSI at UC Davis) as was our usual pre-pandemic practice (2010-2019).

We will offer at least two webinars and four hackfests annually, with our focus varied between attracting new users, attracting new developers, refining our documentation, exploring new functionality and improving our UX, and highlighting new analysis opportunities.

In December of 2021, 2022, and 2023 we will provide a detailed update of our governance progress and future plans. By December 2021, we will have issued invitations to a Steering Council, and begun the process of holding quarterly meetings. By December 2022, we will have engaged with potential fiscal sponsors and identified a path forward.

By mid-2022, we will have designated and seeded a support forum for sourmash.

While this will not be supported by this proposal specifically, we will also have submitted two papers on sourmash by December 2021.

In terms of metrics, * We will have engaged with over 1000 new users via hackyfests, webinars, etc. as a direct result of CZI funding. * Our stretch goal is over 500 citations combined for sourmash core papers by Dec 2023. * We hope to be the “stable, boring” option for petabase-scale content search and expect to have seen a substantial growth in user support and functionality requests for these use cases. * We also expect to see a dozen or more 3rd party extension modules adding new format import/export and visualizations to sourmash. * We will have submitted at least two major releases (v4 and v5) to JOSS, one in 2021 and one by end of 2023.

Value to Biomedical Users:

As the biomedical field increasingly moves towards large-scale sequencing, both of single genomes (e.g. individuals) and metagenomes (e.g. gut microbiome), lightweight analysis tools are becoming an essential part of core biomedical treatment and research. Sourmash provides a lightweight and robust interface for these analyses. In particular, we note four of our well-developed applications have considerable biomedical relevance for sequencing data analysis generally, and microbiome work specifically:

(1) finding the minimal list of relevant genomes for a microbiome, from all available (800k+) microbial and viral genomes;

(2) searching all microbiome data sets for a specific genome;

(3) detecting and removing contamination in metagenome, genome and transcriptome data sets;

(4) extraction of annotation independent features to support machine learning.

These applications are already under active use for large-scale biomedical data: the NIH has provided short-term funding to Dr. Brown in support of applying sourmash systematically to the Human Microbiome Project data sets, and we have an ongoing project using sourmash to discover strain-specific markers of Inflammatory Bowel Disease using a random forest approach.

Beyond the technical aspects of sourmash, we will work towards being a good example of a scientific open source project in biology/bioinformatics, by intentionally moving towards community governance, rewarding a wide variety of contributions, providing use-case focused tutorials, and guiding sourmash users towards how to support and evolve sourmash.

Diversity, Equity, and Inclusion Statement:

We believe that social barriers to contribution are a major cause of the low diversity in scientific OSS, and we are committed to systematically lowering these barriers while also lifting contributors over these barriers.

We also believe that lightweight and robust methods that support large-scale data discovery and reuse can expand bioinformatics into the “lightly resourced” space, e.g. Primarily Undergraduate Institutions; this is an equity issue because so many current methods require substantial resources simply to get started.

Training modules at the DIBSI and STAMPS workshops will introduce sourmash to a diverse range of research-focused participants. NSURP is focused on undergraduates from underrepresented backgrounds, and in 2020 we hosted two Latinx undergraduates. UC Davis is also an HSI and our undergraduate researchers will be recruited with attention to diversity.

We need a stronger CoC response framework, both for forum moderation and for project contributors; currently, the CoC process is based on the BDFL model, which is inadequate. This is important for DEI and antiracism, and improving our CoC process is one of our main goals in finding a fiscal sponsor who can provide a larger framework within which we can operate.

Last but not least, we believe that providing authorship for all contributors, including those who contribute use cases, recipes, and documentation, provides a way to formally recognize contributions that are traditionally undervalued in both open source projects and academia. Recognizing this kind of “invisible labor” is fundamentally an equity issue.

Searching all public metagenomes with sourmash

2021-06-08T00:00:00+02:00

In preparation for an NIH/DOE workshop I'm attending today on "Emerging Solutions in Petabyte Scale Sequence Search", I thought I'd write down what we're currently doing with sourmash for public metagenome search. I'm writing this blog post in a hurry, and I may revise it later as I receive comments and feedback; I'll point to a diff if I do.

This is based largely on work that was done by Dr. Luiz Irber last year, as part of his PhD work with me.

sourmash itself is available (see sourmash.readthedocs.io/), and we just released v4.1.2 yesterday! It's under the BSD 3-clause license and is fully available via conda and pip.

In brief - lightweight metagenome search with MAGsearch

Today, we can use MAGsearch to robustly find matches to 10kb+ sequences (or collections of 10,000 or more k-mers) across all publicly available metagenomes, out to about 93% ANI.

It's particularly useful for -

gathering candidates from public metagenomes for e.g. outbreak detection.
finding matches to a particular species or genus so as to study its ecological distribution.
gathering data sets to expand our knowledge of a species pangenome

A search with ~100 query genomes takes about 17 hours, today, and will search 580,000 metagenomes representing 530 TB of original sequence data.

How it works underneath

We use sourmash to support metagenome containment search with scaled signatures.

sourmash scaled signatures are derived from MinHash techniques. They are compressed representations of k-mer collections, and can reliably be used to find exact matches of ~10kb segments of DNA between any two collections; larger matches can be found out to about 93% ANI.

One key aspect here is that search can be done without access to the original data.

We maintain a collection of signatures for ~580,000 public metagenomes with the SRA for k=21, 31, and 51. A search with about 100 genome-sized queries currently takes about 17 hours using 32 threads with 48 GB of RAM (on our HPC).

Our complete collection of signatures is approximately 10 TB total, although this contains far more than the metagenome data - it contains 3.7m signatures, representing 1.3 PB of total data (SRA metagenomes + SRA non-plant/animals + Genbank/Refseq microbial genomes).

This collection of signatures is automatically updated by wort, which coordinates a distributed collection of workers to compute signatures as new data arrives at NCBI.

Simple opportunities for improvement

MAGsearch is a robust prototype, with many straightforward opportunities for improvement. I would guess that with a few weeks of focused investment, we could get down to about ~1 hour per search.

First, the MAGsearch code doesn't do anything special in terms of loading; it's using the default sourmash signature format, which is JSON. For example, binary encodings would decrease the collection size a lot, while also speeding up search (by decreasing the load time).

Second, searching the signatures is done linearly, and uses Rust to do so in parallel. It uses the same Rust code that underlies sourmash (but is several versions behind the latest version). Making use of recent improvements in sourmash Rust code would probably speed this up several fold.

Third, we can now add protein signatures to our collection of DNA signatures, which would enable much more sensitive search. (We'd have to sketch a lot of data, though. :)

Broader limitations

The internal data structures we use in sourmash are optimized for relatively small collections of k-mers, because sourmash is built around downsampling k-mer collections. We're slowly improving our internal structures, but supporting all k-mers is not straightforward and is not something on our current roadmap.

Our sketching techniques only support individual k-mer sizes/molecule types. So while we can compute, store and search multiple k-mer sizes for DNA, protein, Dayhoff encodings, etc., they are stored separately and don't "compress" together. This means that signature collections grow quickly in size as we provide more k-mer sizes and molecule types!

We're not quite sure how to provide our current databases to people. Personally I'm not really ready to support MAGsearch as a service, either, but that's partly because of a lack of funding.

What else does sourmash offer?

sourmash itself is stable and well tested, and can be used with confidence to do many bioinformatics tasks. It is easy to install (pip/conda), and is reasonably well documented.

Our data structures and algorithms are simple and well-understood and straightforward to (re)implement. While they aren't yet all published, we are happy to explain them and tell you where they will and won't work.

sourmash is fast, and low memory, and requires little disk space for even pretty large collections of signatures.

sourmash has an increasingly useful command-line interface that supports many common k-mer and search operations. In this sense, it can be used as a partial guide for a good "default" set of operations that k-mer-based tools could support. We have paid a fair amount of attention to user experience, too.

Underneath, sourmash has a flexible Python API that is slowly being replaced with Rust underneath. This means that we can quickly prototype new functionality while refactoring critical functionality underneath, so sourmash performance is continually improving while we are also tackling new use cases.

We have an open, robust approach to software development, with an increasingly diverse array of contributors. I'm not sure we're ready to take on a lot of new contributors quite yet, because our roadmapping processes are not very mature, but we're working on that.

We use semantic versioning for the sourmash package itself, and we communicate clearly about breaking changes. As a result, sourmash can be cleanly integrated into workflows with simple versioning pinning requirements.

We support public and private collections of signatures, and all of our primary search and analysis approaches work with multiple databases or signature collections without needing to re-index them or combine them from scratch.

We also support flexible "free-form" taxonomy, and in particular support both NCBI and GTDB taxonomies.

Where would I like to see petabase-scale search go?

I wouldn't advocate for sourmash itself (either the software or the underlying techniques) as the one true method for searching all (meta)genomic data. Among other things, sourmash has a lot of other use cases that matter to us!

But I think we have a few experiences to offer to any such effort -

we have functioning implementations that support a number of really useful use cases for metagenome search and analysis. It would be nice not to lose those use cases!
high-sensitivity prefiltering approaches are good and enable flexible triage afterwards. We mostly use sourmash as a lightweight way to find all the things that we might care about, before doing more in-depth analysis.
having both command-line and Python APIs has been incredibly useful, and I think it would be a mistake to bypass good APIs in favor of a Web API. Of course, this also increases the developer effort by a lot, but the return is that you enable a lot more flexibility.
riffing more on that, I think it would be a mistake to write a custom Web-hosted indexing and search tool that only works with NCBI formats and taxonomies.
riffing even more on that, it's been great to be able to quickly add databases/collections to search, and supporting both completely private databases as well as rapid updating of public database collections is something that has been really useful in comparison to many other metagenome analysis tools.
simplicity of data structures and algorithms has helped us a lot with sourmash. Software support is fundamentally a game of maintenance and it has been great to be able to reimplement our core data structures and algorithms in multiple languages. In particular, I worry a lot about premature optimization when I look at other packages.

Luiz has also done a lot of thinking about distributed computing and decentralization via Dat and IPFS that I think could be valuable, but I'm not expert enough to summarize it myself. Hopefully Luiz will write something up :). (You can already check out his PhD thesis, chapters 4 and 5, for some juicy details and discussion, though!)

What other tools should we be looking at for large scale search?

I think Serratus did an excellent job of showing some of the possibilities of massive-scale metagenome search!

There's lots of tools out there in various stages of development, but I am particularly interested in metagraph.

I'd love to hear about more tools and approaches - please drop them in the comments or on twitter!

--titus

sourmash 4.1.0 released!!

2021-05-17T00:00:00+02:00

We are pleased to announce that sourmash v4.1 is now out! As usual it can be installed via conda or pip. You can read the release notes here for details, or just read on here for the highlights!

One big new command-line feature - zipfile collections.

One command-line feature that opens up a lot of new opportunities down the line is support for zipfile collections.

Zipfile collections provide a way for sourmash to take in potentially very large collections of signatures. Briefly, you can take a directory hierarchy of signatures and zip them all up, and sourmash can now load the signatures directly from the zip file - so you can distribute collections of signatures, search and gather and compare on them, and so on.

Now, back in v3.3.0, Luiz added .zip as a storage format for Sequence Bloom Trees, indexed databases of signatures. These are fantastic, but because of the nature of SBT indices, they came with some restrictions - they had to be compatible signatures, and big SBTs consumed a lot of disk space and memory. While we're working on fixing that separately, zipfile collections offer an alternative that is not faster but does offer some more conveniences.

In particular, unlike SBTs, zipfile collections can store incompatible signatures, and they don't consume any extra memory, and they don't require any ancillary files. This lets us (not so hypothetically...) store k=21, k=31, and k=51 signatures for all 300k+ GTDB genomes in a fairly small (~8.5 GB) zipfile. You can also get the GTDB representatives in even smaller files (1.5 GB) and we built SBTs for them, too (2.8 GB each).

The remaining problem is that zipfile collections aren't indexed, and so searching 300k+ signatures is not really that fast because you're doing it linearly. While search can handle it, iterative approaches like gather cannot. To that end, we added interim support via prefetch. Read on!

Another nifty command line feature: `prefetch`.

sourmash prefetch is a command that basically does a sourmash search --containment. It only works on scaled signatures (more about that soon, promise), and it's meant as a prefilter for sourmash gather. The idea is, you run prefetch with a metagenome query, and prefetch finds all of the potentially relevant signatures, and then saves them for you. Then, you run sourmash gather on the saved signatures, which winnows them down to the smallest possible list of genomes relevant to your metagenome.

Why implement prefetch? A couple of reasons -

we already had code in some other projects that did this and was quite useful.
it was an easy feature to implement that led to massive speedups when doing certain kinds of parameter exploration.
it's explicitly streaming compatible, because it doesn't need to hold anything in memory long-term - it's meant to walk across whatever (potentially very, very large...) databases and collections you give it, and output any relevant matches. As we're approaching a million genomes in Genbank, this feature seemed ...relevant.
last but not least, internal support for prefetch goes with some excellent internal primitives that can now be further optimized. More about THAT in some future releases :).
prefetch also lets us support some other features, such as reporting ties in sourmash gather. We don't do that yet, but we can do so much more easily now.

So how would you use prefetch? You don't need to, really - it now underpins gather, so sourmash gather on a zipfile containing all 300k+ GTDB genomes will actually run much, much faster than it ever would have before, despite using a linear search underneath.

For some speed comparisons of the new features, see sourmash issue #1530 - here's the summary, for searching approximately 45,000 signatures from GTDB with a fake metagenome built from 4 genomes -

	Time (s)	Memory (mb)
1. index, prefetch	10s	215mb
2. index, no prefetch	22s	214mb
3. no index, prefetch	207s	81mb
4. no index, no prefetch	811s	87mb

So obviously you want to use an index if you have the memory, but if you don't, you definitely want to use prefetch! Happy to discuss the scaling behavior in the comments or over at the github issue, too - the short version is that the time for rows (2) and (4) should scale with the diversity of the metagenome, while (1) and (3) should be mostly independent of diversity (which is what you want!)

Important note: before sourmash 4.1, row (2) above was the only behavior supported. :) All of the behaviors above can be toggled at the command line.

Last but by no means least: flexible and online output formats for saving signatures

As we were implementing all this, it turned out to be easy to refactor in some more flexible output formats. You can now specify that sketch, search, gather, and prefetch, as well as many of the sourmash sig manipulation commands, should put their output signatures in a directory (output/), a Zip file (output.zip), or a compressed sig file (output.sig.gz). These are also streaming compatible - the zipfile and directory output saves matches "as you go", without holding them in memory. And, since all of these can be passed into sourmash as collections to search and manipulate, we have a pleasingly complete set of storage formats!

Other features

The release notes should be pretty comprehensive, and they do contain links into the pull requests (and from there, into the issues) that we addressed for this release. In particular, note that as our user base expands we're getting a wider range of issues submitted. Many of these are straightforward to fix - so this release addresses a fair number of those user requests, too.

In particular, this release should address a number of Windows issues around encodings and newlines; we don't yet provide wheels for Windows, but we're getting a lot closer!

Internal improvements and enhanced flexibility

For me the really exciting thing is the internal refactoring that under-pin the features above. We've significantly reworked the internals to consolidate code, make new features easier to add, better support streaming of large signature collections, and permit many more optimizations. Coincidentally (we swear!) the refactoring also sped up some of our core operations - sourmash gather, in particular, is twice as fast and consumes 80-90% less memory on SBTs! Maybe that's the sign of a good refactoring? Or maybe we just got lucky...

sketchily yours, --titus

sourmash 4.0 is now available! Low low cost if you buy now!

2021-03-04T00:00:00+01:00

So, we just released sourmash 4.0, our Python- and Rust-based open source tool for k-mer sketch-based analysis of metagenomes and genomes.

The high notes of this release are -

much better user experience design around creating and storing sketches;
removal of several obsolete features that were holding us back;
improved default Python API;

but they really don't particularly matter, to be honest :).

What's most cool about this release is...

Semantic versioning, feature compatibility, and deprecations

...it's a release where we purposely broke compatibility with previous versions, and went through a whole deprecation effort, and documented it all.

We use semantic versioning for sourmash. What this means is that major versions of sourmash (v3, v4, etc.) can break backwards compatibility, but minor versions (v3.1, v3.2) cannot. In practical terms, it means that when you use sourmash in a workflow or application, you can pin your software install to the major version without worrying about breakage - e.g. specify sourmash >=3,<4.

In the case of v3.x and v4.0, we systematically upgraded and improved sourmash performance and features during 3.x, and reserved breaking features for 4.0. Further, we added warnings and deprecations to v3.5 about features that were going to break in v4.0. Then we wrote a migration guide.

It was a lot of work! I probably put 40-80 hours into just this aspect of things over the last three months.

So... why did we do it?

Why did we do all this work?

We're not really sure how many people use sourmash outside the lab, but in the lab, we use it quite a bit. It's a pretty effective Swiss army knife tool for hacking and slashing at sequencing data, and a lot of basic questions about taxonomy and k-mer content can be answered with a little creative sourmashing.

And we have 5-6 workflows and pipelines that rely on sourmash.

So the first answer is that we did it for ourselves, so that we could robustly rely on sourmash in our workflows.

But the more complete answer is that we wanted to go through the semantic versioning & deprecation workflow and user communication/documentation stuff, so that we could just bake it into project expectations (for this project and for others). And we wanted to do this because we think this is the right way to do scientific software, and we wanted to communicate our expectations about changing sourmash behavior clearly and unambiguously.

What do we get out of it?

We're signaling to our current and prospective user base that we are open to their concerns. This results in improved user communication: for example, after our first release candidate we got some feedback that caused us to explicitly note that (1) numerical results shouldn't change and (2) old sourmash databases are still compatible with the new version.

We're providing a path to ourselves, as well as future developers (and future us), on how to think about pacing our changes to the software. On the one hand it's frustrating to delay cool and important changes to the software because we're not yet ready to release a big version; on the other hand, we took the time to more completely bake some of the new features and did several rounds of documentation improvement.

And, frankly, I think we ended up with better code reviews and development processes internally, because we had to think explicitly about how each particular change would impact users. (FWIW, our best guess is that we have about 1,000 users.)

What are the downsides?

Well, it was a lot of work :). And investment in the future of an academic software project is always a gamble!

Also, we don't have the person power to maintain multiple releases of sourmash, so it does mean we're more or less abandoning people who want to continue using sourmash v3.x. We didn't break any particularly big features, but it does require effort on our users' side to upgrade, so maybe some people will hold off because of that. And while I'll backport fixes to really important bugs if we have any in the next few months, we don't intend to backport performance improvements or new features. So maybe users will suffer a bit from that.

What's next?

We have a lot of new features that will probably come out in v4.1 and beyond, now that we can switch our efforts to that! Lots of exciting stuff is coming in the areas of protein k-mers and massive-scale database search!

I'm already looking forward to v5, where we can remove some of the features that we deprecated for v4.0.

And I think it's more than time for a new JOSS paper...

--titus

sourmash v4.0.0 release candidate 1 is now available for comment!

2021-02-19T00:00:00+01:00

Hello everyone,

we are happy to announce the (imminent ;) release of sourmash 4.0, and present sourmash v3.5.1 and sourmash v4.0.0rc1 (release candidate 1) for your comments and questions!

sourmash is a command-line tool + Python & Rust library for quickly searching, comparing, and analyzing genomic and metagenomic data sets.

If you use sourmash regularly and are interested in upgrading, we are providing you with this release candidate so you can try out the migration guide and the new/revised functionality.

Draft release notes for 4.0.0 are here, and we have a migration guide as well.

Please note that sourmash uses semantic versioning, so v3.5.1 should not break any features or functionality. You should version pin your sourmash dependencies to >=3,<4 if you want to continue using sourmash as before.

sourmash v3.5.1 is the last release of v3.x

sourmash v3.5.1 should be the last release of sourmash v3. It adds warnings for features changed in 4.0.

More info on 3.5.1: https://github.com/dib-lab/sourmash/releases/tag/v3.5.1

You can install it like so from PyPI:

pip install sourmash==3.5.1

or with conda from bioconda and conda-forge:

conda install -c conda-forge -c bioconda sourmash=3.5.1

sourmash v4.0.0 is coming soon!

sourmash v4.0.0rc1 is a feature-complete release of 4.0, with full migration docs. It contains many improvements and some breaking changes from 3.x.

Please see https://github.com/dib-lab/sourmash/releases/tag/v4.0.0rc1 for details!

To install sourmash v4rc1 from PyPI, please use:

pip install --pre sourmash==4.0.0rc1

(You can also install sourmash v3.5.1 from conda to get the dependencies, and then upgrade to the latest version using pip.)

Feedback requested!

We would very much appreciate feedback on the new features in sourmash, as well as comments and questions about upgrading. Please put comments on the migration issue.

C. Titus Brown and Luiz Irber

(for the sourmash development team :)

Transition your Python project to use pyproject.toml and setup.cfg! (An example.)

2021-02-02T00:00:00+01:00

Thanks to Luiz Irber for all the pre-work on sourmash, as well as the code reviews on screed; and Brett Cannon for a review of an earlier version of this blog post!

The future of Python packaging is pyproject.toml, and (for now) setup.cfg, based on PEP 518 and (soon) PEP 621.

For some background, please read "What the heck is pyproject.toml", and also Where to get started with pyproject.toml?

The relevant setuptools docs have been updated to reflect the new toolchain, too!

My takeaway from all of this is:

configuration files are better than scripts
a few standard configuration files are better than many
declarative/static is better than procedural/dynamic

OK! rubs hands with glee let's do this!

"But what do I actually do?"

Brett's post is pretty excellent and was really informative for me, but I have a high tolerance for reading lots of text! It's probably a bit long for people who just want to update their project, though ;).

So I decided to give it a try myself and then post an example!

Recently, I wanted to release a new version of screed, in order to get rid of some DeprecationWarnings for the release of sourmash 4.0. Now, screed is a remarkably ...stable project, by which I mean it does the thing we need it to do and no more, and we're not changing it at all.

BUT. Screed was based on an old school setup.py. So, inspired by Luiz Irber's updating of sourmash to use pyproject.toml, I updated screed similarly. (It was REALLY helpful to have an example!)

tl;dr

Here is the diff.

In brief,

your pyproject.toml can be very close to boilerplate.
- It's basically the three lines that Brett posted...
- the additional stuff for screed has to do with setuptools_scm, which we're using to automatically convert git tags like v1.0.4 into actual version numbers.
your setup.cfg basically contains almost everything your setup.py contained, just a bit reformatted to fit into the setup.cfg format.
your new setup.py can now be a really short stub to permit python setup.py ... to continue to work.

I hope this helps! Comment and ask questions as you have them!

(Also: I just released screed v1.0.5! :tada:)

--titus

A snakemake hack for checkpoints

2021-01-25T00:00:00+01:00

As I get deeper and deeper into using the excellent snakemake workflow system for ...everything, I have had to learn how to use checkpoints. I ended up hacking together an approach that made checkpoints easy for me, and now I'm caught between being proud of it and wondering if it's Actually Bad. So I thought I'd share and see what y'all thought.

(Thanks to Taylor Reiter, Tessa Pierce, and Luiz Irber for their comments on early drafts of this blog post!)

What are checkpoints for?

By default, snakemake figures out what to run based on the rules in the Snakefile and whatever files are present in the working space. It implements this using a simple but incredibly powerful pattern matching technique that is executed at the very beginning of the run.

The one big problem with doing everything at the beginning of the run is that if you don't know exactly which files are going to be produced by a particular step, you can't write a regular rule to depend on them.

For example, suppose you want to run a BLAST search against a query sequence, and then for each BLAST match you want to download the matching sequence and do more analysis. snakemake could handle doing the BLAST easily enough, but the rule that downloads matching sequences would have somewhere between 0 and N outputs. How many wouldn't be known until the BLAST was done!

There are a few different approaches you can use --

simply having multiple workflows (I did this for a while :)
snakemake dynamic
snakemake checkpoints

This blog post is about the last option, which is (apparently) the stable approach, going forward!

What do checkpoints do?

Briefly put, checkpoints trigger a re-evaluation of the Snakefile rules in light of new information.

For each checkpoint, snakemake looks at the rules that depend on the checkpoint's output, and holds off on evaluating downstream inputs until the checkpoint is done. Once the checkpoint is done, the rules are evaluated and new jobs are entered into snakemake's TODO list.

How do checkpoints work, under the hood?

It took me a surprisingly large amount of time to figure out these details, so I'm going to share in case others are in a similar boat :).

There is a checkpoints namespace.

When you create a checkpoint, it is entered into this namespace.

When another rule's input refers to a checkpoint to get its outputs, by calling checkpoint.<name>.get(...), snakemake raises an exception. This exception tells it to defer evaluation of the checkpoint outputs until the checkpoint; snakemake tracks the calling rule and waits.

Once the checkpoint is executed, the output becomes available and the rules that depend on it are re-evaluated.

An example of the syntax

The syntax is straightforward - you define checkpoints the same way you do rules, and then you refer to the checkpoint in an input function.

checkpoint a:
    output: touch("a.out")

def input_for_b(*wildcards):
    return checkpoints.a.get().output

rule b:
    input:
        input_for_b
    run: 
        print(f'input is: {input}')

and if you run rule b in this example like so,

% snakemake -j 1 -s Snakefile.example b

you will see input is: a.out.

Note this example is a bit useless, though, because in this case you could make checkpoint a a rule; it doesn't do anything here that requires it to be a checkpoint. Specifically, the output of rule a and input of rule b are both known.

Nonetheless, I think it serves as a useful example of the syntax:

the output of the checkpoint must be something that fits into the snakemake rules - a filename or a wildcard pattern or something specific.
the rules that depend on this checkpoint need to use a function as an input, so that snakemake can try to run it and generate the exception that lets it know this depends on a checkpoint.
the input function must take a list of potential wildcards, even if there are no wildcards and/or the wildcards aren't used.

A real example: making a spreadsheet dynamically, and then using that spreadsheet

Here is an example Snakefile that is closer to how I use checkpoints in real Snakefiles.

Briefly, a rule make_spreadsheet builds a spreadsheet with some filenames in it (here, the entries are random, but it could be doing something useful, like running BLAST).

Then, I define a checkpoint that waits for that file to be created, and ...does nothing.

Last, I define a rule that depends on that checkpoint. This rule reads in all the names from the spreadsheet and then builds a list of output filenames, output-{name}.txt where {name} is taken from the spreadsheet.

rule make_all_files:
    input:
        Checkpoint_MakePattern("output-{name}.txt")

The individual files are created by another rule; make_all_files just has the responsibility of laying out the list of files to be created, which is created by the Checkpoint_MakePattern class, discussed a few paragraphs below.

The interesting thing here is that the checkpoint doesn't really do anything; it just requires that the names.csv file exist (triggering the correct upstream rule), and it touches a file (because, as it turns out, checkpoints must have an output.)

# second rule, a checkpoint for rules that depend on contents of "count.csv"
checkpoint check_csv:
    input: "names.csv"
    output: # checkpoints _must_ have output.
        touch(".make_spreadsheet.touch")

The "magic" here is in the Checkpoint_MakePattern class, which I defined. This class takes in and saves a pattern:

class Checkpoint_MakePattern:
    def __init__(self, pattern):
        self.pattern = pattern

and then, when called as part of the input function in make_all_files, it (a) waits for the checkpoint, (b) gets the names from the CSV file (get_names() call), and (c) expands the pattern with the names from the CSV file:

    def __call__(self, w):
        global checkpoints

        # wait for the results of 'check_csv'; this will trigger an
        # exception until that rule has been run.
        checkpoints.check_csv.get(**w)

        # the magic, such as it is, happens here: we create the
        # information used to expand the pattern, using arbitrary
        # Python code.
        names = self.get_names()

        pattern = expand(self.pattern, name=names, **w)
        return pattern

The only application-specific bit of code is in get_names(), which reads in the CSV:

    def get_names(self):
        with open('names.csv', 'rt') as fp:
            names = [ x.rstrip() for x in fp ]
        return names

This function can do pretty much anything it needs to do, and could (in cases where a bunch of output files are created) be replaced with snakemake's glob_wildcards function.

Another example: taking a count from a file.

Here is another Snakefile that outputs h+2 (where h is the current hour of the day) to a file count.txt. The number in count.txt is then used to create files named "output-1.txt" to "output-{n}.txt",

Clearly snakemake's runtime analyzer can't know how many files are going to be output up front, so the Snakefile uses a checkpoint to read in the hour from count.txt, and then uses expand to generate the output file patterns:

rule make_file:
    output:
        "output-{n}.txt"
    shell:
        "echo hello, world > {output}"

rule make_all_files:
    input:
        Checkpoint_MakePattern("output-{n}.txt")

A third example - reimplementing `dynamic`

Luiz made an interesting comment when he read a draft of this blog post: he pointed out that this gets pretty close to the dynamic behavior. So I thought I'd (try) to reimplement that!

The result is here, in Snakefile.dynamic.

The make_files rule makes a bunch of files (mimicking clustering output, for example). Then the Checkpoint_MakePattern class uses glob_wildcard to figure out what files there are and extract wildcards, which it uses to fill in the pattern:

    def __call__(self, w):
        global checkpoints

        # wait for the results of 'check_csv'; this will trigger an
        # exception until that rule has been run.
        checkpoints.make_files.get(**w)

        # use glob_wildcards to find the (as-yet-unknown) new files.
        names = glob_wildcards('output-{rs}.txt')[0]

        pattern = expand(self.pattern, name=names, **w)
        return pattern

For example, this rule transforms all of the output-{random}.txt files into output-{random}.round2 names:

# final rule that depends on that checkpoint and transforms
# dynamically created files into something else.
rule make_patterns:
    input:
        '.make_rs_files.touch',
        Checkpoint_MakePattern('output-{name}.round2')

A bonus feature is that you can easily compute a summary across all the files like so:

# bonus rule that does something with all the files
rule make_summary:
    input:
        '.make_rs_files.touch',
        files=Checkpoint_MakePattern('output-{name}.txt')
    output:
        'output-random.summary'
    shell: """
        cat {input.files} > {output}
    """

This works, but is it a good way to do things?

The Checkpoint_MakePattern code that I used above gave me a simple way to make use of checkpoints. I largely ignored the internal snakemake mechanism for passing around information that is laid out in the docs and in (e.g.) this very useful blog post.

I just write Python code that (a) triggered the checkpoint exception and then (b) Did Something in pure Python to spit out a list of files to be created.

I've used essentially this same code a few times now, and I like it a lot! But I would love feedback as to whether I'm doing something unnatural here :), or if I'm missing something that's really much simpler. Feedback welcome!

--titus

Improved workflows-as-applications: tips and tricks for building applications on top of snakemake

2020-08-06T00:00:00+02:00

(Thanks to Camille Scott, Phillip Brooks, Charles Reid, Luiz Irber, Tessa Pierce, and Taylor Reiter for all their efforts over the years! Thanks also to Silas Kieser for his work on ATLAS, which gave us inspiration and some working code :).)

A while back, I wrote about workflow as applications, in which I talked about how Camille Scott had written dammit (link below) in the pydoit workflow system, and released it as an application. In doing so, Camille made a fundamental observation: many bioinformatics tools are wrappers that run other bioinformatics tools, and that is literally what workflow tools are designed to do!

Since that post, we've doubled down on workflow systems, improved and adapted our good enough in-lab practices for software and workflow development, and written a paper on workflow systems - (also on bioRxiv).

Projects that we write this way end up consisting of large collections of interrelated Python scripts (more on how we manage that later - see e.g. the spacegraphcats.search package for an example). This strategy also allows integration of multiple different languages under a single umbrella, including (potentially) R scripts and bash scripts and... whatever else you want :).

As part of this effort, we've developed much improved practices around better (more functional) user experiences with our software. In this blog post, I'm going to talk about some of these - read on for details!

This post extracts experience from the following in-lab projects:

Some background: how do we build applications on top of snakemake?

We've done quite a few times now, and there are 3 parts to the pattern:

first, we build a Snakefile that does the things we want to do, and stuff it into a Python package.

second, we create a Python entry point (see __main__ in spacegraphcats) that calls snakemake - in this case it does it by calling the Python API (but see below for better options).

third, in that entry point we load config files, salt in our own overrides, and otherwise customize the snakemake session.

and voila, now when you call that entry point, you run a custom-configured snakemake that runs whatever workflows are needed to create the specified targets! See for example the docs on running spacegraphcats.

Problems that we've run into, and their solutions.

The strategy above works great in general, but there are a few annoying problems that have popped up over time.

we want more flexible config than is provided by a single config file.
we want to distribute jobs from our application across clusters.
we don't want to have to manually implement all of snakemake's (many) command line options and functionality.
we want to support better testing!
we want to run our applications from within Snakemake workflows.

So, over time, we've come up with the following solutions. Read on!

Stacking config files

One thing we've been doing for a while is providing configuration options via a YAML file (see e.g. spacegraphcats config files). But once you've got more than a few config files, you end up with a whole host of options in common and only a few config parameters that you change for each run.

With our newer project, charcoal, I decided to try out stacking config files, so that there's an installation-wide set of defaults and config parameters, as well as a project-specific config.

This makes it possible to have sensible defaults that can be overridden easily on a per-project basis.

The way this works with snakemake is that you supply one or more JSON or YAML files like this to snakemake. Snakemake then loads them all in order and supplies the parameters in the Snakefile namespace via the config variable.

The Python code to do this via the wrapper command-line is pretty straight forward - you make a list of all the config files and supply that to subprocess!

Supporting snakemake job management on clusters

Snakemake conveniently supports cluster execution, where you can distribute jobs across HPC clusters.

With both spacegraphcats and elvers, we couldn't get this to work at first. This is because we were calling snakemake via its Python API, while the cluster execution engine wanted to call snakemake at the command line and couldn't figure out how to do that properly in our application setup.

The ATLAS folk had figured this out, though: ATLAS uses subprocess to run the snakemake executable, and when I was writing charcoal, I tried doing that instead. It works great, and is surprisingly much easier than using the Python API!

So, now our applications can take full advantage of snakemake's underlying cluster distribution functionality!

Supporting snakemake's (many) parameters

With spacegraphcats, the first application we built on snakemake, we implemented a kind of janky parameter passing thing where we just mapped our own parameters over to snakemake parameters explicitly.

However, snakemake has tons of command line arguments that do useful things, and it's really annoying to reimplement them all. So in charcoal, we switched from argparse to click for argument parsing, and simply pass all "extra" arguments on to snakemake.

This occasionally leads to weird logic like the code needed to support--no-use-conda, where we by default pass --use-conda to snakemake, and then have to override that to turn it off. But by and large it's worked out quite smoothly.

A drop-in module for a command-line API

As we build more applications this way, we're starting to recognize commonalities in the use cases. Recently I wanted to upgrade the spacegraphcats CLI to take advantage of lessons learned, and so I copied the charcoal __main__.py over to spacegraphcats.click and started editing it. Somewhat to my surprise, it was really easy to adapt to spacegraphcats - like, 15 minutes easy!

So, we're pretty close to having a "standard" entry point module that we can copy between projects and quickly customize.

Testing, testing, testing!

We get a lot of value from writing automated functional and integration tests for our command-line apps; they help pin down functionality and make sure it's still working over time.

However, with spacegraphcats, I really struggled to write good tests. It's hard to test the whole workflow when you have piles of interacting Python scripts in a workflow - e.g. the workflow tests are terrible: clunky to write and hard to modify.

In contrast, once I had the new command-line API working, I had the tools to make really nice and simple workflow tests that relied on snakemake underneath - see test_snakemake.py. Now our tests look like this:

def test_dory_build_cdbg():
    global _tempdir

    dory_conf = utils.relative_file('spacegraphcats/conf/dory-test.yaml')
    target = 'dory/bcalm.dory.k21.unitigs.fa'
    status = run_snakemake(dory_conf, verbose=True, outdir=_tempdir,
                           extra_args=[target])
    assert status == 0
    assert os.path.exists(os.path.join(_tempdir, target))

which is about as simple as you can get - specify config file and a target, run snakemake, check that the file exists.

The one tricky bit in test_snakemake.py is that the tests should be run in a particular order, because they build on each other. (You can actually run them in any order you want, because snakemake will create the files as needed, but it makes the test steps take longer.)

I ended up using pytest-dependency to recapitulate which steps in the workflow depended on each other, and now I have a fairly nice granular breakdown of tests, and they seem to work well.

(I'm still stuck on how to ensure that the outputs of the tests have the correct content, but that's a problem for another day :).

Using workflows inside of workflows

Last but not least, we tend to want to run our applications within workflows. This is true even when our applications are workflows :).

However, we ran into a little bit of a problem with paths. Because snakemake relies heavily on file system paths, the applications we built on top of snakemake had fairly hardcoded outputs. For example, spacegraphcats produces lots of directories like genome_name, genome_name_k31_r1, genome_name_k31_r1_search, etc. that have to be in the working directory. This turns into an ugly mess for any reasonably complicated workflow.

So, we took advantage of snakemake's workdir: parameter to provide a command-line feature in our applications that would stuff all of the outputs in a particular directory.

This, however, meant some input locations needed to be adjusted to absolute rather than relative paths. Snakemake handled this automatically for filenames specified in the Snakefile, but for paths loaded from config files, we had to do it manually. This turned out to be quite easy and works robustly!

You can see an example of this usage here. The --outdir parameter tells spacegraphcats to just put everything under a particular location.

Concluding thoughts

I've been pleasantly surprised at how easy it has been to build applications on top of snakemake. We've accumulated some good experience with this, and have some fairly robust and re-usable code that solves many of our problems. I hope you find it useful!

--titus

sourmash databases as zip files, in sourmash v3.3.0

2020-05-07T00:00:00+02:00

The feature that I'm most excited about in sourmash 3.3.0 is the ability to directly use compressed SBT search databases.

Previously, if you wanted to search (say) 100,000 genomes from GenBank, you'd have to download a several GB .tar.gz file, and then uncompress it out to ~20 GB before searching it. The time and disk space requirements for this were major barriers for teaching and use.

In v3.3.0, Luiz Irber fixed this by, first, releasing the niffler Rust library with Pierre Marijon, to read and write compressed files; second, replacing our old khmer Bloom filter nodegraph with a Rust implementation (sourmash PR #799); and, third, adding direct zip file storage (sourmash #648).

So, as of the latest release, you can do the following:

# install sourmash v3.3.0
conda create -y -n sourmash-demo \
    -c conda-forge -c bioconda sourmash=3.3.0

# activate environment
conda activate sourmash-demo

# download the 25k GTDB release89 guide database (~1.4 GB)
curl -L https://osf.io/5mb9k/download > gtdb-release89-k31.sbt.zip

# grab a genome signature - here, download a demo one from OSF
curl -L https://osf.io/vhnk4/download > genome.sig

# search!
sourmash search genome.sig gtdb-release89-k31.sbt.zip

This takes less than 2 GB of disk space total (including conda env), and the search runs in about 3 seconds and 120 MB of RAM.

Using the zip file stuff alone is a slight speed drag (~10-20%?), but the shift to Rust leads to an overall speed increase of about 4x. And you can always unpack the zip file and use the unpacked files directly.

Yay!

New database releases are coming!

Over the next few months, we plan to release all our SBT databases as zip files!

As usual, per our semantic versioning guidelines, you'll need sourmash v3.3 or later to use the zip files. However, old databases will continue to work for all sourmash v3.x, and probably v4.x as well (and maybe beyond :).

--titus

Software and workflow development practices (April 2020 update)

2020-04-20T00:00:00+02:00

Over the last 10-15 years, I've blogged periodically about how my lab develops research software and build scientific workflows. The last update talked a bit about how we've transitioned to snakemake and conda for automation, but I was spurred by an e-mail conversation into another update - because, y'all, it's going pretty well and I'm pretty happy!

Below, I talk through our current practice of building workflows and software. These procedures work pretty well for our (fairly small) lab of people who mostly work part-time on workflow and software development. By far the majority of our effort is usually spent trying to understand the results of our workflows; except in rare cases, I try to guide people to spend at most 20% of their time writing new analysis code - preferably less.

Nothing about these processes ensures that the scientific output is correct or useful, of course. While scientific correctness of computational workflows necessarily depends (often critically) on the correctness of the code underlying those workflows, the code could ultimately be doing the wrong thing scientifically. That having been said, I've found that the processes below let us focus much more cleanly on the scientific value of the code because we don't worry as much about whether the code is correct, and moreover our processes support rapid iteration of software and workflows as we iteratively develop our use cases.

As one side note, I should say that the complexity of the scientific process is one thing that distinguishes research computing from other software engineering projects. Often we don't actually have a good idea of what we're trying to achieve, at least not any level of specificity. This is a recipe for disaster in a software engineering project, but it's our day-to-day life in science! What ...fun? (I mean, it kind of is. But it's also hellishly complicated.)

Workflows and scripts

Pretty much every scientific computing project I've worked on in the last (counts on fingers and toes... runs out of toes... 27 years!? eek) has grown into a gigantic mess of scripts and data files. Over the (many) years I've progressively worked on taming these messes using a variety of techniques.

Phillip Brooks, Charles Reid, Tessa Pierce, and Taylor Reiter have been the source of a lot of the workflow approaches I discussed below, although everyone in the lab has been involved in the discussions!

Store code and configuration in version control

Since I "grew up" simultaneously in science and open source, I started using version control early on - first RCS, then CVS, then darcs, then Subversion, and finally git. Version control is second nature, and it applies to science too!

The first basic rule of scientific projects is, put it in git.

This means that I can (almost) always figure out what I was doing a month ago when I got that neat result that I haven't been able to replicate again. More importantly I can see exactly what I changed in the last hour, and either fix it or revert to what last worked.

Over almost 30 years of sciencing, project naming becomes a problem! Especially since I tend to start projects small and grow them (or let them die on the vine if my focus shifts). So my repo names usually start with the year, followed by a few keywords -- e.g. 2020-long-read-assembly-decontam. While I can't predict which code I'll go back to, I always end up going back to some of it!

Write scripts using a language that encourages modularity and code sharing

I've developed scientific workflows in C, bash, Perl, Tcl, Java, and Python. By far my favorite language of these is Python. The main reason I switched wholeheartedly to Python is that, more than any of the others, Python had a nice blend of modularity and reusability. I could quickly pick up a blob of useful code from one script and put it in a shared module for other scripts to use. And it even had its own simple namespace scheme, which encouraged modularity by default!

At the time (late '90s, early '00s) this kind of namespacing was something that wasn't as well supported by other interpreted languages like Perl (v4?) and Tcl. While I was already a knowledgeable programmer, the ease of code reuse combined with such simple modularity encouraged systematic code reuse in my scripts in a new way. When combined with the straightforward C extension module API, Python was a huge win.

Nowadays there are many good options, of course, but Python is still one of them, so I haven't had to change! My lab now uses an increasing amount of R, of course, because if its dominance in stats and viz. And we're starting to use Rust instead of C/C++ for extension modules.

Automate scientific workflows

Every project ends up with a mess of scripts.

When you have a pile of scripts, it's usually not clear how to run them in order. When you're actively developing the scripts, it becomes confusing to remember whether your output files have been updated by the latest code. Enter workflows!

I've been using make to run workflows for ages, but about 2 years ago the entire lab switched over to snakemake. This is in part because it's well integrated with Python, and in part because it supports conda environments. It's been lovely! And we now have a body of shared snakemake expertise in the lab that is hard to beat.

snakemake also works really well for combining my own scripts with other programs, which is of course something that we do a lot in bioinformatics.

There are a few problems with snakemake, of course. It doesn't readily scale to 100s of thousands of jobs, and we're still working out the best way to orchestrate complex workflows on a cluster. But it's proven relatively straightforward to teach, and it's nicely designed, with an awful lot of useful features. I've heard good things about nextflow, and if I were going to operate at larger scales, I'd be looking at CWL or WDL.

New: Work in isolated execution environments

One problem that we increasingly encounter is the need to run different incompatible versions of software within the same workflow. Usually this manifests in underlying dependencies -- this package needs Python 2 while this other package requires Python 3.

Previously, tackling this required ugly heavyweight hacks such as VMs or docker containers. I personally spent a few years negotiating with Python virtualenvs, but they only solved some of the problems, and only then in Python-land.

Now, we are 100% conda, all the time. In snakemake, we can provide environment config files for running the basic pipeline, with rule/step-specific environment files that rely on pinned (specific) versions of software.

Briefly, with --use-conda on the command line and conda: directives in the Snakefile, snakemake manages creating and updating these environments for you, and activate/deactivates them on a per-rule basis. It's beautiful and Just Works.

New: Provide quickstart demonstration data sets.

(This is a brand new approach to my daily practice, supported by the easy configurability of snakemake!)

The problem is this: often I want to develop and rapidly execute workflows on small test data sets, while also periodically running them on bigger "real" data sets to see what the results look like. It turns out this is hard to stage-manage! Enter ...snakemake config files! These are YAML or JSON files that are automatically loaded into your Snakefile name space.

Digression: A year or three ago, I got excited about using workflows as applications. This was a trend that Camille Scott, a PhD student in the lab, had started with dammit, and we've been using it for spacegraphcats and elvers.

The basic idea is this: Increasingly, bioinformatics "applications" are workflows that involve running other software packages. Writing your own scripts that stage-manage other software execution is problematic, since you have to reinvent a lot of error handling that workflow engines already have. This is also true of issues like parallelization and versioning.

So why not write your applications as wrappers around a workflow engine? It turns out with both pydoit and snakemake, you can do this pretty easily! So that's an avenue we've been exploring a few projects.

Back to the problem to be solved: What I want for workflows is the following:

A workflow that is approximately the same, independent of the input data.
Different sets of input data, ready to go.
In particular, a demo data set (a real data set cut down in size, or synthetic data) that exercises most or all of the features of the workflow.
The ability to switch between input data sets quickly and easily without changing any source code.
In a perfect world, I would have the ability to develop and run the same workflow code on both my laptop and in an HPC queuing system.

This set of functionality is something that snakemake easily supports with its --configfile option - you specify a default config file in your Snakefile, and then override that with other config files when you want to run for realz. Moreover, with the rule-specific conda environment files (see previous section!), I don't even need to worry about installing the software; snakemake manages it all for me!

With this approach, my workflow development process becomes very fluid. I prototype scripts on my laptop, where I have a full dev environment, and I develop synthetic data sets to exercise various features of the scripts. I bake this demo data set into my default snakemake config so that it's what's run by default. For real analyses, I then override this by specifying a different config file on the command line with --configfile. And this all interacts perfectly well with snakemake's cluster execution approach.

As a bonus, the demo data set provides a simple quickstart and example config file for people who want to use your software. This makes the installation and quickstart docs really simple and nearly identical across multiple projects!

(Note that I develop on Mac OS X and execute at scale on Linux HPCs. I'd probably be less happy with this approach if I developed on Windows, for which bioconda doesn't provide packages.)

Libraries and applications

On the opposite end of the spectrum from "piles of scripts" is research software engineering, where we are trying explicitly to build maintainable and reusable libraries and command-line applications. Here we take a very different approach to the workflow approach detailed above, although in recent years I've noticed that we're working across this full spectrum on several projects. (This is perhaps because workflows, done like we are doing them above, start to resemble serious software engineering :).

Whenever we find a core set of functionality that is being used across multiple projects in the lab, we start to abstract that functionality into a library and/or command line application. We do this in part because most scripts have bugs that should be fixed, and we remain ignorant of them until we start reusing the scripts; but it also aids in efficiency and code reuse. It's a nice use-case driven way to develop software!

We've developed several software packages this way. For example, the khmer and screed libraries emerged from piles of code that slowly got unified into a shared library.

More recently, the sourmash project has become the in-lab exemplar of intentional software development practices. We now have 3-5 people working regularly on sourmash, and it's being used by an increasingly wide range of people. Below are some of the key techniques we've been using, which will (in most cases) be readily recognized as matching basic open source development practices!

I want to give an especially big shoutout here to Michael Crusoe, Camille Scott, and Luiz Irber, who have been the three key people leading our adoption of these techniques.

Automate tests

Keeping software working is hard. Automated tests are one of the solutions.

We have an increasingly expansive set of automated tests for sourmash - over 600 at the moment. It takes about a minute to run the whole test suite on my laptop. If it looks intimidating, that's because we've grown it over the years. We started with one test, and went from there.

We don't really use test-driven development extensively, or at least I don't. I know Camille has used it in her De Bruijn graph work. I tend to reserve it for situations where the code is becoming complicated enough at a class or function level that I can't understand it -- and that's rarely necessary in my work. (Usually it means that I need to take a step back and rethink what I'm doing! I'm a big believer in Kernighan's Lever - if you're writing code at the limit of your ability to understand it, you'll never be able to debug it!)

Use code review

Maintainability, sustainability, and correctness of code are all enhanced by having multiple people's eyes on it.

We basically use GitHub Flow, as I understand it. Every PR runs all the tests on each commit, and we have a checklist to help guide contributors.

We have a two-person sign-off rule on every PR. This can slow down code development when some of us are busy, but on the flip side no one person is solely responsible when bad code makes it into a release :).

Most importantly, it means that our code quality is consistently better than what I would produce working on my own.

Use semantic versioning

Semantic versioning means that when we release a new version, outside observers can quickly know if they can upgrade without a problem. For example, within the sourmash 3.x series, the only reason for the same command line options to produce different output is if there was a bug.

We are still figuring out some of the details, of course. For example, we have only recently started tracking performance regressions. And it's unclear exactly what parts of our API should be considered public. Since sourmash isn't that widely used, I'm not pushing hard on resolving these kinds of high level issues, but they are a regular background refrain in my mind.

In any case, what semantic versioning does is provide a simple way for people to know if it's safe to upgrade. It also lets us pin down versions in our own workflows, with some assurance that the behavior shouldn't be changing (but performance might improve!) if we pin to a major version.

Nail down behavior with tests, then refactor underneath

I write a lot of hacky code when I'm exploring research functionality. Often this code gets baked into our packages with a limited understanding of its edge cases. As I explore and expand the use cases more and more, I find more of these edge cases. And, if the code is in a library, I nail down the edge cases with stupidity-driven testing. This then lets me (or others) refactor the code to be less hacky and more robust, without changing its functionality.

For example, I'm currently going through a long, slow refactor of some formerly ugly sourmash code that creates a certain kind of indexed database. This code worked reasonably well for years, but as we developed more uses for it, it became clear that there were, ahem, opportunities for refactoring it to be more usable in other contexts.

We don't start with good code. We don't pretend that our code is good (or at least I wouldn't, and can't :). But we iteratively improve upon our code as we work with it.

Explore useful behavior, then nail it down with tests, and only then optimize the heck out of it

The previous section is how we clean up code, but it turns out it also works really well for speeding up code.

There this really frustrating bias amongst software developers towards premature optimization, which leads to ugly and unmaintainable code. In my experience, flexibility trumps optimization 80% or more of the time, so I take this to the other extreme and rarely worry about optimizing code. Luckily some people in my lab counterbalance me in this preference, so we occasionally produce performant code as well :).

What we do is get to the point where we have pretty well-specified functionality, and then benchmark, and then refactor and optimized based on the benchmarking.

A really clear example of this applied to sourmash was here, when Luiz and Taylor noticed that I'd written really bad code that was recreating large sets again and again in Python. Luiz added a simple "remove_many" method that did the same operation in place and we got a really substantial (order of magnitude?) speed increase.

Critically, this optimization was to a new research algorithm that we developed over the period of years. First we got the research algorithm to work. Then we spent a lot of time understanding how and why and where it was useful. During this period we wrote a whole bunch of tests that nailed down the behavior. And then when Luiz optimized the code, we just dropped in a faster replacement that passed all the tests.

This has become a bit of a trend in recent years. As sourmash has moved from C to C++ to Rust, Luiz has systematically improved the runtimes for various operations. But this has always occurred in the context of well-understood features with lots of tests. Otherwise we just end up breaking our software when we optimize it.

As a side note, whenever I hear someone emphasize the speed of their just-released scientific software, my strong Bayesian prior is that they are really telling me their code is not only full of bugs (all software is!) but that it'll be really hard to find and fix them...

Collaborate by insisting the tests pass

Working on multiple independent feature sets at the same time is hard, whether it's only one person or five. Tests can help here, too!

One of the cooler things to happen in sourmash land in the last two years is that Olga Botvinnik and some of her colleagues at CZBioHub started contributing substantially to sourmash. This started with Olga's interest in using sourmash for single-cell RNAseq analysis, which presents new and challenging scalability challenges.

Recently, the CZBioHub folk submitted a pull request to significantly change one of our core data structures so as to scale it better. (It's going to be merged soon!) Almost all of our review comments have focused on reviewing the code for understandability, rather than questioning the correctness - this is because the interface for this data structure is pretty well tested at a functional level. Since the tests pass, I'm not worried that the code is wrong.

What this overall approach lets us do is simultaneously work on multiple parts of the sourmash code base with some basic assurances that it will still work after all the merges are done.

Distribute via (bio)conda, install via environments

Installation for end users is hard. I've spent many, many years writing installation tutorials. Conda just solves this, and is our go-to approach now for supporting user installs.

Conda software installation is awesome and awesomely simple. Even when software isn't yet packaged for conda install (like spacegraphcats, which is research-y enough that I haven't bothered) you can still install it that way -- see the pip commands, here.

Put everything in issues

You can find most design decisions, feature requests, and long-term musings for sourmash in our issue tracker. This is where we discuss almost everything, and it's our primary help forum as well. Having a one-stop shop that ties together design, bugs, code reviews, and documentation updates is really nice. We even try to archive slack conversations there!

Concluding thoughts

Academic workflow and software development is a tricky business. We operate in slow moving and severely resource-constrained environments, with a constant influx of people who have a variety of experience, to solve problems that are often poorly understood in the beginning (and maybe at the end). The practices above have been developed for a small lab and are battle-tested over a decade and more.

While your mileage may vary in terms of tools and approaches, I've seen convergence across the social-media enabled biological data science community to similar practices. This suggests these practices solve real problems that are being experienced by multiple labs. Moreover, we're developing a solid community of practice in not only using these approaches but also teaching them to new trainees. Huzzah!

--titus

(Special thanks go to the USDA, the NIH, and the Moore Foundation for funding so much of our software development!)

How to give a bad online talk

2020-04-13T00:00:00+02:00

Today at lab meeting, I wanted to brainstorm about how to give good online talks, because I'm giving a few remote talks in the next month. Tracy suggested that perhaps I should demonstrate a bad talk first, just to get everyone on the same page.

So I did!

Direct (YouTube link)

...enjoy? It's short, and not TOO painful if you show up with low expectations!

First, let me say that we were tremendously ...inspired by Greg Wilson's How to Teach Badly and How to Teach Badly (part 2)!

So here's what I did --

I put together a few slides on some stuff that I'd been working on recently, so it would look reasonable.

My initial screen opened with a private Twitter message up, to mimic inadvertent content sharing :).

I started out with "I didn't have a lot of time to prepare for this meeting so apologies for some of the slides."

My slide theme was very hard to read - bad fonts and colors.

A few slides in I went with "I know we're all busy on time so I'm going to be brief. I'll just skip some of the background and through these first slides quickly."

On the first slide with an image, I had Taylor Reiter break in to ask a question, and I shut her down with "Just hold questions, I'll get to them at the end of we have time."

All of my slide content was just ...terrible. I am especially "proud" of the screenshots of code (I carefully cropped off the code comments).

And of course I spoke quickly, imparted little to no useful information in any way, and took no questions at the end, either...

I only informed one or two people in advance that I was doing this, and so I got some good reactions ;). I also got some amazing recommendations for how to make it far, far worse...

Anyway, enjoy! I will write another blog post on what the various suggestions for giving good online talks were -- I'm giving two remote talks in the next month or so, and I'll come back with some specific recommendations, too!

--titus

p.s. Yes, these are real projects and you CAN find them on github :).

Some snakemake hacks for dealing with large collections of files

2020-03-09T00:00:00+01:00

This winter quarter I taught my usual graduate-level introductory bioinformatics lab at UC Davis, GGG 201(b), for the fourth time. The course lectures are given by Megan Dennis and Fereydoun Hormozdiari, and I do a largely separate lab that aims to teach the basics of practical variant calling, de novo assembly, and RNAseq differential expression.

I also co-developed and co-taught a new course, GGG 298 / Tools for Data Intensive Research, with Shannon Joslin, a graduate student here in Genetics & Genomics who (among other things) took GGG 201(b) the first time I offered it. GGG 298 is a series of ten half-day workshops where we teach shell, conda, snakemake, git, RMarkdown, etc - you can see the syllabus for GGG 298 here.

This time around, I did a complete redesign of the GGG 201(b) lab (see syllabus) to focus on using snakemake workflows.

I'm 80% happy with how it went - there's some overall fine tuning to be done, and snakemake has some corners that need more explaining than other corners, but I think the basic concepts got through to a lot of the students. I also think I'm finally teaching people something they really need to know, which is how to build, automate, place controls on, and execute complex bioinformatics workflows.

I was traveling the week before last, so I asked Taylor Reiter and Tessa Pierce to do the first RNAseq lecture for the class (week 8!) As part of their brilliant RNAseq materials for the class (snakemake! salmon! tximeta! DESeq2! RMarkdown!), Tessa used a cute trick in the Snakefile that I hadn't seen before. It's "obvious" if you're a Python+snakemake expert, but many people aren't, and in any case it's always nice to share, right??

Below, I take the opportunity to share several solutions for loading sample names into the Snakefile.

(These are fairly boilerplate examples that you can use in your own code with little modification, too!)

Cute snakemake trick #1: dictionaries for downloads

The following code snippet is a nice, simple Pythonic way to download a bunch of files from Web URLs.

# list sample names & download URLs.
sample_links = {"ERR458493": "https://osf.io/5daup/download",
                "ERR458494":"https://osf.io/8rvh5/download",
                 "ERR458495":"https://osf.io/2wvn3/download",
                 "ERR458500":"https://osf.io/xju4a/download",
                 "ERR458501": "https://osf.io/nmqe6/download",
                 "ERR458502": "https://osf.io/qfsze/download"}

# the sample names are dictionary keys in sample_links. extract them to a list we can use below
SAMPLES=sample_links.keys()

# download yeast rna-seq data from Schurch et al, 2016 study
rule download_all:
    input:
        expand("rnaseq/raw_data/{sample}.fq.gz", sample=SAMPLES)

# rule to download each individual file specified in sample_links
rule download_reads:
    output: "rnaseq/raw_data/{sample}.fq.gz" 
    params:
        # dynamically generate the download link directly from the dictionary
        download_link = lambda wildcards: sample_links[wildcards.sample]
    shell: """
        curl -L {params.download_link} -o {output}
        """

Cute snakemake trick #2: loading filenames from the current directory.

(I don't recommend this approach. Read on.)

One of the most common questions I've been asked in the last few weeks is how to avoid typing all of the sample names into the Snakefile. (This can matter a lot when you have hundreds of samples!)

After you download the files above, you can get a list of the downloaded files like so:

sample_ids = glob_wildcards("rnaseq/raw_data/{sample}.fq.gz")

Now, sample_ids is a Python list that behaves just like SAMPLES, and it can be used with expand.

Note, for this example, SAMPLES and sample_ids are going to contain the same list of files. The difference is that sample_ids are loaded from the directory listing, while SAMPLES has to be written out in the Snakefile somehow (here, in sample_links).

Why don't I recommend this approach? You can only use this approach if the files already exist in the directory. That's fine - often you don't want to copy them in or download them dynamically! - but it sets up a particular kind of potential error. If you load the list of samples from your working directory, and you accidentally delete one of the sample files, you'll omit it from your workflow without knowing.

It's much better to independently specify the list of files, so that if you accidentally delete one, snakemake will complain. That's where the next trick comes in.

As a bonus, the next approach lets you specify metadata in the spreadsheet, which is important!

Cute snakemake trick #3: loading a list of sample names from a spreadsheet.

This is taken from a really nice, clean example RNAseq workflow that uses STAR and DESeq2, written by Johannes Köster, Sebastian Schmeier, and Jose Maturana.

Here, the Snakefile loads sample names from a tab-separated values spreadsheet using pandas; a simplified version of the code follows:

import pandas as pd

samples_df = pd.read_table('samples.tsv').set_index("sample", drop=False)
sample_names = list(samples_df['sample'])

Here, sample_names is the same as SAMPLES and sample_ids, above - a list that you can use in expand and so on. The difference here is that samples_df is a Pandas dataframe that contains other information, such as sample metadata; and it's loaded from a TSV file that can be created, visualized, and edited using spreadsheet software.

Cute snakemake trick #4: loading a list of download links from a spreadsheet.

The TSV approach is particularly useful for downloading files or moving files, as the download links or file paths can be included in the spreadsheet, rather than at the top of the Snakefile (as they were in cute trick #1).

Considering the same yeast RNAseq data as the first example and a TSV file containing the sample names and download links, samples can be downloaded like so:

import pandas as pd

samples_df = pd.read_table('samples.tsv').set_index("sample", drop=False)
SAMPLES = list(samples_df['sample'])

# download yeast rna-seq data from Schurch et al, 2016 study
rule download_all:
    input:
        expand("rnaseq/raw_data/{sample}.fq.gz", sample=SAMPLES)

# rule to download each individual file specified in samples_df
rule download_reads:
    output: "rnaseq/raw_data/{sample}.fq.gz" 
    params:
        # dynamically grab the download link from the "dl_link" column in the samples data frame
        download_link = lambda wildcards: samples_df.loc[wildcards.sample, "dl_link"]
    shell: """
        curl -L {params.download_link} -o {output}
        """

Enjoy! And comments are welcome!

--titus (and Tessa and Taylor!)

Two talks at JGI in May: sourmash, spacegraphcats, and disease associations in the human microbiome.

2020-02-17T00:00:00+01:00

Hello all! I'm giving two metagenomics talks - a tech talk and a bio talk - at the Joint Genome Institute on May 7, 2020. The abstracts are below.

The JGI just moved to a new building at LBNL, so these talks are much more accessible to the UC Berkeley and LBNL communities than they would have been a year ago. I hope interested people can make it!

The talks will be in the afternoon on May 7th at the Integrative Genomics Building, LBNL Bldg 91-310. I've put the tentative times down. I'll update this post with final times and contact information for security + parking passes closer to the day.

Bio talk: Novel approaches to metagenome analysis reveal microbial signatures of IBD

(This will be the Science and Technology seminar, 3-4pm on May 7.)

Inflammatory bowel disease (IBD) is a spectrum of diseases characterized by chronic inflammation of the intestines; it is likely caused by host-mediated inflammatory responses at least in part elicited by microorganisms. As of 2015, 1.3% of US adults have been diagnosed with IBD. To date, although significant microbial associations have been uncovered, no causative or consistent microbial signature has been associated with IBD.

In a metaanalysis of six IBD cohorts comprising 2290 gut microbiome shotgun metagenomes, we sought to uncover microbial signatures of IBD. We developed a k-mer-based analysis approach based on sourmash scaled signatures that comprehensively characterizes each metagenome sample. We demonstrate that this approach explains substantial PCoA variation across samples, and that patient, study, and diagnosis account for the majority of variation. We then built an accurate random forest classifier to predict IBD subtype. This classifier is built on approximately 14,000 predictive k-mers and outperforms all previously published work for characterization of IBD subtype. We next sought to uncover the biological signal of the predictive k-mers. To determine the origin of the predictive k-mers, we used sourmash gather to search 400,000 microbial genomes from GenBank as well as recent human metagenome reanalysis efforts.

We found that 69% of predictive k-mers were contained in 129 genomes, many of which match known IBD correlates. We reasoned that many additional predictive k-mers were likely in the pangenomes of these 129 predictive genomes, so we next used spacegraphcats to query neighborhoods in compact de Bruijn graphs and extract sequences that were near our predictive genomes in graph space. This increased the annotated fraction of predictive k-mers to 85%.

This suggests that ~16% of predictive k-mers originate from strain-variable or accessory components of pangenomes, and that this variation is hidden from referenced-based approaches but is important for determining IBD subtypes. Interestingly, the fraction of predictive k-mers associated with the 129 genomes changed substantially after spacegraphcats queries. For example, a genome from the genus Bacteroides increased from owning 2.1% to 10.7% of predictive k-mers, surpassing the genome that was most predictive prior to spacegraphcats queries (Clostridiales bacterium, 2.9% to 7.4%). We are now working to bioinformatically characterize the genes associated with the pangenomes.

Our pipeline is lightweight and open source, extensible to similar comparative metagenomic studies, and has the potential to improve diagnostic criteria for IBD subtype.

Tech talk: No k-mer left behind.

(This is part of the Compute Next Generation talk series at JGI, 2-3pm on May 7.)

Here at the DIB Lab @ UC Davis, we've developed and implemented a few techniques that might be of interest to microbiology and metagenomics computational researchers. In this tech talk, will dig into the theory and implementation of our approaches, and discuss some of our current and future use cases. While there may be some extreme speculation involved, I will be sure to highlight it as such :).

The first technique is DensityHash, an extension and simplification of the modulo hash technique proposed as an alternative to MinHash by Broder (1997). Briefly, we massively downsample k-mers by intersecting with a subset of hash space. This permits efficient and accurate estimation of Jaccard similarity and containment on large sequencing data sets. We have implemented this technique in sourmash (github.com/dib-lab/sourmash), which offers a pleasant user experience for comparing samples, searching large databases (e.g. all of GenBank), estimating the composition of metagenomes, and discovering contaminated MAGs, among others. We also have a taxonomic module that slices and dices arbitrary taxonomies, and associates them with hashes for fun and profit.

The second technique is neighborhood query into large compact De Bruijn graphs, using dominating sets. Briefly, we implement a practically efficient linear-time neighborhood clustering on metagenome compact De Bruijn graphs, and then use this to query and characterize neighborhoods. This is implemented in spacegraphcats (github.com/spacegraphcats/spacegraphcats/). Spacegraphcats permits recovery of accessory elements and strain variation from metagenomes, for additional fun and profit.

All of our software is open source under the BSD license, developed openly on GitHub, and implemented in a combination of Python and Rust. We use automated tests, continuous integration, code coverage analysis, and pull request review in our development processes.

References:

sourmash: Pierce at al., 2019

spacegraphcats: Brown et al., 2020

Hope to see you there!

--titus

sourmash-oddify: a workflow for exploring contamination in metagenome-assembled genomes

2020-01-02T00:00:00+01:00

(Thanks to Erich Schwarz, Taylor Reiter, and Donovan Parks for brainstorming and feedback on this stuff. Thanks also to Luiz Irber and Phillip Brooks for their work on sourmash!)

Yesterday, I posted about using k-mers and taxonomy to investigate Genbank genomes for potential contamination.

The underlying idea is pretty simple: look for subsets of k-mers that don't match the inferred taxonomy of the genome bins they're from, then analyze.

What started me down this path over two years ago (!!) was the use of the same underlying Tara Oceans metagenomic data for two separate papers, Tully et al., 2018 and Delmont et al., 2018. Both groups released their data early along with bioRxiv preprints, and it proved to be a treasure trove for my bioinformatics methods development - all of the sourmash lca functionality as well a lot of other functionality came from a series of about 14 blog posts examining these genomes.

I last left off with the Tara oceans taxonomic analysis back around Thanksgiving 2017, with the realization that I needed to dig some more in order to really understand what was going on.

Then, over the 2019 winter break, while updating our Genbank databases, I started playing with making sourmash databases for the GTDB taxonomy, and while trying to understand why sourmash classifications were different from GTDB classifications, I developed a pile of scripts to dig into taxonomically divergent genomes that share sequence.

While corresponding with Donovan Parks about some of the Genbank oddities I found, he pointed out that this approach might be a useful technique for exploring contamination in metagenome-assembled genomes more generally.

Yep!

The challenge: metagenome-assembled genome analysis

When people compute metagenome-assembled genomes by assembling metagenomes and then binning the resulting contigs into inferred genomes, they usually assign taxonomy to the genomes using single-copy marker genes. These same genes can also be used in the binning pipeline, and/or in an evaluation step (see e.g. CheckM).

What has always worried me (and others!) is that this taxonomic assignment step drags with it many contigs whose only association with the single-copy marker genes is often that they were binned together. And, absent detailed inspection or knowledge of the genes in those contigs, it's been unclear how to evaluate the inclusion of those accessory contigs.

Here I should note that we + collaborators have looked into similar questions using assembly graph proximity, which may work as well. Regardless, the question of how to QC MAGs is definitely an obsession of mine!

An angle suggested by the above Genbank analysis was to look at the accessory contigs by doing k-mer-based taxonomic analysis on them, and then see if the k-mer taxonomy agreed or disagreed with the marker-gene-based taxonomy.

There are many reasons why this might fail - the main one being that you would generally expect the DNA sequence in MAGs to be novel, followed closely by issues of genuine horizontal gene transfer, plasmids, etc. But nothing ventured, nothing gained - and I already had functioning scripts! So I gave it a try.

Connecting everything into a workflow

In my lab, we have been using the snakemake workflow system a lot. It's an excellent way to tie together a bunch of disparate scripts!

So I put together a workflow, sourmash-oddify, to automate the analysis of genome bins. The steps are:

Given a collection of genome bins,
Assign taxonomy using GTDB-Tk
Build a sourmash taxonomy/LCA database using the resulting taxonomy
Run the find-oddities and find-oddities-examine scripts.

and voila! Sprinkle some YAML config file magic pixie dust on top and you have a configurable workflow!

Now I needed to run it on some interesting data... hmm, what collections of MAGs do I have lying around... hey look, the Tara MAGs!

Running sourmash-oddify on the Tara genomes

So, I ran this on the 2,631 genomes from Tully et al., 2018, and the 957 genomes from Delmont et al., 2018. The GTDB-Tk step took about 12 hours, and the rest (computing signatures, building the LCA database, extracting oddities, aligning genomes) took about an hour. (The config file is here.)

The results on the Delmont data are here and here, and the results on the Tully data are here and here.

I decided to dig into two results, one from each data set. In both cases, the two genomes were classified in different superkingdoms:

    - TOBG_MED-875 (d__Archaea;p__Thermoplasmatota;c__Poseidoniia;o__Poseidoniales;f__Thalassoarchaeaceae;g__MGIIb-O5;;)
    - TOBG_SAT-1614 (d__Bacteria;p__Actinobacteriota;c__Acidimicrobiia;o__Microtrichales;f__TK06;g__UBA7388;s__UBA7388 sp002470695;)

    - TARA_MED_MAG_00140 (d__Archaea;p__Asgardarchaeota;c__Heimdallarchaeia;;;;;)
    - TARA_PON_MAG_00079 (d__Bacteria;p__Patescibacteria;c__CG2-30-54-11;;;;;)

and both pairs shared a lot of sequence between them:

TOBG:
cluster2.0: 208kb aln (130k 51-mers) across (root); longest contig: 115 kb
weighted percent identity across alignments: 98.1%
skipped 0 kb of alignments in 0 alignments (< 0 bp or < 95% identity)
TOBG_SAT-1614: removed 330kb of 2514kb (13%), 3 of 28 contigs
TOBG_MED-875: removed 238kb of 1305kb (18%), 2 of 55 contigs

TARA:
cluster14.0: 1497kb aln (970k 51-mers) across (root); longest contig: 11 kb
weighted percent identity across alignments: 98.9%
skipped 15 kb of alignments in 37 alignments (< 0 bp or < 95% identity)
TARA_PON_MAG_00079: removed 1788kb of 6127kb (29%), 507 of 1767 contigs
TARA_MED_MAG_00140: removed 2791kb of 6746kb (41%), 472 of 1411 contigs

As a control, I then took the "cleaned" genomes and re-ran the classification with GTDB-Tk. Three of the four classified as their original classification, indicating that the removed sequence didn’t contain essential marker genes (as I would have guessed). TARA_PON_MAG_00079 wasn't classified as anything by GTDB, because fewer than 10% of the markers were present. (AFAICT, GTDB-Tk doesn’t give any more details than that in its logs, so I’ll have to dig to figure out what happened.)

TOBG_MED-875 is classified as d__Archaea, while TOBG_SAT-1614 is classified as d__Bacteria. So what is the sequence that is shared? Conveniently find-oddities-examine.py outputs the contigs it removes, but what next?

I decided to run a quick analysis using Torsten Seeman's prokka, which did gene calling and gave me protein sequences in FASTA format with a minimum amount of fuss (thanks Torsten!). I took the resulting aa sequences, extracted those over 100 aa in length, shuffled them, and took the first 10. I then BLASTed these ten sequences over at NCBI BLAST.

The top hit to these genes in all 10 cases is to the TOBG_MED-875 genome in Genbank, which is labeled as a Euryarchaeota archaeon.

However, the second and third hits are generally to a variety of Chloroflexi and/or Acidimicrobiales proteins, in the Bacterial superkingdom. This suggests that the majority of the predicted genes in the DNA shared between TOBG_MED-875 and TOBG_SAT-1614 are bacterial.

Moreover, it suggests that the inclusion of TOBG_MED-875 in Genbank may be messing up some gene taxonomies.

Summary thoughts

I think it is safe to argue that two different binned genomes from the same metagenomic samples should not share much genomic DNA, unless they are from closely related species. (In general, I would not trust conclusions about lateral gene transfer based solely on computationally inferred genomes.)

sourmash-oddify is an alpha-stage automated workflow to identify k-mers and DNA segments that don't follow the taxonomy of their containing genomes. I think using it to flag contamination in metagenome-assembled genomes is (or will be :) straightforward.

It uses the GTDB taxonomy assignment pipeline, GTDB-Tk, to generate the taxonomies, uses a Kraken-inspired approach to identify "incoherent" k-mers shared between genomes, and then runs nucmer to align the genomes.

Indications are that at least on some genomes, it correctly identifies contamination.

This is a pretty lightweight workflow, too, especially if you're already using GTDB-Tk!

What's next?

I'm not really sure. I have a few ideas for some larger scale analyses, but I'm at the stage where I have 80% of the coding done for a full project, but it's only 20% of the work needed to bring the project to some sort of real fruition.

I think what I'd be looking to do next is to automate the Prokka steps above, and find some way of semi-automatedly reaching conclusions about who the contamination belongs to.

I'd like to work with a group or two who have a large collection of pre-publication MAGs to investigate with this approach. I think the best way to mature an approach like this is in tandem with a biology team that really cares about the specifics of the genomes and genes. Drop me a line if you're interested!

I have other ideas and questions, too - can we use this pipeline on single-cell genomes? Should we work to ingest all genomes everywhere, and would that make this more sensitive in a useful way?

--titus

Finding problematic bacterial/archaeal genomes using k-mers and taxonomy

2020-01-01T00:00:00+01:00

(Happy New Year, everyone! Thanks on this blog post go out to Erich Schwarz and Taylor Reiter, for offering helpful suggestions and asking tough questions as I meandered through this work!)

Yesterday, I posted about using sourmash lca classify to taxonomically classify bacterial and archaeal genomes quickly, and compared the results to the full GTDB taxonomy. The tl;dr was that sourmash works pretty well and returns results consistent with GTDB and GTDB-Tk, but that it often doesn't classify as precisely as GTDB-Tk.

I was kind of expecting that at the species level, because there is a limit to the kind of precision that downsampled k-mers can achieve: the last 1-0.1% of nucleotide similarity can be a bit wobbly with sourmash (<- technical term).

But I was surprised to see the phylum and superkingdom level limits. sourmash lca classify couldn't classify 235 genomes beyond phylum level! What could be causing this?

Digging into a single case of imprecise classification by sourmash

I took a closer look at GCF_001477405, a genome tagged as Staphylococcus sciuri in Genbank. Using sourmash lca summarize, I output a summary of the taxonomic labels of the 31-mers in this genome, downsampled at 10,000. At the phylum level, I saw:

67.9%   199   d__Bacteria;p__Firmicutes
2.0%      6   d__Bacteria;p__Proteobacteria

which suggest that there are about 60k 31-mers in the genome that belong to genomes in the phylum Proteobacteria (they're from the Bradyrhizobium sp003020075 genome, if you're interested :).

And, because of the mechanism and thresholds by which sourmash lca classify works, those 60k k-mers were enough to trigger confusion about whether the genome belonged to the Firmicutes or the Proteobacteria.

The limitations of a naive lowest common ancestor algorithm

Please indulge me in a brief digression about lowest common ancestor approaches to classification. Per from Wood and Salzberg 2014), the algorithm for taxonomic classification of collections of k-mers looks something like this:

Classify all k-mers individually
Collect all the classifications into a single tree
Compute the lowest common ancestor of all the classifications
Assign that classification to the collection

This is the algorithm that sourmash uses, with the addition of a filtering step to remove classifications that are few in number before computing the lowest common ancestor.

(Kraken takes a more sophisticated approach than this, where it computes the highest-weighted root-to-leaf path through the taxonomy (as pictured in the above figure).)

So that's what going on with this specific genome: sourmash is doing the right thing (by our implementation of the lca algorithm), and refusing to classify the genome beyond the phylum level, because the genome has bits and pieces of firmicutes and proteobacteria. Meanwhile, GTDB is appropriately deciding that this is almost certainly a Staphylococcus sciuri, based on its marker genes.

Looking systematically at genome composition across 25k genomes

On the other hand, why the heck is 2% of this genome shared with Bradyrhizobium sp003020075?? That's a good question...

Rather than dig into this in a case by case basis, I decided to look across 25k of the GTDB genomes - these are the 25k dereplicated genomes that are part of the GTDB toolkit, and (not coincidentally) the ones in the databases that we posted on Monday. These "LCA" databases contain a dictionary of all of the k-mers in all 25k genomes, together with their taxonomic lineages - perfect!

So I devised the following algorithm:

Gather all 51-mers in the 25k genomes
Identify those that are "taxonomically incoherent" at the superkingdom or phylum level, by which I mean they belong to genomes in both Archaea and Bacteria, or multiple phyla within Archaea or Bacteria.
Find pairs of genomes that belong to different phyla or superkingdoms and contain approximately 100,000 or more 51-mers in common.

This algorithm is implemented in the find-oddities.py script, if you're interested; it'll run on any sourmash LCA database file, and takes about a minute to run on the 25k genomes one.

What does the output look like? This!

cluster 0 has 2 assignments for 47 hashvals / 470000 bp
  rank & lca: superkingdom d__Bacteria
  Candidate genome pairs for these lineages:
    cluster.pair 0.0 share 470000 bases
    - GCA_003220225 (d__Bacteria;p__Methylomirabilota;c__Methylomirabilia;o__Ro\
kubacteriales;f__GWA2-73-35;g__AR12;s__AR12 sp003220225;)
    - GCA_003222275 (d__Bacteria;p__Acidobacteriota;c__Vicinamibacteria;o__Fen-\
336;f__Fen-336;g__AA32;s__AA32 sp003222275;)

This is flagging two genomes, GCA_003220225 and GCA_003222275, as sharing approximately 470,000 51-mers. The output is sorted by number of shared k-mers, descending, and for the particular thresholds and parameters that I'm using, there are 21 sets of lineages in GTDB that share 100,000 or more 51-mers across the superkingdom or phylum level.

Looking at actual alignments

The big problem with the above approach is that it relies on k-mers, and downsampled k-mers at that. To dig into the actual genomes, I decided to actually do some alignments. Briefly, I gathered each pair of genomes, aligned them with nucmer, and then filtered the resulting alignments using pymummer; this is implemented in the script find-oddities-examine.py.

The resulting output is this:

cluster0.0: 557kb aln (470k 51-mers) across d__Bacteria; longest contig: 26 kb
weighted percent identity across alignments: 97.6%
skipped 79 kb of alignments in 97 alignments (< 0 bp or < 95% identity)
GCA_003222275: removed 760kb of 6756kb (11%), 154 of 628 contigs
GCA_003220225: removed 4739kb of 6031kb (79%), 95 of 167 contigs

and hopefully it's pretty self-explanatory.

The script makes some minimal attempt to "cleanse" the genomes of things that align between them, and that's what the last two lines are. But, rather than doing anything clever, I just discard any contig that has an alignment in it. This is obviously wrong in a general sense but...

...interestingly, it provides an opportunity to see that in this pair of genomes, one genome has alignments to a bunch of fragments (that's the first one), while the other has alignments throughout (the second one). The signature of this is that you can cleanly remove all of the alignments from the first genome and only get rid of 11% of the sequence, whereas 79% of the second genome goes away when you eliminate contigs with alignments.

So, in this case, it's pretty clear that the first genome is probably contaminated by sequence from the second genome.

There are other situations where it's less clear what's going on:

cluster21.0: 115kb aln (100k 51-mers) across d__Bacteria; longest contig: 1 kb
weighted percent identity across alignments: 99.2%
skipped 4 kb of alignments in 6 alignments (< 0 bp or < 95% identity)
GCF_000477555: removed 1439kb of 2775kb (52%), 163 of 207 contigs
GCF_000427295: removed 5423kb of 6292kb (86%), 88 of 202 contigs

and here we would need to dig further.

Examining potential contamination across 25k Genbank genomes

Of the 21 pairs of genomes found with the above approach, it looks like there are 11 that have cleanly isolatable contigs with taxonomically incoherent k-mers.

The most interesting one is this:

cluster 2 has 2 assignments for 25 hashvals / 250000 bp
  rank & lca: superkingdom d__Bacteria
  Candidate genome pairs for these lineages:
    cluster.pair 2.0 share 260000 bases
    - GCF_002705755 (d__Bacteria;p__Actinobacteriota;c__Actinobacteria;o__Actin\
omycetales;f__Microbacteriaceae;g__Microbacterium;s__Microbacterium esteraromat\
icum_A;)
    - GCA_003265155 (d__Bacteria;p__Firmicutes;c__Bacilli;o__Mycoplasmatales;f_\
_Mycoplasmoidaceae;g__Eperythrozoon_A;s__Eperythrozoon_A wenyonii_A;)

weighted percent identity across alignments: 97.9%
skipped 45 kb of alignments in 42 alignments (< 0 bp or < 95% identity)
GCA_003265155: removed 593kb of 597kb (99%), 34 of 37 contigs
GCF_002705755: removed 534kb of 3626kb (15%), 139 of 225 contigs

here it looks like 100% of the genome of Eperythrozoon_A wenyonii_A is contained in the genome of Microbacterium esteraromaticum_A!

I'm still fine-tuning the approach but I think this is a promising way to flag Genbank genomes that are candidates for further examination.

Some concluding thoughts for today

To summarize, what we're seeing is that whole-genome approaches to taxonomic classification (either based on phylogeny of marker genes, or on whole-genome nucleotide comparisons, or both) sometimes disagree with the details of small bits of the content.

Let me hasten to add: this is a well known approach to looking at horizontal gene transfer, and the only small bits of interesting novelty here are (a) the scaling power of sourmash and (b) the large-scale application.

Concerning my own initial question: at least some of the imprecise classifications by sourmash are probably due to cross-genome shared nucleotides, some of which may be contamination (and others of which might be legitimate lateral gene transfer, plasmids, etc.) It's hard to tell without digging in further, of course!

I think it's interesting to contrast compositional approaches like the above with approaches like average nucleotide identity (ANI). ANI is a good way to do a comparison of two (or more) genomes, but it's a bulk measure that (like all bulk measures) elides details. A k-mer based approach can detect compositional commonalities between genomes, but of course has its own limitations. Using both seems like a good opportunity!

I've tried analyses like this with the Genbank taxonomy, but because that taxonomy isn't constructed using whole-genome comparisons, the results are too messy for me to look into; I'm too likely to discover that the problem is an incoherent taxonomic assignment of the whole genome, rather than a smaller portion of the genome being confused. So I'm really appreciating GTDB, which resolves a lot of these issues!

Donovan Parks made the excellent point to me in e-mail that many of the exciting new taxa in the tree of life are based on species known only from metagenome-assembled genomes, and so some contamination is not unexpected. (See also "Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories", Shaiber and Eren, 2019 and "Accurate and Complete Genomes from Metagenomes", Chen et al., bioRxiv, 2019 for some relevant discussions.) My interest, at least for the moment, is in building tools to dig into this quickly and easily; we'll see where that goes!

--titus

p.s. The full oddities-k51.txt is here, and the full oddities-k51.examine.txt is here.

p.p.s The command lines to generate the above files are in this script.

How does sourmash's lca classification routine compare with GTDB classifications?

2019-12-31T00:00:00+01:00

Yesterday I posted about the GTDB taxonomy; we are now providing prepared databases that can be used with sourmash's taxonomy classification routines to classify genomes with GTDB.

The databases we posted are built from the dereplicated 25k GTDB genomes distributed as part of the GTDB-Tk classification toolkit, and not the full 145k classifications in GTDB. So they are smaller than they could be, and also potentially lower resolution. Moreover, sourmash uses k-mers instead of amino acids, which may lead to different classifications.

A good first question is, how well do classifications with sourmash lca classify & 25k genomes compare to the full 145k classifications in GTDB? This is basically a measure of generalizability - how reliably can we infer the classifications of the 145k genomes from the 25k?

Comparing `sourmash lca classify` on Genbank to GTDB

I classified all 420k Genbank genomes using sourmash lca classify with k=31, and I then wrote a script to compare the output to the GTDB taxonomy. This involved some rather nasty identifier conversion which sometimes failed, but we ended up with a good number of comparable items:

identifiers in gtdb only:                  6901
identifiers in sourmash lca classify only: 247987
identifiers in both:                       137185

So we are using the harmonized 95% of gtdb identifiers and 35% of sourmash identifiers for the below comparisons. (The missing items are due to failed identifier munging, different versions of genbank, and me using genbank-entire instead of refseq (which is the source of the bulk of the sourmash-specific identifiers)).

Of the 137,185 genomes in common, a straight-up comparison of classifications gave the following:

same:      79666 (58.1%)
different: 57519 (41.9%)

The 58.1% identical number is reassuring, but 41.9% disagreement is not great - what's going on here?

It turns out that, in almost all situations, sourmash agrees with but is lower resolution than GTDB.

different but consistent: 57498
   rank: superkingdom / count: 201
   rank: phylum / count: 36
   rank: class / count: 176
   rank: order / count: 94
   rank: family / count: 2260
   rank: genus / count: 54731
   rank: species / count: 0

That is, 201 of the sourmash classifications stop at the superkingdom level, 36 continue to the phylum level, and so on. Fully 95.1% match at the genus level! And for all of these, the sourmash classifications agree with the GTDB taxonomy - a full 99.96% of the time.

What about the disagreements?

inconsistent: 21
   rank: superkingdom / count: 0
   rank: phylum / count: 0
   rank: class / count: 0
   rank: order / count: 0
   rank: family / count: 0
   rank: genus / count: 21
   rank: species / count: 0

So all 21 of the disagreements are at the genus level... whew.

The upshot is that sourmash lca classify seems to work pretty well as a first-round classification system, and will only lead you astray at the genus level (and even then only rarely). The species-level accuracy could potentially be improved by using k=51 instead of k=31, but that would probably decrease the number being identified, too.

Comparing `sourmash lca classify` to GTDB-Tk.

The next question I had was, how does sourmash's computational performance compare with the GTDB-Tk toolkit? GTDB-Tk is the standard way to classify new genomes using the GTDB taxonomy. How does sourmash lca classify compare computationally?

Using the sourmash k=21 LCA database (https://osf.io/9d5rx/), I analyzed 336 randomly chosen genbank genomes with both sourmash and GTDB-Tk. As with the full comparison above, the results are pretty comparable:

if sourmash lca classify yields a species-level designation, it is identical to what GTDB-Tk produces.
at k=21, sourmash lca classify will never disagree with GTDB-Tk. At worse it will fail to classify out to species, genus, etc. level.

But how did the compute compare?

sourmash lca classify takes about 2 minutes to compute the signatures and 35 seconds to classify 336 signatures. GTDB-Tk takes about 2 hours with GTDB-Tk, using 8 threads.

sourmash lca classify used about 5 GB of RAM, compared to about 120 GB of RAM for GTDB-Tk.

Conclusions

So, it seems like sourmash lca classify is a decent prefilter for GTDB-Tk, and that if you need to classify a lot of genomes quickly, you could start with sourmash and then use GTDB-Tk to focus in on the ones that aren’t classified at the species level.

In summary,

sourmash rarely disagrees with GTDB-Tk, and when it does, it's only at the genus level.
sourmash often fails to classify genomes that GTDB-Tk does.
sourmash is faster and requires less memory than GTDB-Tk. Compute efficiency is admittedly a focus of our project, so ...that's good? :)

Special thanks go to Taylor Reiter for suggesting that we look into the GTDB taxonomy for sourmash, and Donovan Parks for corresponding with me on various GTDB issues!

--titus

p.s. Here's the sourmash command I used to classify genomes:

sourmash compute -k 21,31,51 —scaled=1000 *.fna.gz sourmash lca classify \ --query *.sig \ --db gtdb-release89-k31.lca.json.gz > lca-classify-all-k31.txt

p.p.s. To do the comparison, I ran our sourmash bulk classify script and then converted the results into a lineage CSV. I separately converted the GTDB taxonomy file to a lineage CSV, and then compared the two. Do not try this at home, the scripts are ugly and require a lot of data that's only on our HPC at the moment :)

Sourmash LCA databases now available for the GTDB taxonomy

2019-12-30T00:00:00+01:00

I am happy to announce that we have made available prepared sourmash taxonomy ("LCA") databases for release 89 of the GTDB taxonomy.

The databases are available for download from the Open Science Framework in this project. There are prepared databases avaialble for k=21, k=31, and k=51.

What is the GTDB taxonomy?

GTDB is a revised bacterial and archaeal taxonomy based on phylogenetic relations between proteins from approximately 25k genomes. You can read more about it here.

GTDB is an alternative to the NCBI taxonomy. It is used by (among others) MGnify, the EBI metagenomics resource.

What is sourmash?

Sourmash is a research platform and bioinformatics tool for searching and analyzing genomes, based on a MinHash-inspired approach that allows genome similarity searches, genome containment searches, and compositional analysis of k-mers in large sequence data sets. You can read more about it here.

What do these databases let you do?

There are three immediate uses for these databases:

you can use the sourmash lca classify routine (and other LCA commands) to do taxonomic classification of genomes using the GTDB taxonomy. (See our tutorial on sourmash lca!)
you can do compositional analysis of metagenomes using sourmash lca summarize.
you can search for genomes in GTDB that are similar to genomes (or metagenomes) of interest, using sourmash search and sourmash gather.

How much memory does sourmash need to use these databases?

LCA databases take up less disk space than SBT databases, but are more memory intensive. Using these databases requires about 5 GB of RAM.

--titus

Appendix: How are these databases built?

We use a fully automated snakemake workflow to build them, here. It takes about 12 hours and under 100 GB of RAM to build the databases from the genomes under release89/fastani/database/.

An initial report on the Common Fund Data Ecosystem

2019-08-15T00:00:00+02:00

For the past 6 months or so, I've been working with a team of people on a project called the Common Fund Data Ecosystem. This is a targeted effort within the NIH Common Fund (CF) to improve the Findability, Accessibility, Interoperability, and Reusability - a.k.a. "FAIRness" - of the data sets hosted by their Data Coordinating Centers.

(You can see Dr. Vivien Bonazzi's presentation if you're interested in more details on the background motivation of this project.)

I'm thrilled to announce that our first report is now available! This is the product of a tremendous data gathering effort (by many people), four interviews, and an ensuing distillation and writing effort with Owen White and Amanda Charbonneau. To quote,

This assessment was generated from a combination of systematic review of online materials, in-person site visits to the Genotype Tissue Expression (GTEx) DCC and Kids First, and online interviews with Library of Integrated Network-Based Cellular Signatures (LINCS) and Human Microbiome Project (HMP) DCCs. Comprehensive reports of the site visits and online interviews are available in the appendices. We summarize the results within the body of the report.

The executive summary is just under four pages, and the full report is about 30 - the bulk of the report document (another 100 pages or so) consists of appendices to the main report.

I wanted to highlight a few things about the report in particular.

1. Putting your data in the cloud ...is just the start.

This may be obvious to those of us in the weeds, but supporting long-term availability of data through the use of cloud hosting is only one of many steps. Indexing of (meta)data, auth and access, and a host of other issues are all important to spur actual data reuse.

2. Just, like, talking with people is, y'know, really useful!

We did a lot of interviewing and found out some surprising things! In partial reaction to our experience with the Data Commons, we are taking a much lower key and more ethnographic approach to understanding the opportunities and challenges that actually exist on the ground. A lot of the good stuff in the report emerged from these interviews.

3. Interoperability is contingent on the data sets (and processing pipelines) you're talking about.

The I in FAIR stands for "Interoperability", and (at least in the context of the CFDE) this is probably the trickiest to measure and evaluate. Why?

Suppose, not-so-hypothetically, that you want to take some data from the GTEx human tissue RNAseq collection, and compare the expression of genes in that data with some data from the Kids First datasets.

At some basic level, you might think "RNAseq is RNAseq, surely you just grab both data sets and go for it", right?

Not so fast!

First, you need to make sure that the raw data is comparable - not all RNAseq can be compared, at least not without removing technical biases. (And I'm honestly not sure what the state of the art is around comparing different protocols, e.g. strand-specific RNAseq to generic RNAseq.)

Second, the processing pipeline used to analyze the RNAseq data needs to be the same. Practically speaking this means that you may need to reanalyze all of the raw data.

Third, you need to deal with batch effects. I'm again not actually sure how you do this on data from a variety of different studies.

Fourth, and more fundamental, you need to connect your sample metadata across the various studies so that you are comparing apples to apples. (Spoiler alert: this turns out to be really hard, and seems to be the main conceptual barrier to actual widespread reuse of data across multiple studies.)

There are some techniques and perspectives being developed by various Common Fund DCCs that may help with this, and I hope to talk about them in a future blog post. But it's just hard.

4. Computational training is second on everybody's list.

This is something that I first saw when a group of us were talking with a bunch of NSF Science and Technology Centers (STCs): when asked what their challenges were, everyone said "in addition to our primary mission, computational training is really critical." (This broad realization by the STCs led to two funded NSF supplements that are part of Data Carpentry's back story!)

We saw the same thing here - a surprising result of our interviews was the extent to which the Common Fund Data Coordinating Centers felt that computational training could help foster data use and reuse. I say "surprising" not in the sense that it surprised me that training could be important - I've been banging that drum for well over a decade! - but that it was so high on everybody's list. We only had to mention it - "so, what role do you see for training?" - to have people at the DCCs jump on it enthusiastically!

There are many challenges with building training programs with the CF DCCs, but it seems likely that training will be a focus of the CFDE moving forward.

What's next?

This is only an interim report, and we've only interviewed four DCCs - we have another five to go. Expect to hear more!

--titus

Brown, C. T., Charbonneau, A., & White, O.. (2019, August 13). 2019-July_CFDE_AssessmentReport.pdf (Version 1). figshare. doi: 10.6084/m9.figshare.9588374.v1

Comparing two genome binnings quickly with sourmash

2019-07-23T00:00:00+02:00

tl;dr? Compare and cluster two collections of 1000+ metagenome-assembled genomes in a few minutes with sourmash!

A week ago, someone e-mailed me with an interesting question: how can we compare two collections of genome bins with sourmash?

Why would you want to do this? Well, there's lots of reasons! The main one that caught my attention is comparing genomes extracted from metagenomes via two different binning procedures - that's where I started almost two years ago, with two sets of bins extracted from the Tara ocean data. You might also want to merge bins that were similar to produce a (hopefully) more complete bin, or you could intersect bins that were similar to produce a consensus bin that might be higher quality, or you could identify bins that were in one collection and not in the other, to round out your collection.

I'm assuming this is done by lots of workflows - I note, for example, that the metaWRAP workflow includes a 'bin refinement' step that must do something like this.

I (ahem) haven't really read up on what others do, because I was mostly interested in hacking something together myself. So here goes :).

How do you compare two collections of bins??

There are a few different strategies. My previous attempts were --

comparing two directories in bulk, focusing on summary statistics;
reclassifying each bin set with the taxonomy from the other bin set

In both cases, my conclusions ended with "wow, there are some real differences here" but I never dug deeply into what was going on in detail.

This time, though, I had a bit more experience under my belt and I realized that a fairly simple thing to do would be to cluster all of the bins together while tracking the origin of each bin, and then deconvolving the clusters so that you could dig into each cluster at arbitrary detail.

The basic strategy

Load in two lists of sourmash signatures.
Compare them all.
Perform some kind of clustering on the all-by-all comparison.
Output clusters.

Conveniently, I had already implemented the key bits in a Jupyter notebook about a year ago (here), so it was ready to go! I turned it into a command-line script called cocluster.py and tested it out; on data where I knew the answer, it performed fine, grouping identical bins together and grouping or splitting strain variants depending on the cut point for the dendrogram.

You do have to run it on collections of already-computed signatures; an example command line for cocluster.py is:

cocluster.py --first podar-ref/?.fa.sig --second podar-ref/*.fa.sig -k 31

This version outputs a dendrogram showing the clustering, as well as a spreadsheet containing the cluster assignments.

Speeding it up

The problem is, it's kind of slow for big data sets where you have to do millions of comparisons!

Since comparing N signatures against N signatures is inherently an N**2 problem, any work we can put into filtering out signatures at the front end of the analysis will be paid back in serious coin.

So, I added two optimizations.

First, you can now pass in a --threshold argument that specifies, in basepairs, roughly how many bp need to be shared by a signature from the first list with any of the signatures in the second list. If this threshold isn't met, the signature from the first list is dropped. Then do the same for each signature in the second list with respect to the first list.

Second, you can now downsample the signatures by specifying a --scaled parameter. (Read more about this here.) The logic here is that if you're comparing genomes, you probably don't really need to look at a high resolution to get a rough estimate of what's going on. This optimization speeds up every comparison done.

Together, this made it straightforward to apply this stuff to scads of genomes!

More/better output

Last but not least, I updated the script to output clusters, and provide summary output too!

An example!

Here is an annotated example of the complete workflow - this is done on the reference genome data set from Shakya et al., 2013, which we updated in Awad et al., 2017. This genome collection contains 64 genomes, some of which are strain variants of each other.

Briefly, after computing signatures, cocluster.py calculates an all-by-all comparison for the two input collections, that results in a matrix like this (not currently output by cocluster.py) --

The dendrogram is then cut at some given phenetic distance - in this case I chose 1.8, based on visual inspection of this next dendrogram:

The cocluster.py script then outputs a cluster details CSV file that lists all of the clusters and their members. (The clustered signatures themselves are also provided, along with singletons.)

And, finally, all of this activity is logged and summarized in the results output:

...
total clusters: 60
num 1:1 pairs: 56
num singletons in first: 0
num singletons in second: 0
num multi-sig clusters w/only first: 0
num multi-sig clusters w/only second: 0
num multi-sig clusters mixed: 4

The full set of commands is listed in this Snakefile, and commands to repeat it are in the appendix below.

Playing with real data

Since both the Tully et al. and the Delmont et al. papers have been published now, I first re-downloaded the published data and calculated all the signatures for the 3500 or so genomes -- see the instructions and Snakefile in github.com/ctb/2019-tara-binning2/.

Once downloaded, computing the signatures takes about 15 minutes, using snakemake -j 16.

Then, I ran the cocluster script from https://github.com/ctb/2017-sourmash-cluster like so:

./2017-sourmash-cluster/cocluster.py --threshold=50000 -k 31 \
    --first ../data/tara/tara-tully/*.sig \
    --second ../data/tara/tara-delmont/NON_REDUNDANT_MAGs/*.sig \
    --prefix=tara.coclust --cut-point=1.0

This took about 2 minutes to run on my HPC cluster, and produced the following output with a cut point of 1.0 (which is pretty liberal).

...
total clusters: 2838
num 1:1 pairs: 331
num singletons in first: 1886
num singletons in second: 443
num multi-sig clusters w/only first: 42
num multi-sig clusters w/only second: 4
num multi-sig clusters mixed: 132

When I re-run it with a more stringent cut-point of 0.1, I get:

% ./2017-sourmash-cluster/cocluster.py --threshold=50000 -k 31 \
    --first ../data/tara/tara-tully/*.sig \
    --second ../data/tara/tara-delmont/NON_REDUNDANT_MAGs/*.sig \
    --prefix=tara.coclust --cut-point=0.1
...
total clusters: 3520
num 1:1 pairs: 43
num singletons in first: 2557
num singletons in second: 906
num multi-sig clusters w/only first: 6
num multi-sig clusters w/only second: 0
num multi-sig clusters mixed: 8

Basically this means that:

when doing stringent clustering, there are 3520 different clusters;
43 of the clusters provide a 1-1 match between bins from the Delmont and Tully studies;
2557 of the Tully signatures don't cluster with anything else;
906 of the Delmont signatures don't cluster with anything else;
there are 6 clusters that contain more than one Tully signature, and no Delmont signatures
there are 0 clusters that contain more than one Delmont signatures, and no Tully signatures;
8 of the clusters have more than two signatures and contain at least one Tully and at least one Delmont signature.

I'll dig into some of these results in a separate blog post!

--titus

Appendix: repeating the podar analysis

This workflow will take about 1 minute to run, once the software is installed.

To repeat the analysis of 64 genomes above (see output), do the following.

# create a new conda environment w/python 3.7
conda create -y -c bioconda -p /tmp/podar-coclust \
    python=3.7.3 sourmash snakemake

# activate conda environment
conda activate /tmp/podar-coclust

# grab the cocluster script and podar workflow
git clone https://github.com/ctb/2017-sourmash-cluster/
cd 2017-sourmash-cluster/podar-coclust

# clean out the existing files & run!
snakemake clean
snakemake -j 4 -p all

This last step will download the necessary files, compute the signatures, and run cocluster.py.

How to encourage participation in teleconferences

2019-06-24T00:00:00+02:00

(and/or how to run effective teleconferences!)

I participate in a lot of teleconferences, and some of them aren't very participatory, for various reasons. Recently a good friend asked for suggestions on how to open up the phone calls, and I came up with the below ideas. What am I missing? What did I get wrong?

First, post a meeting agenda with a medium amount of detail, well in advance ( > 24 hours).

Posting an agenda in advance gives people time to think about things, if they are interested.
The medium amount of detail (up to a paragraph) lets people understand what it’s about, see what the major issues/questions are, and think of questions or comments they may have.
If the agenda is posted > 24 hours in advance, you can reasonably expect people to have read it, and if people want to add things to the agenda on the call you punt them to the next call instead.

Basically, if you spring a skeleton agenda on a group with < 3 hours to spare, no one will read it and even when they do they won’t room to dig into it.

Second, assign duties to multiple people and rotate.

Typical meetings need a timekeeper (keeping an eye on the agenda), a facilitator (keeping conversation moving), and a note taker (recording notes and action items).
Assigning these roles is less about authority and more about making sure someone has been given the responsibility.
It also means that at least three different “voices” are heard - two on the call, one in the notes - each time.
Rotating means that you’re not giving someone permanent authority, and also ensures that if someone isn't good at or dislikes one role, they’re not stuck on it. Nor do they necessarily escape practicing :)
Rotating also means that the convener or nominal authority is not always the person driving the conversation.
Having these roles means that at least three people will be engaged in the conversation, even if nobody else is :)

Third, pause after questions until the silence becomes slightly uncomfortable before proceeding.

People who are hesitant to speak will need the time to come forward.

(This is an approach that was taught to me during interview training at UC Davis!)

Fourth, provide a respectful way for people to indicate they are ready to speak.

e.g. type “hand” in chat, or Raise Hand in zoom.
this means it’s not just “first to interrupt” that gets to speak, which biases towards certain types of personalities
institute a rule that only the facilitator and timekeeper get to interrupt without a ‘hand’ (or maybe not even them).
encourage people to post low-key/non-urgent questions in the chat.

You can also circulate a set of rules and suggestions for how to participate effectively. Belinda Weaver wrote up this really great list from the Carpentries - it's a great starting point!

What am I missing? What did I get wrong?

--titus

Using GitHub for janky project reporting - some code

2019-05-15T00:00:00+02:00

For the NIH Data Commons, we needed a way for 10 distinct teams to do reporting at the level of about 50-100 milestones per team, on a monthly basis.

Each team was already using different project management software internally, and we didn't want to require them to switch to something new. We also didn't need a lot of innate functionality in the project reporting system - basically, for each milestone we needed two statuses, "started" and "finished".

So we decided to go with something lightweight and simple that would support programmatic update and automated reporting: GitHub!

We chose to use GitHub for project reporting for several reasons. We were already using GitHub for content stuff, and everyone had accounts. We were also using GitHub for authentication control on static Web sites via a Heroku app.

So what we did was use the PyGithub package to write a script to take the project milestones (which were all in a spreadsheet) and load them into GitHub issues. There was a label for "this task has been started", and when complete, the issue was just closed.

Each issue had some metadata associated with it (this was basically regexp-friendly fields like "id: XYZ") that linked it back to information in the spreadsheet. Other metadata such as the team that "owned" the milestone was layered on with GitHub labels.

We then wrote another script that extracted the issue statuses and output a reporting spreadsheet that we could send to the NIH on a monthly basis.

(Luiz Irber wrote the first version of the scripts as a proof of concept, and then I took over expansion and maintenance as our needs evolved.)

Using GitHub in this way had a number of advantages, some of which were unexpected.

The main advantage was that the user interface for viewing and updating statuses was super easy. Finding issues could be done github search (and eventually via our project search engine, centillion). Permalinks could be bookmarked, too.
Linking between GitHub issues worked nicely: when you put a link from an issue in some other repo to a milestone, a back-link was automatically provided on GitHub.
The statuses of milestones were accessible to everyone, i.e. visible across the project.
People from any team could watch a milestone they were interested in.
Comments and questions could be posted on milestones, and (potentially) could be provided in the monthly rollup.
The GitHub Web and project interface went through churn during our project, but the issue API was not affected, so our scripts kept on working.
Unlike built-in GitHub projects functionality, this works easily across multiple repositories AND multiple organizations.

What if we had not used GitHub?

Within the project, there was some pushback. Most of the pushback amounted to "but we are already using System X, can't we just use that?" But there was no consensus on what to use! Since it was all scriptable, we were expecting to write some status importers (but didn't need to within the first phase of the project). It would have been easy to auto-update issue labels using GitHub project management bridges (and I think at least one group did that without involving us).

GitHub enabled everyone to see each other's milestone statuses without having to give permissions beyond existing GitHub project memberships. I don't know how we would have done that another way.

Because we used a lightweight informal format with some simple scripts, we could update reporting formats and details quickly. If we'd used a heavierweight and/or closed source system, we might have had to put more time into configuration and/or bug workarounds.

GitHub is pretty scriptable, which came in really handy for wonky status update situations, or custom reports. I'm not sure how scriptable and well documented other issue tracking software is.

So where's the code?

I've extracted the core code to github.com/ctb/2019-dcppc-bot, and made a small running example!

There are two scripts, update-milestones.py and milestones-gh-to-csv.py. The first script parsed the big CSV file full of milestones, and updated the GitHub issues from it. The second script exports the GitHub issues and statuses for monthly reporting.

create_issue_body_milestone is what created / updated the actual issues.

extract_report built the milestone output reports, which were then output in the main function.

Running stuff

Create a token by going to GitHub settings, Developer Settings, Personal Access Tokens, Generate New Token.

Copy / paste the string into an environment variable (you'll need to replace the hex string with your own token).

export GITHUB_TOKEN='a6161b3288894522b8930b67231d833295e7d5ba'

Check that the token works and the repo exists (you'll want to replace ctb/example-milestones with a repository you have write access to!)

./update-milestones.py update example-milestones.csv  -f -vv -m ctb/example-milestones

Actually create the issues now, by parsing the example-milestones.csv file

./update-milestones.py update example-milestones.csv  -f -vv -m ctb/example-milestones --change-github

This will create and/or update issues, e.g. like ctb/example-milestones #1.

Now, run a report:

./milestones-gh-to-csv.py  -m ctb/example-milestones example-milestones.csv

This will generate reports by team, e.g. report-team-White.csv and report-team-Brown.csv.

You can see the final set of issues here.

Was this a good idea?

The project only ran for ~6 months in the end, and I would argue that scripting our own solution was a good investment of time and effort because of the flexibility it gave us. In particular, it let us iterate and converge on an approach that met the needs of the funders without unduly burdening the project managers.

In the long term, we might have tried to identify commercial software that had built-in visualization and exploration functionality. But I wouldn't have wanted to do that on the timeline we had for phase 1.

The code was hideous because it was all done really fast at the last minute before the first reporting period. Changes were done carefully, mostly by me, because I was the one who would suffer the most if we screwed up. If we'd brought our infrastructure engineer in to the project earlier, I probably would have asked him to put the time in to unit testing and so on, but the code was working well enough for us to just leave it be.

The general idea of using GitHub issues to surface milestone statuses across multiple teams and integrate with individual project trackers is pretty nice and open-sourcey.

The existing code ignores issues without metadata. So while we did not do this, you could salt "issues for reporting" into an existing repository full of issues, and extract info from just the reporting issues just fine.

So: in this case a quick hack worked out ok, and I'm not ashamed of it.

And maybe there are now better ways to do all this with GitHub Projects, but there weren't then :)

Last but not least: you should always be wary of writing code so that you can write code. Before you know it, maintaining your project management system could become someone's full time job... #yakshaving

--titus

Thanks to Luiz Irber for starting the project, and Charles Reid, Matthew Turk, and Tracy Teal for comments on a draft of this post!

Some questions and thoughts on journal peer review.

2019-04-16T00:00:00+02:00

Can I use comments from other people's prior reviews when reviewing a submission to a new journal?

I just had the dispiriting experience of receiving a paper to review from Journal B, that was unchanged from a prior submission to Journal A. The "dispiriting" part of the experience was that the paper was completely unchanged, despite a host of minor and major comments on the paper from all three reviewers for Journal A.

I ended up writing that I was disappointed that the authors had not seen fit to confront the bigger issues in any way, much less correct even the smallest and easiest errors; and then pasted in my previous review. What I wanted to do was paste in the expert reviews from the other two reviewers for Journal A, but I didn't feel like that was OK.

(If I get the paper back with some revisions, I'll reevaluate it in light of the Journal A reviews, too.)

I think the behavior of the authors is very questionable, too, and I hope they rethink this strategy. If your paper is desk-rejected by a hoity-toity journal without review, that's one thing; if reviewers put in hours of effort and give you detailed comments, you goshdarn well should put in an hour or two of your own time before resubmitting.

Why don't all journals always send all the reviews to all reviewers?

David Koslicki visited my lab yesterday, and I was reminded of the mash and MetaPalette situation from a few years back. Briefly:

I was a reviewer on both the mash paper (Ondov et al. (2016)) and the MetaPalette paper (Koslicki and Falush (2016)) and in my final review of MetaPalette I mentioned the mash paper enthusiastically. (Both were already up on biorxiv.)

At some point later on I sent David an e-mail to follow up on some suggestions I'd had, and we realized that he'd never received the text from my review of MetaPalette. He later told me that he thought that receiving my comments would have accelerated his research by a few months, by pointing him at a new area.

So why didn't mSystems send him the review text?!

(There are plenty of journals that are guilty of this. Nature Biotech is one that I've noted in the past.)

Isn't it irresponsible not to make some portion of the reviews public when the paper is published?

Peer reviews often provide important context that can help people understand why the paper is important and interesting. It's fine and dandy to say that that should all be in the final paper, but that's a hard task and often papers are space constrained (...for some reason).

I think journals should make reviews public along with the article.

The biggest argument against this is that it might take some work by someone to properly adjust reviews for fixes from earlier versions. A short term fix might be to have a box for "this is the part of the review that I would like to make public if this paper is accepted".

Why don't journals behave as if reviews belong to the reviewer?

I no longer review for PNAS, because they started including a provision that I couldn't make any part of my review available in any form, even anonymously. I can understand that they don't want reputation laundering (e.g. my previous behavior in posting reviews, which boosts my own reputation while also being a sign of my own privilege), but I see little harm in allowing it to be posted anonymously.

Journals sure are proprietary about work they didn't pay for. That's a bigger theme here, I guess :)

There is no conclusion other than that peer review seems really broken.

Anyway. Those are my ranty off the cuff comments for today.

--titus

Things to think about when developing shotgun metagenome classifiers

2019-04-11T00:00:00+02:00

So I was talking to someone about how we think about benchmarking and developing sourmash, and then it got long and kind of interesting, so I decided to write it up as a blog post.

(I asked Luiz Irber for comments, and he had the best reaction ever: "many feels, no time to write them, mostly agree, publish")

When benchmarking, often people end up comparing their tool to tools developed to tackle different problems. To no big surprise, the first tool ends up winning out.

Here are questions that we asked ourselves, or decisions we made implicitly, when developing sourmash. Many of these have direct or indirect implications for benchmarking.

Are we developing a library, a command line application, or a Web site? It's hard to do more than one at a time well. We've decided to focus on command line with sourmash, as a light wrapping around a Python library (which was a light wrapping around a C++ library, and will soon be a light wrapping around a Rust library). I think after 3 years we've reached a level of maturity where we could also support a Web site (but we don't really have the focus in the lab to do a good job of it, and would prefer to support someone else if they want to do it).

How sensitive to coverage do we want to be? Phillip Brooks showed that sourmash is really specific and very sensitive, until you have fewer than (approximately) 10,000 reads from your genome of interest. Once your data set has fewer than 10,000 reads from a genome in it, we can't really detect that genome. (This is of course a tradeoff in terms of speed, underlying approach, database size, etc., and we're happy with that tradeoff.)

Do we envision our tool being used in isolation, or as one part of an exploratory pipeline? We are firmly in the camp of using sourmash to do hypothesis generation, following which more compute intensive approaches are probably appropriate. For example, sourmash can tell you which known species are in your metagenome, but we haven't focused too much on assessing how much of those species' genomes are there - after all, that's (fairly) easy to do once you narrow down the list of possible genomes. And again, there are tradeoffs with many of the other design considerations below. But if we wanted to have a single software package that did everything we would design it differently (and it would be a lot harder, since you'd probably want to use multiple methods).

Do we envision our tool being used by programmers? We really like having scriptable tools in our lab. That means the tool has decent command line behavior, has a high level Python API, and consumes and emits standard formats. This may not be what everyone wants to focus on though!

Do we care about speed? Premature optimization can make your codebase ugly and complex. We've chosen (for now) to instead go with a fairly simple code base, which we then test the bejeezus out of. It supports optimization (Luiz Irber has done some amazing things with a profiler :) but we are against trading simple code for speed, because this is a research platform.

What are our desired memory, disk, and time performance metrics? Do we care about one over the other? In general, we have chosen to prioritize low memory over performance, and performance over low disk space. But this isn't clear cut, and depends a lot on what methods we find interesting and implementable.

What's our desired database resolution? Do we want ALL the genomes? Or just some genomes? We made the decision with sourmash to go for ALL the genomes. This causes problems when you think about the next few questions...

What's our desired taxonomic resolution? We implicitly settled on strain-level resolution as our goal for sourmash gather, largely because of the algorithm we chose. (It works quite well for that!) But, unsurprisingly, sourmash gather performs quite poorly when looking at organisms from novel genera and families. It's actually quite hard to do both well.

Who updates the database? And is it easy and straightforward to build new databases, or not? We worked hard on a friendly and flexible database building toolchain, because we expect new genomes to come out on a (very) regular basis (and we wanted to include them in our databases, based on our desired resolution).

Do we want to support private databases, or not? We really like the idea of people using our tool in the privacy of their own lab to search their own collections of genomes. This means that we need to forego certain requirements (e.g. an NCBI taxid).

Do we want a big centralized database, or not? One of our big concerns about models for database distribution & update that require one massive database, that can only realistically be updated by one group, is that they tend to go stale over time (as the group loses interest, etc.) Maintenance is not the strong suit of academic researchers :). So Luiz Irber has been working on IPFS and Dat-based models for database decentralization. This will (soon) permit incremental database updates without massive database download, among other things.

What's our publication model? Do we want others to use our software for cool things? Or are we trying to publish our own innovative methods that we try and get into high impact-factor journals? Are we building a platform for others to build their own tools? Are we playing around with different methods and ideas and so on? We're not particularly interested in high-impact factor journals for sourmash, and we have a surprising number of people just using it do their own thing, so we've opted for providing citation handles via JOSS and F1000Research.

How do we decide what functionality belongs in sourmash? Did we have explicit use cases that we decided up front? Or do we discover them as we go? I'm much more comfortable doing iterations, finding users, and waiting for inspiration to strike, than I am in planning out sourmash years in advance. But then again, I'm an academic researcher and this fits our needs; we're not trying to serve a particular community.

What's our contribution model? Are you interested in supporting collaboration and community development? Or are you interested in limiting external contributions to potential use cases? We are OK interested in both, but it adds a certain level of chaos and coordination challenges to the situation.

--titus

Our submission for the NHGRI Human Genome Reference Center call

2019-04-10T00:00:00+02:00

For the past month, I've been consumed in writing and submitting a grant for the NHGRI Human Genome Reference Center funding opportunity. This is a planned $12.5m / 5 year effort to coordinate the new Human Genome Reference Program (also see the Frequently Asked Questions).

We submitted this grant proposal a week ago Tuesday! I joined with three collaborators on this grant proposal: Curii, Genome in a Bottle, and the Church Lab. We also partnered with the Personal Genome Project and the Open Humans platform.

Since we're all open-science-y people, we agreed to make the grant public after submission. I was thinking about how best to present it in a blog post, but then I remembered that grants are supposed to stand on their own with respect to the RFA. So ...here it is, with only a little bit of organization to make it more approachable!

The HGRC call asked for what was in effect one R01 and two R21 grants, along with another R01-sized grant on top. The first R01 was Component 1, a 12 page section discussing how the center would maintain, improve, and provide the Human Genome reference. The first R21 was a 6 page Component 2, describing the community outreach plans of the center, to do training and gather feedback. The second R21 was the 6 page Component 3, describing the logistical coordination of the rest of the Human Genome Reference Program (running meetings, providing materials, etc.) And the last R01 was an overarching summary of the three components, 12 pages in length.

The end PDF submission was over 300 pages in length. Good fun...

One last comment before I provide the links: just like reading someone else's submitted dissertation, your sole responsibility in reading someone else's ALREADY SUBMITTED GRANT is to make nice noises, like "Hey, that's great congrats on submitting it!" and "There are some great ideas in there!" You don't say "ooh, look, a typo on p3! (How unprofessional! Sucks that you can't fix it now!)" or "Gosh I would have written that completely differently." Basically, you should just be nice - we're going to go through a NIH review panel experience, and I'm sure they'll be properly critical :)

The Actual Grant

Research Strategy - OVERALL - a high level overview of the thing.

Research Strategy - Component 1: Maintain, Improve, Provide the Human Reference Genome - check out our cool validation strategy with Genome-in-a-Bottle-like data sets!

Research Strategy - Component 2: Do Community Outreach and Needs - here we proposed not just doing outreach but also building a community of practice!

Research Strategy - Component 3: Provide Logistical Coordination for Human Genome Reference Program - here we added a standardization effort!

Enjoy!

--titus

News from the NIH Data Commons Pilot Phase Consortium

2019-04-09T00:00:00+02:00

You may recall that about a year and a half ago, I got involved in the NIH Data Commons.

Between then and now, we built a project execution plan, ran Phase 1 for six months, and then in October took a planned work moratorium for the purpose of doing future planning.

Then, in February, we received word that the the NIH Data Commons Pilot Phase Consortium (DCPPC) would not continue in its current form. Here's what we received:

The NIH Office of Data Science Strategy has been asked to lead the next phase of trans-NIH data ecosystem development as described in the NIH Strategic Plan for Data Science. The deliverables from the DCPPC will inform next steps, but we will not pursue a second phase of the DCPPC. New initiatives may emerge from the ODSS and/or from the ICs in response to the Strategic Plan, but they will communicate their plans as they are established.

My award finished at the end of March, and I thought it would be a good time to update y'all (especially since I've been receiving questions!)

What did the NIH Data Commons Pilot Phase Consortium achieve?

I think we achieved quite a lot in our fairly short stint! (And there's a fair amount of public material that was made available as part of it, although it's not well advertised.)

I'm going to focus on things my team helped with, because that's what I know best. There were lots of technical prototypes as well, but those were produced by other teams and are not mine to discuss. (See the list of deliverables and their reviews for more info. Happy to connect you to the authors if you're interested - drop me a line at ctbrown@ucdavis.edu.)

First off, here is the top link to the public site that we created for the end of the first Pilot Phase. There are links and documents in there that I continue to find useful, and expect to find useful for many years to come.

I'm particularly happy with how the Use Case Library effort was proceeding. I think we set a good path for collaboratively developing use cases for Phase 2, and even without a Phase 2 I will be making use of this approach and this material for other projects.

The Centillion search engine that my team built was pretty cool!! See the October writeup of it, here and also the public GitHub page, here.

The "On Commonsing" document we wrote up after a workshop on "Data Commonses" is something that I will be coming back to regularly!

People interested in pragmatic standards development might be interested in Why Multiple Stacks are Necessary.

I continue to think the FAIRshake portal is unreasonably cool... check out the projects.

Personally, I learned a lot about interoperability and creating and growing community from this experience, and I think the same is true of most of the other participants. Completely apart from the technical and infrastructure efforts, the coordination and community aspects of this Pilot Phase seem likely to have long-term positive impacts on how many of us deal with these kinds of projects in the future.

So what's next?

I'm not sure!

I think it's fair to say that the problems the NIH Data Commons effort was tackling are not going away (you can see more about these problems in my talk slides from my 2018 talk at the Dutch Techcentre for Life Sciences). And the NIH and broader biomedical research community will certainly be working on many things in this area. And I may not be involved but I'm sure to have opinions. So, stay tuned!

--titus

Critically assessing open science - the CAOS meeting.

2019-04-08T00:00:00+02:00

The "Critical Assessment of Open Science" meeting, or CAOS, was convened by Sage Bionetworks in New Orleans in early February. About 30 open science practitioners and advocates were invited by Sage to a day long meeting in New Orleans to consider the last 10 years of progress and failures in open science. The meeting was attended by scientists, policy experts, funders, and others. While the emphasis was on the biosciences, many themes were discussed in a broader context of all of science.

You can read more about the motivation for the meeting, and see a series of summary blog posts, here.

This post is my attempt to summarize the entire meeting, based on notes I took during the meeting.

The meeting was organized in a series of "call and response" engagements, in which two participants "called" for 5 minutes to one of five broad themes, and then a responder summarized, contextualized, and responded to their call. There were multiple such calls & responses in each session, for about 5 sessions. Audience participation was lively!

The meeting was held under Chatham House rules, so below I am reporting my takeaways without reference to specific individual comments or revealing details. There should be some form of publication output in the future so you can see who attended and get a more global view of the meeting; I'll link to that below when it is out.

Thank you to Sage Bionetworks for coordinating this meeting & inviting me!

Main themes that emerged (for me)

We hoped that open science would lead to new and better practices; what we too often got was practices that fed into the same broken system.

As the value of analytics and data becomes ever more apparent, there is ever more interest by commercial interests in capturing that value in closed systems. Often, the data creators and/or owners seem to be unaware of this capture, especially when the data is secondary to their primary mission (e.g. in universities). This lack of awareness Has Consequences.

Governance and sustainability of open institutions (especially open source projects) is on a lot of people's minds. Sage has a large team focused on this! (John Wilbanks says "call me!")

We talked a fair bit about the challenge of convincing individuals and groups that increased opportunity for unpredictable serendipity was worth giving up predictable (but smaller) gains in fame/power/money.

The invisibility of successful "open" came up repeatedly - the modern data science ecosystem is built on R and Python, preprints in the life sciences, open & FAIR data, and open source especially. That successful open practices achieve near instant adoption is wonderful; that they are not highlighted as successes of open in the open science community is unfortunate; and their invisibility means that their sustainability is often not strongly considered. (You can see a longer blog post by me on this topic, here.)

It was great to see multiple statements about how the idea of one consortium/community building THE platform for analysis in an area was a non-starter. Functional interoperability, collaboration, and ecosystem thinking within and across platforms is seen as critical, even by the most senior researchers.

In concert with that, I see that every functional system is a compromise between various requirements and design considerations. Therefore building multiple differently functioning systems is a good ecosystem bet.

Several different people referred to the increased attack surface that open practices offer: e.g. by making your methods and data open, you increase the ability of others to attack your conclusions. While this is an important aspect of open science, it is also something that discourages everyone, with disproportionate negative impact on already marginalized populations. Sharing within "club" structures, or gated communities, was seen as one possible solution.

We noted the need for & challenge of placing "do no harm" restrictions on use and reuse of data; community codes of conduct were discussed as one example of a governance structure that (combined with not-entirely-open communities) could enforce such restrictions.

Diversity and inclusion was a frequently mentioned topic. Lack of diversity in communities can be seen as empirical evidence of missing structure in communities that is not clearly visible from within; I think this is important when it comes to formal governance discussions that can externalize internal culture (hopefully accurately).

Another interesting theme was the extent to which some saw that grassroots communities of practice could be an antidote to the "monkey's paw" or "shitty genie" of requirements generation. Often, engineers building infrastructure want detailed use cases and requirements specification, which then leads to the wrong thing being built (and the associated blame), while if the engineers are brought into the community of practice they are more likely to build the right thing due to shared understanding and iterative/continuous participation.

The challenge of analyzing all the interesting data sets was frequently mentioned. While not discussed at the meeting, in my view, training is a way to bring prepared minds and hands to tackle the analysis of interesting data sets. This training needs to be built in rather than bolted on to projects, however.

My own POV: the critical role of communities of practice

Again and again, I saw that communities of practice presented a key ingredient to solutions for problems in governance, training, infrastructure, methods, etc. Communities of practice bring the people to the problems! Fundamentally, I think open systems do not work without a community of practice underpinning them.

Creating, growing, and sustaining these communities is, I think, one of the most important tasks to be tackled. More on that as I have time to write.

Concluding thoughts

One of the organizers closed out the meeting by asking everyone to highlight one theme that surprised and/or dismayed them. This was a productive if depressing way to extract essential takeaways!

"The cavalry isn't coming." One of the more sobering conclusions from this part of meeting was that, given the seniority of the people in the room, we had no one but ourselves to blame for failing at open in the next decade. If we couldn't figure out how to coordinate and incentivize open, then it was unlikely that someone else would step in to help us out. We are the cavalry. (And existing, closed, institutions are more resilient than we realized.)

Consumers are often very happy to trade data for convenience. This is a challenge for open!

Open science can be weaponized by opponents of science, e.g. reproducibility challenges can lead to the conclusion that all science is wrong; there are many politicians eager to attack science. The dangers of further deligitimizing science in the eyes of the world are real!

While scientists always start in and often revert to competitive mode, they can also switch to cooperative mode with ease, given the proper incentives and structure. (I personally recommend reading Kathleen Fitz's book Generous Thinking, which focuses on this issue!)

A generational (?) concern was that DIY biology will eat all of biology, and that this meeting could be viewed as a bunch of PDP-11 engineers discussing the intricacies and importance of time sharing system design. I personally think millenials are more sophisticated about data ownership, more invested in sharing (and more sophisticated about its tradeoffs), and are likely to seriously upset current apple carts, but I'm an optimist :).

There was a repeated concern that open biomedical science has to translate into better outcomes, and a shared concern that open science is an ideology built on practices that don't really work 80% of the time.

My own (depressing) conclusion was that it is not possible for open to be truly open, and that completely open institutions are extremely vulnerable to attack (for my previous thoughts on this in open source projects, see "How open is too open?"). There are gates that must be kept (hodor)! I'll expand on this theme in another blog post when I have time!

In general, I'm happy to expand on themes as time permits, if people have questions!

Immediately after writing this, I happened to revisit Denisse Alejandra's article, "Reimagining Open Science Through a Feminist Lens", and I was encouraged by the overlap and relevance of a lot of what was discussed at the CAOS meeting to this reimagination!

--titus

Living in an Ivory Basement

Why Rust is an increasingly beloved part of my programming toolbox

Some history

The top reasons I love Rust

Simple, robust multithreading is really easy.

It's straightforward to track and manage object modifications, references, and lifetime

The compiler messages are ridiculously useful

The Python integration is really nice

I really like Option and Result

Increasingly, Rust lets me do my work.

Recent advances in the sourmash ecosystem (August 2024)

Speed and memory improvements - multithreading has come to sourmash!

Improved visualization!

New, ultra-scalable backend database system

Plugins!

Stability, maintenance, and releases

Speeding sourmash the heck up

The history of the "Tragedy of the Commons"

Sourmash and branchwater licensing: thoughts on extractive engagement with projects

snakemake for doing bioinformatics - inputs and outputs and more!

input: and output: blocks

Providing inputs and outputs

Digression: Where can we (and should we) put commas?

Inputs and outputs are ordered lists

Using keywords for input and output files

Example: writing a flexible command line

Input functions and more advanced features

References and Links

params: blocks and {params}

A simple example of a params block

Params blocks have access to wildcards

Links and references:

Using expand to generate filenames

Using expand with a single pattern and one list of values

Using expand with multiple lists of values

Generating all combinations vs pairwise combinations

Getting a list of identifiers to use in expand

Examples of loading lists of accessions from files or directories

Loading a list of accessions from a text file

Loading a specific column from a CSV file

Loading from the config file

Using glob_wildcards to load IDs or accessions from a set of files

Wildcards and expand - some closing thoughts

Links and references

snakemake for doing bioinformatics - using wildcards to generalize your rules

Rules for wildcards

Wildcards are determined by the desired output

All wildcards used in a rule must match to wildcards in the output: block

Wildcards are local to each rule

The wildcard namespace is implicitly available in input: and output: blocks, but not in other blocks.

Wildcards match greedily, unless constrained

Some examples of wildcards

Running one rule on many files

Why use snakemake here?

Renaming files by prefix using glob_wildcards

Constraining wildcards to avoid subdirectories and/or periods

Advanced wildcard examples

Renaming files using multiple wildcards

Mixing and matching strings

Using wildcards to determine parameters to use in the shell block.

How to think about wildcards

Additional references

conda & mamba on shared clusters works better now!

conda is, like, the best for teaching bioinformatics!!

My teaching setup for conda

Using a central package cache for a bunch of accounts

Taking a step back: is conda all that?

Conda and R

One last thought for you...

A brief overview of automation and parallelization options in UNIX/on an HPC

Setup and file preparation

Running your basic queries

Automation and parallelization

1. Write a shell script.

2. Add a for loop to your shell script.

3. Write a for loop that creates a shell script.

4. Use parallel to run the commands instead.

5. Write a second shell script that takes a parameter.

6. Change the second shell script to be an sbatch script.

7. Write a snakemake file.

I really like `Option` and `Result`

`input:` and `output:` blocks

`params:` blocks and `{params}`

Using `expand` to generate filenames

Using `expand` with a single pattern and one list of values

Using `expand` with multiple lists of values

Getting a list of identifiers to use in `expand`

Using `glob_wildcards` to load IDs or accessions from a set of files

Wildcards and `expand` - some closing thoughts

All wildcards used in a rule must match to wildcards in the `output:` block

The wildcard namespace is implicitly available in `input:` and `output:` blocks, but not in other blocks.

Renaming files by prefix using `glob_wildcards`

4. Use `parallel` to run the commands instead.

1. Make them runnable without an explicit `bash`

Warning: the `sketch_genome` rule has now changed!

Chapter 9 - using `expand` to make filenames

Chaining rules with `input:` blocks

Avoiding repeated filenames by using `{input}` and `{output}`